K-means

From the course: Applied Machine Learning: Algorithms

Start my 1-month free trial Buy for my team

“

- [Instructor] In this video, we're going to talk about clustering. Specifically, we're going to talk about K-means clustering. And K-means clustering is what we call an unsupervised machine learning algorithm to make clusters from data. The unsupervised portion means that we are going to feed data in, we're not going to provide any labels, and the algorithm, based on the data that we passed in, will do some processing. In this case, it will give us cluster labels. There are other machine learning algorithms that are unsupervised, we'll look at another one called PCA. PCA is also unsupervised, that means it doesn't have labels that come in with it, it just has the data, and PCA will do dimension reduction. So the algorithm for K-means clustering is as followed. We're going to choose a k where k is the number of clusters, and then what happens is the algorithm will choose three points from the data that we passed in and it will go through and label the points that are closest to each of those clusters, it calls the clusters centroids. After it's gone through labeling those, it will recenter the centroids. It will go through everything that's labeled in cluster one. It will move the centroid to the middle of that cluster. Everything for cluster two, same thing, will move the centroid. And then after it's moved the centroid, it repeats the algorithm, meaning it goes and finds the points that are closest to each centroid. And it does that, and eventually the centroids come to a common state. I'll show you an example right here. So here we're using the scikit-learn library, and it has clustering in it. Let me walk through the code. At the top we have our imports. We're going to be using the KMeans library. We're also importing the datasets library to load the Iris dataset, and we're going to be using Matplotlib to do some plotting. We'll load the Iris dataset and make a variable called dataset. And then I'm going to make a variable called "X". Capital X is a common convention that you'll see used throughout the scikit-learn library. In linear algebra, a capital variable is used to describe a matrix or a two-dimensional group of data with rows and columns. And that's a common convention that you'll see in scikit-learn. When we're passing our data into scikit-learn, it's generally in two dimensions, rows and columns. So each row represents a sample, and then you can have columns of features that describe what is in the data. You can also have a Y dataset. Y is used for supervised learning. In this case, the dataset that we have, which is the Iris dataset, is a classic dataset for machine learning. It's got 150 rows, each row represents a different flower from the Iris family, and there are three different types of flowers: Virginica, setosa, and versicolour. And there are a few features in the X variable that describe the shape and dimensions of the petals and the sepals of each of those 150 flowers. The next thing I'm going to do is I'm going to make a list of my centroids, and then I'm just going to loop over 10 times. I'm going to run this algorithm I just talked about 10 times in this little for loop down here. So you can see at the top, I am making a model, and I'm saying I want to have three clusters in here, and I'm initializing the KMeans algorithm. Then I'm going to call model.fit. One of the nice things about scikit-learn, and we'll be using it throughout this course, is that it has a consistent interface. So once you learn that interface, it's really easy to use. Unsupervised learning algorithms will use fit, and you'll just pass in X along with that. For supervised learning algorithms, which we'll see later, you will call fit with both X and Y. The next thing we're going to do is we're going to predict our labels. And because this is a prediction algorithm, it can make predictions, we're going to pass in our X, and it will give us out a label for each row. Again, scikit-learn uses the same interface all over the place. We'll see when we do classification of regression models, they will also have a predict method. This is really nice because when you understand the basics of scikit-learn, it uses a consistent interface all over the place and it will make your life a lot easier. The next line, we're going to create a Matplotlib figure and axis, and then I'm going to plot, on top of that, a scatter plot. I'll also pull out the cluster centers. You can see that I'm saying model.cluster_centers_. One thing to note in scikit-learn is that attributes ending in underscore are learned when we call fit. So when we call fit up above, it determined where those cluster centers are. I'm also going to plot those as a star on my plot. And I'm going to make a title for my plot, it's called "iteration", what iteration I'm on. And then I'm also going to, if my i, if my round is greater than zero, I'm going to plot the previous centroids. And then at the end of my for loop, I'll just keep track of my centroids. So what this is going to show us is that as we move through the algorithm, the centroids will move along as well. And then at the end, outside of the for loop, I will just plot the original data, so you can see that compared to what's going on here. Let's run this. If you're not familiar with using VS Code and Notebooks, all you have to do to run this is you can hit this little triangle up here. I like to just hold down Control and hit Enter. You can see at the top that we've got a little bar indicating that it is running, and then we'll scroll down here. Okay, so it looks like it just ran. And here's our plots. We should have a series of plots. Here's the first iteration. You can see that I have three stars here and we have green labels, we have purple labels, and then we have yellow labels. So this is the first iteration. After we've done this iteration, what's going to happen is the star will move, it will center inside of the labels. And so you can see that in the next iteration, you can see that the stars have migrated a little bit, and you can also see that the boundaries between those different clusters have moved as well. Here's another iteration. You can see that we're just slowly shifting those centroids a little bit. And I'll just scroll down to the bottom here. You can run this on your own and look at what's happening at the individual levels. But here is the original data. And in the original data, this is colored by the target. Note that I did not include the target in my data that I passed into the algorithm, I just passed in the dimensions. But here I have the dimensions, and I'm labeling these by the type of Iris that it is, remember there's three different types there. And you can see on the upper left, there's one type, in the middle, we have that green type, and there's some overlap there with the yellow type on the right. However, if you look up above here, our clustering algorithm did a decent job. It's not perfect, but it did seem to do a decent job getting the upper left group classified. And it's not able to do the overlapping that we see at the bottom, but it does a decent job. Okay, and this video I gave an introduction to the K-means clustering algorithm. This is a unsupervised algorithm that you can pass in data in. You can tell how many clusters you want, and it will return labels for pieces of data that are in the same cluster.

K-means - Python Video Tutorial | LinkedIn Learning, formerly Lynda.com (2024)

From the course: Applied Machine Learning: Algorithms

K-means

Contents