Today I will be looking at K-Means, one of the most common unsupervised machine learning algorithms for clustering problems. Clustering refers to a series of methods for finding subgroups or clusters within datasets. It is in this sense that clustering can be a thought of as an unsupervised learning method, in that we are trying to discover structure that may or may not be previously known.
The K in K-Means refers to the number of clusters we wish to partition the data into. The algorithm assumes that every observation belongs to one of the clusters and that clusters are not overlapping. This prompts the question; how do we evaluate what is a good number of clusters? This question probably warrants an entire blog post by itself, there exist statistical criterion such as the Bayesian Information Criterion or Akaike Information Criterion devoted to this purpose. These are beyond the scope of what we are looking at today, so perhaps a simplistic answer is, the number that create the smallest variation within clusters. This strikes at the heart of one of the weaknesses of K-Means, the selection of K number of clusters by the party undertaking the experiment. It either requires domain specific knowledge of the data at hand or the running of multiple models until an adequate result is found.
The way K-Means works, is by randomly assigning a centroid for each K-cluster from among the data point. After this, the algorithm iterates until the centroids remain static, taking the following steps: 1) It assigns each observation to the cluster with the nearest centroid in terms of Euclidean distance. 2) It calculates new centroids based on the mean values of coordinates after all observations have been assigned to a cluster.
Let’s cast our mind back to a previous blog post, in which I had a look at a regression problem with the Boston House price dataset. I was looking for variables that were highly correlated to average house price of Boston suburbs. One of them was LSTAT, which was the percentage of the population that were defined as ‘lower status’. Here is a scatter plot of them.
If we wanted to cluster these observations, it is as simple as instantiating a K-Means class from the Sci-Kit Learn library, defining the number of clusters desired and calling the fit method on the data. In this case we have arbitraily chosen K = 3. Sci-Kit Learns KMeans class produces labels after it has been fit, which can be passed into the colour parameter of the scatter plot. Here is the code.
And here is the result.
Of course, one of the characteristics of K-Means is thatit is not always obvious what the optimum number of clusters is.
Here we loop through serveral potentia options.
One of the more interesting ways that K-Means has been used is in classifying Satellite data. This has been useful, for example, in identifying areas in danger of suffering a forest fire or drought. This is possible because pictures can be represented in vectors of RGB (Red, Green, Blue) values. More modern methods make use of Neural Nets to classify this type of data, but it is an interesting application of K-Means.
To do so with Python only requires GDAL, Matplotlib, Numpy, SkLearn and Satellite data.
Eos offers its Land Viewer service, which allows the user to search for satellite imagery on the fly for analytics. It’s quite amazing and I would recommend giving it a look if you haven’t already done so. I chose a tip of land by Lake Burullus in Egpyt for this exampel for no other purpose than I wanted a picture with numerous types of terrain. Hopefully, it will be a good example of K-means effects on RGB matrices. Here is the original.
The images can be saved in a .tiff format, which GDAL can convert into a Raster, a dot matrix data structure. This can then be converted into a vector, on which we can run K-Means. The picture loses some of its information, but other aspects can be emphasised, helping in the kind of problems described above.
In this case, the image takes the shape of an array of (619, 1237). To run K-Means we need to flatten this to an array of (-1,1). As a quick aside, numpy has an inbuilt function for this which is helpfully called .reshape. By passing in -1 for the first parameter, we are telling to flatten the array such that it has as many rows as necessary to retain the data with a single column. In this case it takes the shape of (765703, 1).
Then it is as simple as instantiating a KMeans class, fitting the data to it, creating an instance of the labels produced and reshaping it all to the dimensions of the original image.
Helpfully, Matplotlib has the .imshow() method, which can display a 2-D raster as an image. Let’s take a look at the result.
Interesting! This seems like a good place to leave it this time. Thanks for reading!