K-Means Clustering in Python

5 min readFeb 27, 2023

The amount of data generated by the Internet is staggering due to the increasing use of it in our society.

Although the nature of individual data can be easily understood, processing is difficult due to the sheer volume of data that must be processed.

We need to have large data analysis tools in order to manage these processes.

Machine learning and data mining techniques allow us to analyze large quantities of data in a comprehensible manner. Unsupervised machine learning can be done using k-means, which is a method for data clustering.

It can classify unlabeled data in a predetermined number of clusters based upon similarities (k).

Introduction to K–Means Algorithm

K-means clustering algorithms compute centroids and then repeat until it finds the optimal one.

It is assumed that there are many clusters. This algorithm is also known simply as flat clustering. The letter K in K-means denotes the number of clusters that were found using the method.

This method assigns data points to clusters so that the sum of squared distances between data points and the centroid can be as small as possible. Recall that clusters with less diversity will have more identical data points.

Working with K-Means Algorithm

These stages will allow us to understand how K-Means clustering works.

We need to first provide a K number of clusters that will be generated by this algorithm.
Second, select K data points randomly and assign them to a cluster. Simply, you need to categorize data according to the number of datapoints.
Next, Calculate the cluster centroids.
4: Repeat the steps until you find the ideal centroid. This is the assignment of data points to clusters which do not differ.
4.1 First, the sum of the squared distances between data points would be calculated.
4.2 Now, each data point needs to be assigned to the cluster closest to the others (centroid).
4.3 Next, calculate the centroids for clusters by adding all data points from the cluster.

K-means implements the Expectation-Maximization strategy to solve the problem. The Expectation-step assigns data points to the closest cluster. The Maximization-step computes the centroid for each cluster.

How does the K-means algorithm work

The K-means algorithm for data mining begins with a group of randomly chosen centroids. These are used to start every cluster. It then performs iterative (repetitive)calculations to optimize the positions.

If either of these conditions is true, it will stop optimizing and creating clusters.

Because of the success of clustering, the centroids have stabilized.
Iterations have been defined.

K-means clustering is a popular unsupervised machine learning algorithm used to identify clusters in data.

Clustering algorithms are widely used in a variety of applications, including image segmentation, anomaly detection, and customer segmentation.

In this blog post, we will explore the K-means clustering algorithm in depth and provide an example program with step-by-step instructions for clustering data using K-means.

What is K-means Clustering?

K-means clustering is a type of unsupervised learning algorithm used to group similar data points together into clusters.

The algorithm works by first randomly initializing a set of cluster centers, and then iteratively optimizing the location of these centers to minimize the distance between each data point and its nearest center. The algorithm terminates when the cluster centers no longer move significantly.

The K-means algorithm is called “K-means” because it divides the data into K clusters, where K is a user-defined parameter representing the number of clusters desired.

The algorithm works by finding the K cluster centers that minimize the sum of the squared distances between each data point and its nearest cluster center.

Example Program: Clustering Data using K-means

Now that we’ve discussed the basics of K-means clustering, let’s walk through an example program that demonstrates how to cluster data using K-means in Python. We’ll use the scikit-learn library to implement K-means clustering and matplotlib to visualize the results.

Step 1: Import Libraries and Generate Sample Data

The first step in our program is to import the necessary libraries and generate some sample data to cluster. We’ll use NumPy to generate the sample data and matplotlib to plot it.

# Import libraries
import numpy as np
import matplotlib.pyplot as plt
# Generate sample data
np.random.seed(0)
X = np.vstack((np.random.randn(100, 2) * 0.75 + np.array([1, 0]),
np.random.randn(100, 2) * 0.25 + np.array([-0.5, 0.5]),
np.random.randn(100, 2) * 0.5 + np.array([-0.5, -0.5])))
# Plot sample data
plt.scatter(X[:, 0], X[:, 1])
plt.title("Sample Data")
plt.show()

In this code, we first import NumPy and matplotlib. We then generate 300 data points in 2D space with three distinct clusters using NumPy’s random.randn() function. We plot the sample data using matplotlib to visualize the clusters.

Step 2: Initialize K-means Algorithm

The next step in our program is to initialize a K-means clustering algorithm. We’ll use scikit-learn’s KMeans class to do this.

# Initialize K-means clustering algorithm with 3 clusters
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3)

In this code, we import the KMeans class from scikit-learn and initialize a K-means clustering algorithm with 3 clusters.

Step 3: Fit the K-means Model to the Data

Now that we have initialized a K-means clustering algorithm, the next step is to fit the algorithm to the sample data using the fit() method.

# Fit the K-means model to the data
kmeans.fit(X)

Step 4: Get Cluster Centers and Labels

Once the K-means algorithm has been fit to the data, we can extract the cluster centers and labels using the cluster_centers and labels_ attributes of the fitted KMeans object.

Step 5: Plot the Clusters

The final step in our program is to plot the clusters. We’ll use matplotlib to plot the data points colored by their assigned cluster.

In this code, we plot the data points colored by their assigned cluster using the c parameter of the scatter() function. We also plot the cluster centers as stars using the marker and s parameters. Finally, we show the plot using matplotlib.

And the combine out put is .

Initial Data:
[[ 1.20024279 0.94152786]
[-1.06407133 0.32227104]
[ 0.93284903 -0.34803466]
[-1.07508205 0.32625187]
[-1.29344817 0.02026062]
[-1.07353087 0.33556923]
[ 0.95257523 -0.3679664 ]
[-1.1164242 0.41944988]
[-1.21484139 0.31131073]
[ 0.87304039 -0.42110185]
[-0.89728749 0.46521114]
[ 0.79975271 -0.63596077]
[ 1.0597567 0.93188322]
[-0.83206736 0.36060756]
[-0.8005183 0.44387677]
[-0.80013863 0.4524418 ]
[ 0.95020021 -0.37234786]
[-0.84243138 0.38241585]
[ 0.92553729 -0.33848863]
[ 1.02125907 1.06215682]
[ 0.93319152 -0.39132274]
[-0.87436845 0.31714505]
[-0.97047355 0.31931624]
[ 0.87320691 -0.38135194]
[ 0.88111763 -0.39581727]
[-0.8287985 0.37095979]
[ 0.97988822 -0.46777224]
[ 0.90643051 -0.37801775]
[ 1.07375819 1.03047572]
[ 0.96182773 -0.39557954]
[-0.93243978 0.40632804]
[ 0.98817998 -0.3687023 ]
[ 1.03167414 0.99578756]
[-1.09832715 0.40409171]
[-1.21163797 0.37564084]

Conclusion

K-means clustering is a popular unsupervised machine learning algorithm used to identify clusters in data. In this blog post, we walked through an example program that demonstrated how to cluster data using K-means in Python.

We used the scikit-learn library to implement the algorithm and matplotlib to visualize the results. By following this example, you should now have a solid understanding of how K-means clustering works and how to implement it in Python.