K-mean clustering and its real use-case in the security domain

Sweta Sardar
3 min readAug 11, 2021

What is K-Mean Cluster

K-means clustering is a simple and powerful unsupervised machine learning algorithm that is used to solve clustering problems. It follows a simple procedure of classifying a given data set into a number of clusters, defined by the letter “k,” which is fixed beforehand.

It is a centroid-based algorithm, where each cluster is associated with a centroid. The main aim of this algorithm is to minimize the sum of distances between the data point and their corresponding clusters.

Where to use k-means clustering?

K-means clustering has uses in search engines, market segmentation, statistics and even astronomy.

It is used for clustering analysis, especially in data mining and statistics. It aims to partition a set of observations into a number of clusters (k), resulting in the partitioning of the data into Voronoi cells.

How the K-means algorithm works

To process the learning data, the K-means algorithm in data mining starts with the first group of randomly selected centroids, which are used as the beginning points for every cluster, and then performs iterative (repetitive) calculations to optimize the positions of the centroids

It halts creating and optimizing clusters when either:

  • The centroids have stabilized — there is no change in their values because the clustering has been successful.
  • The defined number of iterations has been achieved.

Some use cases of k-means :

Cyber-profiling criminals

Cyber profiling is the process of collecting data from individuals and groups to identify significant correlations. The idea of cyber profiling is derived from criminal profiles, which provide information on the investigation division to classify the types of criminals who were at the crime scene.

Call record detail analysis

A call detail record (cdr) is the information captured by telecom companies during the call, SMS, and internet activity of a customer. This information provides greater insights about the customer’s needs when used with customer demographics. Most of the telecom companies use CDR information for fraud detection by clustering the user profiles, reducing customer churn by usage activity, and targeting the profitable customers by using RFM analysis.

Insurance fraud detection

Machine learning has a critical role to play in fraud detection and has numerous applications in automobile, healthcare, and insurance fraud detection. Utilizing past historical data on fraudulent claims, it is possible to isolate new claims based on their proximity to clusters that indicate fraudulent patterns. Since insurance fraud can potentially have a multi-million dollar impact on a company, the ability to detect frauds is crucial.

Document classification

Cluster documents in multiple categories based on tags, topics, and the content of the document. This is a very standard classification problem and k-means is a highly suitable algorithm for this purpose. The initial processing of the documents is needed to represent each document as a vector and uses term frequency to identify commonly used terms that help classify the document. the document vectors are then clustered to help identify similarity in document groups.

💠Keep Learning Keep Sharing💠

😊 !! Thank You !! 😊

--

--