An algorithm to find the value of k in k-means clustering algorithm

Introduction

This article aims to present an algorithm which helps you to sort out the problem of determining the value of k of K-means algorithm.

[edit]The Algorithm

  • Lets say we have 'n' points with us. Make a 2-D matrix of n*n dimension, i.e. M[n][n]. Its a distance matrix , which keeps the distance of one point to all other points.
For example:- Lets we have 10 points such as- A(2,6), B(3,4), C(3,8), D(4,7), E(6,2), F(6,4), G(7,3), H(7,4), I(8,5), J(7,6).
The M[10][10] matrix would be looking something like this-
Point A(2,6)Point B(3,4)Point C(3,8)Point D(4,7)Point E(6,2)Point F(6,4)Point G(7,3)Point H(7,4)Point I(8,5)Point J(7,6)
Point A(2,6)-2.232.232.235.645.645.835.386.025.38
Point B(3,4)2.23-43.163.6134.1245.14.47
Point C(3,8)2.234-1.416.7856.45.655.914.47
Point D(4,7)2.233.161.41-4.473.654.244.473.16
Point E(6,2)5.643.616.784.47-21.412.233.64.12
Point F(6,4)5.64353.62-1.4112.232.23
Point G(7,3)5.834.126.451.411.41-12.233
Point H(7,4)5.3845.654.242.2311-1.412
Point I(8,5)6.025.15.914.473.62.232.231.41-1.41
Point J(7,6)5.384.474.473.164.122.23321.41-
The entry of this matrix is filled by computing the distance between two points. for example: The distance between point(2,6) and point(3,4) can be calculated as sqrt((2-3)2+(6-4)2)
  • On the basis of this matrix, build another matrix namely L[n][n], here the value of n is 10, of course.
Matrix L[10][10] looks something like this-
Point A(2,6)Point B(3,4)Point C(3,8)Point D(4,7)Point E(6,2)Point F(6,4)Point G(7,3)Point H(7,4)Point I(8,5)Point J(7,6)
Point A(2,6)-100000000
Point B(3,4)1-00000000
Point C(3,8)10-0000000
Point D(4,7)001-000000
Point E(6,2)0000-01000
Point F(6,4)00000-0100
Point G(7,3)000000-100
Point H(7,4)0000010-00
Point I(8,5)00000001-0
Point J(7,6)000000001-
On the basis of this matrix, value of k can be determined easily. For instance, in this example points B and C can be clustered together. Similarly, points F,G ans I can be clustered in one cluster. point A can be clustered alone. point D can be clustered alone. point E can be clustered alone point H can be clustered alone point J can be clustered alone
  • Hence the value of K is 7.

[edit]Future Scope

Value of k can be further optimised as follows-
In the example illustrated above point A is closer to point B. Hence point A can be clustered into B and C. And so as the case with G. Forwarding in this way, we can obtained an optimised value of K.

Comments