An algorithm to find the value of k in k-means clustering algorithm
Introduction
This article aims to present an algorithm which helps you to sort out the problem of determining the value of k of K-means algorithm.
[edit]The Algorithm
- Lets say we have 'n' points with us. Make a 2-D matrix of n*n dimension, i.e. M[n][n]. Its a distance matrix , which keeps the distance of one point to all other points.
For example:- Lets we have 10 points such as- A(2,6), B(3,4), C(3,8), D(4,7), E(6,2), F(6,4), G(7,3), H(7,4), I(8,5), J(7,6).
The M[10][10] matrix would be looking something like this-
The M[10][10] matrix would be looking something like this-
Point A(2,6) | Point B(3,4) | Point C(3,8) | Point D(4,7) | Point E(6,2) | Point F(6,4) | Point G(7,3) | Point H(7,4) | Point I(8,5) | Point J(7,6) | |
Point A(2,6) | - | 2.23 | 2.23 | 2.23 | 5.64 | 5.64 | 5.83 | 5.38 | 6.02 | 5.38 |
Point B(3,4) | 2.23 | - | 4 | 3.16 | 3.61 | 3 | 4.12 | 4 | 5.1 | 4.47 |
Point C(3,8) | 2.23 | 4 | - | 1.41 | 6.78 | 5 | 6.4 | 5.65 | 5.91 | 4.47 |
Point D(4,7) | 2.23 | 3.16 | 1.41 | - | 4.47 | 3.6 | 5 | 4.24 | 4.47 | 3.16 |
Point E(6,2) | 5.64 | 3.61 | 6.78 | 4.47 | - | 2 | 1.41 | 2.23 | 3.6 | 4.12 |
Point F(6,4) | 5.64 | 3 | 5 | 3.6 | 2 | - | 1.41 | 1 | 2.23 | 2.23 |
Point G(7,3) | 5.83 | 4.12 | 6.4 | 5 | 1.41 | 1.41 | - | 1 | 2.23 | 3 |
Point H(7,4) | 5.38 | 4 | 5.65 | 4.24 | 2.23 | 1 | 1 | - | 1.41 | 2 |
Point I(8,5) | 6.02 | 5.1 | 5.91 | 4.47 | 3.6 | 2.23 | 2.23 | 1.41 | - | 1.41 |
Point J(7,6) | 5.38 | 4.47 | 4.47 | 3.16 | 4.12 | 2.23 | 3 | 2 | 1.41 | - |
The entry of this matrix is filled by computing the distance between two points. for example: The distance between point(2,6) and point(3,4) can be calculated as sqrt((2-3)2+(6-4)2)
- On the basis of this matrix, build another matrix namely L[n][n], here the value of n is 10, of course.
Matrix L[10][10] looks something like this-
Point A(2,6) | Point B(3,4) | Point C(3,8) | Point D(4,7) | Point E(6,2) | Point F(6,4) | Point G(7,3) | Point H(7,4) | Point I(8,5) | Point J(7,6) | |
Point A(2,6) | - | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Point B(3,4) | 1 | - | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Point C(3,8) | 1 | 0 | - | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Point D(4,7) | 0 | 0 | 1 | - | 0 | 0 | 0 | 0 | 0 | 0 |
Point E(6,2) | 0 | 0 | 0 | 0 | - | 0 | 1 | 0 | 0 | 0 |
Point F(6,4) | 0 | 0 | 0 | 0 | 0 | - | 0 | 1 | 0 | 0 |
Point G(7,3) | 0 | 0 | 0 | 0 | 0 | 0 | - | 1 | 0 | 0 |
Point H(7,4) | 0 | 0 | 0 | 0 | 0 | 1 | 0 | - | 0 | 0 |
Point I(8,5) | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | - | 0 |
Point J(7,6) | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | - |
On the basis of this matrix, value of k can be determined easily. For instance, in this example points B and C can be clustered together. Similarly, points F,G ans I can be clustered in one cluster. point A can be clustered alone. point D can be clustered alone. point E can be clustered alone point H can be clustered alone point J can be clustered alone
- Hence the value of K is 7.
[edit]Future Scope
Value of k can be further optimised as follows-
In the example illustrated above point A is closer to point B. Hence point A can be clustered into B and C. And so as the case with G. Forwarding in this way, we can obtained an optimised value of K.
In the example illustrated above point A is closer to point B. Hence point A can be clustered into B and C. And so as the case with G. Forwarding in this way, we can obtained an optimised value of K.
Comments
Post a Comment