Hello everyone,
I'm using the k-means example as basis for a custom implementation and I noticed the following behavior: If during an iteration no point is assigned to a particular cluster, this cluster will then "disappear". This happens because SelectNearestCenter() outputs <centroidId, point> tuples, (where centroidId is the chosen center by the point) and these are then grouped by centroidId to compute the new centers. If no point selects a particular centroid, this centroid will not appear in subsequent iterations. For example, assume we have the points { (-10, 0), (-8, 0), (2, 0) } and the initial centroids {1, (0, 0)} and {2, (5, 0)}. Initially, point (2, 0) will be assigned to centroid 1, but then after centroid 1 moves closer to (-10, 0) point(2, 0) will not be reassigned to cluster 2. Is this intended behavior? This seemed odd to me, but I couldn't really find any resources that define the "correct" behavior.. It seems that handling such a situation is implementation-specific. I think that if we keep it this way, we might want to add a comment in the example though :) Cheers, V. |
I think the behaviour is correct. If a cluster has not points then it
has no centroid. If it has no centroid no points could ever be assigned to it again in the future since there is no way of calculating a distance. On Tue, Feb 24, 2015 at 6:57 PM, Vasiliki Kalavri <[hidden email]> wrote: > Hello everyone, > > I'm using the k-means example as basis for a custom implementation and I > noticed the following behavior: If during an iteration no point is assigned > to a particular cluster, this cluster will then "disappear". > This happens because SelectNearestCenter() outputs <centroidId, point> > tuples, (where centroidId is the chosen center by the point) and these are > then grouped by centroidId to compute the new centers. If no point selects > a particular centroid, this centroid will not appear in subsequent > iterations. > > For example, assume we have the points > { (-10, 0), (-8, 0), (2, 0) } and the initial centroids {1, (0, 0)} and {2, > (5, 0)}. > Initially, point (2, 0) will be assigned to centroid 1, but then after > centroid 1 moves closer to (-10, 0) point(2, 0) will not be reassigned to > cluster 2. > > Is this intended behavior? > This seemed odd to me, but I couldn't really find any resources that define > the "correct" behavior.. It seems that handling such a situation is > implementation-specific. I think that if we keep it this way, we might want > to add a comment in the example though :) > > Cheers, > V. |
Apache's commons-math implementation offers various strategies for handling
this scenarios: http://commons.apache.org/proper/commons-math/jacoco/org.apache.commons.math3.stat.clustering/KMeansPlusPlusClusterer.java.html (take a look at the EmptyClusterStrategy enum options) 2015-02-24 23:28 GMT+01:00 Aljoscha Krettek <[hidden email]>: > I think the behaviour is correct. If a cluster has not points then it > has no centroid. If it has no centroid no points could ever be > assigned to it again in the future since there is no way of > calculating a distance. > > On Tue, Feb 24, 2015 at 6:57 PM, Vasiliki Kalavri > <[hidden email]> wrote: > > Hello everyone, > > > > I'm using the k-means example as basis for a custom implementation and I > > noticed the following behavior: If during an iteration no point is > assigned > > to a particular cluster, this cluster will then "disappear". > > This happens because SelectNearestCenter() outputs <centroidId, point> > > tuples, (where centroidId is the chosen center by the point) and these > are > > then grouped by centroidId to compute the new centers. If no point > selects > > a particular centroid, this centroid will not appear in subsequent > > iterations. > > > > For example, assume we have the points > > { (-10, 0), (-8, 0), (2, 0) } and the initial centroids {1, (0, 0)} and > {2, > > (5, 0)}. > > Initially, point (2, 0) will be assigned to centroid 1, but then after > > centroid 1 moves closer to (-10, 0) point(2, 0) will not be reassigned to > > cluster 2. > > > > Is this intended behavior? > > This seemed odd to me, but I couldn't really find any resources that > define > > the "correct" behavior.. It seems that handling such a situation is > > implementation-specific. I think that if we keep it this way, we might > want > > to add a comment in the example though :) > > > > Cheers, > > V. > |
Thanks for the replies guys!
@Aljoscha: I get your point, but I would actually expect either an error message or the lonely centroid to move. k-means is supposed to cluster data in k clusters. If you end up with < k, something must have gone wrong.. :s @Alex: very helpful resource, thanks. I will probably use one of these strategies. Cheers, V. On 25 February 2015 at 15:36, Alexander Alexandrov < [hidden email]> wrote: > Apache's commons-math implementation offers various strategies for handling > this scenarios: > > > http://commons.apache.org/proper/commons-math/jacoco/org.apache.commons.math3.stat.clustering/KMeansPlusPlusClusterer.java.html > > (take a look at the EmptyClusterStrategy enum options) > > 2015-02-24 23:28 GMT+01:00 Aljoscha Krettek <[hidden email]>: > > > I think the behaviour is correct. If a cluster has not points then it > > has no centroid. If it has no centroid no points could ever be > > assigned to it again in the future since there is no way of > > calculating a distance. > > > > On Tue, Feb 24, 2015 at 6:57 PM, Vasiliki Kalavri > > <[hidden email]> wrote: > > > Hello everyone, > > > > > > I'm using the k-means example as basis for a custom implementation and > I > > > noticed the following behavior: If during an iteration no point is > > assigned > > > to a particular cluster, this cluster will then "disappear". > > > This happens because SelectNearestCenter() outputs <centroidId, point> > > > tuples, (where centroidId is the chosen center by the point) and these > > are > > > then grouped by centroidId to compute the new centers. If no point > > selects > > > a particular centroid, this centroid will not appear in subsequent > > > iterations. > > > > > > For example, assume we have the points > > > { (-10, 0), (-8, 0), (2, 0) } and the initial centroids {1, (0, 0)} and > > {2, > > > (5, 0)}. > > > Initially, point (2, 0) will be assigned to centroid 1, but then after > > > centroid 1 moves closer to (-10, 0) point(2, 0) will not be reassigned > to > > > cluster 2. > > > > > > Is this intended behavior? > > > This seemed odd to me, but I couldn't really find any resources that > > define > > > the "correct" behavior.. It seems that handling such a situation is > > > implementation-specific. I think that if we keep it this way, we might > > want > > > to add a comment in the example though :) > > > > > > Cheers, > > > V. > > > |
Free forum by Nabble | Edit this page |