(DEPRECATED) Apache Flink Mailing List archive.

k-means example behavior

Classic

List

Threaded

4 messages Options

Vasiliki Kalavri

k-means example behavior

Hello everyone,

I'm using the k-means example as basis for a custom implementation and I
noticed the following behavior: If during an iteration no point is assigned
to a particular cluster, this cluster will then "disappear".
This happens because SelectNearestCenter() outputs <centroidId, point>
tuples, (where centroidId is the chosen center by the point) and these are
then grouped by centroidId to compute the new centers. If no point selects
a particular centroid, this centroid will not appear in subsequent
iterations.

For example, assume we have the points
{ (-10, 0), (-8, 0), (2, 0) } and the initial centroids {1, (0, 0)} and {2,
(5, 0)}.
Initially, point (2, 0) will be assigned to centroid 1, but then after
centroid 1 moves closer to (-10, 0) point(2, 0) will not be reassigned to
cluster 2.

Is this intended behavior?
This seemed odd to me, but I couldn't really find any resources that define
the "correct" behavior.. It seems that handling such a situation is
implementation-specific. I think that if we keep it this way, we might want
to add a comment in the example though :)

Cheers,
V.

Aljoscha Krettek-2

Re: k-means example behavior

I think the behaviour is correct. If a cluster has not points then it
has no centroid. If it has no centroid no points could ever be
assigned to it again in the future since there is no way of
calculating a distance.

On Tue, Feb 24, 2015 at 6:57 PM, Vasiliki Kalavri
<[hidden email]> wrote:

> Hello everyone,
>
> I'm using the k-means example as basis for a custom implementation and I
> noticed the following behavior: If during an iteration no point is assigned
> to a particular cluster, this cluster will then "disappear".
> This happens because SelectNearestCenter() outputs <centroidId, point>
> tuples, (where centroidId is the chosen center by the point) and these are
> then grouped by centroidId to compute the new centers. If no point selects
> a particular centroid, this centroid will not appear in subsequent
> iterations.
>
> For example, assume we have the points
> { (-10, 0), (-8, 0), (2, 0) } and the initial centroids {1, (0, 0)} and {2,
> (5, 0)}.
> Initially, point (2, 0) will be assigned to centroid 1, but then after
> centroid 1 moves closer to (-10, 0) point(2, 0) will not be reassigned to
> cluster 2.
>
> Is this intended behavior?
> This seemed odd to me, but I couldn't really find any resources that define
> the "correct" behavior.. It seems that handling such a situation is
> implementation-specific. I think that if we keep it this way, we might want
> to add a comment in the example though :)
>
> Cheers,
> V.

aalexandrov

Re: k-means example behavior

Apache's commons-math implementation offers various strategies for handling
this scenarios:

http://commons.apache.org/proper/commons-math/jacoco/org.apache.commons.math3.stat.clustering/KMeansPlusPlusClusterer.java.html

(take a look at the EmptyClusterStrategy enum options)

2015-02-24 23:28 GMT+01:00 Aljoscha Krettek <[hidden email]>:

> I think the behaviour is correct. If a cluster has not points then it
> has no centroid. If it has no centroid no points could ever be
> assigned to it again in the future since there is no way of
> calculating a distance.
>
> On Tue, Feb 24, 2015 at 6:57 PM, Vasiliki Kalavri
> <[hidden email]> wrote:
> > Hello everyone,
> >
> > I'm using the k-means example as basis for a custom implementation and I
> > noticed the following behavior: If during an iteration no point is
> assigned
> > to a particular cluster, this cluster will then "disappear".
> > This happens because SelectNearestCenter() outputs <centroidId, point>
> > tuples, (where centroidId is the chosen center by the point) and these
> are
> > then grouped by centroidId to compute the new centers. If no point
> selects
> > a particular centroid, this centroid will not appear in subsequent
> > iterations.
> >
> > For example, assume we have the points
> > { (-10, 0), (-8, 0), (2, 0) } and the initial centroids {1, (0, 0)} and
> {2,
> > (5, 0)}.
> > Initially, point (2, 0) will be assigned to centroid 1, but then after
> > centroid 1 moves closer to (-10, 0) point(2, 0) will not be reassigned to
> > cluster 2.
> >
> > Is this intended behavior?
> > This seemed odd to me, but I couldn't really find any resources that
> define
> > the "correct" behavior.. It seems that handling such a situation is
> > implementation-specific. I think that if we keep it this way, we might
> want
> > to add a comment in the example though :)
> >
> > Cheers,
> > V.
>

Vasiliki Kalavri

Re: k-means example behavior

Thanks for the replies guys!

@Aljoscha: I get your point, but I would actually expect either an error
message or the lonely centroid to move.
k-means is supposed to cluster data in k clusters. If you end up with < k,
something must have gone wrong.. :s

@Alex: very helpful resource, thanks. I will probably use one of these
strategies.

Cheers,
V.

On 25 February 2015 at 15:36, Alexander Alexandrov <
[hidden email]> wrote:

> Apache's commons-math implementation offers various strategies for handling
> this scenarios:
>
>
> http://commons.apache.org/proper/commons-math/jacoco/org.apache.commons.math3.stat.clustering/KMeansPlusPlusClusterer.java.html
>
> (take a look at the EmptyClusterStrategy enum options)
>
> 2015-02-24 23:28 GMT+01:00 Aljoscha Krettek <[hidden email]>:
>
> > I think the behaviour is correct. If a cluster has not points then it
> > has no centroid. If it has no centroid no points could ever be
> > assigned to it again in the future since there is no way of
> > calculating a distance.
> >
> > On Tue, Feb 24, 2015 at 6:57 PM, Vasiliki Kalavri
> > <[hidden email]> wrote:
> > > Hello everyone,
> > >
> > > I'm using the k-means example as basis for a custom implementation and
> I
> > > noticed the following behavior: If during an iteration no point is
> > assigned
> > > to a particular cluster, this cluster will then "disappear".
> > > This happens because SelectNearestCenter() outputs <centroidId, point>
> > > tuples, (where centroidId is the chosen center by the point) and these
> > are
> > > then grouped by centroidId to compute the new centers. If no point
> > selects
> > > a particular centroid, this centroid will not appear in subsequent
> > > iterations.
> > >
> > > For example, assume we have the points
> > > { (-10, 0), (-8, 0), (2, 0) } and the initial centroids {1, (0, 0)} and
> > {2,
> > > (5, 0)}.
> > > Initially, point (2, 0) will be assigned to centroid 1, but then after
> > > centroid 1 moves closer to (-10, 0) point(2, 0) will not be reassigned
> to
> > > cluster 2.
> > >
> > > Is this intended behavior?
> > > This seemed odd to me, but I couldn't really find any resources that
> > define
> > > the "correct" behavior.. It seems that handling such a situation is
> > > implementation-specific. I think that if we keep it this way, we might
> > want
> > > to add a comment in the example though :)
> > >
> > > Cheers,
> > > V.
> >
>