[HN Gopher] Introduction to K-Means Clustering
       ___________________________________________________________________
        
       Introduction to K-Means Clustering
        
       Author : gk1
       Score  : 168 points
       Date   : 2022-03-14 15:12 UTC (7 hours ago)
        
 (HTM) web link (www.pinecone.io)
 (TXT) w3m dump (www.pinecone.io)
        
       | jdeaton wrote:
       | After having implemented k-means ~5 times in undergraduate
       | classes, it was interesting to learn that k-means is just a
       | special case of expectation maximization.
        
         | petters wrote:
         | ...which in turn is just a special case of block coordinate
         | descent. One set of variables is fixed, while another set is
         | optimized and vice versa.
        
       | amelius wrote:
       | How do you benchmark a clustering algorithm?
        
       | mistrial9 wrote:
       | times are changing online -- I definitely recall a two volume,
       | hard bound, >$100USD edition of "Clustering Methods" in the
       | university book store and did not buy it. I was being mocked by
       | my non-native English colleague with a Masters in Computer
       | Science from great (non-English) university, about the ability to
       | even read that work. Now, I look at the Scikit-learn docs mainly.
       | Good work pinecone dot IO.
        
         | monkeybutton wrote:
         | I think its now turned in to somewhat of a competition. If you
         | want your algorithm/package adopted, you have to make it open
         | source and provide good documentation like this:
         | https://lifelines.readthedocs.io/en/latest/Survival%20Analys...
         | https://hdbscan.readthedocs.io/en/latest/how_hdbscan_works.h...
        
           | danuker wrote:
           | Props to Lifelines. It arranged my thoughts on how to handle
           | survival data, by requiring the data in a clear format.
        
       | oofbey wrote:
       | K-means has a lot going for it, but it's not very good at
       | recovering "natural" clusters. Like if you have a dataset that
       | you know actually is grouped in a certain way, k-means very often
       | will not discover those clusters. Other more complex clustering
       | algorithms will do a more reliable job. If you have to implement
       | clustering yourself for some reason, k-means' simplicity is
       | great. But how often do you need to write your own algorithm
       | these days?
        
       | [deleted]
        
       | notafraudster wrote:
       | One thing that's nice about k-means is that it can quite
       | literally be grasped intuitively by, say, a 10 year old.
       | k-nearest neighbors is similarly graspable. I think these are
       | wonderful techniques to introduce kids to data science without
       | having to get into the math. Obviously you're going to have to
       | abstract away high-dimensionality and a lot of the underlying
       | math of "how" -- but that's ok!
       | 
       | I'm not sure I've found either to be massively useful on real-
       | world data, but that's OK too!
        
       | globular-toast wrote:
       | One thing not mentioned with k-means is it requires your data to
       | be in Euclidean space. You can't just come up with a distance
       | function and do clustering with it. That's one reason the other
       | techniques often make sense. In fact, those techniques often
       | don't care what the data actually is, they only need the
       | distances, and even then they don't need all of them.
       | 
       | Still, k-means is enormously practical despite its shortcomings.
        
         | n4r9 wrote:
         | An alternative I've used is k-medoids:
         | 
         | https://en.m.wikipedia.org/wiki/K-medoids
         | 
         | This is especially useful in cases where the distance between
         | points is given by a matrix, and cannot be derived from the
         | points themselves. For example, the drive time between
         | locations on a road network.
        
         | whiplash451 wrote:
         | Not sure why this was down voted.
         | 
         | K-means in its standard form does make a Euclidean hypothesis
         | (it does not minimize a distance but the in-cluster variance).
         | 
         | It is not correct to use arbitrary distances because k-means
         | may stop converging with other distance functions.
         | 
         | See this thread for more info:
         | https://stats.stackexchange.com/questions/81481/why-does-k-m...
        
           | isoprophlex wrote:
           | That's a really pedantic take, nothing precludes you from
           | using whatever distance function you want to get the distance
           | to k centroids and act accordingly
        
             | amilios wrote:
             | Losing convergence guarantees is somewhat important!
        
               | isoprophlex wrote:
               | I totally agree with your point, but sometimes you can
               | get away by being handwavey about it and making sure the
               | centroids aren't skipping all over the place... and
               | that's good enough
        
         | isoprophlex wrote:
         | But you can. Get the distance to all your centroids, assign to
         | closest centroid, update centroid location.
         | 
         | The distance function can be anything.
        
           | globular-toast wrote:
           | What's a centroid? It's the point that minimises the sum of
           | squared Euclidean distances between itself and every point in
           | the cluster.
        
             | isoprophlex wrote:
             | It's a tuple of floats that you move slightly towards the
             | closest point it has just seen, optionally with some rate
             | of forgetting.
             | 
             | You shuffle your big ass wad of data, and loop over each
             | element doing the above, optionally decreasing the
             | forgetting rate.
             | 
             | You track the location of your centroids and if your data /
             | loss fn isn't too pathological, they won't misbehave too
             | badly and you have handwavey assurance of convergence...
             | Definitions be damned.
        
               | civilized wrote:
               | So, you made up your own algorithm that is kinda like
               | k-means, this algorithm might sometimes work, and this
               | shows that k-means works in non-Euclidean spaces?
        
         | CardenB wrote:
         | What prevents you from using a different distance than
         | Euclidean distance?
        
           | danuker wrote:
           | Suppose you use prices and square meters, to cluster
           | apartments. Suppose you also use a simple metric, like:
           | distance <- sqrt(d_price*2 + d_area*2)
           | 
           | Without normalizing the units somehow, so that they have a
           | meaningfully similar numeric expression, you are essentially
           | only clustering by price, which is in the tens- or hundreds-
           | of-thousands.
           | 
           | You use very little information about the area, because that
           | is in the hundreds, at most thousands.
        
             | nerdponx wrote:
             | This has nothing to do with using non-Euclidean distance,
             | though. This is a separate issue related to having features
             | on different scales.
        
             | jacquesm wrote:
             | That is not a proper distance function.
        
               | danuker wrote:
               | What do you mean? Is it not the Euclidean distance?
               | 
               | https://en.wikipedia.org/wiki/Euclidean_distance
        
           | [deleted]
        
           | amilios wrote:
           | The way k-means is constructed is not based on distances.
           | 
           | K-means minimizes within-cluster variance. If you look at the
           | definition of variance, it is identical to the sum of squared
           | Euclidean distances from the centroid
        
             | nerdponx wrote:
             | Do you lose convergence guarantees if you minimize "sum of
             | squared distances from the centroid" using your other-than-
             | Euclidean distance metric?
        
         | fmakunbound wrote:
         | It's been ages since I've used K-means, but I thought you had a
         | lot of freedom in how you define your distance function.
        
       | tus666 wrote:
       | Why not just do a proper course in data science?
        
         | nerdponx wrote:
         | Why read articles about computer science when you could take a
         | proper course in computer science?
        
           | tus666 wrote:
           | Because it is far more generic and likely to be actually
           | useful.
           | 
           | Could you imagine someone doing K-means outside of a specific
           | data science application?
        
             | magicalhippo wrote:
             | > Could you imagine someone doing K-means outside of a
             | specific data science application?
             | 
             | I implemented k-means clustering to reduce noise in a path
             | tracer. I've never done a data science course or anything
             | like that.
        
       | lichtenberger wrote:
       | That was one of the first algorithms I learned at the university,
       | thanks for posting :-)
        
       | Bostonian wrote:
       | Mixtures of multivariate normals can be regarded as a method of
       | fuzzy clustering, where each point has a probability of coming
       | from each of the component Gaussians. The EM (expectation
       | maximization) algorithm can be used to fit mixture models.
        
       | sideproject wrote:
       | Simplicity wins in many cases and it is with K-Means clustering.
       | There are so many clustering algorithms, far superior to K-Means,
       | but its simplicity and easy-to-apply approach to most of the
       | datasets available allow pretty much anyone with programming
       | skills to understand and implement quickly to get results and
       | gives you that feeling of "oh, hey, that's my first data science
       | thingy". Simplicity is very powerful.
        
       | jacquesm wrote:
       | K-Means Clustering is one of those tools that once you understand
       | it becomes more and more useful over time. You keep finding new
       | ways to apply it, it's a bit like Quicksort, Bresenham or Divide
       | and Conquer in that sense.
        
         | csunbird wrote:
         | It is also super intuitive. The logic makes sense immediately
         | when you see it first time.
        
         | Helmut10001 wrote:
         | Working in spatial data science, I rarely find applications
         | where k-means is the best tool. The problem is that it is
         | difficult to know how many clusters you can expect on maps. Is
         | it 5; 500; or 10,000? Here HDBSCAN [1] shines because it will
         | cluster _and_ select the most suitable number of clusters, to
         | cut the single linkage cluster tree.
         | 
         | [1]: https://github.com/scikit-learn-contrib/hdbscan
        
           | falkenb0t wrote:
           | I found myself having to do some clustering a few months back
           | and learned about mean-shift clustering [1]. I found it to be
           | a nice clustering algorithm for when you don't know quantity
           | of clusters ahead of time.
           | 
           | That said, its worth noting that there are algorithms out
           | there for determining optimal number of clusters for k-means
           | (though personally I found them to be costly and subject to
           | overfitting) like using the silhouette coefficient or elbow
           | method [2].
           | 
           | [1]: https://www.geeksforgeeks.org/ml-mean-shift-clustering/
           | [2]: https://towardsdatascience.com/k-means-clustering-how-
           | it-wor...
        
             | CuriouslyC wrote:
             | A bigger problem with kmeans than figuring out the number
             | of clusters is the fact that the model assumes multivariate
             | gaussian spheres. That's a bad model for most non-toy data.
        
               | Royi wrote:
               | Not only limited to Gaussian Spheres but also being
               | isotropic (Unless data is pre processed or distance is
               | defined by Mahalanobis Distance).
        
               | jacquesm wrote:
               | Can you explain this please?
        
             | nerdponx wrote:
             | Note also that specifically for one-dimensional data, there
             | is a globally optimal solution to the k-means clustering
             | problem. There is an R package that implements it using a
             | C++ core implementation [1], and also a Python wrapper [2].
             | 
             | This implementation is also surprisingly fast, so you can
             | use it to brute-force check many different numbers of
             | clusters and check using silhouette distance. The advantage
             | over traditional k-means is that you don't need to check
             | multiple initializations for any given number of clusters,
             | because the algorithm is deterministic and guaranteed to
             | find the global optimum.
             | 
             | [1]: https://cran.r-project.org/package=Ckmeans.1d.dp
             | 
             | [2]: https://github.com/djdt/ckwrap
        
               | Royi wrote:
               | Is there a paper or a post which describes the solution?
        
               | nerdponx wrote:
               | Yes, the R package was written by the authors of the
               | paper, and the CRAN info page references the paper.
        
             | civilized wrote:
             | The elbow method exists in many fields. It essentially
             | boils down to "hope that an obviously good choice exists"
        
           | jacquesm wrote:
           | Yes that is a good point, if you don't have a good idea on
           | how many subdivisions you want then K-Means won't help you
           | and hdbscan is better.
           | 
           | I always liken the latter to 'automatic feature detection'.
        
         | NicoJuicy wrote:
         | I'd be interested to hear some of those examples actually, i
         | know what it is. But I haven't used it in practise ( yet).
        
           | jacquesm wrote:
           | Just one example from recent practice, to automatically
           | collect a number of texts (pdfs) into a set of groups for
           | easier subsequent processing (summarization). Without K-Means
           | you'd have a pretty hard time figuring out for each
           | subsequent document which group you want to add it to or
           | whether or not you need to make a new group. With K-Means
           | given a fairly simple 'distance' function all of the grouping
           | is automatic. In fact, once you have K-Means coded up the
           | distance function _is_ the problem to solve. If you can 't
           | meaningfully define a distance then you probably won't be
           | able to use it.
        
             | NicoJuicy wrote:
             | Do you have an example of the groups ( = output) and what
             | you are using as variables for input ?
             | 
             | Eg. Wouldn't KNN be easier if you want to classify new
             | groups later on? I'm probably missing something, but
             | defining the groups upfront seems more labor intensive (
             | note: novice in this area )
        
               | dilyevsky wrote:
               | Not tp but I've tried using k-means for similar task and
               | and LDA and found LDA to be better. Particularly if your
               | documents have mixture of topics
        
               | flashfaffe2 wrote:
               | Depending on the task but using K- means implies you have
               | have idea of the clusters you might have ( unsupervised
               | learning). KNN assumes the groups already exist.
        
               | jacquesm wrote:
               | Well, you don't so much define the groups up front as
               | that you define the _number_ of groups up front, the
               | distance function is what will eventually result in the
               | mapping of inputs to groups if you use the groups (for
               | instance: sample texts in those groups) as a part of the
               | scoring system.
        
             | nerdponx wrote:
             | This is probably the biggest (only?) advantage of k-means
             | in real problems: it's easy to put new data points into
             | existing clusters. You can do this with mean-shift
             | clustering as well.
             | 
             | It's not obvious how to do this with HDBSCAN or
             | hierarchical clustering. Maybe you'd have to draw some kind
             | of boundary around each cluster after fixing the
             | parameters.
        
               | anonymousDan wrote:
               | You mean to update the clusters incrementally as new data
               | arrives?
        
             | importantbrian wrote:
             | Interesting did you try any other techniques like HDBSCAN?
             | I've found HDBSCAN to produce more meaningful clusters for
             | a similar workflow.
        
         | time_to_smile wrote:
         | My experience has been the opposite, K-Means clustering is
         | generally pretty useless in the real world with real data.
         | 
         | In general clustering algorithms tend not to be very useful
         | because they are ultimately just very poorly defined latent
         | variable models. Most people starting out with clustering don't
         | even understand that they're making the assumption that K
         | latent variables are responsible for generating the observed
         | data.
         | 
         | K-means in particular has loads of problems for practical
         | applications: the final location of the means is heavily
         | dependent on the initial locations, the final location can also
         | be heavily dependent on the particular dataset used (try
         | bootstrap resampling to get variance of your means), assumes
         | the data rests on a Euclidean surface, making geometric
         | assumptions about high dimensional spaces is almost always a
         | bad a bad idea (things being close by in Euclidean terms in
         | high dimensional spaces often misses things that are nearly
         | identical except on feature that is quite different).
         | 
         | With over a decade of modeling experience and building data
         | science products I have never seen a product ship that relies
         | on k-means and never seen analysis that isn't coming form
         | someone junior in the field that relies on it.
         | 
         | K-means is a great demo the expectation maximization algorithm,
         | which is a useful tool to understand, but generally I would
         | avoid k-means, and clustering in general unless you have an
         | explicit understanding of why you are doing it. Even then,
         | there is probably a more stable and appropriate latent variable
         | model you could use.
        
           | nerdponx wrote:
           | > With over a decade of modeling experience and building data
           | science products I have never seen a product ship that relies
           | on k-means and never seen analysis that isn't coming form
           | someone junior in the field that relies on it.
           | 
           | I used it exactly once, and it worked quite well :) But that
           | was because I was clustering 1-dimensional data, which has a
           | globally-optimal solution and there was a fast implementation
           | for it available (see my post
           | https://news.ycombinator.com/item?id=30675594). So it was
           | okay in that instance to brute-force check a lot of different
           | cluster numbers and then post-process the results with some
           | heuristics.
        
       | anonymousDan wrote:
       | The more I learn about machine learning, the more it seems to me
       | that the real key is to understand optimization algorithms (both
       | analytical and numerical) and how/when they are likely to
       | converge. Am I wrong? Are there good books summarizing the
       | relationship between all the different approaches to optimization
       | used in ML?
        
         | cinntaile wrote:
         | You're basically trying to fit data to some model and to help
         | with that you have a cost function that you try to optimize.
         | Any decent machine learning curriculum will have an
         | optimization course.
        
       ___________________________________________________________________
       (page generated 2022-03-14 23:00 UTC)