[HN Gopher] Introduction to K-Means Clustering ___________________________________________________________________ Introduction to K-Means Clustering Author : gk1 Score : 168 points Date : 2022-03-14 15:12 UTC (7 hours ago) (HTM) web link (www.pinecone.io) (TXT) w3m dump (www.pinecone.io) | jdeaton wrote: | After having implemented k-means ~5 times in undergraduate | classes, it was interesting to learn that k-means is just a | special case of expectation maximization. | petters wrote: | ...which in turn is just a special case of block coordinate | descent. One set of variables is fixed, while another set is | optimized and vice versa. | amelius wrote: | How do you benchmark a clustering algorithm? | mistrial9 wrote: | times are changing online -- I definitely recall a two volume, | hard bound, >$100USD edition of "Clustering Methods" in the | university book store and did not buy it. I was being mocked by | my non-native English colleague with a Masters in Computer | Science from great (non-English) university, about the ability to | even read that work. Now, I look at the Scikit-learn docs mainly. | Good work pinecone dot IO. | monkeybutton wrote: | I think its now turned in to somewhat of a competition. If you | want your algorithm/package adopted, you have to make it open | source and provide good documentation like this: | https://lifelines.readthedocs.io/en/latest/Survival%20Analys... | https://hdbscan.readthedocs.io/en/latest/how_hdbscan_works.h... | danuker wrote: | Props to Lifelines. It arranged my thoughts on how to handle | survival data, by requiring the data in a clear format. | oofbey wrote: | K-means has a lot going for it, but it's not very good at | recovering "natural" clusters. Like if you have a dataset that | you know actually is grouped in a certain way, k-means very often | will not discover those clusters. Other more complex clustering | algorithms will do a more reliable job. If you have to implement | clustering yourself for some reason, k-means' simplicity is | great. But how often do you need to write your own algorithm | these days? | [deleted] | notafraudster wrote: | One thing that's nice about k-means is that it can quite | literally be grasped intuitively by, say, a 10 year old. | k-nearest neighbors is similarly graspable. I think these are | wonderful techniques to introduce kids to data science without | having to get into the math. Obviously you're going to have to | abstract away high-dimensionality and a lot of the underlying | math of "how" -- but that's ok! | | I'm not sure I've found either to be massively useful on real- | world data, but that's OK too! | globular-toast wrote: | One thing not mentioned with k-means is it requires your data to | be in Euclidean space. You can't just come up with a distance | function and do clustering with it. That's one reason the other | techniques often make sense. In fact, those techniques often | don't care what the data actually is, they only need the | distances, and even then they don't need all of them. | | Still, k-means is enormously practical despite its shortcomings. | n4r9 wrote: | An alternative I've used is k-medoids: | | https://en.m.wikipedia.org/wiki/K-medoids | | This is especially useful in cases where the distance between | points is given by a matrix, and cannot be derived from the | points themselves. For example, the drive time between | locations on a road network. | whiplash451 wrote: | Not sure why this was down voted. | | K-means in its standard form does make a Euclidean hypothesis | (it does not minimize a distance but the in-cluster variance). | | It is not correct to use arbitrary distances because k-means | may stop converging with other distance functions. | | See this thread for more info: | https://stats.stackexchange.com/questions/81481/why-does-k-m... | isoprophlex wrote: | That's a really pedantic take, nothing precludes you from | using whatever distance function you want to get the distance | to k centroids and act accordingly | amilios wrote: | Losing convergence guarantees is somewhat important! | isoprophlex wrote: | I totally agree with your point, but sometimes you can | get away by being handwavey about it and making sure the | centroids aren't skipping all over the place... and | that's good enough | isoprophlex wrote: | But you can. Get the distance to all your centroids, assign to | closest centroid, update centroid location. | | The distance function can be anything. | globular-toast wrote: | What's a centroid? It's the point that minimises the sum of | squared Euclidean distances between itself and every point in | the cluster. | isoprophlex wrote: | It's a tuple of floats that you move slightly towards the | closest point it has just seen, optionally with some rate | of forgetting. | | You shuffle your big ass wad of data, and loop over each | element doing the above, optionally decreasing the | forgetting rate. | | You track the location of your centroids and if your data / | loss fn isn't too pathological, they won't misbehave too | badly and you have handwavey assurance of convergence... | Definitions be damned. | civilized wrote: | So, you made up your own algorithm that is kinda like | k-means, this algorithm might sometimes work, and this | shows that k-means works in non-Euclidean spaces? | CardenB wrote: | What prevents you from using a different distance than | Euclidean distance? | danuker wrote: | Suppose you use prices and square meters, to cluster | apartments. Suppose you also use a simple metric, like: | distance <- sqrt(d_price*2 + d_area*2) | | Without normalizing the units somehow, so that they have a | meaningfully similar numeric expression, you are essentially | only clustering by price, which is in the tens- or hundreds- | of-thousands. | | You use very little information about the area, because that | is in the hundreds, at most thousands. | nerdponx wrote: | This has nothing to do with using non-Euclidean distance, | though. This is a separate issue related to having features | on different scales. | jacquesm wrote: | That is not a proper distance function. | danuker wrote: | What do you mean? Is it not the Euclidean distance? | | https://en.wikipedia.org/wiki/Euclidean_distance | [deleted] | amilios wrote: | The way k-means is constructed is not based on distances. | | K-means minimizes within-cluster variance. If you look at the | definition of variance, it is identical to the sum of squared | Euclidean distances from the centroid | nerdponx wrote: | Do you lose convergence guarantees if you minimize "sum of | squared distances from the centroid" using your other-than- | Euclidean distance metric? | fmakunbound wrote: | It's been ages since I've used K-means, but I thought you had a | lot of freedom in how you define your distance function. | tus666 wrote: | Why not just do a proper course in data science? | nerdponx wrote: | Why read articles about computer science when you could take a | proper course in computer science? | tus666 wrote: | Because it is far more generic and likely to be actually | useful. | | Could you imagine someone doing K-means outside of a specific | data science application? | magicalhippo wrote: | > Could you imagine someone doing K-means outside of a | specific data science application? | | I implemented k-means clustering to reduce noise in a path | tracer. I've never done a data science course or anything | like that. | lichtenberger wrote: | That was one of the first algorithms I learned at the university, | thanks for posting :-) | Bostonian wrote: | Mixtures of multivariate normals can be regarded as a method of | fuzzy clustering, where each point has a probability of coming | from each of the component Gaussians. The EM (expectation | maximization) algorithm can be used to fit mixture models. | sideproject wrote: | Simplicity wins in many cases and it is with K-Means clustering. | There are so many clustering algorithms, far superior to K-Means, | but its simplicity and easy-to-apply approach to most of the | datasets available allow pretty much anyone with programming | skills to understand and implement quickly to get results and | gives you that feeling of "oh, hey, that's my first data science | thingy". Simplicity is very powerful. | jacquesm wrote: | K-Means Clustering is one of those tools that once you understand | it becomes more and more useful over time. You keep finding new | ways to apply it, it's a bit like Quicksort, Bresenham or Divide | and Conquer in that sense. | csunbird wrote: | It is also super intuitive. The logic makes sense immediately | when you see it first time. | Helmut10001 wrote: | Working in spatial data science, I rarely find applications | where k-means is the best tool. The problem is that it is | difficult to know how many clusters you can expect on maps. Is | it 5; 500; or 10,000? Here HDBSCAN [1] shines because it will | cluster _and_ select the most suitable number of clusters, to | cut the single linkage cluster tree. | | [1]: https://github.com/scikit-learn-contrib/hdbscan | falkenb0t wrote: | I found myself having to do some clustering a few months back | and learned about mean-shift clustering [1]. I found it to be | a nice clustering algorithm for when you don't know quantity | of clusters ahead of time. | | That said, its worth noting that there are algorithms out | there for determining optimal number of clusters for k-means | (though personally I found them to be costly and subject to | overfitting) like using the silhouette coefficient or elbow | method [2]. | | [1]: https://www.geeksforgeeks.org/ml-mean-shift-clustering/ | [2]: https://towardsdatascience.com/k-means-clustering-how- | it-wor... | CuriouslyC wrote: | A bigger problem with kmeans than figuring out the number | of clusters is the fact that the model assumes multivariate | gaussian spheres. That's a bad model for most non-toy data. | Royi wrote: | Not only limited to Gaussian Spheres but also being | isotropic (Unless data is pre processed or distance is | defined by Mahalanobis Distance). | jacquesm wrote: | Can you explain this please? | nerdponx wrote: | Note also that specifically for one-dimensional data, there | is a globally optimal solution to the k-means clustering | problem. There is an R package that implements it using a | C++ core implementation [1], and also a Python wrapper [2]. | | This implementation is also surprisingly fast, so you can | use it to brute-force check many different numbers of | clusters and check using silhouette distance. The advantage | over traditional k-means is that you don't need to check | multiple initializations for any given number of clusters, | because the algorithm is deterministic and guaranteed to | find the global optimum. | | [1]: https://cran.r-project.org/package=Ckmeans.1d.dp | | [2]: https://github.com/djdt/ckwrap | Royi wrote: | Is there a paper or a post which describes the solution? | nerdponx wrote: | Yes, the R package was written by the authors of the | paper, and the CRAN info page references the paper. | civilized wrote: | The elbow method exists in many fields. It essentially | boils down to "hope that an obviously good choice exists" | jacquesm wrote: | Yes that is a good point, if you don't have a good idea on | how many subdivisions you want then K-Means won't help you | and hdbscan is better. | | I always liken the latter to 'automatic feature detection'. | NicoJuicy wrote: | I'd be interested to hear some of those examples actually, i | know what it is. But I haven't used it in practise ( yet). | jacquesm wrote: | Just one example from recent practice, to automatically | collect a number of texts (pdfs) into a set of groups for | easier subsequent processing (summarization). Without K-Means | you'd have a pretty hard time figuring out for each | subsequent document which group you want to add it to or | whether or not you need to make a new group. With K-Means | given a fairly simple 'distance' function all of the grouping | is automatic. In fact, once you have K-Means coded up the | distance function _is_ the problem to solve. If you can 't | meaningfully define a distance then you probably won't be | able to use it. | NicoJuicy wrote: | Do you have an example of the groups ( = output) and what | you are using as variables for input ? | | Eg. Wouldn't KNN be easier if you want to classify new | groups later on? I'm probably missing something, but | defining the groups upfront seems more labor intensive ( | note: novice in this area ) | dilyevsky wrote: | Not tp but I've tried using k-means for similar task and | and LDA and found LDA to be better. Particularly if your | documents have mixture of topics | flashfaffe2 wrote: | Depending on the task but using K- means implies you have | have idea of the clusters you might have ( unsupervised | learning). KNN assumes the groups already exist. | jacquesm wrote: | Well, you don't so much define the groups up front as | that you define the _number_ of groups up front, the | distance function is what will eventually result in the | mapping of inputs to groups if you use the groups (for | instance: sample texts in those groups) as a part of the | scoring system. | nerdponx wrote: | This is probably the biggest (only?) advantage of k-means | in real problems: it's easy to put new data points into | existing clusters. You can do this with mean-shift | clustering as well. | | It's not obvious how to do this with HDBSCAN or | hierarchical clustering. Maybe you'd have to draw some kind | of boundary around each cluster after fixing the | parameters. | anonymousDan wrote: | You mean to update the clusters incrementally as new data | arrives? | importantbrian wrote: | Interesting did you try any other techniques like HDBSCAN? | I've found HDBSCAN to produce more meaningful clusters for | a similar workflow. | time_to_smile wrote: | My experience has been the opposite, K-Means clustering is | generally pretty useless in the real world with real data. | | In general clustering algorithms tend not to be very useful | because they are ultimately just very poorly defined latent | variable models. Most people starting out with clustering don't | even understand that they're making the assumption that K | latent variables are responsible for generating the observed | data. | | K-means in particular has loads of problems for practical | applications: the final location of the means is heavily | dependent on the initial locations, the final location can also | be heavily dependent on the particular dataset used (try | bootstrap resampling to get variance of your means), assumes | the data rests on a Euclidean surface, making geometric | assumptions about high dimensional spaces is almost always a | bad a bad idea (things being close by in Euclidean terms in | high dimensional spaces often misses things that are nearly | identical except on feature that is quite different). | | With over a decade of modeling experience and building data | science products I have never seen a product ship that relies | on k-means and never seen analysis that isn't coming form | someone junior in the field that relies on it. | | K-means is a great demo the expectation maximization algorithm, | which is a useful tool to understand, but generally I would | avoid k-means, and clustering in general unless you have an | explicit understanding of why you are doing it. Even then, | there is probably a more stable and appropriate latent variable | model you could use. | nerdponx wrote: | > With over a decade of modeling experience and building data | science products I have never seen a product ship that relies | on k-means and never seen analysis that isn't coming form | someone junior in the field that relies on it. | | I used it exactly once, and it worked quite well :) But that | was because I was clustering 1-dimensional data, which has a | globally-optimal solution and there was a fast implementation | for it available (see my post | https://news.ycombinator.com/item?id=30675594). So it was | okay in that instance to brute-force check a lot of different | cluster numbers and then post-process the results with some | heuristics. | anonymousDan wrote: | The more I learn about machine learning, the more it seems to me | that the real key is to understand optimization algorithms (both | analytical and numerical) and how/when they are likely to | converge. Am I wrong? Are there good books summarizing the | relationship between all the different approaches to optimization | used in ML? | cinntaile wrote: | You're basically trying to fit data to some model and to help | with that you have a cost function that you try to optimize. | Any decent machine learning curriculum will have an | optimization course. ___________________________________________________________________ (page generated 2022-03-14 23:00 UTC)