[HN Gopher] Introduction to Locality-Sensitive Hashing (2018) ___________________________________________________________________ Introduction to Locality-Sensitive Hashing (2018) Author : signa11 Score : 145 points Date : 2022-12-23 06:30 UTC (16 hours ago) (HTM) web link (tylerneylon.com) (TXT) w3m dump (tylerneylon.com) | AR_1 wrote: | A well written article and the problem of efficient peer | discovery in a large network of nodes will become even more | pronounced in the near future. | | Another design using proximity based hashing is to use a particle | swarm optimization approach [1] | | [1] | | Proximity-based Networking: Small world overlays optimized with | particle swarm optimization | | https://arxiv.org/pdf/2006.02006.pdf | eruci wrote: | I currently use a modified hilbert curve implementation for nn | searches, will check this out to see how it compares. | | my use case is here: https://geocode.xyz/2451481561372 -> | https://geocode.xyz/45.17310,-75.62670 / | https://geocode.xyz/2451481561373 -> | https://geocode.xyz/45.17300,-75.62670 | glass_of_water wrote: | How do you deal with false negatives when using Hilbert curves | for nearest neighbor searches? Based on my limited | understanding, points close on space can be far from each other | on the curve. Do you use multiple curves or something? | lgvld wrote: | Thank you for sharing! Very interesting. I love visual | explanations [1]. If I remember correctly, LSH can be used in | computer graphics, by instance when computing fluids simulations | (where _a lot_ of particles are involved). | | The author also wrote an article in which he illustrates RNN in | an original and stunning way: https://tylerneylon.com/a/randnn/ | | [1]: Not related, but you might also enjoy the work of Bartosz | Ciechanowski: https://ciechanow.ski/ | abalaji wrote: | This is a fantastic article! Though, for the first half, I do | think folks would benefit from having a rough understanding of | the nuances of hash maps, if I may plug my own rough notes [1] | | [1] | https://github.com/adithyabsk/datastructures/blob/main/docs/... | tylerneylon wrote: | Hi, I'm the author of this article. I'm grateful for the positive | comments! I plan to explore FALCONN (pointed to by gajjanag) and | the beautiful article of Bartosz Ciechanowski pointed out by | lgvld. | | My original motivation for writing this was to explain (in a | follow-up article) an ANN algorithm I published in 2010: | https://epubs.siam.org/doi/pdf/10.1137/1.9781611973075.94 | | I never wrote that up (as a non-academic article), but the fact | that people like this article is a nudge for me to do that. | | The above paper was the first LSH algorithm which | deterministically returns all close-enough points and guarantees | no false positives for distant-enough points. I've heard there | are also other algorithms with that property published since. If | I have time, I would like to one day see how that algorithm | compares against the others in Erik Bernhardsson's benchmarks. | ctxc wrote: | Forgive my lack of knowledge, but I was lost the instant I | found the first equation :( | | H is the hash function and x are coordinates. But what is R Z? | tylerneylon wrote: | R is the set of real numbers. | | Z is the set of integers {.., -2, -1, 0, 1, 2, ...}. | | Z is called "Z" due to influential German mathematicians who | chose the letter to stand for "zahlen" ("number" in German). | | As a fun fact, the letter "N" stands for "natural numbers," | meaning either positive integers, or positive integers and | zero. However, that letter (N) is _not_ standardized because | mathematicians disagree about whether or not to include zero. | When I separately told a couple of my professors this in grad | school, they didn't believe me -- one of them thinking it was | crazy for zero to be excluded, the other thinking it was | crazy to include it! | | https://en.wikipedia.org/wiki/Natural_number | dang wrote: | Discussed at the time: | | _Introduction to Locality-Sensitive Hashing_ - | https://news.ycombinator.com/item?id=27614381 - June 2021 (63 | comments) | [deleted] | zbird wrote: | I like meself a well-written and formatted article. Thank you. | tipsytoad wrote: | Fantastic article! How does this compare to ANN for the use case? | | https://erikbern.com/2015/10/01/nearest-neighbors-and-vector... | gpderetta wrote: | Reminds me of the random projections dimensionality reduction | schema which is often used with LSH. | | In fact the described forest of trees schema can probably be | interpreted as an LSH. | | Disclaimer: I haven't touched this stuff for more than 10 | years. Don't know what's the state of the art now. | uniqueuid wrote: | I agree that random hyperplanes are underpinning both, but as | far as I recall, usually only one set of random planes is | used in LSH (i.e. one set of trees). The granularity of | approximate proximity rests in the number of identical signs, | i.e. being on the same side of the plane. | | There is another good and more technical explanation (using | bands) in chapter 2 of mining massive datasets by Leskovec, | Rajaraman and Ullman. | gpderetta wrote: | As I said, it has been a long time! | | I have dim memories of using random k basis vectors to | convert high dimensionality feature vectors to k | dimensions, and doing m times to generate multiple | projections as part of a an LSH schema. Min-hashing might | have been involved. | senderista wrote: | IIRC, minhashing is used to approximate Jacquard | similarity (a set-theoretic measure), while random | hyperplanes (aka simhashing) is used to approximate | cosine similarity (a geometric/algebraic measure). So | they solve different problems, even though some problems | can be cast in terms of either framework. | foobar502 wrote: | Read the article and googled a bit - what are more example use | cases for LSH behind those described? | madiator wrote: | Here's an example I can think of. Suppose you have a bunch of | text documents, and you know that some documents are similar | but not identical (e.g. plagiarized and slightly modified). You | want to find out which documents are similar. | | You can first run the contents through some sort of embedding | model (e.g. the recent OpenAI embedding model [1]), and then | apply LSH on those embeddings. The documents that have the same | LSH value would have had very similar embeddings, and thus very | similar content. | | [1] https://beta.openai.com/docs/guides/embeddings | tylerneylon wrote: | The use case I see the most in my career is to use LSH to help | solve the "ANN" problem = approximate nearest neighbors | (typically with ranked results). I've seen ANN used many times | for near-duplicate detection and in recommendation systems. | | Although I don't have access to the proprietary code used, it's | most likely that an LSH algorithm is behind the scenes in every | modern search engine (to avoid serving duplicates), many modern | ranking systems such as Elasticsearch (because items are | typically vectorized and retrieved based on these vectors), and | most recommendation systems (for similar reasons as ranking). | For example, all of these pages probably have an LSH algorithm | at some point (either batch processing before your request, or | in some cases real-time lookups): | | * Every search result page on Google * Every product page on | Amazon (similar products) * All music suggestions on Spotify or | similar * Every video recommendation from TikTok, YouTube, or | Instagram | | etc. | molodec wrote: | Another interesting use case for LSH is search results | caching. Used by Amazon https://www.linkedin.com/feed/update/ | urn:li:activity:6943348... | https://www.amazon.science/blog/more-efficient-caching- | for-p... | senderista wrote: | Yes, e.g. many IR systems use cosine similarity to compute | query vector/term vector similarity, and simhashing | approximates cosine similarity. OTOH, some IR systems instead | use a set-theoretic measure, Jacquard similarity, which can | be approximated by minhashing. | gajjanag wrote: | By the way, https://github.com/FALCONN-LIB/FALCONN contains a | really good LSH implementation. Also see | https://www.mit.edu/~andoni/LSH/ if you want to know more about | the research literature. | | The fastest way for Euclidean space that I know that works well | in practice is via Leech lattice decoding: | https://www.semanticscholar.org/paper/Maximum-likelihood-dec... , | or https://www.semanticscholar.org/paper/Efficient-bounded- | dist... . | | It is possible to create an implementation based on the above | that decodes 24 dimensional points to the closest Leech lattice | vector in < 1 microsecond per point on my AMD Ryzen laptop. | Combine with some fast random projections/Johnson Lindenstrauss | as described in the article to form the LSH. | | This LSH family is unfortunately not present in FALCONN, but the | alternatives in FALCONN are pretty good. | | Source: thought extensively about LSH for my PhD. | [deleted] | gumby wrote: | Interesting concept! Until I read it I thought locality was | exactly what I would _not_ want in a hash function. | | I love it when I read something that upends my assumptions and | opens up new possibilities. ___________________________________________________________________ (page generated 2022-12-23 23:00 UTC)