[HN Gopher] Introduction to Locality-Sensitive Hashing (2018)
       ___________________________________________________________________
        
       Introduction to Locality-Sensitive Hashing (2018)
        
       Author : signa11
       Score  : 145 points
       Date   : 2022-12-23 06:30 UTC (16 hours ago)
        
 (HTM) web link (tylerneylon.com)
 (TXT) w3m dump (tylerneylon.com)
        
       | AR_1 wrote:
       | A well written article and the problem of efficient peer
       | discovery in a large network of nodes will become even more
       | pronounced in the near future.
       | 
       | Another design using proximity based hashing is to use a particle
       | swarm optimization approach [1]
       | 
       | [1]
       | 
       | Proximity-based Networking: Small world overlays optimized with
       | particle swarm optimization
       | 
       | https://arxiv.org/pdf/2006.02006.pdf
        
       | eruci wrote:
       | I currently use a modified hilbert curve implementation for nn
       | searches, will check this out to see how it compares.
       | 
       | my use case is here: https://geocode.xyz/2451481561372 ->
       | https://geocode.xyz/45.17310,-75.62670 /
       | https://geocode.xyz/2451481561373 ->
       | https://geocode.xyz/45.17300,-75.62670
        
         | glass_of_water wrote:
         | How do you deal with false negatives when using Hilbert curves
         | for nearest neighbor searches? Based on my limited
         | understanding, points close on space can be far from each other
         | on the curve. Do you use multiple curves or something?
        
       | lgvld wrote:
       | Thank you for sharing! Very interesting. I love visual
       | explanations [1]. If I remember correctly, LSH can be used in
       | computer graphics, by instance when computing fluids simulations
       | (where _a lot_ of particles are involved).
       | 
       | The author also wrote an article in which he illustrates RNN in
       | an original and stunning way: https://tylerneylon.com/a/randnn/
       | 
       | [1]: Not related, but you might also enjoy the work of Bartosz
       | Ciechanowski: https://ciechanow.ski/
        
       | abalaji wrote:
       | This is a fantastic article! Though, for the first half, I do
       | think folks would benefit from having a rough understanding of
       | the nuances of hash maps, if I may plug my own rough notes [1]
       | 
       | [1]
       | https://github.com/adithyabsk/datastructures/blob/main/docs/...
        
       | tylerneylon wrote:
       | Hi, I'm the author of this article. I'm grateful for the positive
       | comments! I plan to explore FALCONN (pointed to by gajjanag) and
       | the beautiful article of Bartosz Ciechanowski pointed out by
       | lgvld.
       | 
       | My original motivation for writing this was to explain (in a
       | follow-up article) an ANN algorithm I published in 2010:
       | https://epubs.siam.org/doi/pdf/10.1137/1.9781611973075.94
       | 
       | I never wrote that up (as a non-academic article), but the fact
       | that people like this article is a nudge for me to do that.
       | 
       | The above paper was the first LSH algorithm which
       | deterministically returns all close-enough points and guarantees
       | no false positives for distant-enough points. I've heard there
       | are also other algorithms with that property published since. If
       | I have time, I would like to one day see how that algorithm
       | compares against the others in Erik Bernhardsson's benchmarks.
        
         | ctxc wrote:
         | Forgive my lack of knowledge, but I was lost the instant I
         | found the first equation :(
         | 
         | H is the hash function and x are coordinates. But what is R Z?
        
           | tylerneylon wrote:
           | R is the set of real numbers.
           | 
           | Z is the set of integers {.., -2, -1, 0, 1, 2, ...}.
           | 
           | Z is called "Z" due to influential German mathematicians who
           | chose the letter to stand for "zahlen" ("number" in German).
           | 
           | As a fun fact, the letter "N" stands for "natural numbers,"
           | meaning either positive integers, or positive integers and
           | zero. However, that letter (N) is _not_ standardized because
           | mathematicians disagree about whether or not to include zero.
           | When I separately told a couple of my professors this in grad
           | school, they didn't believe me -- one of them thinking it was
           | crazy for zero to be excluded, the other thinking it was
           | crazy to include it!
           | 
           | https://en.wikipedia.org/wiki/Natural_number
        
       | dang wrote:
       | Discussed at the time:
       | 
       |  _Introduction to Locality-Sensitive Hashing_ -
       | https://news.ycombinator.com/item?id=27614381 - June 2021 (63
       | comments)
        
       | [deleted]
        
       | zbird wrote:
       | I like meself a well-written and formatted article. Thank you.
        
       | tipsytoad wrote:
       | Fantastic article! How does this compare to ANN for the use case?
       | 
       | https://erikbern.com/2015/10/01/nearest-neighbors-and-vector...
        
         | gpderetta wrote:
         | Reminds me of the random projections dimensionality reduction
         | schema which is often used with LSH.
         | 
         | In fact the described forest of trees schema can probably be
         | interpreted as an LSH.
         | 
         | Disclaimer: I haven't touched this stuff for more than 10
         | years. Don't know what's the state of the art now.
        
           | uniqueuid wrote:
           | I agree that random hyperplanes are underpinning both, but as
           | far as I recall, usually only one set of random planes is
           | used in LSH (i.e. one set of trees). The granularity of
           | approximate proximity rests in the number of identical signs,
           | i.e. being on the same side of the plane.
           | 
           | There is another good and more technical explanation (using
           | bands) in chapter 2 of mining massive datasets by Leskovec,
           | Rajaraman and Ullman.
        
             | gpderetta wrote:
             | As I said, it has been a long time!
             | 
             | I have dim memories of using random k basis vectors to
             | convert high dimensionality feature vectors to k
             | dimensions, and doing m times to generate multiple
             | projections as part of a an LSH schema. Min-hashing might
             | have been involved.
        
               | senderista wrote:
               | IIRC, minhashing is used to approximate Jacquard
               | similarity (a set-theoretic measure), while random
               | hyperplanes (aka simhashing) is used to approximate
               | cosine similarity (a geometric/algebraic measure). So
               | they solve different problems, even though some problems
               | can be cast in terms of either framework.
        
       | foobar502 wrote:
       | Read the article and googled a bit - what are more example use
       | cases for LSH behind those described?
        
         | madiator wrote:
         | Here's an example I can think of. Suppose you have a bunch of
         | text documents, and you know that some documents are similar
         | but not identical (e.g. plagiarized and slightly modified). You
         | want to find out which documents are similar.
         | 
         | You can first run the contents through some sort of embedding
         | model (e.g. the recent OpenAI embedding model [1]), and then
         | apply LSH on those embeddings. The documents that have the same
         | LSH value would have had very similar embeddings, and thus very
         | similar content.
         | 
         | [1] https://beta.openai.com/docs/guides/embeddings
        
         | tylerneylon wrote:
         | The use case I see the most in my career is to use LSH to help
         | solve the "ANN" problem = approximate nearest neighbors
         | (typically with ranked results). I've seen ANN used many times
         | for near-duplicate detection and in recommendation systems.
         | 
         | Although I don't have access to the proprietary code used, it's
         | most likely that an LSH algorithm is behind the scenes in every
         | modern search engine (to avoid serving duplicates), many modern
         | ranking systems such as Elasticsearch (because items are
         | typically vectorized and retrieved based on these vectors), and
         | most recommendation systems (for similar reasons as ranking).
         | For example, all of these pages probably have an LSH algorithm
         | at some point (either batch processing before your request, or
         | in some cases real-time lookups):
         | 
         | * Every search result page on Google * Every product page on
         | Amazon (similar products) * All music suggestions on Spotify or
         | similar * Every video recommendation from TikTok, YouTube, or
         | Instagram
         | 
         | etc.
        
           | molodec wrote:
           | Another interesting use case for LSH is search results
           | caching. Used by Amazon https://www.linkedin.com/feed/update/
           | urn:li:activity:6943348...
           | https://www.amazon.science/blog/more-efficient-caching-
           | for-p...
        
           | senderista wrote:
           | Yes, e.g. many IR systems use cosine similarity to compute
           | query vector/term vector similarity, and simhashing
           | approximates cosine similarity. OTOH, some IR systems instead
           | use a set-theoretic measure, Jacquard similarity, which can
           | be approximated by minhashing.
        
       | gajjanag wrote:
       | By the way, https://github.com/FALCONN-LIB/FALCONN contains a
       | really good LSH implementation. Also see
       | https://www.mit.edu/~andoni/LSH/ if you want to know more about
       | the research literature.
       | 
       | The fastest way for Euclidean space that I know that works well
       | in practice is via Leech lattice decoding:
       | https://www.semanticscholar.org/paper/Maximum-likelihood-dec... ,
       | or https://www.semanticscholar.org/paper/Efficient-bounded-
       | dist... .
       | 
       | It is possible to create an implementation based on the above
       | that decodes 24 dimensional points to the closest Leech lattice
       | vector in < 1 microsecond per point on my AMD Ryzen laptop.
       | Combine with some fast random projections/Johnson Lindenstrauss
       | as described in the article to form the LSH.
       | 
       | This LSH family is unfortunately not present in FALCONN, but the
       | alternatives in FALCONN are pretty good.
       | 
       | Source: thought extensively about LSH for my PhD.
        
         | [deleted]
        
       | gumby wrote:
       | Interesting concept! Until I read it I thought locality was
       | exactly what I would _not_ want in a hash function.
       | 
       | I love it when I read something that upends my assumptions and
       | opens up new possibilities.
        
       ___________________________________________________________________
       (page generated 2022-12-23 23:00 UTC)