[HN Gopher] Vectors are over, hashes are the future
       ___________________________________________________________________
        
       Vectors are over, hashes are the future
        
       Author : jsilvers
       Score  : 98 points
       Date   : 2022-10-07 16:59 UTC (6 hours ago)
        
 (HTM) web link (www.algolia.com)
 (TXT) w3m dump (www.algolia.com)
        
       | nelsondev wrote:
       | Seems the author is proposing LSH instead of vectors for doing
       | ANN?
       | 
       | There are benchmarks here, http://ann-benchmarks.com/ , but LSH
       | underperforms the state of the art ANN algorithms like HNSW on
       | recall/throughput.
       | 
       | LSH I believe was state of the art 10ish years ago, but has since
       | been surpassed. Although the caching aspect is really nice.
        
         | kvathupo wrote:
         | To elaborate on Noe's comment, the article is suggesting the
         | use of LSH where the hashing function is learned by a neural
         | network such that similar vectors correspond to similar hashes
         | via Hamming weight (whilst enforcing some load factor). In
         | effect, a good hash is generated by a neural network. It
         | appears Elastiknn a prioi chooses the hash function? Not sure,
         | not my area of knowledge.
         | 
         | This approach seems feasible tbh. For example, a stock's
         | historical bids/asks probably don't deviate greatly from month
         | to month. That said, the generation of a good hash is dependent
         | on the stock ticker, and a human doesn't have the time to find
         | a good one for every stock at scale.
        
         | Noe2097 wrote:
         | LSH is a _technique_, whose performance vastly/mostly depends
         | on the hashing function and on how this function enables
         | neighborhood exploration.
         | 
         | It might not be trendy, but it doesn't mean it can't work as
         | good or better than HNSW. It all depends on the hashing
         | function you come up with.
        
           | a-dub wrote:
           | when combined with minhashing it approximates jaccard
           | similarity, so it seems it would be bounded by that.
        
         | a-dub wrote:
         | 10? no, it's more like 20+. lsh was a core piece of the google
         | crawler. it was used for high performance fuzzy deduplication.
         | 
         | see ullman's text: mining massive datasets. it's free on the
         | web.
        
           | johanvts wrote:
           | I think LSH was only introduced in 99 by Indyk et. al. I
           | would say it was a pretty active research area 10 years ago.
        
             | a-dub wrote:
             | right, but massive scale production use in the google
             | crawler to index the entire internet when that was at the
             | bleeding edge was state of the art before the art was even
             | really recognized as an art.
             | 
             | i don't even think they called it ANN. it was high
             | performance, scalable deduplication. (which is, in fact,
             | just fast/scalable lossy clustering)
             | 
             | collaborative filtering was kind of a cute joke at the
             | time. meanwhile they had lsh, in production, actually
             | deduplicating the internet.
        
         | molodec wrote:
         | It is true that HNSW outperforms LSH on recall and throughput,
         | but for some use cases LSH outperforms HNSW. I just deployed
         | this week to prod a new system for short text streaming
         | clustering using LSH. I used algorithms from this crate that I
         | also built https://github.com/serega/gaoya
         | 
         | HNSW index is slow to construct, so it is best suited for
         | search or recommendation engines where you build the index and
         | serve. For workloads where you continuously mutate the index,
         | like streaming clustering/deduplication LSH outperforms HNSW.
        
       | whycombinetor wrote:
       | The article's 0.65 vs 0.66 float64 example doesn't indicate much
       | since neither 0.64 nor 0.65 have a terminating representation in
       | base 2...
        
       | whatever1 wrote:
       | Omg NN "research" is just heuristics on top of heuristics on top
       | of mambo jumbo.
       | 
       | Hopefully someone who knows math will enter the field one day and
       | build the theoretical basis for all this mess and allow us to
       | make real progress.
        
         | auraham wrote:
         | Old post of Yann LeCun [1]:
         | 
         | > But another important goal is inventing new methods, new
         | techniques, and yes, new tricks. In the history of science and
         | technology, the engineering artifacts have almost always
         | preceded the theoretical understanding: the lens and the
         | telescope preceded optics theory, the steam engine preceded
         | thermodynamics, the airplane preceded flight aerodynamics,
         | radio and data communication preceded information theory, the
         | computer preceded computer science.
         | 
         | [1]
         | https://www.reddit.com/r/MachineLearning/comments/7i1uer/n_y...
        
       | sramam wrote:
       | (I know nothing about the area.)
       | 
       | Am I incorrect in thinking we are headed to future AIs that jump
       | to conclusions? Or is it just my "human neural hash" being
       | triggered in error?!
        
       | [deleted]
        
       | mrkeen wrote:
       | > The analogy here would be the choice between a 1 second flight
       | to somewhere random in the suburb of your choosing in any city in
       | the world versus a 10 hour trip putting you at the exact house
       | you wanted in the city of your choice.
       | 
       | Wouldn't the first part of the analogy actually be:
       | 
       | A 1 second flight that will probably land at your exact
       | destination, but could potentially land you anywhere on earth?
        
       | steve76 wrote:
        
       | olliej wrote:
       | So my interpretation of the neural hash approach is largely that
       | it is essentially trading a much larger number of very small
       | "neurons" vs a smaller number of floats. Given that I'd be
       | curious about what the total size difference is.
       | 
       | I could see the hash approach at a functional level resulting in
       | different features essentially getting a different number of bit
       | directly, which be approximately equivalent to having a NN with
       | variable precision floats, all in a very hand wavy way.
       | 
       | Eg we could say a NN/NH needs N bits of information to work
       | accurately, in which case you're trading the format and
       | operations on those Nbits
        
       | gk1 wrote:
       | This is a rehash (pardon me) of this post from 2021:
       | https://www.search.io/blog/vectors-versus-hashes
       | 
       | The demand for vector embedding models (like those released by
       | OpenAI, Cohere, HuggingFace, etc) and vector databases (like
       | https://pinecone.io -- disclosure: I work there) has only grown
       | since then. The market has decided that vectors are not, in fact,
       | over.
        
         | packetlost wrote:
         | PineCone seems interesting. Is the storage backend open source?
         | I've been working on a persistent hashmap database that's
         | somewhat similar (albeit not done) that should have less RAM
         | requirements than bitcask (ie. larger than RAM keysets)
        
       | fzliu wrote:
       | Hashes are fine, but to say that "vectors are over" is just plain
       | nonsense. We continue to see vectors as a core part of production
       | systems for entity representation and recommendation (example:
       | https://slack.engineering/recommend-api) and within models
       | themselves (example: multimodal and diffusion models). For folks
       | into metrics, we're building a vector database specifically for
       | storing, indexing, and searching across massive quantities of
       | vectors (https://github.com/milvus-io/milvus), and we've seen
       | close to exponential growth in terms of total downloads.
       | 
       | Vectors are just getting started.
        
         | PaulHoule wrote:
         | Frequently people use vectors as a hash. It's a bit like a
         | fashionista declaring clothes obsolete.
        
         | kvathupo wrote:
         | Click-bait title aside : ^ ), I'd agree. Neural hashes seem to
         | be a promising advancement imo, but I question its impact on
         | the convergence time of AI models. In the pecking order of
         | neural network bottlenecks, I'd imagine it's not terribly
         | expensive to access training data from some database. Rather,
         | hardware considerations for improving parallelism seem to be
         | the biggest hurdle [1].
         | 
         | [1] - https://www.nvidia.com/en-us/data-center/nvlink/
        
         | jurschreuder wrote:
         | For searching on faces I also needed to find vectors in a
         | database.
         | 
         | I used random projection hashing to increase the search speed
         | because you can just match directly (or at least narrow down
         | search) instead of calculating the euclidean distance for each
         | row.
        
         | gauddasa wrote:
         | True. The title is just clickbait and what we find inside is
         | suggestions for dimensionality reduction by a person who
         | appears to be on the verge of reinventing autoencoders
         | disguised as neural hashes. Is it a mere coincidence that the
         | article fails to mention autoencoders?
        
       | aaaaaaaaaaab wrote:
       | Pfhew, I thought you wanted to ditch std::vector for hash maps!
        
       | PLenz wrote:
       | Hashes are just short, constrained membership vectors
        
       | robotresearcher wrote:
       | A state vector can represent a point in the state space of
       | floating-point representation, a point in the state space of a
       | hash function, or any other discrete space.
       | 
       | Vectors didn't go anywhere. The article is discussing which
       | function to use to interpret a vector.
       | 
       | Is there a special meaning of 'vector' here that I am missing? Is
       | it so synonymous in the ML context with 'multidimensional
       | floating point state space descriptor' that any other use is not
       | a vector any more?
        
         | Firmwarrior wrote:
         | The title probably makes a lot more sense in the context of
         | where it was originally posted
         | 
         | I was as confused and annoyed as you were, though, since I
         | don't have a machine learning background
        
       | cratermoon wrote:
       | And then there's this:
       | https://news.ycombinator.com/item?id=33125640
        
       ___________________________________________________________________
       (page generated 2022-10-07 23:00 UTC)