[HN Gopher] Vectors are over, hashes are the future ___________________________________________________________________ Vectors are over, hashes are the future Author : jsilvers Score : 98 points Date : 2022-10-07 16:59 UTC (6 hours ago) (HTM) web link (www.algolia.com) (TXT) w3m dump (www.algolia.com) | nelsondev wrote: | Seems the author is proposing LSH instead of vectors for doing | ANN? | | There are benchmarks here, http://ann-benchmarks.com/ , but LSH | underperforms the state of the art ANN algorithms like HNSW on | recall/throughput. | | LSH I believe was state of the art 10ish years ago, but has since | been surpassed. Although the caching aspect is really nice. | kvathupo wrote: | To elaborate on Noe's comment, the article is suggesting the | use of LSH where the hashing function is learned by a neural | network such that similar vectors correspond to similar hashes | via Hamming weight (whilst enforcing some load factor). In | effect, a good hash is generated by a neural network. It | appears Elastiknn a prioi chooses the hash function? Not sure, | not my area of knowledge. | | This approach seems feasible tbh. For example, a stock's | historical bids/asks probably don't deviate greatly from month | to month. That said, the generation of a good hash is dependent | on the stock ticker, and a human doesn't have the time to find | a good one for every stock at scale. | Noe2097 wrote: | LSH is a _technique_, whose performance vastly/mostly depends | on the hashing function and on how this function enables | neighborhood exploration. | | It might not be trendy, but it doesn't mean it can't work as | good or better than HNSW. It all depends on the hashing | function you come up with. | a-dub wrote: | when combined with minhashing it approximates jaccard | similarity, so it seems it would be bounded by that. | a-dub wrote: | 10? no, it's more like 20+. lsh was a core piece of the google | crawler. it was used for high performance fuzzy deduplication. | | see ullman's text: mining massive datasets. it's free on the | web. | johanvts wrote: | I think LSH was only introduced in 99 by Indyk et. al. I | would say it was a pretty active research area 10 years ago. | a-dub wrote: | right, but massive scale production use in the google | crawler to index the entire internet when that was at the | bleeding edge was state of the art before the art was even | really recognized as an art. | | i don't even think they called it ANN. it was high | performance, scalable deduplication. (which is, in fact, | just fast/scalable lossy clustering) | | collaborative filtering was kind of a cute joke at the | time. meanwhile they had lsh, in production, actually | deduplicating the internet. | molodec wrote: | It is true that HNSW outperforms LSH on recall and throughput, | but for some use cases LSH outperforms HNSW. I just deployed | this week to prod a new system for short text streaming | clustering using LSH. I used algorithms from this crate that I | also built https://github.com/serega/gaoya | | HNSW index is slow to construct, so it is best suited for | search or recommendation engines where you build the index and | serve. For workloads where you continuously mutate the index, | like streaming clustering/deduplication LSH outperforms HNSW. | whycombinetor wrote: | The article's 0.65 vs 0.66 float64 example doesn't indicate much | since neither 0.64 nor 0.65 have a terminating representation in | base 2... | whatever1 wrote: | Omg NN "research" is just heuristics on top of heuristics on top | of mambo jumbo. | | Hopefully someone who knows math will enter the field one day and | build the theoretical basis for all this mess and allow us to | make real progress. | auraham wrote: | Old post of Yann LeCun [1]: | | > But another important goal is inventing new methods, new | techniques, and yes, new tricks. In the history of science and | technology, the engineering artifacts have almost always | preceded the theoretical understanding: the lens and the | telescope preceded optics theory, the steam engine preceded | thermodynamics, the airplane preceded flight aerodynamics, | radio and data communication preceded information theory, the | computer preceded computer science. | | [1] | https://www.reddit.com/r/MachineLearning/comments/7i1uer/n_y... | sramam wrote: | (I know nothing about the area.) | | Am I incorrect in thinking we are headed to future AIs that jump | to conclusions? Or is it just my "human neural hash" being | triggered in error?! | [deleted] | mrkeen wrote: | > The analogy here would be the choice between a 1 second flight | to somewhere random in the suburb of your choosing in any city in | the world versus a 10 hour trip putting you at the exact house | you wanted in the city of your choice. | | Wouldn't the first part of the analogy actually be: | | A 1 second flight that will probably land at your exact | destination, but could potentially land you anywhere on earth? | steve76 wrote: | olliej wrote: | So my interpretation of the neural hash approach is largely that | it is essentially trading a much larger number of very small | "neurons" vs a smaller number of floats. Given that I'd be | curious about what the total size difference is. | | I could see the hash approach at a functional level resulting in | different features essentially getting a different number of bit | directly, which be approximately equivalent to having a NN with | variable precision floats, all in a very hand wavy way. | | Eg we could say a NN/NH needs N bits of information to work | accurately, in which case you're trading the format and | operations on those Nbits | gk1 wrote: | This is a rehash (pardon me) of this post from 2021: | https://www.search.io/blog/vectors-versus-hashes | | The demand for vector embedding models (like those released by | OpenAI, Cohere, HuggingFace, etc) and vector databases (like | https://pinecone.io -- disclosure: I work there) has only grown | since then. The market has decided that vectors are not, in fact, | over. | packetlost wrote: | PineCone seems interesting. Is the storage backend open source? | I've been working on a persistent hashmap database that's | somewhat similar (albeit not done) that should have less RAM | requirements than bitcask (ie. larger than RAM keysets) | fzliu wrote: | Hashes are fine, but to say that "vectors are over" is just plain | nonsense. We continue to see vectors as a core part of production | systems for entity representation and recommendation (example: | https://slack.engineering/recommend-api) and within models | themselves (example: multimodal and diffusion models). For folks | into metrics, we're building a vector database specifically for | storing, indexing, and searching across massive quantities of | vectors (https://github.com/milvus-io/milvus), and we've seen | close to exponential growth in terms of total downloads. | | Vectors are just getting started. | PaulHoule wrote: | Frequently people use vectors as a hash. It's a bit like a | fashionista declaring clothes obsolete. | kvathupo wrote: | Click-bait title aside : ^ ), I'd agree. Neural hashes seem to | be a promising advancement imo, but I question its impact on | the convergence time of AI models. In the pecking order of | neural network bottlenecks, I'd imagine it's not terribly | expensive to access training data from some database. Rather, | hardware considerations for improving parallelism seem to be | the biggest hurdle [1]. | | [1] - https://www.nvidia.com/en-us/data-center/nvlink/ | jurschreuder wrote: | For searching on faces I also needed to find vectors in a | database. | | I used random projection hashing to increase the search speed | because you can just match directly (or at least narrow down | search) instead of calculating the euclidean distance for each | row. | gauddasa wrote: | True. The title is just clickbait and what we find inside is | suggestions for dimensionality reduction by a person who | appears to be on the verge of reinventing autoencoders | disguised as neural hashes. Is it a mere coincidence that the | article fails to mention autoencoders? | aaaaaaaaaaab wrote: | Pfhew, I thought you wanted to ditch std::vector for hash maps! | PLenz wrote: | Hashes are just short, constrained membership vectors | robotresearcher wrote: | A state vector can represent a point in the state space of | floating-point representation, a point in the state space of a | hash function, or any other discrete space. | | Vectors didn't go anywhere. The article is discussing which | function to use to interpret a vector. | | Is there a special meaning of 'vector' here that I am missing? Is | it so synonymous in the ML context with 'multidimensional | floating point state space descriptor' that any other use is not | a vector any more? | Firmwarrior wrote: | The title probably makes a lot more sense in the context of | where it was originally posted | | I was as confused and annoyed as you were, though, since I | don't have a machine learning background | cratermoon wrote: | And then there's this: | https://news.ycombinator.com/item?id=33125640 ___________________________________________________________________ (page generated 2022-10-07 23:00 UTC)