[HN Gopher] Billion-Scale Approximate Nearest Neighbor Search [pdf] ___________________________________________________________________ Billion-Scale Approximate Nearest Neighbor Search [pdf] Author : tim_sw Score : 105 points Date : 2023-05-04 12:44 UTC (2 days ago) (HTM) web link (wangzwhu.github.io) (TXT) w3m dump (wangzwhu.github.io) | victor106 wrote: | Being in software development for the past 15 years I cant | understand any of this. What are some good resource to get up to | speed on this stuff? | WinLychee wrote: | This series of articles from Pinecone seems accessible | https://www.pinecone.io/learn/vector-indexes/ . Stepping back a | moment, the problem "I want to find the top-k most similar | items to a given item from a collection" comes up in a variety | of domains, and a common approach is nearest-neighbor search. | Depending on the number of dimensions and the number of items, | you may need tricks to speed up the search. In recent years | we've seen an increase in scale and invention of some new | tricks to speed this up. | | In a bit more detail the idea is to convert your data into | vectors embedded in some d-dimensional space and compute the | distance between a query vector and all other vectors in the | space. This is an O(dN) computation. ANN covers a number of | techniques giving a speedup by letting you reduce number of | required comparisons. | shoo wrote: | see also - the section titled "Distance function" from | https://en.wikipedia.org/wiki/Curse_of_dimensionality | | > One way to illustrate the "vastness" of high-dimensional | Euclidean space is to compare the proportion of an inscribed | hypersphere with radius r and dimension d, to that of a | hypercube with edges of length 2 r . [...] As the dimension d | of the space increases, the hypersphere becomes an | insignificant volume relative to that of the hypercube | | E.g. in two dimensions the ratio area of the circle of radius | r=1 to the area of a square of side length 2r is pi / 4 = | approx 0.785. As the number of dimensions becomes large, this | ratio drops toward zero. | | There's also some discussion of how using distance functions to | measure distances from reference points become less useful as | the number of dimensions of the space increases -- all | distances becoming similar -- although | | > research has shown this to only hold in the artificial | scenario when the one-dimensional distributions are independent | and identically distributed. | esafak wrote: | It crops up in machine learning. | | https://en.wikipedia.org/wiki/Nearest_neighbor_search | tysam_and wrote: | High dimensional nearest neighbor search scales very poorly | with dimensions and/or number of examples IIRC, so approximate | methods are best for time/value tradeoffs. | | Oftentimes the applications don't require a perfect match | anyways, just something 'close enough'. | | Because things live in high dimensions, and they're often | related to each other, this results in these vast 'sparse' gaps | where no data really generally lives. | | Hence allowing for good speed of approximate methods, and a | good tradeoff of information-per-FLOP when compared to the full | scale search itself. | convexstrictly wrote: | Video: https://www.youtube.com/watch?v=iI8e3kU11eU | swaraj wrote: | Is this different from pinecone? | beoberha wrote: | From what I understand, Pinecone is a wrapper around a lot of | this. It utilizes these libraries under the hood (or their own | implementations) to store and perform search on your vectors | for you. All in a nice managed PaaS offering. | ianbutler wrote: | Nice this will get you acquainted with many of the methods used | currently, as my small addition to the list of things you can | learn. https://ai.googleblog.com/2020/07/announcing-scann- | efficient... https://github.com/google-research/google- | research/tree/mast... | | Google released SCaNN a few years ago and back then it wasn't | implemented in many tools like Milvus etc. Maybe it is now, it's | been a few years since I needed a scalable vector db. I believe | it's still state of the art, and they use a novel approach to | quantization, which was the key intuition. | | I didn't see it included in the PDF from a quick scan (hah!) | bravura wrote: | I remember SCaNN. Are there easy to use implementations | available now? | | You also seem unsure. | | Personally, I love a cheat-sheet like in Slide 7 of the | presentation. It's a game-tree for what ANN _implementation_ to | use, given your problem setting and the assumption you don 't | want to do groundbreaking ANN research; rather, you have an ANN | problem you need to solve on the path to your actual problem. | Maybe that's why SCaNN wasn't cited? | | A quick google reveals that SCaNN is available on pypi from | Google: https://pypi.org/project/scann/ Do any users have any | feedback or experiences to share? | | [edit: GPT4, comparing hnwslib and SCaNN: "Note that SCaNN | recently ceased to be supported by Google Research, and there | won't be any further improvements, bug fixes, or documentation | updates from them. However, you can still use and get support | from the community. This recent change in support could shift | preference towards hnswlib for future applications."] | ianbutler wrote: | GPT4 is only trained on data up to 2021, and the library has | commits as recent as two months ago. So I think that may be a | hallucination. A quick google also reveals nothing like that. | | It's been a while but I also remember HSNW having undesirable | characteristics for large scale vector search (compared to | other methods), so I'm not sure there. | | I did do a quick search and saw there are methods that make | it more suitable, but again it's been 3 years now since I've | used anything in this area. | | Edit: I got curious, https://ann-benchmarks.com/ | | So I was mis remembering a bit, HSNW is pretty good overall | depending on your implementation! | sgt101 wrote: | > I remember SCaNN. Are there easy to use implementations | available now? | | Yes, on gcp in vertex AI. | | GPT4 is talking utter horseshit. | | https://cloud.google.com/vertex-ai/docs/matching- | engine/ann-... | jhj wrote: | SCaNN (the paper) is roughly two different things: | | 1. a SIMD-optimized form of product quantization (PQ), where | code to distance lookup can be performed in SIMD registers | | 2. anisotropic quantization to bias the database towards | returning better maximum inner product search (MIPS) | candidates, versus usual quantization (such as PQ) that aims | to minimize compressed vector reconstruction error. In MIPS | it is much more likely that the query data set may be of a | completely different distribution than the database vectors. | | If your application needs L2 lookup rather than MIPS which | does not admit a metric, then only the PQ part is relevant. | For cosine similarity, you can get that by normalizing all | vectors to the surface of a hypersphere, in which case it has | the same order as L2 (see "L2 normalized Euclidean distance" | on https://en.wikipedia.org/wiki/Cosine_similarity ). | | SCaNN is implemented in Faiss CPU, on the GPU the fast PQ | part is less relevant due to the greater register set size | and throughput (rather than latency) optimized nature of the | hardware, but the GPU version is more geared towards batch | lookup in any case. | | https://github.com/facebookresearch/faiss/wiki/Fast- | accumula... | | https://github.com/facebookresearch/faiss/wiki/Indexing-1M-v. | .. | | We have not found the anisotropic quantization part to be | that useful for large datasets, but results may vary. Graph- | based ANN techniques tend to be better in many cases for | these small datasets than IVF based strategies. | | (I'm the author of GPU Faiss) | chuckcode wrote: | Thanks for details! Few follow up questions: | | - I've seen neural nets using int8 for matrix | multiplication to reduce memory size [1]. Do you think | something similar could be useful in the ANN space? | | - Do you know of any studies using Faiss looking at | speed/cost tradeoffs of RAM vs flash vs Disk for storage? | | - Are there recommended ways to update Faiss index with | streaming data, e.g. updating the vectors continuously? | | Seems like more and more use cases for Faiss as neural nets | become more and more core to workflows. Would like to try | and figure out the configurations that are optimized to | minimize carbon usage in addition to latency and recall | metrics. | | [1] https://arxiv.org/abs/2208.07339 | | (edit for formatting) | jhj wrote: | Regarding reduced precision, depending upon what you are | trying to do, I think it doesn't work quite as well in | similarity search as it does for, say, neural networks. | | If you are concerned about recall of the true nearest | neighbor (k=1), in many datasets I've seen (especially | large ones) in float32 the distance from a query vector | to candidate nearest neighbor vectors may only differ by | some thousands of ULPs when performing brute force | search, which if done in float16 would result in the true | nearest neighbor being the same as (or perhaps behind | even, due to rounding error) other proximate vectors. If | you are performing approximate lookup and you have the | luxury of performing reranking (you store the compressed | / approximate index for lookup, but return a larger | candidate set like k=100 or k=1000 and refine the results | based on true distances computed from the uncompressed | vectors via brute-force, so you have to keep all the | original vectors around) then this problem can go away. | | If however you are looking at recall@100 (is the true | nearest neighbor reported within the top k=100) or set | intersection (of the k=100 approximate nearest neighbors, | how much overlap is there with the true set of the 100 | nearest neighbors), then this doesn't matter as much. | | Certainly a lot of the options in the Faiss library are | geared towards compression and quantization anyways | (e.g., storing a billion high-dimensional vector dataset | in 8/16 GB of memory) so this is a tradeoff as with | everything else. | | In Faiss there are the scalar quantizer indexes which do | store vectors in int4/int8 etc for which int8 GEMM | (accumulate in int32) would be great, but using this | would require that the query vector itself be quantized | to int4/int8 as well. This is the difference between | "asymmetric distance computation" (ADC) where you compute | the distance between int4 encoded vectors (or product | quantized encoded vectors, etc) versus a float32 query | vector, where we reconstruct the database vectors in | floating point and compare in floating point, versus | symmetric distance computation (you have to convert the | query vector to int4 say, and compare in the quantized | regime). ADC tends to work a lot better than symmetric | computation, so this is why we don't use pure int8 GEMM, | but maybe in many applications (NN inference, say, | instead of image database search) the non-ADC comparison | would be ok. | fzliu wrote: | SCaNN is coming to Milvus very soon (we've seen some demand | from the community). Stay tuned. | eternalban wrote: | both are 2020. | eachro wrote: | I see HN pieces on SIMD optimizations for numerical computing | every so often. Are these the sort of optimizations that a | hobbyist with say a normal laptop (either a macbook air or | consumer grade thinkpad) can actually end up tinkering with as | well? Or am I going to have to rent some AWS machine and have it | run on some specific architecture to be able to mess around with? | boulos wrote: | Absolutely. Most development work for these things is just fine | on a laptop. | | Each processor family has its own supported SIMD instructions | though. For a long time, SSE / AVX / AVX2 were the only game in | town on x86. AVX-512 introduced in Skylake was _primarily_ | available only on server parts, and was a mixed bag, and didn | 't have support on AMD machines until just recently (with Genoa | / Zen4). | | A modern Mac laptop with an M-series chip will have ARM NEON | vector instructions, just like on an iPhone or similar. There | is a new-ish ARM vector instruction set called Scalable Vector | Extensions (SVE) that Apple doesn't support, but Graviton 3 | does. But generally improvements on your laptop _often_ | translate onto the server side, and you can definitely test | correctness. | | There are a few least-common denominator SIMD libraries out | there, but realistically for many problems, you'll get the most | mileage from writing some intrinsics yourself for your target | platform. | nwoli wrote: | Basically all CPUs have simd nowadays. You can even make use of | it in wasm/the web. (Though currently not supported on iOS.) | Just compile with 'emcc -msimd128 main.c' | lmeyerov wrote: | Yes, regular computers have had SIMD for decades. Ex: Intel was | SSE and now AVX. Fancier servers will have the same, except | maybe wider and some other tricks. | | If you go down the vectorization path, unless you are a crufty | CPU DB vendor, I'd skip ahead to GPU for the same. Ecosystem is | built to do the same but faster and easier. CPU makers are | slowly changing to look more like GPUs, so just skip ahead... | sebasv_ wrote: | If you're asking how widely available SIMD is, it has been | common in consumer hardware for 2 decades. To perform SIMD | instructions manually you will need a compiled language that | supports it, like Rust or C. But the compiler can actually | implement it for you as an optimization. | kzrdude wrote: | This is a Rust crate doc, but it's a nice place to browse for | different cpu features and what they unlock: | https://docs.rs/target-features/latest/target_features/docs/... | | If you go to the x86 cpu list it will tell you what features | are enabled if you would optimize for a particular one | https://docs.rs/target-features/latest/target_features/docs/... | | On linux you can run lscpu to check what cpu features are | detected. | perfmode wrote: | I use optimized SIMD instructions on iOS to do on-device linear | algebra for the cooking app i work on as a hobby. | | https://theflavor.app/ | manx wrote: | I only see a landing page with missing images (on mobile). | How do I use it? | nwoli wrote: | Spotify uses random hyperplanes to produce a hash where each bit | is the sign of the dot product of the query and the hyperplanes | (ie "is above or below" the hyperplane). Works really well | AndrewKemendo wrote: | Is there a blog post or something on this? ___________________________________________________________________ (page generated 2023-05-06 23:00 UTC)