[HN Gopher] Billion-Scale Approximate Nearest Neighbor Search [pdf]
       ___________________________________________________________________
        
       Billion-Scale Approximate Nearest Neighbor Search [pdf]
        
       Author : tim_sw
       Score  : 105 points
       Date   : 2023-05-04 12:44 UTC (2 days ago)
        
 (HTM) web link (wangzwhu.github.io)
 (TXT) w3m dump (wangzwhu.github.io)
        
       | victor106 wrote:
       | Being in software development for the past 15 years I cant
       | understand any of this. What are some good resource to get up to
       | speed on this stuff?
        
         | WinLychee wrote:
         | This series of articles from Pinecone seems accessible
         | https://www.pinecone.io/learn/vector-indexes/ . Stepping back a
         | moment, the problem "I want to find the top-k most similar
         | items to a given item from a collection" comes up in a variety
         | of domains, and a common approach is nearest-neighbor search.
         | Depending on the number of dimensions and the number of items,
         | you may need tricks to speed up the search. In recent years
         | we've seen an increase in scale and invention of some new
         | tricks to speed this up.
         | 
         | In a bit more detail the idea is to convert your data into
         | vectors embedded in some d-dimensional space and compute the
         | distance between a query vector and all other vectors in the
         | space. This is an O(dN) computation. ANN covers a number of
         | techniques giving a speedup by letting you reduce number of
         | required comparisons.
        
         | shoo wrote:
         | see also - the section titled "Distance function" from
         | https://en.wikipedia.org/wiki/Curse_of_dimensionality
         | 
         | > One way to illustrate the "vastness" of high-dimensional
         | Euclidean space is to compare the proportion of an inscribed
         | hypersphere with radius r and dimension d, to that of a
         | hypercube with edges of length 2 r . [...] As the dimension d
         | of the space increases, the hypersphere becomes an
         | insignificant volume relative to that of the hypercube
         | 
         | E.g. in two dimensions the ratio area of the circle of radius
         | r=1 to the area of a square of side length 2r is pi / 4 =
         | approx 0.785. As the number of dimensions becomes large, this
         | ratio drops toward zero.
         | 
         | There's also some discussion of how using distance functions to
         | measure distances from reference points become less useful as
         | the number of dimensions of the space increases -- all
         | distances becoming similar -- although
         | 
         | > research has shown this to only hold in the artificial
         | scenario when the one-dimensional distributions are independent
         | and identically distributed.
        
         | esafak wrote:
         | It crops up in machine learning.
         | 
         | https://en.wikipedia.org/wiki/Nearest_neighbor_search
        
         | tysam_and wrote:
         | High dimensional nearest neighbor search scales very poorly
         | with dimensions and/or number of examples IIRC, so approximate
         | methods are best for time/value tradeoffs.
         | 
         | Oftentimes the applications don't require a perfect match
         | anyways, just something 'close enough'.
         | 
         | Because things live in high dimensions, and they're often
         | related to each other, this results in these vast 'sparse' gaps
         | where no data really generally lives.
         | 
         | Hence allowing for good speed of approximate methods, and a
         | good tradeoff of information-per-FLOP when compared to the full
         | scale search itself.
        
       | convexstrictly wrote:
       | Video: https://www.youtube.com/watch?v=iI8e3kU11eU
        
       | swaraj wrote:
       | Is this different from pinecone?
        
         | beoberha wrote:
         | From what I understand, Pinecone is a wrapper around a lot of
         | this. It utilizes these libraries under the hood (or their own
         | implementations) to store and perform search on your vectors
         | for you. All in a nice managed PaaS offering.
        
       | ianbutler wrote:
       | Nice this will get you acquainted with many of the methods used
       | currently, as my small addition to the list of things you can
       | learn. https://ai.googleblog.com/2020/07/announcing-scann-
       | efficient... https://github.com/google-research/google-
       | research/tree/mast...
       | 
       | Google released SCaNN a few years ago and back then it wasn't
       | implemented in many tools like Milvus etc. Maybe it is now, it's
       | been a few years since I needed a scalable vector db. I believe
       | it's still state of the art, and they use a novel approach to
       | quantization, which was the key intuition.
       | 
       | I didn't see it included in the PDF from a quick scan (hah!)
        
         | bravura wrote:
         | I remember SCaNN. Are there easy to use implementations
         | available now?
         | 
         | You also seem unsure.
         | 
         | Personally, I love a cheat-sheet like in Slide 7 of the
         | presentation. It's a game-tree for what ANN _implementation_ to
         | use, given your problem setting and the assumption you don 't
         | want to do groundbreaking ANN research; rather, you have an ANN
         | problem you need to solve on the path to your actual problem.
         | Maybe that's why SCaNN wasn't cited?
         | 
         | A quick google reveals that SCaNN is available on pypi from
         | Google: https://pypi.org/project/scann/ Do any users have any
         | feedback or experiences to share?
         | 
         | [edit: GPT4, comparing hnwslib and SCaNN: "Note that SCaNN
         | recently ceased to be supported by Google Research, and there
         | won't be any further improvements, bug fixes, or documentation
         | updates from them. However, you can still use and get support
         | from the community. This recent change in support could shift
         | preference towards hnswlib for future applications."]
        
           | ianbutler wrote:
           | GPT4 is only trained on data up to 2021, and the library has
           | commits as recent as two months ago. So I think that may be a
           | hallucination. A quick google also reveals nothing like that.
           | 
           | It's been a while but I also remember HSNW having undesirable
           | characteristics for large scale vector search (compared to
           | other methods), so I'm not sure there.
           | 
           | I did do a quick search and saw there are methods that make
           | it more suitable, but again it's been 3 years now since I've
           | used anything in this area.
           | 
           | Edit: I got curious, https://ann-benchmarks.com/
           | 
           | So I was mis remembering a bit, HSNW is pretty good overall
           | depending on your implementation!
        
           | sgt101 wrote:
           | > I remember SCaNN. Are there easy to use implementations
           | available now?
           | 
           | Yes, on gcp in vertex AI.
           | 
           | GPT4 is talking utter horseshit.
           | 
           | https://cloud.google.com/vertex-ai/docs/matching-
           | engine/ann-...
        
           | jhj wrote:
           | SCaNN (the paper) is roughly two different things:
           | 
           | 1. a SIMD-optimized form of product quantization (PQ), where
           | code to distance lookup can be performed in SIMD registers
           | 
           | 2. anisotropic quantization to bias the database towards
           | returning better maximum inner product search (MIPS)
           | candidates, versus usual quantization (such as PQ) that aims
           | to minimize compressed vector reconstruction error. In MIPS
           | it is much more likely that the query data set may be of a
           | completely different distribution than the database vectors.
           | 
           | If your application needs L2 lookup rather than MIPS which
           | does not admit a metric, then only the PQ part is relevant.
           | For cosine similarity, you can get that by normalizing all
           | vectors to the surface of a hypersphere, in which case it has
           | the same order as L2 (see "L2 normalized Euclidean distance"
           | on https://en.wikipedia.org/wiki/Cosine_similarity ).
           | 
           | SCaNN is implemented in Faiss CPU, on the GPU the fast PQ
           | part is less relevant due to the greater register set size
           | and throughput (rather than latency) optimized nature of the
           | hardware, but the GPU version is more geared towards batch
           | lookup in any case.
           | 
           | https://github.com/facebookresearch/faiss/wiki/Fast-
           | accumula...
           | 
           | https://github.com/facebookresearch/faiss/wiki/Indexing-1M-v.
           | ..
           | 
           | We have not found the anisotropic quantization part to be
           | that useful for large datasets, but results may vary. Graph-
           | based ANN techniques tend to be better in many cases for
           | these small datasets than IVF based strategies.
           | 
           | (I'm the author of GPU Faiss)
        
             | chuckcode wrote:
             | Thanks for details! Few follow up questions:
             | 
             | - I've seen neural nets using int8 for matrix
             | multiplication to reduce memory size [1]. Do you think
             | something similar could be useful in the ANN space?
             | 
             | - Do you know of any studies using Faiss looking at
             | speed/cost tradeoffs of RAM vs flash vs Disk for storage?
             | 
             | - Are there recommended ways to update Faiss index with
             | streaming data, e.g. updating the vectors continuously?
             | 
             | Seems like more and more use cases for Faiss as neural nets
             | become more and more core to workflows. Would like to try
             | and figure out the configurations that are optimized to
             | minimize carbon usage in addition to latency and recall
             | metrics.
             | 
             | [1] https://arxiv.org/abs/2208.07339
             | 
             | (edit for formatting)
        
               | jhj wrote:
               | Regarding reduced precision, depending upon what you are
               | trying to do, I think it doesn't work quite as well in
               | similarity search as it does for, say, neural networks.
               | 
               | If you are concerned about recall of the true nearest
               | neighbor (k=1), in many datasets I've seen (especially
               | large ones) in float32 the distance from a query vector
               | to candidate nearest neighbor vectors may only differ by
               | some thousands of ULPs when performing brute force
               | search, which if done in float16 would result in the true
               | nearest neighbor being the same as (or perhaps behind
               | even, due to rounding error) other proximate vectors. If
               | you are performing approximate lookup and you have the
               | luxury of performing reranking (you store the compressed
               | / approximate index for lookup, but return a larger
               | candidate set like k=100 or k=1000 and refine the results
               | based on true distances computed from the uncompressed
               | vectors via brute-force, so you have to keep all the
               | original vectors around) then this problem can go away.
               | 
               | If however you are looking at recall@100 (is the true
               | nearest neighbor reported within the top k=100) or set
               | intersection (of the k=100 approximate nearest neighbors,
               | how much overlap is there with the true set of the 100
               | nearest neighbors), then this doesn't matter as much.
               | 
               | Certainly a lot of the options in the Faiss library are
               | geared towards compression and quantization anyways
               | (e.g., storing a billion high-dimensional vector dataset
               | in 8/16 GB of memory) so this is a tradeoff as with
               | everything else.
               | 
               | In Faiss there are the scalar quantizer indexes which do
               | store vectors in int4/int8 etc for which int8 GEMM
               | (accumulate in int32) would be great, but using this
               | would require that the query vector itself be quantized
               | to int4/int8 as well. This is the difference between
               | "asymmetric distance computation" (ADC) where you compute
               | the distance between int4 encoded vectors (or product
               | quantized encoded vectors, etc) versus a float32 query
               | vector, where we reconstruct the database vectors in
               | floating point and compare in floating point, versus
               | symmetric distance computation (you have to convert the
               | query vector to int4 say, and compare in the quantized
               | regime). ADC tends to work a lot better than symmetric
               | computation, so this is why we don't use pure int8 GEMM,
               | but maybe in many applications (NN inference, say,
               | instead of image database search) the non-ADC comparison
               | would be ok.
        
         | fzliu wrote:
         | SCaNN is coming to Milvus very soon (we've seen some demand
         | from the community). Stay tuned.
        
         | eternalban wrote:
         | both are 2020.
        
       | eachro wrote:
       | I see HN pieces on SIMD optimizations for numerical computing
       | every so often. Are these the sort of optimizations that a
       | hobbyist with say a normal laptop (either a macbook air or
       | consumer grade thinkpad) can actually end up tinkering with as
       | well? Or am I going to have to rent some AWS machine and have it
       | run on some specific architecture to be able to mess around with?
        
         | boulos wrote:
         | Absolutely. Most development work for these things is just fine
         | on a laptop.
         | 
         | Each processor family has its own supported SIMD instructions
         | though. For a long time, SSE / AVX / AVX2 were the only game in
         | town on x86. AVX-512 introduced in Skylake was _primarily_
         | available only on server parts, and was a mixed bag, and didn
         | 't have support on AMD machines until just recently (with Genoa
         | / Zen4).
         | 
         | A modern Mac laptop with an M-series chip will have ARM NEON
         | vector instructions, just like on an iPhone or similar. There
         | is a new-ish ARM vector instruction set called Scalable Vector
         | Extensions (SVE) that Apple doesn't support, but Graviton 3
         | does. But generally improvements on your laptop _often_
         | translate onto the server side, and you can definitely test
         | correctness.
         | 
         | There are a few least-common denominator SIMD libraries out
         | there, but realistically for many problems, you'll get the most
         | mileage from writing some intrinsics yourself for your target
         | platform.
        
         | nwoli wrote:
         | Basically all CPUs have simd nowadays. You can even make use of
         | it in wasm/the web. (Though currently not supported on iOS.)
         | Just compile with 'emcc -msimd128 main.c'
        
         | lmeyerov wrote:
         | Yes, regular computers have had SIMD for decades. Ex: Intel was
         | SSE and now AVX. Fancier servers will have the same, except
         | maybe wider and some other tricks.
         | 
         | If you go down the vectorization path, unless you are a crufty
         | CPU DB vendor, I'd skip ahead to GPU for the same. Ecosystem is
         | built to do the same but faster and easier. CPU makers are
         | slowly changing to look more like GPUs, so just skip ahead...
        
         | sebasv_ wrote:
         | If you're asking how widely available SIMD is, it has been
         | common in consumer hardware for 2 decades. To perform SIMD
         | instructions manually you will need a compiled language that
         | supports it, like Rust or C. But the compiler can actually
         | implement it for you as an optimization.
        
         | kzrdude wrote:
         | This is a Rust crate doc, but it's a nice place to browse for
         | different cpu features and what they unlock:
         | https://docs.rs/target-features/latest/target_features/docs/...
         | 
         | If you go to the x86 cpu list it will tell you what features
         | are enabled if you would optimize for a particular one
         | https://docs.rs/target-features/latest/target_features/docs/...
         | 
         | On linux you can run lscpu to check what cpu features are
         | detected.
        
         | perfmode wrote:
         | I use optimized SIMD instructions on iOS to do on-device linear
         | algebra for the cooking app i work on as a hobby.
         | 
         | https://theflavor.app/
        
           | manx wrote:
           | I only see a landing page with missing images (on mobile).
           | How do I use it?
        
       | nwoli wrote:
       | Spotify uses random hyperplanes to produce a hash where each bit
       | is the sign of the dot product of the query and the hyperplanes
       | (ie "is above or below" the hyperplane). Works really well
        
         | AndrewKemendo wrote:
         | Is there a blog post or something on this?
        
       ___________________________________________________________________
       (page generated 2023-05-06 23:00 UTC)