[HN Gopher] Solr's Dense Vector Search for indexing and searchin...
       ___________________________________________________________________
        
       Solr's Dense Vector Search for indexing and searching dense
       numerical vectors
        
       Author : kordlessagain
       Score  : 84 points
       Date   : 2022-09-05 15:24 UTC (7 hours ago)
        
 (HTM) web link (solr.apache.org)
 (TXT) w3m dump (solr.apache.org)
        
       | lovelearning wrote:
       | A much-awaited enhancement. Saves the trouble of having to deploy
       | a separate vector DB like Milvus.
       | 
       | I don't like the query syntax though. Maybe a more developer-
       | friendly indexing+query flow is possible. Vectorize fields and
       | queries transparently using a lib like DL4J running in the same
       | JVM. That can further simplify both app development and
       | deployment.
        
       | lmeyerov wrote:
       | can this do something like a 100M+ index on a single node?
       | 
       | it seems like all the vc-funded oss options are targeting more
       | like 1M-rows-per-server, which doesn't really make sense for most
       | of our use cases..
        
       | QuadmasterXLII wrote:
       | Question: I have ~10,000 128 element query vectors, and want to
       | find the nearest neighbor (cosine similarity) for each of them in
       | a dataset of ~1,000,000 target vectors. I can do this using brute
       | force search on a GPU in a few minutes, which is fast but still a
       | serious bottleneck for me. Is this an appropriate size of dataset
       | and task for acceleration with some sort of vector database or
       | algorithm more intelligent than brute force search?
        
         | ianbutler wrote:
         | Use an approximate method like faiss and then do cosine
         | similarity on the results of that.
         | 
         | Short answer is most of these databases uses some type of
         | precomputation to make doing approximate nearest neighbors
         | faster. HNSW[0], FAISS[1], SCANN[2] etc are then all methods of
         | doing approximate nearest neighbors but make use of different
         | techniques to speed up that approximation. For your use case it
         | will likely result in a speed up.
         | 
         | [0] https://www.pinecone.io/learn/hnsw/ [1]
         | https://engineering.fb.com/2017/03/29/data-infrastructure/fa...
         | [2]https://ai.googleblog.com/2020/07/announcing-scann-
         | efficient...
        
         | ParanoidShroom wrote:
         | What about annoy? https://github.com/spotify/annoy I used this
         | in the past. It will probably have it's limitations but worked
         | great for me
        
         | cschmidt wrote:
         | You can fit 1M vectors in the free tier of www.pinecone.io if
         | you want to experiment. I'm not sure how fast having that many
         | query vectors would be. (I'm a happy Pinecone customer, but
         | only use a single query vector.)
        
           | cschmidt wrote:
           | Huh, at one point you could have multiple queries, but it
           | looks like that is deprecated now.
           | 
           | https://www.pinecone.io/docs/api/operation/query/
           | 
           | So maybe it wouldn't work for you use case.
        
         | tarr11 wrote:
         | SOLR is using Lucene's approximate nearest neighbor (ANN)
         | implementation.
         | 
         | This site has some nice information on how ANN performs for
         | vector search.
         | 
         | http://ann-benchmarks.com/
        
           | generall wrote:
           | There is more relevant benchmark of vector search engines
           | end-to-end, not just algorithms:
           | https://qdrant.tech/benchmarks/
        
           | QuadmasterXLII wrote:
           | Thanks!
        
         | cschmidt wrote:
         | I hesitate to mention this, because you probably know it and
         | are doing it this way. But some other poster mentioned "and
         | then do cosine similarity". In this case, you're going to want
         | to preprocess and normalize each row of both matrices to have
         | unit norm. Then cosine similarity is simply a matrix multiply
         | between the two matrices (one transposed), and a pass over the
         | results to find the top-k per query using a max queue type data
         | structure.
        
           | kordlessagain wrote:
           | Could this matrix be compressed to binary form for storage in
           | a binary index?
        
             | [deleted]
        
             | cschmidt wrote:
             | That wouldn't really help. Let me explain in a bit more
             | detail. The results depends on the query matrix, which will
             | be different for each set of queries. We have a query
             | matrix Q that of dimension 10000x128. And we have another
             | vector matrix A that is 1,000,000x128. We preprocess both Q
             | and A so each row has unit norm:                   Q[i,:]
             | /= norm(Q[i,:])         A[k,:] /= norm(A[k,:])
             | 
             | So now with that preprocessing the cosine similarity of a
             | given row i of Q and k of A is:
             | cossim(i,k) = dot(Q[i,:], A[k,:])
             | 
             | If you multiply QxA.T (10,000 x 128)x(128, 1M) you get a
             | result matrix (10,000 x 1M) with all the cosine similarity
             | values for each combination of query and vector.
             | 
             | If you make a pass across each column with a priority
             | queue, you can find the top-n cosine similarity values in
             | time O(1,000,000xn).
             | 
             | Now you could store the resulting matrix, but Q is going to
             | change for each call, and we really only care about the
             | top-n values for each query, so storing it wouldn't really
             | accomplish anything.
             | 
             | Edited: fixed lots of typos
        
               | kordlessagain wrote:
               | I asked GPT-3 about it using an array of vectors of
               | fragments of this page, weighted by relevance (using
               | np.dot(v1,v2)) to the query. This is used to build the
               | prompt for submission to the OpenAI APIs. I'm interested
               | in storing these vectors in a very fast DB for memories.
               | 
               | pastel-mature-herring~> Could this matrix be compressed
               | to binary form for storage in a binary index?
               | 
               | angelic-quokka|> It is possible to compress the matrix to
               | binary form for storage in a binary index, but this would
               | likely decrease the accuracy of the cosine similarity
               | values.
        
         | fzliu wrote:
         | There's no one answer to this, but I'd say that anything past
         | 10k vectors would benefit greatly from a vector database. A
         | vector DB will abstract away the building of a vector index
         | along with other core database features such as caching,
         | failover, replication, horizontal scaling, etc. Milvus
         | (https://milvus.io) is open-source and always my go-to choice
         | for this (disclaimer: I'm a part of the Milvus community). An
         | added bonus of Milvus is that it supports GPU-accelerated
         | indexing and search in addition to batched queries and inserts.
         | 
         | All of this assumes you're okay with a bit of imprecision -
         | vector search with modern indexes is inherently probabilistic,
         | e.g. your recall may not be 100%, but it will be close. Using a
         | flat indexing strategy is still an option, but you lose a lot
         | of the speedup that comes with a vector database.
        
         | thirdtrigger wrote:
         | Agreed with fzliu, you can also use https://weaviate.io
         | (disclaimer, I'm affiliated with Weaviate). You might also like
         | this article which describes why one might want to use a vector
         | search engine: https://db-engines.com/en/blog_post/87
        
           | QuadmasterXLII wrote:
           | I'll look into weaviate.
        
         | [deleted]
        
       | binarymax wrote:
       | Dense vector search in Solr is a welcome addition, but getting
       | started requires a lot of pieces that aren't included.
       | 
       | So I made this a couple months ago to make it super easy to get
       | started with this tech. If you have a sitemap you can start the
       | docker compose and index your website with one command line.
       | 
       | https://github.com/maxdotio/neural-solr
       | 
       | Enjoy!
        
         | kordlessagain wrote:
         | Thanks for this. Very useful. Any interest in adding a crawler?
         | https://github.com/kordless/grub-2.0
        
       | andre-z wrote:
       | It was to expect after recent ES releases. However, dedicated
       | vector search engines offer better performance and more advanced
       | features. Qdrant https://github.com/qdrant/qdrant is written in
       | Rust. Fast, stable, and super easy to deploy. (disclaimer.
       | affiliated with the project).
        
       | bratao wrote:
       | Shameless plug from someone not related to the project. Try
       | https://vespa.ai , fully open-source, very mature hybrid search
       | with dense and approximated vector search. A breeze to deploy and
       | maintain compared to ES and Solr. If I could name a single secret
       | ingredient for my startup, is Vespa.
        
         | forrest2 wrote:
         | Vespa looks pretty compelling; indexing looks like a dream.
         | 
         | I'd recommend basically anything else over a customized ES /
         | Solr cluster. Some of the least fun clusters to manage. Great
         | for simple use-cases / anything you see in a tutorial. The
         | moment you walk off the beaten path with them, best of luck.
         | 
         | Just an anecdote
        
           | binarymax wrote:
           | Solr has its quirks for sure, but I've seen multi-terabyte
           | sized indices running with great relevance and performance. I
           | would call it a mechanics search engine. It is very powerful
           | but you need to get your hands dirty.
        
         | mountainriver wrote:
         | Vespa seemed like a total mess compared to Milvus when I picked
         | them up
        
           | peterstjohn wrote:
           | Two big reasons for Vespa over Milvus 1.x:
           | 
           | * Filtering
           | 
           | * String-based IDs
           | 
           | (a caveat that I haven't used Milvus 2.x recently, which does
           | fix these issues, but brings in a bunch of other dependencies
           | like Kafka or Pulsar)
        
         | lmeyerov wrote:
         | can vespa index 100M+ vectors on a regular RAM cpu server? any
         | faster w/ a gpu (T4 / A10)?
        
         | kofejnik wrote:
         | omg so cool, thank you!
        
       | stoicjumbotron wrote:
       | Different from Solr I know, but thoughts on Lunr?
       | https://github.com/olivernn/lunr.js
        
         | kordlessagain wrote:
         | Whoosh is cool too:
         | https://whoosh.readthedocs.io/en/latest/intro.html
        
       | dsign wrote:
       | After the attempt of Apple of using "neural search" to spy on its
       | customers, the term has been left with a bad rep.
        
         | visarga wrote:
         | It doesn't have a bad reputation, it's cosine similarity done
         | faster by approximation, something part of many ML papers and
         | systems these days.
        
       ___________________________________________________________________
       (page generated 2022-09-05 23:01 UTC)