[HN Gopher] Solr's Dense Vector Search for indexing and searchin... ___________________________________________________________________ Solr's Dense Vector Search for indexing and searching dense numerical vectors Author : kordlessagain Score : 84 points Date : 2022-09-05 15:24 UTC (7 hours ago) (HTM) web link (solr.apache.org) (TXT) w3m dump (solr.apache.org) | lovelearning wrote: | A much-awaited enhancement. Saves the trouble of having to deploy | a separate vector DB like Milvus. | | I don't like the query syntax though. Maybe a more developer- | friendly indexing+query flow is possible. Vectorize fields and | queries transparently using a lib like DL4J running in the same | JVM. That can further simplify both app development and | deployment. | lmeyerov wrote: | can this do something like a 100M+ index on a single node? | | it seems like all the vc-funded oss options are targeting more | like 1M-rows-per-server, which doesn't really make sense for most | of our use cases.. | QuadmasterXLII wrote: | Question: I have ~10,000 128 element query vectors, and want to | find the nearest neighbor (cosine similarity) for each of them in | a dataset of ~1,000,000 target vectors. I can do this using brute | force search on a GPU in a few minutes, which is fast but still a | serious bottleneck for me. Is this an appropriate size of dataset | and task for acceleration with some sort of vector database or | algorithm more intelligent than brute force search? | ianbutler wrote: | Use an approximate method like faiss and then do cosine | similarity on the results of that. | | Short answer is most of these databases uses some type of | precomputation to make doing approximate nearest neighbors | faster. HNSW[0], FAISS[1], SCANN[2] etc are then all methods of | doing approximate nearest neighbors but make use of different | techniques to speed up that approximation. For your use case it | will likely result in a speed up. | | [0] https://www.pinecone.io/learn/hnsw/ [1] | https://engineering.fb.com/2017/03/29/data-infrastructure/fa... | [2]https://ai.googleblog.com/2020/07/announcing-scann- | efficient... | ParanoidShroom wrote: | What about annoy? https://github.com/spotify/annoy I used this | in the past. It will probably have it's limitations but worked | great for me | cschmidt wrote: | You can fit 1M vectors in the free tier of www.pinecone.io if | you want to experiment. I'm not sure how fast having that many | query vectors would be. (I'm a happy Pinecone customer, but | only use a single query vector.) | cschmidt wrote: | Huh, at one point you could have multiple queries, but it | looks like that is deprecated now. | | https://www.pinecone.io/docs/api/operation/query/ | | So maybe it wouldn't work for you use case. | tarr11 wrote: | SOLR is using Lucene's approximate nearest neighbor (ANN) | implementation. | | This site has some nice information on how ANN performs for | vector search. | | http://ann-benchmarks.com/ | generall wrote: | There is more relevant benchmark of vector search engines | end-to-end, not just algorithms: | https://qdrant.tech/benchmarks/ | QuadmasterXLII wrote: | Thanks! | cschmidt wrote: | I hesitate to mention this, because you probably know it and | are doing it this way. But some other poster mentioned "and | then do cosine similarity". In this case, you're going to want | to preprocess and normalize each row of both matrices to have | unit norm. Then cosine similarity is simply a matrix multiply | between the two matrices (one transposed), and a pass over the | results to find the top-k per query using a max queue type data | structure. | kordlessagain wrote: | Could this matrix be compressed to binary form for storage in | a binary index? | [deleted] | cschmidt wrote: | That wouldn't really help. Let me explain in a bit more | detail. The results depends on the query matrix, which will | be different for each set of queries. We have a query | matrix Q that of dimension 10000x128. And we have another | vector matrix A that is 1,000,000x128. We preprocess both Q | and A so each row has unit norm: Q[i,:] | /= norm(Q[i,:]) A[k,:] /= norm(A[k,:]) | | So now with that preprocessing the cosine similarity of a | given row i of Q and k of A is: | cossim(i,k) = dot(Q[i,:], A[k,:]) | | If you multiply QxA.T (10,000 x 128)x(128, 1M) you get a | result matrix (10,000 x 1M) with all the cosine similarity | values for each combination of query and vector. | | If you make a pass across each column with a priority | queue, you can find the top-n cosine similarity values in | time O(1,000,000xn). | | Now you could store the resulting matrix, but Q is going to | change for each call, and we really only care about the | top-n values for each query, so storing it wouldn't really | accomplish anything. | | Edited: fixed lots of typos | kordlessagain wrote: | I asked GPT-3 about it using an array of vectors of | fragments of this page, weighted by relevance (using | np.dot(v1,v2)) to the query. This is used to build the | prompt for submission to the OpenAI APIs. I'm interested | in storing these vectors in a very fast DB for memories. | | pastel-mature-herring~> Could this matrix be compressed | to binary form for storage in a binary index? | | angelic-quokka|> It is possible to compress the matrix to | binary form for storage in a binary index, but this would | likely decrease the accuracy of the cosine similarity | values. | fzliu wrote: | There's no one answer to this, but I'd say that anything past | 10k vectors would benefit greatly from a vector database. A | vector DB will abstract away the building of a vector index | along with other core database features such as caching, | failover, replication, horizontal scaling, etc. Milvus | (https://milvus.io) is open-source and always my go-to choice | for this (disclaimer: I'm a part of the Milvus community). An | added bonus of Milvus is that it supports GPU-accelerated | indexing and search in addition to batched queries and inserts. | | All of this assumes you're okay with a bit of imprecision - | vector search with modern indexes is inherently probabilistic, | e.g. your recall may not be 100%, but it will be close. Using a | flat indexing strategy is still an option, but you lose a lot | of the speedup that comes with a vector database. | thirdtrigger wrote: | Agreed with fzliu, you can also use https://weaviate.io | (disclaimer, I'm affiliated with Weaviate). You might also like | this article which describes why one might want to use a vector | search engine: https://db-engines.com/en/blog_post/87 | QuadmasterXLII wrote: | I'll look into weaviate. | [deleted] | binarymax wrote: | Dense vector search in Solr is a welcome addition, but getting | started requires a lot of pieces that aren't included. | | So I made this a couple months ago to make it super easy to get | started with this tech. If you have a sitemap you can start the | docker compose and index your website with one command line. | | https://github.com/maxdotio/neural-solr | | Enjoy! | kordlessagain wrote: | Thanks for this. Very useful. Any interest in adding a crawler? | https://github.com/kordless/grub-2.0 | andre-z wrote: | It was to expect after recent ES releases. However, dedicated | vector search engines offer better performance and more advanced | features. Qdrant https://github.com/qdrant/qdrant is written in | Rust. Fast, stable, and super easy to deploy. (disclaimer. | affiliated with the project). | bratao wrote: | Shameless plug from someone not related to the project. Try | https://vespa.ai , fully open-source, very mature hybrid search | with dense and approximated vector search. A breeze to deploy and | maintain compared to ES and Solr. If I could name a single secret | ingredient for my startup, is Vespa. | forrest2 wrote: | Vespa looks pretty compelling; indexing looks like a dream. | | I'd recommend basically anything else over a customized ES / | Solr cluster. Some of the least fun clusters to manage. Great | for simple use-cases / anything you see in a tutorial. The | moment you walk off the beaten path with them, best of luck. | | Just an anecdote | binarymax wrote: | Solr has its quirks for sure, but I've seen multi-terabyte | sized indices running with great relevance and performance. I | would call it a mechanics search engine. It is very powerful | but you need to get your hands dirty. | mountainriver wrote: | Vespa seemed like a total mess compared to Milvus when I picked | them up | peterstjohn wrote: | Two big reasons for Vespa over Milvus 1.x: | | * Filtering | | * String-based IDs | | (a caveat that I haven't used Milvus 2.x recently, which does | fix these issues, but brings in a bunch of other dependencies | like Kafka or Pulsar) | lmeyerov wrote: | can vespa index 100M+ vectors on a regular RAM cpu server? any | faster w/ a gpu (T4 / A10)? | kofejnik wrote: | omg so cool, thank you! | stoicjumbotron wrote: | Different from Solr I know, but thoughts on Lunr? | https://github.com/olivernn/lunr.js | kordlessagain wrote: | Whoosh is cool too: | https://whoosh.readthedocs.io/en/latest/intro.html | dsign wrote: | After the attempt of Apple of using "neural search" to spy on its | customers, the term has been left with a bad rep. | visarga wrote: | It doesn't have a bad reputation, it's cosine similarity done | faster by approximation, something part of many ML papers and | systems these days. ___________________________________________________________________ (page generated 2022-09-05 23:01 UTC)