[HN Gopher] Show HN: Epsilla - Open-source vector database with ...
       ___________________________________________________________________
        
       Show HN: Epsilla - Open-source vector database with low query
       latency
        
       Hey HN! We are building Epsilla (https://github.com/epsilla-
       cloud/vectordb), an open-source, self-hostable vector database for
       semantic similarity search that specializes in low query latency.
       When do we need a vector database? For example, GPT-3.5 has a 16k
       context window limit. If we want to let it answer a question about
       a 300 page book, we cannot put the whole book content into the
       context. We have to choose the sections of the book that are most
       relevant to the question. Vector database is specialized at ranking
       and picking the most relevant content from a large pool of
       documents based on their semantic similarity.  Most vector
       databases utilize hierarchical navigational small world (HNSW) for
       indexing the vectors for high precision vector search, and its
       latency significantly degrades when the precision target is higher
       than 95%.  At a previous company, we worked on building the
       parallel graph traversal engine. We realized that the bottleneck of
       HNSW performance is because there are too many sequential traversal
       steps that don't fully leverage multi-core CPU computation
       resources. After some research, we found that there are algorithms
       such as SpeedANN that are targeting this problem, which is not
       leveraged by industry yet. So we built the Epsilla vector database
       to turn the research into a production system.  With Epsilla, we
       shoot for 10x lower vector search latency compared to HNSW based
       vector databases. We did an initial benchmark against the top open
       source vector databases:
       https://medium.com/@richard_50832/benchmarking-epsilla-with-...  We
       provide a Docker image for you to install Epsilla backend locally,
       and provide a Python client and a JavaScript client to connect and
       interact with it.  Quickstart:                     docker pull
       epsilla/vectordb                docker run --pull=always -d -p
       8888:8888 epsilla/vectordb                pip install pyepsilla
       git clone https://github.com/epsilla-cloud/epsilla-python-
       client.git                cd examples                python
       hello_epsilla.py            We just started a month ago. We'd love
       to hear what you think, and more importantly, what you wish to see
       in the future. We are thinking about a serverless vector database
       on cloud with a consumption based pricing model, and we are eager
       to get your feedback.
        
       Author : songrenchu
       Score  : 64 points
       Date   : 2023-08-14 15:21 UTC (7 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | vessenes wrote:
       | I'm curious about your approach on where you draw the line for
       | database features; I don't have a perspective on what's right,
       | just trying to get informed.
       | 
       | There are a bunch of possible areas to circle or ignore when
       | making an ML-capable database of some sort. In rough order of
       | data complexity:
       | 
       | 1. Embeddings (context-free vectors, just an ID and the vector)
       | 
       | 2. Metadata + Embedding (source data, JSON)
       | 
       | 3. Binary Data + Metadata + Embedding (add documents)
       | 
       | Then there are tooling questions: in this matrix you'd want to
       | decide if you're going to allow inference, and if so, will it be
       | arbitrary, service-based, etc. against the documents, and if so,
       | how will you store the results?
       | 
       | I'm curious how you're thinking about the design space. The
       | embedding-only route is conceptually appealing because it's
       | simple. In a larger engineering project, there's a tension
       | between "where do I keep all this data," "how do I process and
       | reprocess all this data", and "where do I keep the results of all
       | the processing", and to me there aren't clear bright-line
       | architectures that seem "best of".
       | 
       | Put another way, 15 years ago, we went memcached -> redis 1 ->
       | redis (whatever it is now), and at the same time, we went
       | mysql/postgres/oracle -> nosql json stores; today all of these
       | have relatively well-defined use cases, (and for most of them
       | sqlite is the best choice, obviously).
       | 
       | How are you seeing the ML db scene playing out, and where do you
       | think the sqlite of this space will land on architecture?
        
         | songrenchu wrote:
         | Thank you for the insightful topic! By reading the question
         | itself drive me think a lot.
         | 
         | For the database perspective, instead of dividing the table
         | schema into 3 parts: id, metadata, embedding, we designed in a
         | way closer to SQL, treat vector as another data type, and let
         | user to define any number of fields in a table. ID is just an
         | annotation of a field (composite key might be overkilling for
         | now). There will be another debate on whether schemaful or
         | schemaless is the right approach, we can leave it here for now
         | 
         | With this foundation, we already covers 1 and 2. And in our
         | roadmap we also plan to cover 3, with multi-modal data type
         | support. We think the real big advantage of embedding is on
         | unstructured data (documents, images, video, audio, etc), and
         | storing the embedding of multi-modal data and connect them
         | through semantic relevance will open up big opportunities. And
         | this fits with the table and fields-based design for
         | introducing cross table embedding index on connecting different
         | shape data.
         | 
         | And from the multi-modal data perspective comes the problem
         | where do we store those data? One way is we provide a generic
         | binary data type that let users put anything. Another way which
         | most enterprise will do is integrate us with a larger data
         | warehouse/data lake system. And this opens up the requirement
         | for us on supporting data streaming in/out with kafka
         | connector, spark connector, etc.
         | 
         | And totally agree that SQLite works so well in huge amount of
         | scenarios, now there is DuckDB. We also see some other players
         | like LanceDB taking this approach to be Vector DB space's
         | SQLite. We are also pretty close to announce our Python in-
         | process package support, so docker / a separate server is not a
         | must have anymore.
         | 
         | For inference, this is a broader direction for us for now. We
         | are open to explore this space and see if the serverless
         | architecture on cloud can provide extra efficiency benefit to
         | the market
        
       | behnamoh wrote:
       | How long until you sell it to BigCompany? I just don't get why
       | there are numerous vector databases all with similar
       | functionality.
        
         | songrenchu wrote:
         | You are right, there are numerous vector databases in the
         | market. Most of them (including us) are still pretty early and
         | a lot of enterprise readiness features to build. Including role
         | based / privilege based access control, authN/Z integration,
         | data versioning and backup / restore, fault tolerance, data
         | streaming in/out, etc. We have first hand experience on the
         | enterprise level product development and sales in our previous
         | job at a series D graph database startup, and we will apply our
         | learnings there to make Epsilla enterprise ready in the next
         | few months
        
         | theolivenbaum wrote:
         | Because wrapping HNSW is just that easy. Same with all the
         | ChatGPT-based tools popping around a dozen a week, it's just
         | easy to throw something together and see if it sticks.
        
       | sidhantgandhi wrote:
       | "Hippocampus of AI" in a readme is a yellow flag
        
         | songrenchu wrote:
         | Thank you for pointing it out. We just removed it from README
        
       | VoVAllen wrote:
       | Why did you choose SpeedANN instead of other new indexes such as
       | DiskANN? And you changed the color of epsilla in every benchmark
       | figure, which is quite confusing
        
         | songrenchu wrote:
         | Thank you for sharing! DiskANN was published in 2019 and
         | SpeedANN in 2022. DiskANN is specialized in disk based ANNS
         | solution, and it's focus on the scenario where the vectors
         | don't fit into memory. SpeedANN is in-memory solution and
         | specialize in low latency query, which is the scenario we want
         | to tackle for now. We can further extend our engine to support
         | DiskANN and other index algorithms based on our customer's
         | requirements Thanks for pointing out the benchmark figure, we
         | just fixed it to have consistent colors
        
         | [deleted]
        
       | vosper wrote:
       | Vector databases seem to be a dime a dozen, now, as well as being
       | built into Elasticsearch and available as Postgres extensions.
       | 
       | As far as I know they're all relatively undifferentiated in
       | performance and features.
       | 
       | Is there a viable long-term business here?
        
         | snordgren wrote:
         | This one has more reason to exist than most vector DBs since
         | it's not just a wrapper around hnswlib.
        
       | itake wrote:
       | imho, vectorDbs need to scale horizontally. Simply running on a
       | single host doesn't cut it anymore.
        
         | songrenchu wrote:
         | You are right. We designed our storage in a segment-based way,
         | with configurable segment size, so it can horizontally scale in
         | the future cross multiple workers in one machine, and cross
         | multiple machine cluster. And the search will become a two-
         | stage search: find top K in each segment, then a global merger
         | (can also be horizontally scaled) to merge the results from all
         | segments
        
       | social_quotient wrote:
       | Regarding the embedding vectors - is there a maximum limit to
       | their dimensionality? Also, can you share insights into how the
       | precision remains consistent at 99.9% even with high-dimension
       | vectors?
        
         | songrenchu wrote:
         | For now we didn't put a limit on the dimension of the vectors,
         | so the machine can fit as much as #vector * #dimension *
         | sizeof(float) into memory. For now we just support dense
         | vector, and in the future we will work on sparse vector support
         | for much higher dimension. I think you are referring to the
         | "Curse of dimensionality" problem. Here is my thoughts: in a
         | graph-based index such as SpeedANN, or HNSW, each vector is
         | treated as a node in the graph, and the index is a nearest
         | neighbor graph. Different from spatial partition-based indices,
         | the topology quality of the nearest neighbor graph is
         | independent from the dimensionality of the vectors. Our
         | benchmark is on 960 dimension vector, but we will do more
         | experiments in sparse vectors in the future
        
       | yding wrote:
       | Congrats on the launch!
        
       ___________________________________________________________________
       (page generated 2023-08-14 23:00 UTC)