hngopher.com

       [HN Gopher] Sonic: Fast, lightweight and schema-less search backend
       ___________________________________________________________________
        
       Sonic: Fast, lightweight and schema-less search backend
        
       Author : rcarmo
       Score  : 479 points
       Date   : 2022-10-24 11:17 UTC (11 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | pavelevst wrote:
       | It would be nice if it can be replacement for logging stack,
       | elastic is super hungry for ram
        
         | IceWreck wrote:
         | Look at zincsearch. Its another lightweight elastic alternative
         | and they advertise logging as a usecase.
        
         | francoismassot wrote:
         | You can have a look at Quickwit (https://quickwit.io), it's a
         | search engine made for logs :). It's still pretty young and...
         | there are way less features than in ES.
         | 
         | (disclaimer: I'm one of the cofounder)
        
       | marginalia_nu wrote:
       | * We imported ~1,000,000 messages of dynamic length (some very
       | long, eg. emails);
       | 
       | * Once imported, the search index weights 20MB (KV) + 1.4MB (FST)
       | on disk;
       | 
       | This is almost unbelievably succinct! If you encode the document
       | features into 8 bits per document, and thus completely forego the
       | need to store the document ID by indexing them implicitly, that
       | alone is 1 MB.
       | 
       | Getting meaningful search out of on average 21 bytes per document
       | seriously impressive.
       | 
       | [For reference, this sentence is 42 bytes.]
        
         | [deleted]
        
         | mattb314 wrote:
         | Wonder if this has anything to do with the sliding window:
         | 
         | > Sonic only keeps the N most recently pushed results for a
         | given word, in a sliding window way (the sliding window width
         | can be configured)
         | 
         | Default window looks like 1k documents. I read this as saying
         | that super common words are basically dropped from the index
         | (only 1k out of many thousands of docs retained), but I don't
         | know enough about the internals to be sure. Not sure if this
         | actually hurts search results in practice, seems like an ok
         | trade off for help docs at least.
        
           | nightpool wrote:
           | I wonder how easy it would be to change "most recently
           | pushed" to something like a redis sorted set where each
           | document has a score and only the top N results are retained
           | when sorted by their separate score value? That would allow
           | you to sort by pageviews / popularity in a more useful way.
           | But it fails entirely when looking for uncommon intersections
           | of common words, which feels like it makes it useless for
           | most actual full-text search use-cases :(
        
           | 411111111111111 wrote:
           | It's definitely a great trade-off to make for efficiently,
           | but makes it inherently unusable for most of elastic searchs
           | usecases.
           | 
           | Looking at it from a practical example such as log search
           | (almost everyone I know has used
           | kibana/logstash/elasticsearch at some point): you'd be able
           | to search for things like tracingId/requestId but adding more
           | filters such as logLevel, requestType or serviceName would be
           | impossible
           | 
           | It has it's niche, but calling it an elasticsearch
           | alternative really is a stretch
        
             | rabuse wrote:
             | Also the ability to weight fields when fetching results to
             | boost relevancy, which is needed for a lot of my use cases.
        
       | syrusakbary wrote:
       | Long ago I was searching on lightweight search engines that could
       | run on the Edge, as ElasticSearch -while very popular- is also
       | quite heavy and relies on the Lucene/JVM.
       | 
       | Apart from Sonic, I also found Tantivy [1] and Meilisearch [2]...
       | all delightfully made in Rust. My favorite, and the closest one
       | to ElasticSearch (for its features) is probably Tantivy.
       | 
       | I'd recommend anyone to check up this three projects and choose
       | on what best fits your needs... it's awesome to see that more
       | projects are becoming available by the day!
       | 
       | [1]: https://github.com/quickwit-oss/tantivy
       | 
       | [2]: https://github.com/meilisearch/meilisearch
        
         | codedokode wrote:
         | There is also sphinx search which was open source before 3.0
         | version.
        
           | snikolaev wrote:
           | And it's open source continuation - Manticore Search [1]
           | 
           | [1] https://manticoresearch.com/
        
         | croes wrote:
         | Do they support document access control like ES does?
        
           | sanxiyn wrote:
           | Yes, Meilisearch supports ES-like document access control.
        
         | alserio wrote:
         | I've looked up tantivy and quickwit. Quickwit uses tantivy as
         | the engine. It has decoupled storage (awesome, only recently
         | elastic announced something comparable) but is oriented towards
         | log processing and esplitly warns against its use to power an
         | user facing site search. Do you happen to know if there's
         | anything like that with the same minimal footprint that can
         | scale up and, importantly, down to serve the needs of highly
         | variable traffic websites? Right now I'm looking at something
         | with clustering capabilities and decoupled storage (e.g on s3)
         | like quickwit
        
           | francoismassot wrote:
           | One of the reasons for not using Quickwit for user facing
           | search is the latency: for example, you pay 70ms of latency
           | when you make a request on AWS S3... and generally you expect
           | latency below that figure. Decoupling compute and storage
           | while keeping a very low latency may be then impossible
           | unless ending up by caching all your data on disk :).
           | 
           | You can have a look at lnx (https://lnx.rs/) that is based on
           | tantivy and is performing quite well. It's not yet
           | distributed but the author Chillfish8 has some thoughts about
           | how to do it.
        
             | alserio wrote:
             | Thank you! I'll look into it
        
       | dewey wrote:
       | Another interesting alternative:
       | https://github.com/meilisearch/meilisearch - I'm using it in one
       | of my (small) projects and I had a good experience with it, also
       | very helpful community.
        
       | Thaxll wrote:
       | So I can use that to inject millions of logs daily and it will do
       | sharding and rebalancing automatically?
        
         | sanxiyn wrote:
         | No. Sonic is a single node server and not distributed.
        
       | daitangio wrote:
       | Nice. I have done some tests with SQLlite, and I find its index
       | module very interesting, also because it offers stemming, which
       | seems missed here: am I wrong?
       | 
       | SQLite has stemming only for english out-of-the-box, but I find
       | it quite a need for a good ES drop in replacement.
       | 
       | My two cents
        
         | rcarmo wrote:
         | It is great and works, but sonic has broader applications (I
         | found it because it was actually being used as a way to index
         | an existing SQLite database that pointed to file storage).
        
       | PedroBatista wrote:
       | While I get the wants-and-needs since ElasticSearch has a
       | voracious appetite for RAM, I get the feeling most people think
       | search engines are a simple thing where you can just import some
       | lib, fool around for a bit and call it a day.
       | 
       | The truth is that ElasticSearch/Solr/Lucene is orders of
       | magnitude more complex and powerful than these "alternatives".
       | All this is mostly fine as long everyone is on the same page
       | regarding the expectations.
       | 
       | Most people don't need ElasticSearch for their use cases on the
       | surface, but I feel they expect top-notch mind-reading results
       | and that requires something like ElasticSearch and someone who
       | knows the field.
       | 
       | Having said all of that, Meilisearch and this are quite fine.
        
         | keyle wrote:
         | Yeah there needs to be some kind of acid test that will compare
         | these products on equal footing and show the pitfalls.
        
           | DeathArrow wrote:
           | Here is a performance benchmark: https://db-
           | benchmarks.com/test-hn/#manticore-search-columnar...
        
           | jasfi wrote:
           | That would be great. However if you wanted to benchmark
           | relevance ranking, how would you do that?
        
             | sanxiyn wrote:
             | You need a dataset and an evaluation metric. The usual
             | evaluation metric is NDCG(Normalized Discounted Cumulative
             | Gain): https://en.wikipedia.org/wiki/NDCG
             | 
             | An example dataset is BEIR(BEnchmarking Information
             | Retrieval), published in NIPS 2021:
             | https://github.com/beir-cellar/beir
        
           | sanxiyn wrote:
           | This is very very difficult, but Tantivy tried: see
           | https://github.com/quickwit-oss/search-benchmark-game
        
         | ilyt wrote:
        
         | Spivak wrote:
         | I think the upshot is that if you have no idea what all the
         | advanced features of ES even are then you probably don't need
         | ES because it's not turnkey.
         | 
         | If you utter the phrase "I just want search" then it really is
         | a matter of just using one of these lightweight projects and
         | libs because your needs are simple.
        
         | alessmar wrote:
         | I would like to suggest https://typesense.org/ It has some
         | features that makes it a better choice than Meilisearch
        
           | paraboul wrote:
           | Can you elaborate on said features?
           | 
           | I migrated from typesense to Meilisearch on a project after I
           | found it had much better search accuracy. I can't exactly
           | explain why, but overall Meilisearch results feel more
           | relevant by default.
        
             | jabo wrote:
             | I work on Typesense. Mind if I ask which version of
             | Typesense and Meilisearch you tried this on? And if this
             | was on some public dataset I can use?
             | 
             | I'd love to take a closer look.
        
               | paraboul wrote:
               | Hey jabo,
               | 
               | I migrated in April 2021 (latest version of typesense &
               | meilisearch at that time).
               | 
               | I don't have a public dataset has it was a fairly large
               | ecommerce catalog with close to ~500k entries. And again,
               | it was just my own perception which is hard to define. I
               | just found that Typesense was a bit off compared to
               | Meilisearch on search accuracy, and of course could
               | totally be different today with a more recent release.
        
               | jabo wrote:
               | Got it, thank you for sharing that. Typesense was at
               | v0.19.0 around that time. Two prominent issues we had in
               | that version were how we handled matches across multiple
               | fields and how we handled "keyword stuffing".
               | 
               | We're now at v0.24.rc, and we've iterated quite a lot on
               | improving relevancy since then, as more users shared
               | their datasets with us and gave us feedback over the last
               | 1.5 years.
               | 
               | If you get a chance to try out Typesense again in the
               | future, I'd love to hear how relevance feels with the
               | latest version, out of the box for your dataset.
        
             | snikolaev wrote:
             | There are actually benchmarks that allow measuring search
             | relevancy objectively, e.g. BEIR[1]. Manticore Search team
             | did an effort to make a PR to include it to the list. The
             | results are here [2]. Unfortunately the BEIR team seems to
             | be too busy to review a whole pile of PRs including about
             | Vespa. Nevertheless it would be nice to have both
             | Meilisearch and Typesense there too since it's interesting
             | what performance those non-tf-idf based search engines
             | would show compared to BM25-based and vector search
             | engines.
             | 
             | [1] https://github.com/beir-cellar/beir [2] https://docs.go
             | ogle.com/spreadsheets/d/1_ZyYkPJ_K0st9FJBrjbZ...
        
       | eric4smith wrote:
       | What about relevancy?
       | 
       | There's not much mention of that. I'm always on the lookout for
       | something lightweight that improves on PostgreSQL full text.
        
         | sanxiyn wrote:
         | Sonic doesn't do any ranking other than latest first.
        
           | eric4smith wrote:
           | Ouch
        
       | giancarlostoro wrote:
       | Now if it were drop-in capable and still more efficient, that
       | would be impressive and I would count the days until Elastic buys
       | you out.
        
       | cies wrote:
       | Other:
       | 
       | https://www.meilisearch.com/
       | 
       | https://github.com/quickwit-oss/tantivy
       | 
       | https://github.com/toshi-search/Toshi
       | 
       | https://github.com/typesense/typesense
        
       | didip wrote:
       | Somewhat related, this guy: https://github.com/mosuka/ seems to
       | be very passionate about search service.
       | 
       | He built two distributed search services:
       | 
       | - https://github.com/mosuka/phalanx, written in Go.
       | 
       | - https://github.com/mosuka/bayard, written in Rust.
        
       | erikcw wrote:
       | One of the features I like in ES that I haven't seen in
       | alternatives is "Percolate queries" (queries where you feed the
       | service a document and it returns a list of queries that you've
       | indexed that would match that document - basically inverting the
       | whole process).
       | 
       | Does anyone know of any alternatives that support this use case?
       | 
       | https://www.elastic.co/guide/en/elasticsearch/reference/mast...
        
         | snikolaev wrote:
         | Yes. Manticore Search does. Here's an interactive course[1]
         | about it, it's a little bit outdated though. More info in the
         | docs[2]
         | 
         | [1] https://play.manticoresearch.com/pq [2]
         | https://manual.manticoresearch.com/Creating_an_index/Local_i...
        
       | thedougd wrote:
       | Just another plug for Lucene or the library route. I had a simple
       | use case to offer a search/autocomplete API for the employee
       | directory of ~50,000 records. The source of truth was only
       | updated once a day. We ran a job that reindexed daily and
       | published the index as a file (< 15 megabytes) to where the
       | service could access it.
       | 
       | That service worked beautifully. Results were returned in 10-20ms
       | and we only ever made software updates to handle the occasional
       | CVE. It did, however, take quite a bit of fiddling initially to
       | get the query results to match the user expectations. For
       | example, weighting first vs last vs full name.
        
       | codedokode wrote:
       | I am not sure if it can be called an "alternative". ElasticSearch
       | has thousands of features and settings while this library seems
       | to be just a simple inverted index implementation only for text
       | search.
       | 
       | By the way if you are looking for lightweight "alternative" for
       | ElasticSearch you might look at sphinx search engine (although it
       | doesn't has as much features as ES has and it has became closed-
       | source since 3.0 version).
        
         | snikolaev wrote:
         | > you might look at sphinx search engine
         | 
         | Manticore Search [1] forked from the latest open source version
         | and has been continually improved for more than 5 years.
         | 
         | [1] https://manticoresearch.com/
         | 
         | > although it doesn't has as much features as ES has
         | 
         | Manticore unlike Sphinx is much closer to Elasticsearch in
         | terms of features set.
        
       | 9dev wrote:
       | Every time someone comes up with an alternative to a software
       | behemoth like Elasticsearch, what they actually mean is: "An
       | alternative to the 10% of functionality of $tool _that are
       | interesting to me_ ".
       | 
       | This is surely an impressive engineering feat, but hardly a
       | replacement for the myriad of query possibilities Elasticsearch
       | offers.
        
         | coldtea wrote:
         | "ative to a software behemoth like Elasticsearch, what they
         | actually mean is: "An alternative to the 10% of functionality
         | of $tool that are interesting to me"
         | 
         | Which is perfectly fine. A lot of tools become so general and
         | bloated, that there are large groups that would be fine with
         | many different 10% subsets of their features...
         | 
         | Kind of like how I don't need MS Word or OpenOffice Write, any
         | simple text editing program with a few basic features (like
         | printing, bold/italics, and word count) will do for my needs...
        
           | 9dev wrote:
           | I'm not opposed to that, however, the chance of _their 10%_
           | and _my 10%_ overlapping is rather slim. Just like you only
           | need basic formatting, and I require footnotes in my
           | documents. Nothing wrong with either, but I 'd be upset if
           | you tried to sell me GEdit as a replacement for OpenOffice
           | Write.
        
         | manigandham wrote:
         | True, but most deployments are also just generic searching of
         | records like Algolia rather than using all the low-level
         | functionality.
         | 
         | Tyoesense is probably the most compete competitor in that
         | regard: https://typesense.org/
         | 
         | Other alternatives here:
         | https://gist.github.com/manigandham/58320ddb24fed654b57b4ba2...
        
         | jethro_tell wrote:
         | well, and how do you solve excessive ram usage for a search
         | engine? Generally you write the indexes/search trees to disk
         | which, may or may not be ideal.
        
         | GrinningFool wrote:
         | From the opening line of the README:
         | 
         | "Sonic can be used as a simple alternative to super-heavy and
         | full-featured search backends such as Elasticsearch in some
         | use-cases."
         | 
         | Seem pretty up-front about it, and doesn't claim to be a full-
         | featured alternative.
        
           | lolinder wrote:
           | Agreed, they do a good job of hedging it. I think OP was
           | probably pre-empting the usual comments along the lines of
           | "yep, $tool is super bloated, $smallerTool proves that those
           | other guys building $tool are bad engineers."
        
         | marginalia_nu wrote:
         | To be fair, there is often a better reasons to only replace a
         | portion of ES' functionality, since doing so can save a lot of
         | computation and space; than to replace ES itself, since it
         | already exists and does a good job if what you need is the full
         | kit.
         | 
         | I found myself last week reimplementing 10% of RoaringBitmap's
         | functionality as a homebrew replacement, because doing so was
         | 500% faster. Not that RB isn't great, but it's designed for a
         | general problem space, and not my particular problem.
        
         | tensor wrote:
         | My guess is that the majority of people using ES could actually
         | use something simpler like this.
        
         | jamil7 wrote:
         | Agreed, although to be fair to the actual author (assuming they
         | didn't post this here) the readme is a lot more upfront about
         | it's capabilities.
         | 
         | > Sonic can be used as a simple alternative to super-heavy and
         | full-featured search backends such as Elasticsearch in some
         | use-cases.
        
         | rlex wrote:
         | and almost all of them won't offer ES-compatible API. Out of my
         | head i can think about manticore (https://manticoresearch.com/)
         | that offers at least subset of elasticsearch API
        
           | sanxiyn wrote:
           | Quickwit does too (a subset anyway):
           | https://github.com/quickwit-oss/quickwit
        
         | papruapap wrote:
         | tbh "Text Search" is a vague description for these kind of
         | softwares, so I guess everyone go with elasticsearch-like.
        
         | RicoElectrico wrote:
         | Honestly most of the "alternative to" programs do not meet
         | expectations they set by dropping a big known name. So much so,
         | that I think people are doing FOSS disservice by comparing to
         | those who they can't meaningfully overtake.
         | 
         | The only exceptions could be small single feature utilities.
        
           | graftak wrote:
           | To me it seems the "alternative to" part is more damaging in
           | that sense than dropping a big name. The name is used to put
           | a complicated piece of software in a context many people are
           | familiar with. The same thing happens with the
           | "Tinder/Uber/Airbnb for <x>..." type of services.
           | 
           | The friction is introduced where it's not made crystal clear
           | how it's similar, and which concept are different or missing
           | altogether. Then it will cause unmet expectations.
           | 
           | Perhaps it's better to say "inspired by ..." or "similar to
           | ..." to make a more precise statement.
        
         | ianbutler wrote:
         | I don't think your opinion is wrong, but I do think
         | ElasticSearch has a lot of features that many people consider
         | bloat depending on their work, and scaling and doing general
         | dev ops for ES can be an absolute slog. Light weight
         | alternatives that cut down to a set of core features for some
         | niche seem like a good idea to me.
        
           | 9dev wrote:
           | It's totally fine that many people consider stuff bloat, but
           | other people don't. I've built a highly specialised search
           | engine for manufacturing companies on top of Elasticsearch,
           | and I _decidedly_ need vector queries, TF-IDF queries,
           | geospatial range queries, and heaps of other, niche features
           | you probably never used before.
           | 
           | Having a lightweight search engine is fine, but calling it an
           | alternative to Elasticsearch is not doing either justice.
        
             | ianbutler wrote:
             | That's very assumptive of you. I have in fact used most of
             | those features, and note I said their opinion was not
             | wrong. In their readme they said it's a replacement for
             | some use cases which is upfront and fine.
             | 
             | Vector queries aren't niche, Elastic however only tacked on
             | a proper (non HNSW) implementation in the last year and a
             | half. Geospatial isn't niche, anyone working with location
             | data will work with those queries. TF-IDF is a basic
             | ranking algo / signal.
             | 
             | Maybe Elasticsearch is good for you because they have all
             | their features in aggregate. But I can name a tool that
             | focuses specifically on each area and query type and is
             | better for that specific subset of functionality.
             | 
             | So my point still stands, if all you need are specific
             | features Elastic is too much. You need all of it and that's
             | fine too.
        
               | sanxiyn wrote:
               | I mean, Sonic doesn't store term frequency at all, so it
               | can't do TF-IDF. It probably doesn't want to. If you need
               | any ranking other than latest first, Sonic is not for
               | you.
        
               | _tom_ wrote:
               | The problem with subsets is everyone wants a different
               | subset. It's my popular software almost always bloats.
               | Everyone wants some different features.
        
               | pbowyer wrote:
               | > But I can name a tool that focuses specifically on each
               | area and query type and is better for that specific
               | subset of functionality.
               | 
               | Please do name them, because I for one would like to
               | never run ElasticSearch again for faceted, full-text and
               | specialised search.
        
         | osigurdson wrote:
         | I agree, but ES should re-write their core engine to be more
         | lightweight, otherwise a viable competitor will emerge.
        
           | snorremd wrote:
           | Projects like Meili Search are already coming for Elastic
           | Search's lunch: https://www.meilisearch.com. I think there is
           | a market for fast, light weight alternatives like Meili that
           | offers up a fully featured open source experience.
           | 
           | With Elastic Search many of the features, security being one,
           | are locked away behind commercial licenses. With Meili it
           | seems they are, for the time being anyway, going with a
           | proper open source version. I understand Elastic needs to
           | earn money, and I get their licensing model to accomplish
           | this. But Meili will probably steal away a good portion of
           | customers interested in self hosting their search solution.
        
             | osigurdson wrote:
             | I'm not sure what this competitor will be but > ES will
             | have the following properties:
             | 
             | - written in rust or maybe just C - extremely lightweight
             | and high performance - single small binary that runs
             | anywhere - designed to run in Kubernetes from the ground up
             | - scales dynamically up/down - zero downtime upgrades -
             | rigorous security built into the core offering - fully open
             | source - wire compatibility with ES
             | 
             | I hope that ES themselves do this. There are pretty
             | significant barriers to creating a serious competitor to ES
             | (unlike something like MongoDB for example which seems to
             | have a very limited role in the future).
        
         | felipellrocha wrote:
         | That is exactly what they are, and I don't think they hide it?!
         | So, I don't know what the issue it. This is the kind of
         | innovation that keeps us moving forward.
        
       | atesti wrote:
       | >Also, Sonic only keeps the N most recently pushed results for a
       | given word, in a sliding window way (the sliding window width can
       | be configured)
       | 
       | Does this mean that it only ever finds at most N documents per
       | word? Even searches for "A and B" would probably not find
       | everything, even if less than N documents contain A and B,
       | because they might have been removed with the sliding window
       | already for A or B alone. Is that correct?
        
         | sanxiyn wrote:
         | As far as I can tell, yes, this is correct.
        
         | Aeolun wrote:
         | Huh? Yeah. I can keep my index size down by throwing results
         | away as well.
         | 
         | Every time you think it's somehow magic, someone has to dump a
         | bucket of cold water over your head.
        
       | marsven_422 wrote:
        
       | eerikkivistik wrote:
       | About 2 weeks ago, I was searching for an alternative to Elastic
       | for this exact use case. Funny how the world works, now I have my
       | answer: "someone has built it".
        
       | habibur wrote:
       | First thing I looked for is how long does it takes to delete a
       | document from the index.
       | 
       | Looks like it rebuilds the whole index periodically and that's
       | very processor intensive. The delete will be reflected after a
       | rebuild.
        
       | IYasha wrote:
       | But does it scale?
        
         | sanxiyn wrote:
         | No, it doesn't.
        
       | keroro wrote:
       | There's also mellisearch which is another elasticsearch
       | alternative written in rust.
       | 
       | Comparison to elasticsearch:
       | https://docs.meilisearch.com/learn/what_is_meilisearch/compa...
       | 
       | Github: https://github.com/meilisearch/meilisearch
       | 
       | Website: https://www.meilisearch.com/
        
       | mhitza wrote:
       | The readme doesn't offer enough information to accept that it can
       | be an alternative to elasticsearch. From what I can gather by
       | skimming the information, it can only do word level matching and
       | that it isn't some form of TF-IDF type index (as is Lucene, which
       | stands behind Solr/ElasticSearch).
        
         | sanxiyn wrote:
         | Yes, it doesn't do any ranking at all. Results are returned in
         | the reverse order of indexing.
        
       | vlovich123 wrote:
       | Using a 32 bit ID is an interesting choice. It means you can only
       | index 64-bits per bucket. I wonder if using a varint encoding
       | would give you even more savings while handling > 4 billion
       | documents at the cost of a bit more expensive
       | serialization/deserialization cost (which should be negligible in
       | the grand scheme of everything else being done).
        
       | speps wrote:
       | Does anyone know of an alternative for the time series side of
       | Elastic?
        
         | gkorland wrote:
         | You might want to check Redis-Stack -
         | https://redis.io/docs/stack. It's a stack on top of Redis,
         | which come bundled with RedisTimeSeries, RediSearch, and
         | RedisJSON (also includes RedisGraph and RedisBloom).
        
         | snikolaev wrote:
         | Manticore Search. Here's a blog post with detailed comparison
         | [1]
         | 
         | [1] https://manticoresearch.com/blog/manticore-alternative-to-
         | el...
        
       | pipeline_peak wrote:
       | If they keep introducing hipster names like Deno and Sonic, no
       | one will know what anything means anymore.
        
       | endisneigh wrote:
       | I wish someone would write a full text engine that supports
       | pluggable storage engines.
        
         | ilyt wrote:
        
       | AndrewKemendo wrote:
       | If anyone has been successful compiling this with VSCode on Win10
       | please let me know how you get CLANG/LLVM to play nicely with
       | VSCode.
       | 
       | I'd like to avoid compiling LLVM from source if I can
        
       | hardwaresofton wrote:
       | Wow it's weird that this comes up, I'm actually running a site I
       | am going to repost to HN today that I want to use as a testbed
       | for search engines (kind of like an extension to my recent
       | collaboration with supabase[0]).
       | 
       | Right now I've got the site going on just Postgres FTS + trigram
       | and it's pretty darn fast, looks like I need to test sonic too.
       | 
       | Going to burn some midnight oil (in my timezone, anyway) and get
       | it out -- though sonic isn't implemented yet!
       | 
       | Anyway to make this comment useful to people, here's my short
       | list of engines that I want to run in parallel:
       | 
       | - MeiliSearch (https://github.com/meilisearch/MeiliSearch)
       | 
       | - TypeSense (https://github.com/typesense/typesense)
       | 
       | - Lyra (https://github.com/LyraSearch/lyra)
       | 
       | - OpenSearch (https://github.com/opensearch-project/OpenSearch)
       | 
       | - ZincSearch (https://github.com/prabhatsharma/zinc)
       | 
       | - Sonic (https://github.com/valeriansaliou/sonic)
       | 
       | There isn't enough out there comparing all these for the simple
       | typical fuzzy search/search box usecase, so I'm adapting a little
       | podcast search site I made to try and use all of these at the
       | same time. So far only Postgres though, will try and add
       | Meilisearch today and post it!
       | 
       | Like other people are pointing out, most of these engines won't
       | have all the features of ES (or more accurately Lucene) but I am
       | pretty convinced that most of the time it doesn't _actually_
       | matter and if someone is searching on your site excessively maybe
       | there 's a problem with your UX (unless you're a search engine or
       | repository of information).
       | 
       | [0]: https://supabase.com/blog/postgres-full-text-search-vs-
       | the-r...
        
         | Bilal_io wrote:
         | Hey that's a great list of tools.
         | 
         | Are you aware of any that can be used client side like Lyra and
         | supports faceted search?
         | 
         | I've been looking for a solution and cannot find it, even an
         | algorithm and/or a data structure can be helpful. I attempted
         | coming up with a solution myself but ended up with frustration
         | when it came to making the facets dynamic and update as other
         | filters are applied.
         | 
         | I read a couple of papers and one stood out [0], which
         | introduces category theory as a solution to faceted filtering.
         | I understood it in theory and it was still does not seem
         | straight forward to implement but I haven't attempted yet.
         | 
         | 0.
         | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5145200/#!po=28...
        
           | hardwaresofton wrote:
           | So for client-side search, I generally know of Lunr.js:
           | 
           | https://lunrjs.com/docs/index.html
           | 
           | There are some others but I can't find them at this moment --
           | a bunch of the other projects I find are somewhat abandoned,
           | lunr is actually on my list of things to use (because it
           | makes the most sense to just ship a pre-built index with the
           | first like... 5 letters maybe of typeahead, no matter how
           | fast the backend is)
        
             | Bilal_io wrote:
             | Thanks for the link. This unfortunately is not what I am
             | looking for. Faceted filters are a different beast.
        
         | hawski wrote:
         | Thank you for this comparison. I would also like to know how
         | Bleve Search (https://github.com/blevesearch/bleve) turns out.
         | 
         | I have for many years now a small search engine project in my
         | free-time pipeline, but I'm before crawling even and I intend
         | to sit for searching part after some of that.
        
           | hardwaresofton wrote:
           | You're right I should put bleve on there as well. This isn't
           | even the whole list. Toshi (https://github.com/toshi-
           | search/Toshi) is also out there...
        
             | snikolaev wrote:
             | If you decide to add Manticore Search to the list feel free
             | to ping me at sergey@manticoresearch.com if you need help
             | with preparing the ingestion scripts etc.
        
               | hardwaresofton wrote:
               | Oh! Damn it I forgot about manticore -- I had seen it
               | before but forgot to include it.
               | 
               | Eventually all of these projects will be highlighted on
               | Awesome F/OSS (https://awsmfoss.com), but for now I'm
               | just going to dump my bookmarks here for other people,
               | since I'm leaving awesome projects out:
               | 
               | Search Engines
               | 
               | AWS OpenSearch https://github.com/opensearch-
               | project/OpenSearch
               | 
               | https://github.com/opensearch-project/OpenSearch-
               | Dashboards
               | 
               | https://github.com/opensearch-project/perftop
               | 
               | https://github.com/go-ego/riot
               | 
               | https://groonga.org/ https://github.com/groonga/groonga
               | 
               | https://github.com/meilisearch/MeiliSearch
               | 
               | https://github.com/mosuka/bayard
               | 
               | https://github.com/nezaboodka/nevod
               | 
               | https://github.com/searx/searx
               | 
               | https://github.com/stryku/okon
               | 
               | https://github.com/toshi-search/Toshi
               | 
               | https://github.com/typesense/typesense
               | 
               | https://github.com/valeriansaliou/sonic
               | 
               | Algolia
               | 
               | https://github.com/marconi1992/algolite
               | 
               | https://quickwit.io/
               | 
               | https://github.com/quickwit-inc/quickwit
               | 
               | https://docs.meilisearch.com/
               | 
               | https://github.com/prabhatsharma/zinc
               | 
               | phalanx https://github.com/blugelabs/bluge
               | https://github.com/mosuka/phalanx
               | https://github.com/mosuka/blast
               | 
               | ManticoreSearch
               | 
               | https://github.com/manticoresoftware/manticoresearch
               | 
               | https://github.com/manticoresoftware/docker
               | 
               | https://manticoresearch.com/blog/manticore-alternative-
               | to-el...
               | 
               | https://manticoresearch.com/
               | 
               | https://manual.manticoresearch.com/Introduction
               | 
               | https://forum.manticoresearch.com/t/manticore-search-
               | cheatsh...
               | 
               | https://forum.manticoresearch.com/
               | 
               | Whoosh https://whoosh.readthedocs.io/en/latest/
               | 
               | https://pypi.org/project/Whoosh/
               | 
               | lyra https://github.com/nearform/lyra
               | 
               | https://nearform.github.io/lyra/
               | 
               | https://github.com/LyraSearch/lyra
               | 
               | https://lyrasearch.io/
               | 
               | flexsearch
               | 
               | https://github.com/nextapps-de/flexsearch#performance-
               | benchm...
               | 
               | https://pagefind.app/docs/
               | 
               | Lucene
               | 
               | https://github.com/apache/lucene
               | 
               | https://lucene.apache.org/
               | 
               | ZincSearch
               | 
               | https://zincsearch.com/
               | 
               | Solr
               | 
               | https://solr.apache.org/
               | 
               | https://solr.apache.org/operator/
               | 
               | https://solr.apache.org/guide/solr/latest/getting-
               | started/so...
               | 
               | https://github.com/apache/solr
               | 
               | https://solr.apache.org/guide/solr/latest/deployment-
               | guide/s...
               | 
               | Konnu https://gitlab.com/shadowislord/konnu
               | 
               | Quickwit QuickWit + Clickhouse
               | 
               | https://clickhouse.com/docs/en/guides/developer/full-
               | text-se...
               | 
               | https://clickhouse.com/docs/en/sql-
               | reference/functions/strin...
               | 
               | There is no way I can get to running _all_ of these (this
               | project was supposed to be quick!!), but I will run the
               | ones I noted earlier, and probably manticore too since it
               | was high on my list since it 's quite polished looking.
        
               | _tom_ wrote:
               | I'd encourage you to maintain and publish your list of
               | search engines. Even if you aren't supporting them.
               | 
               | The list has value on its own, especially if you maintain
               | it.
        
               | kapilvt wrote:
               | + xapian which has been around a while, and while gpl
               | licensed, is quite capable https://xapian.org/
        
               | donio wrote:
               | Xapian is great, especially when you need a a C/C++
               | library rather than a separate service. Kinda like an
               | sqlite for search. Some of my favorite tools like notmuch
               | and recoll use it.
        
               | fzliu wrote:
               | + Milvus (https://github.com/milvus-io/milvus) for large
               | scale similarity/semantic search.
        
               | thirdtrigger wrote:
               | + Weaviate for vector based search. Has a BSD-3 license.
               | https://weaviate.io/developers/weaviate/current/
        
         | nightpool wrote:
         | > and if someone is searching on your site excessively maybe
         | there's a problem with your UX (unless you're a search engine
         | or repository of information).
         | 
         | I don't understand this comment. Why would you search something
         | that *isn't*, in some senses, a repository of information? I
         | would say almost every website needs to have search in some
         | sense, and it's *because* sites function as a repository of
         | information that they need this search. Think about e.g.
         | Stripe's documentation, or Github's repository / code search.
         | HN is also another great example--I search for stories or
         | comments all the time to try and remember something I read
         | about recently or heard about last week, but couldn't quite
         | remember. I'm hard-pressed to think of a web site I use
         | regularly that *shouldn't* have full-text search, if I'm being
         | honest.
        
           | hardwaresofton wrote:
           | I don't consider use cases like documentation a "repository"
           | of information, but maybe this is just me not phrasing it
           | badly. In the literal sense sure it is, but when I think of a
           | "repository of information" I think of wikipedia, amazon
           | search items, etc.
           | 
           | The scale of a documentation site is a very different problem
           | -- you can brute force it in ways that you can't at larger
           | scales.
           | 
           | I agree that HN would be a case of the large repository, but
           | even then what most people want out of HN search is pretty
           | simple/basic keyword search. I think a decent non-frustrating
           | HN search feature could be very basic and get by without most
           | of the advanced features/rabbit holes available in search.
           | 
           | Basically I think most apps fall into the lighter search use
           | case -- command palettes, search inside of apps with a small
           | scale of information, etc.
           | 
           | My comment wasn't that apps _shouldn 't_ have full text
           | search -- it was that most that have full text search don't
           | need _complex_ full text search with all the bells and
           | whistles that lucene and other serious search engines
           | provide. These up-and-comers might be enough for a bunch of
           | apps for which search is not the main feature.
        
           | TylerE wrote:
           | Most site searches are basically unusable. Either it isn't
           | very good, is painfully slow, or both.
           | 
           | Just gooling site:foo.com/baz <query> almost always produces
           | better results.
        
         | francoismassot wrote:
         | You can consider also lnx that is based on tantivy and is
         | performing quite well (https://lnx.rs/).
        
         | hardwaresofton wrote:
         | Meili is still ingesting documents but we're live:
         | 
         | https://news.ycombinator.com/item?id=33321268
         | 
         | Maybe I should have used their batch thing instead.
        
         | MobiusHorizons wrote:
         | Would it make sense to include Sqlite FTS5 in that mix?
        
           | hardwaresofton wrote:
           | It would, I did for the supabase post but... This is already
           | way too much! I have no idea when I'll actually be able to
           | get to all this as-is.
           | 
           | Waiting for meilisearch to ingest documents right now and the
           | Show HN is going up.
        
       | blacklight wrote:
       | While I really like their lightweight, SQL-like protocol instead
       | of Elasticsearch's fat JSON, I really think that this project
       | could have much more impact if it could be a drop-in replacement
       | for ES.
       | 
       | Even if it offers only a fraction of the features offered by ES,
       | that may be fair enough for at least half of the use-cases out
       | there.
       | 
       | Sonic could have really had a strong selling point: "Use an ES-
       | alternative that works fine in most of the real-world
       | applications, but it's written in Rust and it only takes a
       | fraction of the memory footprint required by ES, and it shouldn't
       | require you to change your application code".
       | 
       | Instead, they are proposing yet another search protocol, that
       | developers have to learn and adopt. That definitely increases the
       | adoption barriers.
        
         | tensor wrote:
         | It's probably fairly easy to write an adapter here.
        
         | xvello wrote:
         | Since Elastic spitefully patched all of their client libraries
         | to fail if the server is not a "genuine" ES server, I don't see
         | what good a drop-in replacement with protocol compatibility
         | would do.
         | 
         | Go client: https://github.com/elastic/go-
         | elasticsearch/blob/3985f2a1554...
         | 
         | Python client: https://github.com/elastic/elasticsearch-
         | py/commit/e72aa3e24...
        
           | snikolaev wrote:
           | Is it prohibited to include `X-Elastic-Product:
           | Elasticsearch` in the output of your server if the user
           | instructs the server to do so? :)
        
             | hangonhn wrote:
             | I don't see how they can legally have any control over what
             | a 3rd party's software outputs. And more importantly, how
             | would they even enforce such restrictions?
        
               | yvan wrote:
               | I believe Elasticsearch is a trademark.
        
               | jeltz wrote:
               | A trademark does not forbid people from using a name, it
               | only restricts how it can be used in marketing. I do not
               | see how that would be applicable here.
        
               | metadat wrote:
               | Are HTTP headers important or even relevant at all for
               | branding trademark purposes?
               | 
               | Such a concern seems utterly ridiculous.
        
               | mumblemumble wrote:
               | If it really does work this way, then we're all doomed.
               | 
               | https://stackoverflow.com/questions/1114254/why-do-all-
               | brows...
        
               | blowski wrote:
               | I imagine AWS can't put it on the headers of their
               | managed service, and that's what it's about.
        
             | AbraKdabra wrote:
             | Those libraries are open source, just nuke those
             | restrictions and you're good to go. Is it the best way?
             | Maybe not, but it's better than modifying your server
             | responses (and in the worst 1984 case, allowing Elastic to
             | sue you), if you develop such a tool you can always put
             | that distinction in your README.
        
         | markandrewj wrote:
         | Although not exactly the same, Elastic has an SQL query syntax
         | which can be used now as well.
         | 
         | https://www.elastic.co/what-is/elasticsearch-sql
        
         | leros wrote:
         | ElasticSearch is so much more than search. Sonic is very
         | minimal in comparison, so a drop in replacement doesn't work
         | here.
         | 
         | But yes, Sonic could replace lots of use cases.
        
       | nathell wrote:
       | I've written a full-text search engine as well. I don't tout it
       | as a replacement for Elasticsearch, but it does have a few
       | advantages: it's fast; supports HTML documents; supports Polish
       | inflection (via a full-blown morphological dictionary, not just a
       | stemmer); and has a very compact on-disk format (pre-parsed HTML
       | trees, Huffman-encoded over large alphabets). Oh, and it's 100%
       | Clojure.
       | 
       | It underlies a concordancer GUI called Smyrna:
       | https://github.com/nathell/smyrna, https://smyrna.danieljanus.pl
       | 
       | I haven't touched it in six years, other than a few small
       | changes. But I do plan on revisiting it when time permits.
        
         | johnebgd wrote:
         | That's very cool. I hope you consider open sourcing it so
         | others can contribute.
        
           | nathell wrote:
           | It is open-source already (MIT)! I just need to make other
           | languages more easily pluggable, and factor out the search
           | engine so that it can be used on its own. :)
        
         | _tom_ wrote:
         | Could your steamer be ported to Lucene? Might get more usage
         | there.
        
       | scottwick wrote:
       | Does anyone have any recommendations of books or other resources
       | that go over the theory behind full-text search? i.e. language
       | processing, data encoding, on-disk storage and retrieval, etc.
        
         | sanxiyn wrote:
         | If you want a book, Managing Gigabytes is still pretty good.
        
         | snikolaev wrote:
         | https://nlp.stanford.edu/IR-book/information-retrieval-book....
        
       | dang wrote:
       | Related:
       | 
       |  _Sonic: Fast, lightweight and schemaless search back end in
       | Rust_ - https://news.ycombinator.com/item?id=19471471 - March
       | 2019 (39 comments)
        
       | excsn wrote:
       | This is not a direct alternative to ElasticSearch. Tantivy is
       | closer to an alternative to ElasticSearch since ES is built on
       | top of Lucene. An alternative could be achieved if built on top
       | of Tantivy.
       | 
       | Sonic here only returns document identifiers so you will never be
       | able to get document information back. This is very useful though
       | if all you want to do is index text data and then get the stored
       | information from another data store.
        
         | codedokode wrote:
         | > Sonic here only returns document identifiers
         | 
         | In many cases that is what you want because you have the data
         | in a database and don't want to duplicate it in Elastisearch.
        
         | counttheforks wrote:
         | > Sonic here only returns document identifiers so you will
         | never be able to get document information back
         | 
         | Why would you want that anyway? Always thought it was silly to
         | duplicate all your data which will be stored in a real database
         | anyway
        
           | excsn wrote:
           | From a use case I am not experienced with. If you index
           | books, you want the search engine to return highlighted data
           | like google does.
           | 
           | Also, now that I think of it, typically logs/structured data
           | is stored only in ES.
        
         | sanxiyn wrote:
         | Quickwit is a search engine built on top of Tantivy (by the
         | author of Tantivy): https://github.com/quickwit-oss/quickwit
         | 
         | Quickwit supports Elasticsearch compatible bulk indexing API.
        
       | croes wrote:
       | Most of the time these ES replacements lack a decent access
       | control.
       | 
       | One thing is to find what you search, but the other is not to
       | find what you aren't allowed to see.
        
         | sanxiyn wrote:
         | Meilisearch supports ES-like document access control.
        
       | DeathArrow wrote:
       | >Also, Sonic only keeps the N most recently pushed results for a
       | given word, in a sliding window way (the sliding window width can
       | be configured)
       | 
       | If you discard many potential hits, why not use /dev/null as the
       | search engine?
        
         | Someone1234 wrote:
         | I believe you must have misread what you quoted, because
         | whatever point you're trying doesn't really follow what you
         | quoted.
         | 
         | They let you configure the number of expected results to cache
         | for a given query, the number of cache results are configurable
         | based on your use-case for the results (e.g. if your website
         | only lists 100 results, don't store beyond that).
         | 
         | If more results than that for a given query are returned then
         | they disregard additional results since you told it you won't
         | make use of them. In essence, they're saving you from caching
         | results that you'll never consume.
         | 
         | How you got from this to "just use /dev/null" is a mystery to
         | me. It has to be a misread or misunderstanding.
        
         | nine_k wrote:
         | This thing looks like a very genetic cache. You can of course
         | use /dev/null as a degenerate cache, without any performance
         | benefit though.
        
       | manigandham wrote:
       | Lots of (elastic)search alternatives now, I keep track here:
       | https://gist.github.com/manigandham/58320ddb24fed654b57b4ba2...
       | 
       | Sonic is good. Typesense is probably what most are looking for as
       | more of an Algolia-like setup: https://typesense.org/
        
       ___________________________________________________________________
       (page generated 2022-10-24 23:00 UTC)