[HN Gopher] Sonic: Fast, lightweight and schema-less search backend ___________________________________________________________________ Sonic: Fast, lightweight and schema-less search backend Author : rcarmo Score : 479 points Date : 2022-10-24 11:17 UTC (11 hours ago) (HTM) web link (github.com) (TXT) w3m dump (github.com) | pavelevst wrote: | It would be nice if it can be replacement for logging stack, | elastic is super hungry for ram | IceWreck wrote: | Look at zincsearch. Its another lightweight elastic alternative | and they advertise logging as a usecase. | francoismassot wrote: | You can have a look at Quickwit (https://quickwit.io), it's a | search engine made for logs :). It's still pretty young and... | there are way less features than in ES. | | (disclaimer: I'm one of the cofounder) | marginalia_nu wrote: | * We imported ~1,000,000 messages of dynamic length (some very | long, eg. emails); | | * Once imported, the search index weights 20MB (KV) + 1.4MB (FST) | on disk; | | This is almost unbelievably succinct! If you encode the document | features into 8 bits per document, and thus completely forego the | need to store the document ID by indexing them implicitly, that | alone is 1 MB. | | Getting meaningful search out of on average 21 bytes per document | seriously impressive. | | [For reference, this sentence is 42 bytes.] | [deleted] | mattb314 wrote: | Wonder if this has anything to do with the sliding window: | | > Sonic only keeps the N most recently pushed results for a | given word, in a sliding window way (the sliding window width | can be configured) | | Default window looks like 1k documents. I read this as saying | that super common words are basically dropped from the index | (only 1k out of many thousands of docs retained), but I don't | know enough about the internals to be sure. Not sure if this | actually hurts search results in practice, seems like an ok | trade off for help docs at least. | nightpool wrote: | I wonder how easy it would be to change "most recently | pushed" to something like a redis sorted set where each | document has a score and only the top N results are retained | when sorted by their separate score value? That would allow | you to sort by pageviews / popularity in a more useful way. | But it fails entirely when looking for uncommon intersections | of common words, which feels like it makes it useless for | most actual full-text search use-cases :( | 411111111111111 wrote: | It's definitely a great trade-off to make for efficiently, | but makes it inherently unusable for most of elastic searchs | usecases. | | Looking at it from a practical example such as log search | (almost everyone I know has used | kibana/logstash/elasticsearch at some point): you'd be able | to search for things like tracingId/requestId but adding more | filters such as logLevel, requestType or serviceName would be | impossible | | It has it's niche, but calling it an elasticsearch | alternative really is a stretch | rabuse wrote: | Also the ability to weight fields when fetching results to | boost relevancy, which is needed for a lot of my use cases. | syrusakbary wrote: | Long ago I was searching on lightweight search engines that could | run on the Edge, as ElasticSearch -while very popular- is also | quite heavy and relies on the Lucene/JVM. | | Apart from Sonic, I also found Tantivy [1] and Meilisearch [2]... | all delightfully made in Rust. My favorite, and the closest one | to ElasticSearch (for its features) is probably Tantivy. | | I'd recommend anyone to check up this three projects and choose | on what best fits your needs... it's awesome to see that more | projects are becoming available by the day! | | [1]: https://github.com/quickwit-oss/tantivy | | [2]: https://github.com/meilisearch/meilisearch | codedokode wrote: | There is also sphinx search which was open source before 3.0 | version. | snikolaev wrote: | And it's open source continuation - Manticore Search [1] | | [1] https://manticoresearch.com/ | croes wrote: | Do they support document access control like ES does? | sanxiyn wrote: | Yes, Meilisearch supports ES-like document access control. | alserio wrote: | I've looked up tantivy and quickwit. Quickwit uses tantivy as | the engine. It has decoupled storage (awesome, only recently | elastic announced something comparable) but is oriented towards | log processing and esplitly warns against its use to power an | user facing site search. Do you happen to know if there's | anything like that with the same minimal footprint that can | scale up and, importantly, down to serve the needs of highly | variable traffic websites? Right now I'm looking at something | with clustering capabilities and decoupled storage (e.g on s3) | like quickwit | francoismassot wrote: | One of the reasons for not using Quickwit for user facing | search is the latency: for example, you pay 70ms of latency | when you make a request on AWS S3... and generally you expect | latency below that figure. Decoupling compute and storage | while keeping a very low latency may be then impossible | unless ending up by caching all your data on disk :). | | You can have a look at lnx (https://lnx.rs/) that is based on | tantivy and is performing quite well. It's not yet | distributed but the author Chillfish8 has some thoughts about | how to do it. | alserio wrote: | Thank you! I'll look into it | dewey wrote: | Another interesting alternative: | https://github.com/meilisearch/meilisearch - I'm using it in one | of my (small) projects and I had a good experience with it, also | very helpful community. | Thaxll wrote: | So I can use that to inject millions of logs daily and it will do | sharding and rebalancing automatically? | sanxiyn wrote: | No. Sonic is a single node server and not distributed. | daitangio wrote: | Nice. I have done some tests with SQLlite, and I find its index | module very interesting, also because it offers stemming, which | seems missed here: am I wrong? | | SQLite has stemming only for english out-of-the-box, but I find | it quite a need for a good ES drop in replacement. | | My two cents | rcarmo wrote: | It is great and works, but sonic has broader applications (I | found it because it was actually being used as a way to index | an existing SQLite database that pointed to file storage). | PedroBatista wrote: | While I get the wants-and-needs since ElasticSearch has a | voracious appetite for RAM, I get the feeling most people think | search engines are a simple thing where you can just import some | lib, fool around for a bit and call it a day. | | The truth is that ElasticSearch/Solr/Lucene is orders of | magnitude more complex and powerful than these "alternatives". | All this is mostly fine as long everyone is on the same page | regarding the expectations. | | Most people don't need ElasticSearch for their use cases on the | surface, but I feel they expect top-notch mind-reading results | and that requires something like ElasticSearch and someone who | knows the field. | | Having said all of that, Meilisearch and this are quite fine. | keyle wrote: | Yeah there needs to be some kind of acid test that will compare | these products on equal footing and show the pitfalls. | DeathArrow wrote: | Here is a performance benchmark: https://db- | benchmarks.com/test-hn/#manticore-search-columnar... | jasfi wrote: | That would be great. However if you wanted to benchmark | relevance ranking, how would you do that? | sanxiyn wrote: | You need a dataset and an evaluation metric. The usual | evaluation metric is NDCG(Normalized Discounted Cumulative | Gain): https://en.wikipedia.org/wiki/NDCG | | An example dataset is BEIR(BEnchmarking Information | Retrieval), published in NIPS 2021: | https://github.com/beir-cellar/beir | sanxiyn wrote: | This is very very difficult, but Tantivy tried: see | https://github.com/quickwit-oss/search-benchmark-game | ilyt wrote: | Spivak wrote: | I think the upshot is that if you have no idea what all the | advanced features of ES even are then you probably don't need | ES because it's not turnkey. | | If you utter the phrase "I just want search" then it really is | a matter of just using one of these lightweight projects and | libs because your needs are simple. | alessmar wrote: | I would like to suggest https://typesense.org/ It has some | features that makes it a better choice than Meilisearch | paraboul wrote: | Can you elaborate on said features? | | I migrated from typesense to Meilisearch on a project after I | found it had much better search accuracy. I can't exactly | explain why, but overall Meilisearch results feel more | relevant by default. | jabo wrote: | I work on Typesense. Mind if I ask which version of | Typesense and Meilisearch you tried this on? And if this | was on some public dataset I can use? | | I'd love to take a closer look. | paraboul wrote: | Hey jabo, | | I migrated in April 2021 (latest version of typesense & | meilisearch at that time). | | I don't have a public dataset has it was a fairly large | ecommerce catalog with close to ~500k entries. And again, | it was just my own perception which is hard to define. I | just found that Typesense was a bit off compared to | Meilisearch on search accuracy, and of course could | totally be different today with a more recent release. | jabo wrote: | Got it, thank you for sharing that. Typesense was at | v0.19.0 around that time. Two prominent issues we had in | that version were how we handled matches across multiple | fields and how we handled "keyword stuffing". | | We're now at v0.24.rc, and we've iterated quite a lot on | improving relevancy since then, as more users shared | their datasets with us and gave us feedback over the last | 1.5 years. | | If you get a chance to try out Typesense again in the | future, I'd love to hear how relevance feels with the | latest version, out of the box for your dataset. | snikolaev wrote: | There are actually benchmarks that allow measuring search | relevancy objectively, e.g. BEIR[1]. Manticore Search team | did an effort to make a PR to include it to the list. The | results are here [2]. Unfortunately the BEIR team seems to | be too busy to review a whole pile of PRs including about | Vespa. Nevertheless it would be nice to have both | Meilisearch and Typesense there too since it's interesting | what performance those non-tf-idf based search engines | would show compared to BM25-based and vector search | engines. | | [1] https://github.com/beir-cellar/beir [2] https://docs.go | ogle.com/spreadsheets/d/1_ZyYkPJ_K0st9FJBrjbZ... | eric4smith wrote: | What about relevancy? | | There's not much mention of that. I'm always on the lookout for | something lightweight that improves on PostgreSQL full text. | sanxiyn wrote: | Sonic doesn't do any ranking other than latest first. | eric4smith wrote: | Ouch | giancarlostoro wrote: | Now if it were drop-in capable and still more efficient, that | would be impressive and I would count the days until Elastic buys | you out. | cies wrote: | Other: | | https://www.meilisearch.com/ | | https://github.com/quickwit-oss/tantivy | | https://github.com/toshi-search/Toshi | | https://github.com/typesense/typesense | didip wrote: | Somewhat related, this guy: https://github.com/mosuka/ seems to | be very passionate about search service. | | He built two distributed search services: | | - https://github.com/mosuka/phalanx, written in Go. | | - https://github.com/mosuka/bayard, written in Rust. | erikcw wrote: | One of the features I like in ES that I haven't seen in | alternatives is "Percolate queries" (queries where you feed the | service a document and it returns a list of queries that you've | indexed that would match that document - basically inverting the | whole process). | | Does anyone know of any alternatives that support this use case? | | https://www.elastic.co/guide/en/elasticsearch/reference/mast... | snikolaev wrote: | Yes. Manticore Search does. Here's an interactive course[1] | about it, it's a little bit outdated though. More info in the | docs[2] | | [1] https://play.manticoresearch.com/pq [2] | https://manual.manticoresearch.com/Creating_an_index/Local_i... | thedougd wrote: | Just another plug for Lucene or the library route. I had a simple | use case to offer a search/autocomplete API for the employee | directory of ~50,000 records. The source of truth was only | updated once a day. We ran a job that reindexed daily and | published the index as a file (< 15 megabytes) to where the | service could access it. | | That service worked beautifully. Results were returned in 10-20ms | and we only ever made software updates to handle the occasional | CVE. It did, however, take quite a bit of fiddling initially to | get the query results to match the user expectations. For | example, weighting first vs last vs full name. | codedokode wrote: | I am not sure if it can be called an "alternative". ElasticSearch | has thousands of features and settings while this library seems | to be just a simple inverted index implementation only for text | search. | | By the way if you are looking for lightweight "alternative" for | ElasticSearch you might look at sphinx search engine (although it | doesn't has as much features as ES has and it has became closed- | source since 3.0 version). | snikolaev wrote: | > you might look at sphinx search engine | | Manticore Search [1] forked from the latest open source version | and has been continually improved for more than 5 years. | | [1] https://manticoresearch.com/ | | > although it doesn't has as much features as ES has | | Manticore unlike Sphinx is much closer to Elasticsearch in | terms of features set. | 9dev wrote: | Every time someone comes up with an alternative to a software | behemoth like Elasticsearch, what they actually mean is: "An | alternative to the 10% of functionality of $tool _that are | interesting to me_ ". | | This is surely an impressive engineering feat, but hardly a | replacement for the myriad of query possibilities Elasticsearch | offers. | coldtea wrote: | "ative to a software behemoth like Elasticsearch, what they | actually mean is: "An alternative to the 10% of functionality | of $tool that are interesting to me" | | Which is perfectly fine. A lot of tools become so general and | bloated, that there are large groups that would be fine with | many different 10% subsets of their features... | | Kind of like how I don't need MS Word or OpenOffice Write, any | simple text editing program with a few basic features (like | printing, bold/italics, and word count) will do for my needs... | 9dev wrote: | I'm not opposed to that, however, the chance of _their 10%_ | and _my 10%_ overlapping is rather slim. Just like you only | need basic formatting, and I require footnotes in my | documents. Nothing wrong with either, but I 'd be upset if | you tried to sell me GEdit as a replacement for OpenOffice | Write. | manigandham wrote: | True, but most deployments are also just generic searching of | records like Algolia rather than using all the low-level | functionality. | | Tyoesense is probably the most compete competitor in that | regard: https://typesense.org/ | | Other alternatives here: | https://gist.github.com/manigandham/58320ddb24fed654b57b4ba2... | jethro_tell wrote: | well, and how do you solve excessive ram usage for a search | engine? Generally you write the indexes/search trees to disk | which, may or may not be ideal. | GrinningFool wrote: | From the opening line of the README: | | "Sonic can be used as a simple alternative to super-heavy and | full-featured search backends such as Elasticsearch in some | use-cases." | | Seem pretty up-front about it, and doesn't claim to be a full- | featured alternative. | lolinder wrote: | Agreed, they do a good job of hedging it. I think OP was | probably pre-empting the usual comments along the lines of | "yep, $tool is super bloated, $smallerTool proves that those | other guys building $tool are bad engineers." | marginalia_nu wrote: | To be fair, there is often a better reasons to only replace a | portion of ES' functionality, since doing so can save a lot of | computation and space; than to replace ES itself, since it | already exists and does a good job if what you need is the full | kit. | | I found myself last week reimplementing 10% of RoaringBitmap's | functionality as a homebrew replacement, because doing so was | 500% faster. Not that RB isn't great, but it's designed for a | general problem space, and not my particular problem. | tensor wrote: | My guess is that the majority of people using ES could actually | use something simpler like this. | jamil7 wrote: | Agreed, although to be fair to the actual author (assuming they | didn't post this here) the readme is a lot more upfront about | it's capabilities. | | > Sonic can be used as a simple alternative to super-heavy and | full-featured search backends such as Elasticsearch in some | use-cases. | rlex wrote: | and almost all of them won't offer ES-compatible API. Out of my | head i can think about manticore (https://manticoresearch.com/) | that offers at least subset of elasticsearch API | sanxiyn wrote: | Quickwit does too (a subset anyway): | https://github.com/quickwit-oss/quickwit | papruapap wrote: | tbh "Text Search" is a vague description for these kind of | softwares, so I guess everyone go with elasticsearch-like. | RicoElectrico wrote: | Honestly most of the "alternative to" programs do not meet | expectations they set by dropping a big known name. So much so, | that I think people are doing FOSS disservice by comparing to | those who they can't meaningfully overtake. | | The only exceptions could be small single feature utilities. | graftak wrote: | To me it seems the "alternative to" part is more damaging in | that sense than dropping a big name. The name is used to put | a complicated piece of software in a context many people are | familiar with. The same thing happens with the | "Tinder/Uber/Airbnb for <x>..." type of services. | | The friction is introduced where it's not made crystal clear | how it's similar, and which concept are different or missing | altogether. Then it will cause unmet expectations. | | Perhaps it's better to say "inspired by ..." or "similar to | ..." to make a more precise statement. | ianbutler wrote: | I don't think your opinion is wrong, but I do think | ElasticSearch has a lot of features that many people consider | bloat depending on their work, and scaling and doing general | dev ops for ES can be an absolute slog. Light weight | alternatives that cut down to a set of core features for some | niche seem like a good idea to me. | 9dev wrote: | It's totally fine that many people consider stuff bloat, but | other people don't. I've built a highly specialised search | engine for manufacturing companies on top of Elasticsearch, | and I _decidedly_ need vector queries, TF-IDF queries, | geospatial range queries, and heaps of other, niche features | you probably never used before. | | Having a lightweight search engine is fine, but calling it an | alternative to Elasticsearch is not doing either justice. | ianbutler wrote: | That's very assumptive of you. I have in fact used most of | those features, and note I said their opinion was not | wrong. In their readme they said it's a replacement for | some use cases which is upfront and fine. | | Vector queries aren't niche, Elastic however only tacked on | a proper (non HNSW) implementation in the last year and a | half. Geospatial isn't niche, anyone working with location | data will work with those queries. TF-IDF is a basic | ranking algo / signal. | | Maybe Elasticsearch is good for you because they have all | their features in aggregate. But I can name a tool that | focuses specifically on each area and query type and is | better for that specific subset of functionality. | | So my point still stands, if all you need are specific | features Elastic is too much. You need all of it and that's | fine too. | sanxiyn wrote: | I mean, Sonic doesn't store term frequency at all, so it | can't do TF-IDF. It probably doesn't want to. If you need | any ranking other than latest first, Sonic is not for | you. | _tom_ wrote: | The problem with subsets is everyone wants a different | subset. It's my popular software almost always bloats. | Everyone wants some different features. | pbowyer wrote: | > But I can name a tool that focuses specifically on each | area and query type and is better for that specific | subset of functionality. | | Please do name them, because I for one would like to | never run ElasticSearch again for faceted, full-text and | specialised search. | osigurdson wrote: | I agree, but ES should re-write their core engine to be more | lightweight, otherwise a viable competitor will emerge. | snorremd wrote: | Projects like Meili Search are already coming for Elastic | Search's lunch: https://www.meilisearch.com. I think there is | a market for fast, light weight alternatives like Meili that | offers up a fully featured open source experience. | | With Elastic Search many of the features, security being one, | are locked away behind commercial licenses. With Meili it | seems they are, for the time being anyway, going with a | proper open source version. I understand Elastic needs to | earn money, and I get their licensing model to accomplish | this. But Meili will probably steal away a good portion of | customers interested in self hosting their search solution. | osigurdson wrote: | I'm not sure what this competitor will be but > ES will | have the following properties: | | - written in rust or maybe just C - extremely lightweight | and high performance - single small binary that runs | anywhere - designed to run in Kubernetes from the ground up | - scales dynamically up/down - zero downtime upgrades - | rigorous security built into the core offering - fully open | source - wire compatibility with ES | | I hope that ES themselves do this. There are pretty | significant barriers to creating a serious competitor to ES | (unlike something like MongoDB for example which seems to | have a very limited role in the future). | felipellrocha wrote: | That is exactly what they are, and I don't think they hide it?! | So, I don't know what the issue it. This is the kind of | innovation that keeps us moving forward. | atesti wrote: | >Also, Sonic only keeps the N most recently pushed results for a | given word, in a sliding window way (the sliding window width can | be configured) | | Does this mean that it only ever finds at most N documents per | word? Even searches for "A and B" would probably not find | everything, even if less than N documents contain A and B, | because they might have been removed with the sliding window | already for A or B alone. Is that correct? | sanxiyn wrote: | As far as I can tell, yes, this is correct. | Aeolun wrote: | Huh? Yeah. I can keep my index size down by throwing results | away as well. | | Every time you think it's somehow magic, someone has to dump a | bucket of cold water over your head. | marsven_422 wrote: | eerikkivistik wrote: | About 2 weeks ago, I was searching for an alternative to Elastic | for this exact use case. Funny how the world works, now I have my | answer: "someone has built it". | habibur wrote: | First thing I looked for is how long does it takes to delete a | document from the index. | | Looks like it rebuilds the whole index periodically and that's | very processor intensive. The delete will be reflected after a | rebuild. | IYasha wrote: | But does it scale? | sanxiyn wrote: | No, it doesn't. | keroro wrote: | There's also mellisearch which is another elasticsearch | alternative written in rust. | | Comparison to elasticsearch: | https://docs.meilisearch.com/learn/what_is_meilisearch/compa... | | Github: https://github.com/meilisearch/meilisearch | | Website: https://www.meilisearch.com/ | mhitza wrote: | The readme doesn't offer enough information to accept that it can | be an alternative to elasticsearch. From what I can gather by | skimming the information, it can only do word level matching and | that it isn't some form of TF-IDF type index (as is Lucene, which | stands behind Solr/ElasticSearch). | sanxiyn wrote: | Yes, it doesn't do any ranking at all. Results are returned in | the reverse order of indexing. | vlovich123 wrote: | Using a 32 bit ID is an interesting choice. It means you can only | index 64-bits per bucket. I wonder if using a varint encoding | would give you even more savings while handling > 4 billion | documents at the cost of a bit more expensive | serialization/deserialization cost (which should be negligible in | the grand scheme of everything else being done). | speps wrote: | Does anyone know of an alternative for the time series side of | Elastic? | gkorland wrote: | You might want to check Redis-Stack - | https://redis.io/docs/stack. It's a stack on top of Redis, | which come bundled with RedisTimeSeries, RediSearch, and | RedisJSON (also includes RedisGraph and RedisBloom). | snikolaev wrote: | Manticore Search. Here's a blog post with detailed comparison | [1] | | [1] https://manticoresearch.com/blog/manticore-alternative-to- | el... | pipeline_peak wrote: | If they keep introducing hipster names like Deno and Sonic, no | one will know what anything means anymore. | endisneigh wrote: | I wish someone would write a full text engine that supports | pluggable storage engines. | ilyt wrote: | AndrewKemendo wrote: | If anyone has been successful compiling this with VSCode on Win10 | please let me know how you get CLANG/LLVM to play nicely with | VSCode. | | I'd like to avoid compiling LLVM from source if I can | hardwaresofton wrote: | Wow it's weird that this comes up, I'm actually running a site I | am going to repost to HN today that I want to use as a testbed | for search engines (kind of like an extension to my recent | collaboration with supabase[0]). | | Right now I've got the site going on just Postgres FTS + trigram | and it's pretty darn fast, looks like I need to test sonic too. | | Going to burn some midnight oil (in my timezone, anyway) and get | it out -- though sonic isn't implemented yet! | | Anyway to make this comment useful to people, here's my short | list of engines that I want to run in parallel: | | - MeiliSearch (https://github.com/meilisearch/MeiliSearch) | | - TypeSense (https://github.com/typesense/typesense) | | - Lyra (https://github.com/LyraSearch/lyra) | | - OpenSearch (https://github.com/opensearch-project/OpenSearch) | | - ZincSearch (https://github.com/prabhatsharma/zinc) | | - Sonic (https://github.com/valeriansaliou/sonic) | | There isn't enough out there comparing all these for the simple | typical fuzzy search/search box usecase, so I'm adapting a little | podcast search site I made to try and use all of these at the | same time. So far only Postgres though, will try and add | Meilisearch today and post it! | | Like other people are pointing out, most of these engines won't | have all the features of ES (or more accurately Lucene) but I am | pretty convinced that most of the time it doesn't _actually_ | matter and if someone is searching on your site excessively maybe | there 's a problem with your UX (unless you're a search engine or | repository of information). | | [0]: https://supabase.com/blog/postgres-full-text-search-vs- | the-r... | Bilal_io wrote: | Hey that's a great list of tools. | | Are you aware of any that can be used client side like Lyra and | supports faceted search? | | I've been looking for a solution and cannot find it, even an | algorithm and/or a data structure can be helpful. I attempted | coming up with a solution myself but ended up with frustration | when it came to making the facets dynamic and update as other | filters are applied. | | I read a couple of papers and one stood out [0], which | introduces category theory as a solution to faceted filtering. | I understood it in theory and it was still does not seem | straight forward to implement but I haven't attempted yet. | | 0. | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5145200/#!po=28... | hardwaresofton wrote: | So for client-side search, I generally know of Lunr.js: | | https://lunrjs.com/docs/index.html | | There are some others but I can't find them at this moment -- | a bunch of the other projects I find are somewhat abandoned, | lunr is actually on my list of things to use (because it | makes the most sense to just ship a pre-built index with the | first like... 5 letters maybe of typeahead, no matter how | fast the backend is) | Bilal_io wrote: | Thanks for the link. This unfortunately is not what I am | looking for. Faceted filters are a different beast. | hawski wrote: | Thank you for this comparison. I would also like to know how | Bleve Search (https://github.com/blevesearch/bleve) turns out. | | I have for many years now a small search engine project in my | free-time pipeline, but I'm before crawling even and I intend | to sit for searching part after some of that. | hardwaresofton wrote: | You're right I should put bleve on there as well. This isn't | even the whole list. Toshi (https://github.com/toshi- | search/Toshi) is also out there... | snikolaev wrote: | If you decide to add Manticore Search to the list feel free | to ping me at sergey@manticoresearch.com if you need help | with preparing the ingestion scripts etc. | hardwaresofton wrote: | Oh! Damn it I forgot about manticore -- I had seen it | before but forgot to include it. | | Eventually all of these projects will be highlighted on | Awesome F/OSS (https://awsmfoss.com), but for now I'm | just going to dump my bookmarks here for other people, | since I'm leaving awesome projects out: | | Search Engines | | AWS OpenSearch https://github.com/opensearch- | project/OpenSearch | | https://github.com/opensearch-project/OpenSearch- | Dashboards | | https://github.com/opensearch-project/perftop | | https://github.com/go-ego/riot | | https://groonga.org/ https://github.com/groonga/groonga | | https://github.com/meilisearch/MeiliSearch | | https://github.com/mosuka/bayard | | https://github.com/nezaboodka/nevod | | https://github.com/searx/searx | | https://github.com/stryku/okon | | https://github.com/toshi-search/Toshi | | https://github.com/typesense/typesense | | https://github.com/valeriansaliou/sonic | | Algolia | | https://github.com/marconi1992/algolite | | https://quickwit.io/ | | https://github.com/quickwit-inc/quickwit | | https://docs.meilisearch.com/ | | https://github.com/prabhatsharma/zinc | | phalanx https://github.com/blugelabs/bluge | https://github.com/mosuka/phalanx | https://github.com/mosuka/blast | | ManticoreSearch | | https://github.com/manticoresoftware/manticoresearch | | https://github.com/manticoresoftware/docker | | https://manticoresearch.com/blog/manticore-alternative- | to-el... | | https://manticoresearch.com/ | | https://manual.manticoresearch.com/Introduction | | https://forum.manticoresearch.com/t/manticore-search- | cheatsh... | | https://forum.manticoresearch.com/ | | Whoosh https://whoosh.readthedocs.io/en/latest/ | | https://pypi.org/project/Whoosh/ | | lyra https://github.com/nearform/lyra | | https://nearform.github.io/lyra/ | | https://github.com/LyraSearch/lyra | | https://lyrasearch.io/ | | flexsearch | | https://github.com/nextapps-de/flexsearch#performance- | benchm... | | https://pagefind.app/docs/ | | Lucene | | https://github.com/apache/lucene | | https://lucene.apache.org/ | | ZincSearch | | https://zincsearch.com/ | | Solr | | https://solr.apache.org/ | | https://solr.apache.org/operator/ | | https://solr.apache.org/guide/solr/latest/getting- | started/so... | | https://github.com/apache/solr | | https://solr.apache.org/guide/solr/latest/deployment- | guide/s... | | Konnu https://gitlab.com/shadowislord/konnu | | Quickwit QuickWit + Clickhouse | | https://clickhouse.com/docs/en/guides/developer/full- | text-se... | | https://clickhouse.com/docs/en/sql- | reference/functions/strin... | | There is no way I can get to running _all_ of these (this | project was supposed to be quick!!), but I will run the | ones I noted earlier, and probably manticore too since it | was high on my list since it 's quite polished looking. | _tom_ wrote: | I'd encourage you to maintain and publish your list of | search engines. Even if you aren't supporting them. | | The list has value on its own, especially if you maintain | it. | kapilvt wrote: | + xapian which has been around a while, and while gpl | licensed, is quite capable https://xapian.org/ | donio wrote: | Xapian is great, especially when you need a a C/C++ | library rather than a separate service. Kinda like an | sqlite for search. Some of my favorite tools like notmuch | and recoll use it. | fzliu wrote: | + Milvus (https://github.com/milvus-io/milvus) for large | scale similarity/semantic search. | thirdtrigger wrote: | + Weaviate for vector based search. Has a BSD-3 license. | https://weaviate.io/developers/weaviate/current/ | nightpool wrote: | > and if someone is searching on your site excessively maybe | there's a problem with your UX (unless you're a search engine | or repository of information). | | I don't understand this comment. Why would you search something | that *isn't*, in some senses, a repository of information? I | would say almost every website needs to have search in some | sense, and it's *because* sites function as a repository of | information that they need this search. Think about e.g. | Stripe's documentation, or Github's repository / code search. | HN is also another great example--I search for stories or | comments all the time to try and remember something I read | about recently or heard about last week, but couldn't quite | remember. I'm hard-pressed to think of a web site I use | regularly that *shouldn't* have full-text search, if I'm being | honest. | hardwaresofton wrote: | I don't consider use cases like documentation a "repository" | of information, but maybe this is just me not phrasing it | badly. In the literal sense sure it is, but when I think of a | "repository of information" I think of wikipedia, amazon | search items, etc. | | The scale of a documentation site is a very different problem | -- you can brute force it in ways that you can't at larger | scales. | | I agree that HN would be a case of the large repository, but | even then what most people want out of HN search is pretty | simple/basic keyword search. I think a decent non-frustrating | HN search feature could be very basic and get by without most | of the advanced features/rabbit holes available in search. | | Basically I think most apps fall into the lighter search use | case -- command palettes, search inside of apps with a small | scale of information, etc. | | My comment wasn't that apps _shouldn 't_ have full text | search -- it was that most that have full text search don't | need _complex_ full text search with all the bells and | whistles that lucene and other serious search engines | provide. These up-and-comers might be enough for a bunch of | apps for which search is not the main feature. | TylerE wrote: | Most site searches are basically unusable. Either it isn't | very good, is painfully slow, or both. | | Just gooling site:foo.com/baz <query> almost always produces | better results. | francoismassot wrote: | You can consider also lnx that is based on tantivy and is | performing quite well (https://lnx.rs/). | hardwaresofton wrote: | Meili is still ingesting documents but we're live: | | https://news.ycombinator.com/item?id=33321268 | | Maybe I should have used their batch thing instead. | MobiusHorizons wrote: | Would it make sense to include Sqlite FTS5 in that mix? | hardwaresofton wrote: | It would, I did for the supabase post but... This is already | way too much! I have no idea when I'll actually be able to | get to all this as-is. | | Waiting for meilisearch to ingest documents right now and the | Show HN is going up. | blacklight wrote: | While I really like their lightweight, SQL-like protocol instead | of Elasticsearch's fat JSON, I really think that this project | could have much more impact if it could be a drop-in replacement | for ES. | | Even if it offers only a fraction of the features offered by ES, | that may be fair enough for at least half of the use-cases out | there. | | Sonic could have really had a strong selling point: "Use an ES- | alternative that works fine in most of the real-world | applications, but it's written in Rust and it only takes a | fraction of the memory footprint required by ES, and it shouldn't | require you to change your application code". | | Instead, they are proposing yet another search protocol, that | developers have to learn and adopt. That definitely increases the | adoption barriers. | tensor wrote: | It's probably fairly easy to write an adapter here. | xvello wrote: | Since Elastic spitefully patched all of their client libraries | to fail if the server is not a "genuine" ES server, I don't see | what good a drop-in replacement with protocol compatibility | would do. | | Go client: https://github.com/elastic/go- | elasticsearch/blob/3985f2a1554... | | Python client: https://github.com/elastic/elasticsearch- | py/commit/e72aa3e24... | snikolaev wrote: | Is it prohibited to include `X-Elastic-Product: | Elasticsearch` in the output of your server if the user | instructs the server to do so? :) | hangonhn wrote: | I don't see how they can legally have any control over what | a 3rd party's software outputs. And more importantly, how | would they even enforce such restrictions? | yvan wrote: | I believe Elasticsearch is a trademark. | jeltz wrote: | A trademark does not forbid people from using a name, it | only restricts how it can be used in marketing. I do not | see how that would be applicable here. | metadat wrote: | Are HTTP headers important or even relevant at all for | branding trademark purposes? | | Such a concern seems utterly ridiculous. | mumblemumble wrote: | If it really does work this way, then we're all doomed. | | https://stackoverflow.com/questions/1114254/why-do-all- | brows... | blowski wrote: | I imagine AWS can't put it on the headers of their | managed service, and that's what it's about. | AbraKdabra wrote: | Those libraries are open source, just nuke those | restrictions and you're good to go. Is it the best way? | Maybe not, but it's better than modifying your server | responses (and in the worst 1984 case, allowing Elastic to | sue you), if you develop such a tool you can always put | that distinction in your README. | markandrewj wrote: | Although not exactly the same, Elastic has an SQL query syntax | which can be used now as well. | | https://www.elastic.co/what-is/elasticsearch-sql | leros wrote: | ElasticSearch is so much more than search. Sonic is very | minimal in comparison, so a drop in replacement doesn't work | here. | | But yes, Sonic could replace lots of use cases. | nathell wrote: | I've written a full-text search engine as well. I don't tout it | as a replacement for Elasticsearch, but it does have a few | advantages: it's fast; supports HTML documents; supports Polish | inflection (via a full-blown morphological dictionary, not just a | stemmer); and has a very compact on-disk format (pre-parsed HTML | trees, Huffman-encoded over large alphabets). Oh, and it's 100% | Clojure. | | It underlies a concordancer GUI called Smyrna: | https://github.com/nathell/smyrna, https://smyrna.danieljanus.pl | | I haven't touched it in six years, other than a few small | changes. But I do plan on revisiting it when time permits. | johnebgd wrote: | That's very cool. I hope you consider open sourcing it so | others can contribute. | nathell wrote: | It is open-source already (MIT)! I just need to make other | languages more easily pluggable, and factor out the search | engine so that it can be used on its own. :) | _tom_ wrote: | Could your steamer be ported to Lucene? Might get more usage | there. | scottwick wrote: | Does anyone have any recommendations of books or other resources | that go over the theory behind full-text search? i.e. language | processing, data encoding, on-disk storage and retrieval, etc. | sanxiyn wrote: | If you want a book, Managing Gigabytes is still pretty good. | snikolaev wrote: | https://nlp.stanford.edu/IR-book/information-retrieval-book.... | dang wrote: | Related: | | _Sonic: Fast, lightweight and schemaless search back end in | Rust_ - https://news.ycombinator.com/item?id=19471471 - March | 2019 (39 comments) | excsn wrote: | This is not a direct alternative to ElasticSearch. Tantivy is | closer to an alternative to ElasticSearch since ES is built on | top of Lucene. An alternative could be achieved if built on top | of Tantivy. | | Sonic here only returns document identifiers so you will never be | able to get document information back. This is very useful though | if all you want to do is index text data and then get the stored | information from another data store. | codedokode wrote: | > Sonic here only returns document identifiers | | In many cases that is what you want because you have the data | in a database and don't want to duplicate it in Elastisearch. | counttheforks wrote: | > Sonic here only returns document identifiers so you will | never be able to get document information back | | Why would you want that anyway? Always thought it was silly to | duplicate all your data which will be stored in a real database | anyway | excsn wrote: | From a use case I am not experienced with. If you index | books, you want the search engine to return highlighted data | like google does. | | Also, now that I think of it, typically logs/structured data | is stored only in ES. | sanxiyn wrote: | Quickwit is a search engine built on top of Tantivy (by the | author of Tantivy): https://github.com/quickwit-oss/quickwit | | Quickwit supports Elasticsearch compatible bulk indexing API. | croes wrote: | Most of the time these ES replacements lack a decent access | control. | | One thing is to find what you search, but the other is not to | find what you aren't allowed to see. | sanxiyn wrote: | Meilisearch supports ES-like document access control. | DeathArrow wrote: | >Also, Sonic only keeps the N most recently pushed results for a | given word, in a sliding window way (the sliding window width can | be configured) | | If you discard many potential hits, why not use /dev/null as the | search engine? | Someone1234 wrote: | I believe you must have misread what you quoted, because | whatever point you're trying doesn't really follow what you | quoted. | | They let you configure the number of expected results to cache | for a given query, the number of cache results are configurable | based on your use-case for the results (e.g. if your website | only lists 100 results, don't store beyond that). | | If more results than that for a given query are returned then | they disregard additional results since you told it you won't | make use of them. In essence, they're saving you from caching | results that you'll never consume. | | How you got from this to "just use /dev/null" is a mystery to | me. It has to be a misread or misunderstanding. | nine_k wrote: | This thing looks like a very genetic cache. You can of course | use /dev/null as a degenerate cache, without any performance | benefit though. | manigandham wrote: | Lots of (elastic)search alternatives now, I keep track here: | https://gist.github.com/manigandham/58320ddb24fed654b57b4ba2... | | Sonic is good. Typesense is probably what most are looking for as | more of an Algolia-like setup: https://typesense.org/ ___________________________________________________________________ (page generated 2022-10-24 23:00 UTC)