[HN Gopher] Evolution of Search Engines Architecture - Algolia S... ___________________________________________________________________ Evolution of Search Engines Architecture - Algolia Search Architecture Part 1 Author : PretzelFisch Score : 154 points Date : 2021-08-08 12:36 UTC (10 hours ago) (HTM) web link (highscalability.com) (TXT) w3m dump (highscalability.com) | polote wrote: | The issue with Algolia is that they have insane technology but it | is mostly used only to search documentation. | | They are struggling to sell their techno to people who need them | deeply, for a lot of reasons. But one of them is that they are a | tricky choice. It is not a database technology, so not a | developer choice but also their technology is only useful to | developers. | | As a result they have to try to sell their product when you need | a search but no developers are working on it. That's how you end | up powering external and internal documentation portals. That's | really a waste of resource | petulla wrote: | no BERT? | ramoz wrote: | I've scaled large transformer based models that supplement a | lucene-based search engine. The architecture supports an | ensemble approach where Lucene results are first-class and then | we tailor similarity rankings with the models. | | It looks a lot like this: https://huggingface.co/blog/bert-cpu- | scaling-part-1 | | We have to store large "index" embeddings on SSDs and use | leveldb for value retrievals of the lucene results. | lmeyerov wrote: | Yep I was surprised -- google and others have long moved to | neural search, afaict, where we are seeing things like faiss | for indexes based on embeddings, and all sorts of deploy pain | around training+inference. I knew that was still true for | elastic, but hadn't realized also for their replacements. So | this article is clustering for pre-neural search, and guess | enterprise search is still getting there.. | cabbagehead wrote: | What's the "snapshot"/"snapshat" use case mentioned? | ww520 wrote: | Are suffix tree/array used at all? How about n-gram with Bloom | filter for filtering documents? | avereveard wrote: | idk this seems more an evolution of clustering, when I think | about search engines I think more at the progression toward | stemming, lemming, synonym matching and context matching. | ramraj07 wrote: | Also doing it in memory (which is what all the regular search | engines do right?) | jabo wrote: | No, ElasticSearch for example uses a disk-first approach. | Grimm1 wrote: | ES uses a disk first approach but only on first load, and | is smart enough to load similar results for frequently | searched items into cache as well. That's why search return | times differ significantly between hot and cold queries. | This is actually such a problem that a lot of the times in | older ES versions you wind up prewarming the ES cache | before you can actually let it be used in production. Most | alternative search engine implementations and especially in | vector based search engines load into memory first when | brought up versus at query time. ES still isn't great at | this and is one reason they're falling behind in modern | search, that and their vector search support is kind of | abysmal as of even 6 months ago. | ramoz wrote: | They're falling behind compared to what exactly? | | Elastic is great on-disk, especially on SSDs and avoiding | issues like write amplification. | | Loading large indexes in memory isn't simple/cheap and | when it comes to vectors we're talking apples/oranges I | feel. Modern search architectures need to embrace | ensemble approaches but boolean-based content searches is | often the primary util in enterprise (and search is | supplemented by a customizable td-idf). Using vector- | based retrieval & similarity is still useful but not | something you necessarily need elastic to do for you or | couldn't co-exist together. | Grimm1 wrote: | I've scaled a cluster that was in the 100s of millions of | results range, the experience was not great and tuning | for our use case which was decidedly not a typical | enterprise search problem and that made it a complete | pain. So that's great that it works for that particular | case, and we ultimately made it work ourselves much like | you're suggesting, we used vector search with something | like FAISS as a pre filtering step and then a final | search through a much reduced set of ids in ES but it's | pretty clear a new player could come in here and make a | much better experience. Basically ES is, if not | unsuitable, a big pain for large non enterprise search | such as web search, where things like vector search are | one major signal and provide a better search experience. | And that's the exact problem there aren't off the shelf | open source solutions if you're not doing a fairly | standard ecom or internal business search type problem | like log aggregation or internal documents. | | I'm also suggesting that use cases that aren't enterprise | type search problems are more common than you'd think | these days. | | Edit: Additionally the thing here is you have classical | boolean search systems like ES and vector search | solutions like Milvus, but no one's gotten around to | making something that does both well, from what I can see | a lot of the players in the space are trying to go in | that direction but it's a slow painful crawl that results | in this type of situation where we had to do a lot of | custom gluing of these systems together and keeping that | parity that was super annoying and expensive, and time | consuming, but not necessarily performance inhibiting. | stingraycharles wrote: | It's a highscalability blog post, though, which usually focuses | on precisely the clustering, sharding, etc aspects. | | Not saying you're wrong, but it's just a different audience | that would be interested in the actual search algorithms. | manojlds wrote: | No https in 2021? | ilrwbwrkhv wrote: | No. In fact most websites don't need HTTPS and pointless data | transfer. Wish we could go back a few years on this zeitgeist. | Xorlev wrote: | This is false. Just because the page content isn't sensitive, | that doesn't mean that TLS is worthless. | | TLS prevents your run of the mill MITM scenarios. Like ISPs | inserting ads (something Comcast actually did), or public | wifi doing the same. Or worse, more malicious scripts. | | You could argue that all I'm really looking for in most cases | is message integrity (signing), but if you're going to do | that, you might as well just encrypt it too and avoid | accidents where sensitive information is sent over encrypted | channels. | pornel wrote: | Every visited HTTP website is a network vulnerability. | | It doesn't matter what is supposed to be on these sites. From | security perspective they contain MITM attacker's content. | They are effectively an API for issuing arbitrary commands to | the browser. To shut down this attack API, all sites have to | stop using HTTP, no exceptions. | merliossu wrote: | in memory search works well as long as you dont care about | persisting your data.. for most companies that would like a big | chunk of their strategic assets | bwb wrote: | I am about to roll out search on Shepherd.com and looking at | using Algolia. I've been impressed with Algolia on Hacker News... | | Is anyone else using them? What are your impressions so far? | | Much appreciated | gervwyk wrote: | We have it configured for https://docs.lowdefy.com | | Really happy with the service it provides and the ease of | implementation. Note that because the docs can take a few | seconds to load, the their crawler times out and misses some | content some of the time. With better page performance this | should not be an issue. | | (We are actively working on some cool ideas to make Lowdefy | apps super fast) | ushakov wrote: | i'm using MeiliSearch, which is a open source alternative | | worth giving a look | | https://github.com/meilisearch/MeiliSearch | gervwyk wrote: | Did not know about MeiliSearch. Looks really great! Thanks | for sharing. | thefounder wrote: | It's easy to use and setup. If pricing and closed source is OK | with you then it's worth it. We've used them few years ago and | then switched to ES. Think of it like of pre-docker Heroku. | oakfr wrote: | Out of curiosity, what made you choose ES over Algolia? | thefounder wrote: | As @kirubakaran said it was the price and the closed source | license. If search becomes a very important part of your | business you better own it rather than outsource it. | | Algolia is great to get started but it doesn't make sense | at scale. If you have large indexes it's just too | expensive. | jabo wrote: | What did the migration effort look like when moving from | Algolia to ElasticSearch? Also, were you able to | replicate the same user experience? | kirubakaran wrote: | From the comment, I guess "pricing and closed source" | became not OK | jabo wrote: | I work on an open source alternative to Algolia called | Typesense. | | Algolia is a great product but can get quite expensive at even | moderate scale. If I had a dollar for every time I've heard | this from Algolia users switching over... | | I recently put together this comparison page, comparing a few | search engines, including Algolia, you might find interesting: | https://typesense.org/typesense-vs-algolia-vs-elasticsearch-... | arbitrandomuser wrote: | I heard a joke about FTS engines, but Whoosh ! | notdang wrote: | It's missing the most important thing:speed. We moved to | Algolia mainly because of this. Elastic Search and Solr could | not compete. | jabo wrote: | Oh yes. Speed is an important point. ElasticSearch & Solr | use disk-first indexing (with RAM as just a cache), whereas | Algolia and Typesense use a RAM-first approach where the | entire index is stored in memory. This is what makes | Algolia/Typesense return results much much faster than | ES/Solr, and lets you build search-as-you-type experiences | for each keystroke. | | I was thinking about adding a row about speed to the | comparison matrix, but couldn't find a way to express the | comparison clearly... Imagine a row that said: | | Search Speed | Super-fast | Super-fast | Slow? ... | | That felt a little off. So I resorted to just mentioning | primary index location as a proxy. | | Open to suggestions on how to express this succinctly. | Nextgrid wrote: | What index sizes are we talking about? If it's a few | hundred gigs there's always the possibility of putting | the entire ElasticSearch index into a ramdisk, or even | just leaving lots of "free" RAM meaning the underlying OS | will use it to speed up I/O transparently. Bare-metal | machines with insane RAM sizes are a thing, and at | massive scale could make sense. | | I've had great success at a client where simply upgrading | a DB to an instance with enough RAM to fit 80% of the | entire data set fixed all performance problems and | significantly reduced I/O "pressure" at least for reads | (writes were never a problem). | jabo wrote: | I haven't tried to do this myself so I can't speak to it. | | But one thing I would add is ElasticSearch is quite | versatile and flexible, so I wouldn't be surprised if you | can contort it to get it to work for a wide variety of | use cases. This is a blessing and a curse - blessing | because it's so flexible, curse because the flexibility | breeds complexity and brings with it a steep learning | curve and operational complexity. | | Where I think Algolia / Typesense help is that things | work out of the box without the learning curve or | operational overhead. | NicoJuicy wrote: | Why not place the main algorithm for speed of search, so | users can lookup the difference on another page. | tommoor wrote: | Does Typesense support searching in non-latin languages? | jabo wrote: | Yes it does - all languages except logographic ones | (Chinese, Japanese and Korean) which we are actively | working on: | https://github.com/typesense/typesense/issues/228 | kqr wrote: | Big up-front disclaimer: my job is making software at Loop54 | and my salary comes from happy customers of our service. | | One of our goals is similar to yours: browsing an online store | should be like walking around in a physical store. The | navigation system on the site should be as adept as a | knowledgeable store employee in helping you find exactly what | you're looking for. | | At Loop54 many of our customers come from Algolia. It's very | popular, and nobody ever gets fired for buying Algolia. In that | sense, it's a safe option. | | On the other hand, customers come to us from Algolia because | Algolia requires a bit of hand-holding and it still doesn't | quite seem to get what users are really looking for. When our | prospects run randomised controlled trials, our search | consistently seems to give users what they want better than | Algolia does, with less effort. I can ask about specific | numbers if you want. | | However, another strength of Algolia that Loop54 is currently | behind in is in the surrounding tooling. For better or worse, | with Algolia, you'll have more knobs and levers to play with | (and you'll need them much more often!) | | We do have one or two customers that have a majority of books | in their product catalogues, and we know there are some unique | challenges that come with that domain. | | Loop54 is a very competent, but smaller player. If you think | it's interesting, it's worth talking to us. I can't evaluate | how good a fit your site would be for us, but that's why we | have people who do that for a living! | | Edit: I should also say that yes, Loop54 is even more | expensive. You shouldn't blindly trust us (or any other | provider.) I would strongly suggest running a randomised | controlled trial to see whether any expense at all is worth it | in your case. | | I say this in part because I'm a man of science and believe in | experiments to measure things, but also out of self-interest; | anyone can throw out impressive marketing, but our search truly | shines when put to the test against the alternatives. | Redsquare wrote: | I am sorry but https://www.loop54.com/pricing is just totally | snied. No monetary information whatsoever. Why even blag me | to a pricing page with less than zero pricing honesty. | geraneum wrote: | The comfort that they provide is trap sometimes! Algolia suggests | that frontend sends the queries directly to its service instead | of going through our backend, which is good if you want to have a | good search engine fast. But don't go for it without considering | the consequences. It will take over part of the frontend and your | product will depend on Algolia to the point that implementing a | single favourite functionality for your users may need to | integrated with their service if you're not careful! | cinntaile wrote: | It's strange, I don't really like using the HN Algolia search. I | think it's because the responsiveness doesn't fit HN and the | results are okay but not great? What are some other big sites | that use Algolia as their search backend? It would be interesting | to compare. | adamveld12 wrote: | We use it for general search and similar items results on | www.liveauctioneers.com | polote wrote: | > the results are okay but not great | | What results do you expect more than keywords search ranked by | upvote on HN? I find it great honestly, it's fast and don't do | magics ___________________________________________________________________ (page generated 2021-08-08 23:00 UTC)