[HN Gopher] Evolution of Search Engines Architecture - Algolia S...
       ___________________________________________________________________
        
       Evolution of Search Engines Architecture - Algolia Search
       Architecture Part 1
        
       Author : PretzelFisch
       Score  : 154 points
       Date   : 2021-08-08 12:36 UTC (10 hours ago)
        
 (HTM) web link (highscalability.com)
 (TXT) w3m dump (highscalability.com)
        
       | polote wrote:
       | The issue with Algolia is that they have insane technology but it
       | is mostly used only to search documentation.
       | 
       | They are struggling to sell their techno to people who need them
       | deeply, for a lot of reasons. But one of them is that they are a
       | tricky choice. It is not a database technology, so not a
       | developer choice but also their technology is only useful to
       | developers.
       | 
       | As a result they have to try to sell their product when you need
       | a search but no developers are working on it. That's how you end
       | up powering external and internal documentation portals. That's
       | really a waste of resource
        
       | petulla wrote:
       | no BERT?
        
         | ramoz wrote:
         | I've scaled large transformer based models that supplement a
         | lucene-based search engine. The architecture supports an
         | ensemble approach where Lucene results are first-class and then
         | we tailor similarity rankings with the models.
         | 
         | It looks a lot like this: https://huggingface.co/blog/bert-cpu-
         | scaling-part-1
         | 
         | We have to store large "index" embeddings on SSDs and use
         | leveldb for value retrievals of the lucene results.
        
         | lmeyerov wrote:
         | Yep I was surprised -- google and others have long moved to
         | neural search, afaict, where we are seeing things like faiss
         | for indexes based on embeddings, and all sorts of deploy pain
         | around training+inference. I knew that was still true for
         | elastic, but hadn't realized also for their replacements. So
         | this article is clustering for pre-neural search, and guess
         | enterprise search is still getting there..
        
       | cabbagehead wrote:
       | What's the "snapshot"/"snapshat" use case mentioned?
        
       | ww520 wrote:
       | Are suffix tree/array used at all? How about n-gram with Bloom
       | filter for filtering documents?
        
       | avereveard wrote:
       | idk this seems more an evolution of clustering, when I think
       | about search engines I think more at the progression toward
       | stemming, lemming, synonym matching and context matching.
        
         | ramraj07 wrote:
         | Also doing it in memory (which is what all the regular search
         | engines do right?)
        
           | jabo wrote:
           | No, ElasticSearch for example uses a disk-first approach.
        
             | Grimm1 wrote:
             | ES uses a disk first approach but only on first load, and
             | is smart enough to load similar results for frequently
             | searched items into cache as well. That's why search return
             | times differ significantly between hot and cold queries.
             | This is actually such a problem that a lot of the times in
             | older ES versions you wind up prewarming the ES cache
             | before you can actually let it be used in production. Most
             | alternative search engine implementations and especially in
             | vector based search engines load into memory first when
             | brought up versus at query time. ES still isn't great at
             | this and is one reason they're falling behind in modern
             | search, that and their vector search support is kind of
             | abysmal as of even 6 months ago.
        
               | ramoz wrote:
               | They're falling behind compared to what exactly?
               | 
               | Elastic is great on-disk, especially on SSDs and avoiding
               | issues like write amplification.
               | 
               | Loading large indexes in memory isn't simple/cheap and
               | when it comes to vectors we're talking apples/oranges I
               | feel. Modern search architectures need to embrace
               | ensemble approaches but boolean-based content searches is
               | often the primary util in enterprise (and search is
               | supplemented by a customizable td-idf). Using vector-
               | based retrieval & similarity is still useful but not
               | something you necessarily need elastic to do for you or
               | couldn't co-exist together.
        
               | Grimm1 wrote:
               | I've scaled a cluster that was in the 100s of millions of
               | results range, the experience was not great and tuning
               | for our use case which was decidedly not a typical
               | enterprise search problem and that made it a complete
               | pain. So that's great that it works for that particular
               | case, and we ultimately made it work ourselves much like
               | you're suggesting, we used vector search with something
               | like FAISS as a pre filtering step and then a final
               | search through a much reduced set of ids in ES but it's
               | pretty clear a new player could come in here and make a
               | much better experience. Basically ES is, if not
               | unsuitable, a big pain for large non enterprise search
               | such as web search, where things like vector search are
               | one major signal and provide a better search experience.
               | And that's the exact problem there aren't off the shelf
               | open source solutions if you're not doing a fairly
               | standard ecom or internal business search type problem
               | like log aggregation or internal documents.
               | 
               | I'm also suggesting that use cases that aren't enterprise
               | type search problems are more common than you'd think
               | these days.
               | 
               | Edit: Additionally the thing here is you have classical
               | boolean search systems like ES and vector search
               | solutions like Milvus, but no one's gotten around to
               | making something that does both well, from what I can see
               | a lot of the players in the space are trying to go in
               | that direction but it's a slow painful crawl that results
               | in this type of situation where we had to do a lot of
               | custom gluing of these systems together and keeping that
               | parity that was super annoying and expensive, and time
               | consuming, but not necessarily performance inhibiting.
        
         | stingraycharles wrote:
         | It's a highscalability blog post, though, which usually focuses
         | on precisely the clustering, sharding, etc aspects.
         | 
         | Not saying you're wrong, but it's just a different audience
         | that would be interested in the actual search algorithms.
        
       | manojlds wrote:
       | No https in 2021?
        
         | ilrwbwrkhv wrote:
         | No. In fact most websites don't need HTTPS and pointless data
         | transfer. Wish we could go back a few years on this zeitgeist.
        
           | Xorlev wrote:
           | This is false. Just because the page content isn't sensitive,
           | that doesn't mean that TLS is worthless.
           | 
           | TLS prevents your run of the mill MITM scenarios. Like ISPs
           | inserting ads (something Comcast actually did), or public
           | wifi doing the same. Or worse, more malicious scripts.
           | 
           | You could argue that all I'm really looking for in most cases
           | is message integrity (signing), but if you're going to do
           | that, you might as well just encrypt it too and avoid
           | accidents where sensitive information is sent over encrypted
           | channels.
        
           | pornel wrote:
           | Every visited HTTP website is a network vulnerability.
           | 
           | It doesn't matter what is supposed to be on these sites. From
           | security perspective they contain MITM attacker's content.
           | They are effectively an API for issuing arbitrary commands to
           | the browser. To shut down this attack API, all sites have to
           | stop using HTTP, no exceptions.
        
       | merliossu wrote:
       | in memory search works well as long as you dont care about
       | persisting your data.. for most companies that would like a big
       | chunk of their strategic assets
        
       | bwb wrote:
       | I am about to roll out search on Shepherd.com and looking at
       | using Algolia. I've been impressed with Algolia on Hacker News...
       | 
       | Is anyone else using them? What are your impressions so far?
       | 
       | Much appreciated
        
         | gervwyk wrote:
         | We have it configured for https://docs.lowdefy.com
         | 
         | Really happy with the service it provides and the ease of
         | implementation. Note that because the docs can take a few
         | seconds to load, the their crawler times out and misses some
         | content some of the time. With better page performance this
         | should not be an issue.
         | 
         | (We are actively working on some cool ideas to make Lowdefy
         | apps super fast)
        
         | ushakov wrote:
         | i'm using MeiliSearch, which is a open source alternative
         | 
         | worth giving a look
         | 
         | https://github.com/meilisearch/MeiliSearch
        
           | gervwyk wrote:
           | Did not know about MeiliSearch. Looks really great! Thanks
           | for sharing.
        
         | thefounder wrote:
         | It's easy to use and setup. If pricing and closed source is OK
         | with you then it's worth it. We've used them few years ago and
         | then switched to ES. Think of it like of pre-docker Heroku.
        
           | oakfr wrote:
           | Out of curiosity, what made you choose ES over Algolia?
        
             | thefounder wrote:
             | As @kirubakaran said it was the price and the closed source
             | license. If search becomes a very important part of your
             | business you better own it rather than outsource it.
             | 
             | Algolia is great to get started but it doesn't make sense
             | at scale. If you have large indexes it's just too
             | expensive.
        
               | jabo wrote:
               | What did the migration effort look like when moving from
               | Algolia to ElasticSearch? Also, were you able to
               | replicate the same user experience?
        
             | kirubakaran wrote:
             | From the comment, I guess "pricing and closed source"
             | became not OK
        
         | jabo wrote:
         | I work on an open source alternative to Algolia called
         | Typesense.
         | 
         | Algolia is a great product but can get quite expensive at even
         | moderate scale. If I had a dollar for every time I've heard
         | this from Algolia users switching over...
         | 
         | I recently put together this comparison page, comparing a few
         | search engines, including Algolia, you might find interesting:
         | https://typesense.org/typesense-vs-algolia-vs-elasticsearch-...
        
           | arbitrandomuser wrote:
           | I heard a joke about FTS engines, but Whoosh !
        
           | notdang wrote:
           | It's missing the most important thing:speed. We moved to
           | Algolia mainly because of this. Elastic Search and Solr could
           | not compete.
        
             | jabo wrote:
             | Oh yes. Speed is an important point. ElasticSearch & Solr
             | use disk-first indexing (with RAM as just a cache), whereas
             | Algolia and Typesense use a RAM-first approach where the
             | entire index is stored in memory. This is what makes
             | Algolia/Typesense return results much much faster than
             | ES/Solr, and lets you build search-as-you-type experiences
             | for each keystroke.
             | 
             | I was thinking about adding a row about speed to the
             | comparison matrix, but couldn't find a way to express the
             | comparison clearly... Imagine a row that said:
             | 
             | Search Speed | Super-fast | Super-fast | Slow? ...
             | 
             | That felt a little off. So I resorted to just mentioning
             | primary index location as a proxy.
             | 
             | Open to suggestions on how to express this succinctly.
        
               | Nextgrid wrote:
               | What index sizes are we talking about? If it's a few
               | hundred gigs there's always the possibility of putting
               | the entire ElasticSearch index into a ramdisk, or even
               | just leaving lots of "free" RAM meaning the underlying OS
               | will use it to speed up I/O transparently. Bare-metal
               | machines with insane RAM sizes are a thing, and at
               | massive scale could make sense.
               | 
               | I've had great success at a client where simply upgrading
               | a DB to an instance with enough RAM to fit 80% of the
               | entire data set fixed all performance problems and
               | significantly reduced I/O "pressure" at least for reads
               | (writes were never a problem).
        
               | jabo wrote:
               | I haven't tried to do this myself so I can't speak to it.
               | 
               | But one thing I would add is ElasticSearch is quite
               | versatile and flexible, so I wouldn't be surprised if you
               | can contort it to get it to work for a wide variety of
               | use cases. This is a blessing and a curse - blessing
               | because it's so flexible, curse because the flexibility
               | breeds complexity and brings with it a steep learning
               | curve and operational complexity.
               | 
               | Where I think Algolia / Typesense help is that things
               | work out of the box without the learning curve or
               | operational overhead.
        
               | NicoJuicy wrote:
               | Why not place the main algorithm for speed of search, so
               | users can lookup the difference on another page.
        
           | tommoor wrote:
           | Does Typesense support searching in non-latin languages?
        
             | jabo wrote:
             | Yes it does - all languages except logographic ones
             | (Chinese, Japanese and Korean) which we are actively
             | working on:
             | https://github.com/typesense/typesense/issues/228
        
         | kqr wrote:
         | Big up-front disclaimer: my job is making software at Loop54
         | and my salary comes from happy customers of our service.
         | 
         | One of our goals is similar to yours: browsing an online store
         | should be like walking around in a physical store. The
         | navigation system on the site should be as adept as a
         | knowledgeable store employee in helping you find exactly what
         | you're looking for.
         | 
         | At Loop54 many of our customers come from Algolia. It's very
         | popular, and nobody ever gets fired for buying Algolia. In that
         | sense, it's a safe option.
         | 
         | On the other hand, customers come to us from Algolia because
         | Algolia requires a bit of hand-holding and it still doesn't
         | quite seem to get what users are really looking for. When our
         | prospects run randomised controlled trials, our search
         | consistently seems to give users what they want better than
         | Algolia does, with less effort. I can ask about specific
         | numbers if you want.
         | 
         | However, another strength of Algolia that Loop54 is currently
         | behind in is in the surrounding tooling. For better or worse,
         | with Algolia, you'll have more knobs and levers to play with
         | (and you'll need them much more often!)
         | 
         | We do have one or two customers that have a majority of books
         | in their product catalogues, and we know there are some unique
         | challenges that come with that domain.
         | 
         | Loop54 is a very competent, but smaller player. If you think
         | it's interesting, it's worth talking to us. I can't evaluate
         | how good a fit your site would be for us, but that's why we
         | have people who do that for a living!
         | 
         | Edit: I should also say that yes, Loop54 is even more
         | expensive. You shouldn't blindly trust us (or any other
         | provider.) I would strongly suggest running a randomised
         | controlled trial to see whether any expense at all is worth it
         | in your case.
         | 
         | I say this in part because I'm a man of science and believe in
         | experiments to measure things, but also out of self-interest;
         | anyone can throw out impressive marketing, but our search truly
         | shines when put to the test against the alternatives.
        
           | Redsquare wrote:
           | I am sorry but https://www.loop54.com/pricing is just totally
           | snied. No monetary information whatsoever. Why even blag me
           | to a pricing page with less than zero pricing honesty.
        
       | geraneum wrote:
       | The comfort that they provide is trap sometimes! Algolia suggests
       | that frontend sends the queries directly to its service instead
       | of going through our backend, which is good if you want to have a
       | good search engine fast. But don't go for it without considering
       | the consequences. It will take over part of the frontend and your
       | product will depend on Algolia to the point that implementing a
       | single favourite functionality for your users may need to
       | integrated with their service if you're not careful!
        
       | cinntaile wrote:
       | It's strange, I don't really like using the HN Algolia search. I
       | think it's because the responsiveness doesn't fit HN and the
       | results are okay but not great? What are some other big sites
       | that use Algolia as their search backend? It would be interesting
       | to compare.
        
         | adamveld12 wrote:
         | We use it for general search and similar items results on
         | www.liveauctioneers.com
        
         | polote wrote:
         | > the results are okay but not great
         | 
         | What results do you expect more than keywords search ranked by
         | upvote on HN? I find it great honestly, it's fast and don't do
         | magics
        
       ___________________________________________________________________
       (page generated 2021-08-08 23:00 UTC)