[HN Gopher] Why Writing Your Own Search Engine Is Hard (2004)
       ___________________________________________________________________
        
       Why Writing Your Own Search Engine Is Hard (2004)
        
       Author : georgehill
       Score  : 86 points
       Date   : 2022-07-23 17:34 UTC (5 hours ago)
        
 (HTM) web link (queue.acm.org)
 (TXT) w3m dump (queue.acm.org)
        
       | ldjkfkdsjnv wrote:
       | Theory I have:
       | 
       | Text search on the web will slowly die. People will search video
       | based content, and use the fact that a human spoke the
       | information, as well as comments/upvotes to vet it as trustworthy
       | material. Google search as we know it will slowly die, and then
       | will decline like Facebook. TikTok will steal search marketshare
       | as their video clips span all of human life.
        
         | xnx wrote:
         | Returning text results in response to queries will continue to
         | decline in favor of returning answers and synthesized responses
         | directly. I don't want Google to point me to a page that
         | contains the answer somewhere, when it could provide me an even
         | better summary based on thousands of related pages it has read.
        
           | ldjkfkdsjnv wrote:
           | Right but the main flaw with Google is people increasingly
           | dont trust the result whether it is synthesized or not. And
           | Google is in the adversarial position of wanting to censor
           | certain answers as well as present answers that maximize
           | their own revenue. An answer (like video based TikTok), will
           | arise and crush them eventually.
        
       | wizofaus wrote:
       | Doesn't mention the hardest part I found when developing a
       | crawler - dealing with pages whose content is mostly dynamic and
       | generated client side (SPA's). Even using V8 it's hard to do
       | reliably and performantly at scale.
        
         | sanjayts wrote:
         | > Doesn't mention the hardest part ... dealing with pages whose
         | content is mostly dynamic and generated client side (SPA's)
         | 
         | Given this is from 2004 I'm not surprised.
        
           | wizofaus wrote:
           | That was about when I was writing my crawler (not for search
           | but for rules-based analysis). Even in 2004 a lot of key DOM
           | elements were created/modified client side.
        
             | wizofaus wrote:
             | Though I do remember now that we solved it by having a
             | separate mechanism for accessing pages that required
             | logging in or had significant client-side rendering by
             | allowing the user to record a macro that was played back in
             | a headless browser. Within a few years though it was
             | obvious a crawler would need to be able to automatically
             | handle client scripts.
        
             | [deleted]
        
       | wolfgang42 wrote:
       | I've been puttering away at making a search engine of my own (I
       | should really do a Show HN sometime); let's see how my experience
       | compares with 18 years ago:
       | 
       | Bandwidth: This is now also cheap; my residential service is 1
       | Gbit. However, the suggestion to wait until you've got indexing
       | working well before optimizing crawling is IMO still spot-on;
       | trying to make a polite, performant crawler that can deal with
       | all the bizzare edge cases
       | (https://memex.marginalia.nu/log/32-bot-apologetics.gmi) on the
       | Web will drag you down. (I bypassed this problem by starting with
       | the Stack Exchange data dumps and Wikipedia crawls, which are a
       | lot more consistent than trying to deal with random websites.)
       | 
       | CPU: Computers are _really_ fast now; I'm using a 2-core computer
       | from 2014 and it does what I need just fine.
       | 
       | Disk: SATA is the new thing now, of course, but the difference
       | these days is HDD vs SSD. SSD is faster: but you can design your
       | architecture so that this mostly doesn't matter, and even a
       | "slow" HDD will be running at capacity. (The trick is to do
       | linear streaming as much as possible, and avoid seeks at all
       | costs.) Still, it's probably a good idea to store your production
       | index on an SSD, and it's useful for intermediate data as well;
       | by happenstance more than design I have a large HDD and a small
       | SSD and they balance each other nicely.
       | 
       | Storing files: 100% agree with this section, for the disk-seek
       | reasons I mention above. Also, pages from the same website often
       | compress very well against each other (since they're using the
       | same templates, large chunks of HTML can be squished down
       | considerably), so if you're pressed for space consider storing
       | one GZIPped file per domain. (The tradeoff with zipping is that
       | you can't arbitrarily seek, but ideally you've designed things so
       | you don't need to do that anyway.) Also, WARC is a standard file
       | format that has a lot of tooling for this exact use case.
       | 
       | Networking: I skipped this by just storing everything on one
       | computer; I expect to be able to continue doing this for a long
       | time, since vertical scaling can get you _very_ far these days.
       | 
       | Indexing: You basically don't need to write _anything_ to get
       | started with this these days! I'm just using bog-standard
       | Elasticsearch with some glue code to do html2text; it's working
       | fine and took all of an afternoon to set up from scratch. (That
       | said, I'm not sure I'll _continue_ using Elastic: it has a ton of
       | features I don't need, which makes it very hard to understand and
       | work with since there's so much that's irrelevant to me. I'm
       | probably going to switch to either straight Lucene or Bleve
       | soon.)
       | 
       | Page rank: I added pagerank very early on in the hopes that it
       | would improve my results, and I'm not really sure how helpful it
       | is if your results aren't decent to begin with. However, the
       | march of Moore's law has made it an easy experiment: what Page
       | and Brin's server could compute in a week with carefully
       | optimized C code, mine can do in less than 5 minutes (!) with a
       | bit of JavaScript.
       | 
       | Serving: Again, ElasticSearch will solve this entire problem for
       | you (at least to start with); all your frontend has to do is take
       | the JSON result and poke it into an HTML template.
       | 
       | It's easier than ever to start building a search engine in your
       | own home; the recent explosion of such services (as seen on HN)
       | is an indicator of the feasibility, and the rising complaints
       | about Google show that the demand is there. Come and join us, the
       | water's fine!
        
         | boyter wrote:
         | Please do write about it and your thinking behind it. There is
         | so little out there written in the space.
        
       | t_mann wrote:
       | Would be interesting to see stats from that time how many people
       | were working on search engines and how it turned out for them.
       | Did they end up getting acquired, at least funded for a while,
       | exited, or just bootstrapped themselves until they realized
       | there'll only be one winner.
        
       | boyter wrote:
       | Glad to see this on the front page. One of those posts I reread
       | every now and then. Better yet it's written by Anna Patterson,
       | who in addition to the mentioned searches at the bottom wrote
       | chunks of Cuil (interesting even if it failed) and works on parts
       | of Googles index both before Cuil and I think now.
       | 
       | Sadly it's a little out of date. I'd love to see a more modern
       | post by someone. Perhaps the authors of mojeek, right dao or
       | someone Elise running their own custom index. Heck I'd pay for
       | some by Matt Wells of Gigablast or those behind Blekko. The whole
       | space is so secretive that for those really interested in the
       | space only crumbs of information are ever really released.
       | 
       | If you are into this space or just curious the videos about
       | bitfunnel which forms parts of the bing index are an excellent
       | watch https://www.youtube.com/watch?v=1-Xoy5w5ydM and
       | https://www.clsp.jhu.edu/events/mike-hopcroft-microsoft/#.YT...
        
       | Xeoncross wrote:
       | Yeah, there are certainly more problems these days. For one, the
       | size of the web is larger, more of it is spam causing issues with
       | pure page rank to detect networks that heavily link to each
       | other.
       | 
       | Important sites have a bunch of anti-crawling detection set up
       | (especially news sites). It's even worse that the best user-
       | generated content is behind walled gardens in facebook groups,
       | slack channels, quora threads, etc...
       | 
       | The rest of the good sites are javascript-heavy and you often
       | have to run chrome headless to render the page and find the
       | content - but that is detectable so you end up renting IP's from
       | mobile number farms or trying to build your own 4G network.
       | 
       | On the upside, https://commoncrawl.org/ now exists and makes the
       | prototype crawling work much easier. It's not the full internet,
       | but gives you plenty to work with and test against so you can
       | skip to the part where you figure out if you can produce anything
       | useful should you actually try to crawl the whole internet.
        
         | ArrayBoundCheck wrote:
         | I don't know how people can use the data. There's so much of
         | it! I don't see any harddrives that are 80TB. It seems like
         | people would need some kind of raid setup that can handle
         | 200+TB of uncompressed data
        
           | francoismassot wrote:
           | A search index is often made of smaller independent pieces
           | often called segments. So you can download & process
           | progressively the data locally and upload it to an object
           | storage. And run queries on it. That's what we did here for
           | this project: https://quickwit.io/blog/commoncrawl
           | 
           | Also an interesting blog post here:
           | https://fulmicoton.com/posts/commoncrawl/
        
           | Xeoncross wrote:
           | You don't need to download the whole thing. You can parse the
           | WARC files from S3 to only extract the information you want
           | (like pages with content). It's a lot smaller when you only
           | keep the links and text.
        
         | nonrandomstring wrote:
         | > but that is detectable so you end up renting IP's from mobile
         | number farms or trying to build your own 4G network.
         | 
         | Something is deeply wrong with such an adversarial ecosystem.
         | If sites don't want to be found and indexed why go to any
         | effort to include them? On the other hand there are millions of
         | small sites out there keen to be found.
         | 
         | The established idea of a "search engine" seems stuck, limited
         | and based on some 90's technology that worked on a 90's web
         | that no longer exists. Surely after 30 years we can build some
         | kind of content discovery layer on top of what's out there?
        
           | noncoml wrote:
           | They don't want to be indexed unless you are Google
        
           | amelius wrote:
           | > Something is deeply wrong with such an adversarial
           | ecosystem. If sites don't want to be found and indexed why go
           | to any effort to include them?
           | 
           | I think it is not about being found. It is more about being
           | copied.
           | 
           | These sites are afraid their content is stolen, so they only
           | allow Google to crawl them.
        
           | jonhohle wrote:
           | Maybe we need a categorized, hand curated directory of sites
           | that users can submit their own sites to for inclusion and
           | categorization. Maybe like an open directory. Perhaps Mozilla
           | could operate it, or maybe Yahoo!
        
             | groffee wrote:
             | With Goggles[0] (goggles/googles/potato/potato) you can get
             | them. Curated lists by topic.
             | 
             | [0] https://search.brave.com/help/goggles
        
             | noduerme wrote:
             | I know, right? Imagine if you went to the front page of
             | Yahoo! and it was like a curated directory of websites.
             | Like... a _portal_.
             | 
             | It could look something like this: https://web.archive.org/
             | web/20000302042007/http://www1.yahoo...
        
               | wongarsu wrote:
               | We could also make a website where people can submit
               | links to great websites they find, and also allow then to
               | vote on the submissions of other users. That way you have
               | a page filled with the best links, as determined by
               | users. Maybe call it "the homepage of the internet".
               | 
               | You could even add the ability to discuss these links,
               | and add a similar voting system to those discussions.
        
               | zeroonetwothree wrote:
               | Wow this gave me such an overwhelming feeling of
               | nostalgia. I really miss the early years of the web.
        
           | Xeoncross wrote:
           | https://blogsurf.io/ is an example of a small search engine
           | that just stuck to a directory of known blogs instead of
           | indexing the big sites or randomly crawling the web and
           | ending up with mostly gibberish pages from all the spam
           | sites.
        
             | mannyistyping wrote:
             | thank you for sharing this! I read through the site's about
             | and I really enjoy how the creator wanted to stick to a
             | specific area for quality over quantity.
        
           | altdataseller wrote:
           | >> If sites don't want to be found and indexed why go to any
           | effort to include them? On the other hand there are millions
           | of small sites out there keen to be found.
           | 
           | Then they should treat all bots equally and block Google as
           | well. If they block Google as well, then yes, we should leave
           | them alone.
           | 
           | Why give unfair treatment to Google? That's anti-competitive
           | behavior and it just prevents new search engines from being
           | created.
        
             | nonrandomstring wrote:
             | I think I understand, combined with jeffbee's answer, that
             | these sites are behaving selectively according to who you
             | are. So we're back to "No Blacks or Irish" on the 2022
             | Internet?
             | 
             | What do you think they have against smaller search engines?
             | I can't quite fathom the motives.
        
               | wolfgang42 wrote:
               | There are a lot of crawlers out there, and many of them
               | are ill-behaved. When GoogleBot crawls your site, you get
               | more visitors. When FizzBuzzBot/0.1.3 comes along, you're
               | more likely to get an overloaded server, weird URLs in
               | your crash logs, spam, or any other manner of mess.
               | 
               | Small search engines getting blocked is just collateral
               | damage from websites trying to solve this problem with a
               | blunt ban-hammer.
        
           | jeffbee wrote:
           | I think that is not what they mean. I think what they meant
           | is the site will detect your headless robot and serve it good
           | content, while serving spam and malware to everyone else. The
           | crawlers need their own distributed networks of unrelated
           | addresses to prevent or detect this behavior.
        
           | thanksgiving wrote:
           | > Something is deeply wrong with such an adversarial
           | ecosystem. If sites don't want to be found and indexed why go
           | to any effort to include them? On the other hand there are
           | millions of small sites out there keen to be found.
           | 
           | I work on a small - medium ecommerce website and my code
           | just... sucks. I kind of don't want to admit it but it is
           | true. When there is some Chinese search engine that tries to
           | crawl all the product detail pages during the day (presumably
           | at night for them?), it slows down the site to a crawl. I
           | mean technically I should have the pages set up so they can't
           | pierce through the cloudflare cache but it is easier to just
           | ask cloudflare to challenge user (captcha?) if there are more
           | than n (I think currently set to something small like ten)
           | requests per second from any single source.
           | 
           | I don't understand all the business decisions but yeah, I'd
           | suspect the biggest reason is we simply have poor codebases
           | and can't spend too much time fixing this while we have so
           | many backlog items from marketing to work on...
        
             | [deleted]
        
             | ALittleLight wrote:
             | Why are page loads so slow or demanding? I can't imagine
             | how a web crawler could be DoS'ing you if it's in good
             | faith. What is the TPS? What caching are you doing? What's
             | your stack like?
        
               | Gh0stRAT wrote:
               | Not GP, but from having run a small/niche search engine
               | that got hammered by a crawler in the past:
               | 
               | Webserver was a single VM running a Java + Spring
               | webserver in Tomcat, connecting to an overworked Solr
               | cluster to do the actual faceted searching.
               | 
               | Caches kept most page loads for organic traffic within
               | respectable bounds, but the crawler destroyed our cache
               | hit rate when it was scraping our site and at one point
               | did exhaust a concurrent connection limit of some kind
               | because there were so many slow/timing-out requests in
               | progress at the same time.
        
               | ALittleLight wrote:
               | I would expect that a small to medium e-commerce site
               | would cache all their pages.
        
         | Xeoncross wrote:
         | There isn't a one-size-fits all approach, but I've never worked
         | on a project that encompasses as many computer science
         | algorithms as a search engine.
         | 
         | - Tries (patricia, radix, etc...)
         | 
         | - Trees (b-trees, b+trees, merkle trees, log-structured merge-
         | tree, etc..)
         | 
         | - Consensus (raft, paxos, etc..)
         | 
         | - Block storage (disk block size optimizations, mmap files,
         | delta storage, etc..)
         | 
         | - Probabilistic filters (hyperloloog, bloom filters, etc...)
         | 
         | - Binary Search (sstables, sorted inverted indexes)
         | 
         | - Ranking (pagerank, tf/idf, bm25, etc...)
         | 
         | - NLP (stemming, POS tagging, subject identification, etc...)
         | 
         | - HTML (document parsing/lexing)
         | 
         | - Images (exif extraction, removal, resizing / proxying,
         | etc...)
         | 
         | - Queues (SQS, NATS, Apollo, etc...)
         | 
         | - Clustering (k-means, density, hierarchical, gaussian
         | distributions, etc...)
         | 
         | - Rate limiting (leaky bucket, windowed, etc...)
         | 
         | - text processing (unicode-normalization, slugify, sanitation,
         | lossless and lossy hashing like metaphone and document
         | fingerprinting)
         | 
         | - etc...
         | 
         | I'm sure there is plenty more I've missed. There are lots of
         | generic structures involved like hashes, linked-lists, skip-
         | lists, heaps and priority queues and this is just to get 2000's
         | level basic tech.
         | 
         | - https://github.com/quickwit-oss/tantivy
         | 
         | - https://github.com/valeriansaliou/sonic
         | 
         | - https://github.com/mosuka/phalanx
         | 
         | - https://github.com/meilisearch/MeiliSearch
         | 
         | - https://github.com/blevesearch/bleve
         | 
         | - https://github.com/thomasjungblut/go-sstables
         | 
         | A lot of people new to this space mistakenly think you can just
         | throw elastic search or postgres fulltext search in front of
         | terabytes of records and have something decent. That might work
         | for something small like a curated collection of a few hundred
         | sites.
        
           | kreeben wrote:
           | Yes, yes, yes :D There are so many topics in this space that
           | are so interesting it's like a dream. I would add to your
           | list
           | 
           | - sentiment analysis
           | 
           | - roaring bitmaps
           | 
           | - compression
           | 
           | - applied linear algebra
           | 
           | - ai
           | 
           | In a vent diagram intersecting all of these topics, is
           | search. Coding a search engine from scratch is a beautiful
           | way to spend ones days, if you're into programming.
        
           | boyter wrote:
           | > That might work for something small like a curated
           | collection of a few hundred sites.
           | 
           | Probably more like a few million but otherwise 100% true.
           | Once you really need to scale you have to start losing some
           | accuracy or correctness.
           | 
           | It helps that the goal of a search engine is not to find all
           | the results but instead delight the user by finding the
           | things they want.
        
       | [deleted]
        
       | streets1627 wrote:
       | Hey folks, I am one of the co-founders of neeva.com
       | 
       | While writing a search engine is hard, it is also incredibly
       | rewarding. Over the past two years, we have brought up a
       | meaningful crawl / index / serve pipeline for Neeva. Being able
       | to create pages like https://neeva.com/search?q=tomato%20soup or
       | https://neeva.com/search?q=golang+struct+split which are so much
       | better than what is out there in commercial search engines is so
       | worth it.
       | 
       | We are private, ads free and customer paid.
        
       | amelius wrote:
       | The article doesn't touch upon the hardest and most interesting
       | part: NLP and finding the most relevant results. I would like to
       | see a post on this.
        
       ___________________________________________________________________
       (page generated 2022-07-23 23:00 UTC)