[HN Gopher] Why Writing Your Own Search Engine Is Hard (2004) ___________________________________________________________________ Why Writing Your Own Search Engine Is Hard (2004) Author : georgehill Score : 86 points Date : 2022-07-23 17:34 UTC (5 hours ago) (HTM) web link (queue.acm.org) (TXT) w3m dump (queue.acm.org) | ldjkfkdsjnv wrote: | Theory I have: | | Text search on the web will slowly die. People will search video | based content, and use the fact that a human spoke the | information, as well as comments/upvotes to vet it as trustworthy | material. Google search as we know it will slowly die, and then | will decline like Facebook. TikTok will steal search marketshare | as their video clips span all of human life. | xnx wrote: | Returning text results in response to queries will continue to | decline in favor of returning answers and synthesized responses | directly. I don't want Google to point me to a page that | contains the answer somewhere, when it could provide me an even | better summary based on thousands of related pages it has read. | ldjkfkdsjnv wrote: | Right but the main flaw with Google is people increasingly | dont trust the result whether it is synthesized or not. And | Google is in the adversarial position of wanting to censor | certain answers as well as present answers that maximize | their own revenue. An answer (like video based TikTok), will | arise and crush them eventually. | wizofaus wrote: | Doesn't mention the hardest part I found when developing a | crawler - dealing with pages whose content is mostly dynamic and | generated client side (SPA's). Even using V8 it's hard to do | reliably and performantly at scale. | sanjayts wrote: | > Doesn't mention the hardest part ... dealing with pages whose | content is mostly dynamic and generated client side (SPA's) | | Given this is from 2004 I'm not surprised. | wizofaus wrote: | That was about when I was writing my crawler (not for search | but for rules-based analysis). Even in 2004 a lot of key DOM | elements were created/modified client side. | wizofaus wrote: | Though I do remember now that we solved it by having a | separate mechanism for accessing pages that required | logging in or had significant client-side rendering by | allowing the user to record a macro that was played back in | a headless browser. Within a few years though it was | obvious a crawler would need to be able to automatically | handle client scripts. | [deleted] | wolfgang42 wrote: | I've been puttering away at making a search engine of my own (I | should really do a Show HN sometime); let's see how my experience | compares with 18 years ago: | | Bandwidth: This is now also cheap; my residential service is 1 | Gbit. However, the suggestion to wait until you've got indexing | working well before optimizing crawling is IMO still spot-on; | trying to make a polite, performant crawler that can deal with | all the bizzare edge cases | (https://memex.marginalia.nu/log/32-bot-apologetics.gmi) on the | Web will drag you down. (I bypassed this problem by starting with | the Stack Exchange data dumps and Wikipedia crawls, which are a | lot more consistent than trying to deal with random websites.) | | CPU: Computers are _really_ fast now; I'm using a 2-core computer | from 2014 and it does what I need just fine. | | Disk: SATA is the new thing now, of course, but the difference | these days is HDD vs SSD. SSD is faster: but you can design your | architecture so that this mostly doesn't matter, and even a | "slow" HDD will be running at capacity. (The trick is to do | linear streaming as much as possible, and avoid seeks at all | costs.) Still, it's probably a good idea to store your production | index on an SSD, and it's useful for intermediate data as well; | by happenstance more than design I have a large HDD and a small | SSD and they balance each other nicely. | | Storing files: 100% agree with this section, for the disk-seek | reasons I mention above. Also, pages from the same website often | compress very well against each other (since they're using the | same templates, large chunks of HTML can be squished down | considerably), so if you're pressed for space consider storing | one GZIPped file per domain. (The tradeoff with zipping is that | you can't arbitrarily seek, but ideally you've designed things so | you don't need to do that anyway.) Also, WARC is a standard file | format that has a lot of tooling for this exact use case. | | Networking: I skipped this by just storing everything on one | computer; I expect to be able to continue doing this for a long | time, since vertical scaling can get you _very_ far these days. | | Indexing: You basically don't need to write _anything_ to get | started with this these days! I'm just using bog-standard | Elasticsearch with some glue code to do html2text; it's working | fine and took all of an afternoon to set up from scratch. (That | said, I'm not sure I'll _continue_ using Elastic: it has a ton of | features I don't need, which makes it very hard to understand and | work with since there's so much that's irrelevant to me. I'm | probably going to switch to either straight Lucene or Bleve | soon.) | | Page rank: I added pagerank very early on in the hopes that it | would improve my results, and I'm not really sure how helpful it | is if your results aren't decent to begin with. However, the | march of Moore's law has made it an easy experiment: what Page | and Brin's server could compute in a week with carefully | optimized C code, mine can do in less than 5 minutes (!) with a | bit of JavaScript. | | Serving: Again, ElasticSearch will solve this entire problem for | you (at least to start with); all your frontend has to do is take | the JSON result and poke it into an HTML template. | | It's easier than ever to start building a search engine in your | own home; the recent explosion of such services (as seen on HN) | is an indicator of the feasibility, and the rising complaints | about Google show that the demand is there. Come and join us, the | water's fine! | boyter wrote: | Please do write about it and your thinking behind it. There is | so little out there written in the space. | t_mann wrote: | Would be interesting to see stats from that time how many people | were working on search engines and how it turned out for them. | Did they end up getting acquired, at least funded for a while, | exited, or just bootstrapped themselves until they realized | there'll only be one winner. | boyter wrote: | Glad to see this on the front page. One of those posts I reread | every now and then. Better yet it's written by Anna Patterson, | who in addition to the mentioned searches at the bottom wrote | chunks of Cuil (interesting even if it failed) and works on parts | of Googles index both before Cuil and I think now. | | Sadly it's a little out of date. I'd love to see a more modern | post by someone. Perhaps the authors of mojeek, right dao or | someone Elise running their own custom index. Heck I'd pay for | some by Matt Wells of Gigablast or those behind Blekko. The whole | space is so secretive that for those really interested in the | space only crumbs of information are ever really released. | | If you are into this space or just curious the videos about | bitfunnel which forms parts of the bing index are an excellent | watch https://www.youtube.com/watch?v=1-Xoy5w5ydM and | https://www.clsp.jhu.edu/events/mike-hopcroft-microsoft/#.YT... | Xeoncross wrote: | Yeah, there are certainly more problems these days. For one, the | size of the web is larger, more of it is spam causing issues with | pure page rank to detect networks that heavily link to each | other. | | Important sites have a bunch of anti-crawling detection set up | (especially news sites). It's even worse that the best user- | generated content is behind walled gardens in facebook groups, | slack channels, quora threads, etc... | | The rest of the good sites are javascript-heavy and you often | have to run chrome headless to render the page and find the | content - but that is detectable so you end up renting IP's from | mobile number farms or trying to build your own 4G network. | | On the upside, https://commoncrawl.org/ now exists and makes the | prototype crawling work much easier. It's not the full internet, | but gives you plenty to work with and test against so you can | skip to the part where you figure out if you can produce anything | useful should you actually try to crawl the whole internet. | ArrayBoundCheck wrote: | I don't know how people can use the data. There's so much of | it! I don't see any harddrives that are 80TB. It seems like | people would need some kind of raid setup that can handle | 200+TB of uncompressed data | francoismassot wrote: | A search index is often made of smaller independent pieces | often called segments. So you can download & process | progressively the data locally and upload it to an object | storage. And run queries on it. That's what we did here for | this project: https://quickwit.io/blog/commoncrawl | | Also an interesting blog post here: | https://fulmicoton.com/posts/commoncrawl/ | Xeoncross wrote: | You don't need to download the whole thing. You can parse the | WARC files from S3 to only extract the information you want | (like pages with content). It's a lot smaller when you only | keep the links and text. | nonrandomstring wrote: | > but that is detectable so you end up renting IP's from mobile | number farms or trying to build your own 4G network. | | Something is deeply wrong with such an adversarial ecosystem. | If sites don't want to be found and indexed why go to any | effort to include them? On the other hand there are millions of | small sites out there keen to be found. | | The established idea of a "search engine" seems stuck, limited | and based on some 90's technology that worked on a 90's web | that no longer exists. Surely after 30 years we can build some | kind of content discovery layer on top of what's out there? | noncoml wrote: | They don't want to be indexed unless you are Google | amelius wrote: | > Something is deeply wrong with such an adversarial | ecosystem. If sites don't want to be found and indexed why go | to any effort to include them? | | I think it is not about being found. It is more about being | copied. | | These sites are afraid their content is stolen, so they only | allow Google to crawl them. | jonhohle wrote: | Maybe we need a categorized, hand curated directory of sites | that users can submit their own sites to for inclusion and | categorization. Maybe like an open directory. Perhaps Mozilla | could operate it, or maybe Yahoo! | groffee wrote: | With Goggles[0] (goggles/googles/potato/potato) you can get | them. Curated lists by topic. | | [0] https://search.brave.com/help/goggles | noduerme wrote: | I know, right? Imagine if you went to the front page of | Yahoo! and it was like a curated directory of websites. | Like... a _portal_. | | It could look something like this: https://web.archive.org/ | web/20000302042007/http://www1.yahoo... | wongarsu wrote: | We could also make a website where people can submit | links to great websites they find, and also allow then to | vote on the submissions of other users. That way you have | a page filled with the best links, as determined by | users. Maybe call it "the homepage of the internet". | | You could even add the ability to discuss these links, | and add a similar voting system to those discussions. | zeroonetwothree wrote: | Wow this gave me such an overwhelming feeling of | nostalgia. I really miss the early years of the web. | Xeoncross wrote: | https://blogsurf.io/ is an example of a small search engine | that just stuck to a directory of known blogs instead of | indexing the big sites or randomly crawling the web and | ending up with mostly gibberish pages from all the spam | sites. | mannyistyping wrote: | thank you for sharing this! I read through the site's about | and I really enjoy how the creator wanted to stick to a | specific area for quality over quantity. | altdataseller wrote: | >> If sites don't want to be found and indexed why go to any | effort to include them? On the other hand there are millions | of small sites out there keen to be found. | | Then they should treat all bots equally and block Google as | well. If they block Google as well, then yes, we should leave | them alone. | | Why give unfair treatment to Google? That's anti-competitive | behavior and it just prevents new search engines from being | created. | nonrandomstring wrote: | I think I understand, combined with jeffbee's answer, that | these sites are behaving selectively according to who you | are. So we're back to "No Blacks or Irish" on the 2022 | Internet? | | What do you think they have against smaller search engines? | I can't quite fathom the motives. | wolfgang42 wrote: | There are a lot of crawlers out there, and many of them | are ill-behaved. When GoogleBot crawls your site, you get | more visitors. When FizzBuzzBot/0.1.3 comes along, you're | more likely to get an overloaded server, weird URLs in | your crash logs, spam, or any other manner of mess. | | Small search engines getting blocked is just collateral | damage from websites trying to solve this problem with a | blunt ban-hammer. | jeffbee wrote: | I think that is not what they mean. I think what they meant | is the site will detect your headless robot and serve it good | content, while serving spam and malware to everyone else. The | crawlers need their own distributed networks of unrelated | addresses to prevent or detect this behavior. | thanksgiving wrote: | > Something is deeply wrong with such an adversarial | ecosystem. If sites don't want to be found and indexed why go | to any effort to include them? On the other hand there are | millions of small sites out there keen to be found. | | I work on a small - medium ecommerce website and my code | just... sucks. I kind of don't want to admit it but it is | true. When there is some Chinese search engine that tries to | crawl all the product detail pages during the day (presumably | at night for them?), it slows down the site to a crawl. I | mean technically I should have the pages set up so they can't | pierce through the cloudflare cache but it is easier to just | ask cloudflare to challenge user (captcha?) if there are more | than n (I think currently set to something small like ten) | requests per second from any single source. | | I don't understand all the business decisions but yeah, I'd | suspect the biggest reason is we simply have poor codebases | and can't spend too much time fixing this while we have so | many backlog items from marketing to work on... | [deleted] | ALittleLight wrote: | Why are page loads so slow or demanding? I can't imagine | how a web crawler could be DoS'ing you if it's in good | faith. What is the TPS? What caching are you doing? What's | your stack like? | Gh0stRAT wrote: | Not GP, but from having run a small/niche search engine | that got hammered by a crawler in the past: | | Webserver was a single VM running a Java + Spring | webserver in Tomcat, connecting to an overworked Solr | cluster to do the actual faceted searching. | | Caches kept most page loads for organic traffic within | respectable bounds, but the crawler destroyed our cache | hit rate when it was scraping our site and at one point | did exhaust a concurrent connection limit of some kind | because there were so many slow/timing-out requests in | progress at the same time. | ALittleLight wrote: | I would expect that a small to medium e-commerce site | would cache all their pages. | Xeoncross wrote: | There isn't a one-size-fits all approach, but I've never worked | on a project that encompasses as many computer science | algorithms as a search engine. | | - Tries (patricia, radix, etc...) | | - Trees (b-trees, b+trees, merkle trees, log-structured merge- | tree, etc..) | | - Consensus (raft, paxos, etc..) | | - Block storage (disk block size optimizations, mmap files, | delta storage, etc..) | | - Probabilistic filters (hyperloloog, bloom filters, etc...) | | - Binary Search (sstables, sorted inverted indexes) | | - Ranking (pagerank, tf/idf, bm25, etc...) | | - NLP (stemming, POS tagging, subject identification, etc...) | | - HTML (document parsing/lexing) | | - Images (exif extraction, removal, resizing / proxying, | etc...) | | - Queues (SQS, NATS, Apollo, etc...) | | - Clustering (k-means, density, hierarchical, gaussian | distributions, etc...) | | - Rate limiting (leaky bucket, windowed, etc...) | | - text processing (unicode-normalization, slugify, sanitation, | lossless and lossy hashing like metaphone and document | fingerprinting) | | - etc... | | I'm sure there is plenty more I've missed. There are lots of | generic structures involved like hashes, linked-lists, skip- | lists, heaps and priority queues and this is just to get 2000's | level basic tech. | | - https://github.com/quickwit-oss/tantivy | | - https://github.com/valeriansaliou/sonic | | - https://github.com/mosuka/phalanx | | - https://github.com/meilisearch/MeiliSearch | | - https://github.com/blevesearch/bleve | | - https://github.com/thomasjungblut/go-sstables | | A lot of people new to this space mistakenly think you can just | throw elastic search or postgres fulltext search in front of | terabytes of records and have something decent. That might work | for something small like a curated collection of a few hundred | sites. | kreeben wrote: | Yes, yes, yes :D There are so many topics in this space that | are so interesting it's like a dream. I would add to your | list | | - sentiment analysis | | - roaring bitmaps | | - compression | | - applied linear algebra | | - ai | | In a vent diagram intersecting all of these topics, is | search. Coding a search engine from scratch is a beautiful | way to spend ones days, if you're into programming. | boyter wrote: | > That might work for something small like a curated | collection of a few hundred sites. | | Probably more like a few million but otherwise 100% true. | Once you really need to scale you have to start losing some | accuracy or correctness. | | It helps that the goal of a search engine is not to find all | the results but instead delight the user by finding the | things they want. | [deleted] | streets1627 wrote: | Hey folks, I am one of the co-founders of neeva.com | | While writing a search engine is hard, it is also incredibly | rewarding. Over the past two years, we have brought up a | meaningful crawl / index / serve pipeline for Neeva. Being able | to create pages like https://neeva.com/search?q=tomato%20soup or | https://neeva.com/search?q=golang+struct+split which are so much | better than what is out there in commercial search engines is so | worth it. | | We are private, ads free and customer paid. | amelius wrote: | The article doesn't touch upon the hardest and most interesting | part: NLP and finding the most relevant results. I would like to | see a post on this. ___________________________________________________________________ (page generated 2022-07-23 23:00 UTC)