[HN Gopher] Ask HN: Books about full text search?
       ___________________________________________________________________
        
       Ask HN: Books about full text search?
        
       I would love to learn more about FTS at a very low level and I'm
       looking for books to read more on that topic. Any good suggestions
       ?
        
       Author : sopromo
       Score  : 86 points
       Date   : 2022-11-24 17:58 UTC (5 hours ago)
        
       | [deleted]
        
       | pixelmonkey wrote:
       | Take a look at my post "Lucene: The Good Parts"--
       | 
       | https://blog.parse.ly/lucene/
       | 
       | The book mentioned there is Lucene in Action.
       | 
       | And then this YouTube presentation by a Lucene/Elasticsearch
       | committer will give you a nice overview of some related
       | algorithms--
       | 
       | https://youtu.be/eQ-rXP-D80U
        
       | DamonHD wrote:
       | Managing Gigabytes
       | 
       | https://books.google.co.uk/books/about/Managing_Gigabytes.ht...
       | 
       | Old but good!
        
         | CoolestBeans wrote:
         | Came here to recommend Managing Gigabytes as well. People these
         | days are managing far more than gigabytes but the fundamental
         | ideas remain useful.
        
       | 100k wrote:
       | At a general audience level, "Index" is on my list to read. It
       | covers the invention of the index up to digital search engines.
       | https://www.nytimes.com/2022/02/09/books/review-index-histor...
       | 
       | "Introduction to Information Retrieval" is a textbook which is
       | available online https://nlp.stanford.edu/IR-book/ Here's a
       | review: http://glinden.blogspot.com/2009/02/book-review-
       | introduction...
       | 
       | Another textbook which IMHO is a bit lower level is "Information
       | Retrieval: Implementing and Evaluating Search Engines". The book
       | website is down for me right now, but you can find it on Amazon
       | here: https://www.amazon.com/Information-Retrieval-Implementing-
       | Ev...
       | 
       | Another commenter linked to "Relevant Search", which is great if
       | you want to learn how to effectively use a search engine to
       | improve relevance (as opposed to how to implement a search
       | engine). It's old, but another book in that vein that was really
       | helpful for me earlier in my career is Lucene in Action:
       | https://www.amazon.com/Lucene-Action-Second-Covers-Apache/dp...
        
       | tgv wrote:
       | Check the literature of open courses on Text Retrieval. E.g.
       | https://stanford.edu/class/cs276/
        
       | binarymax wrote:
       | "Relevant search" by Doug Turnbull and John Berryman, published
       | by Manning, is THE best book to get started with tuning search
       | engines.
       | 
       | I'be been a search engineer for >10 years and this is always the
       | first book I recommend.
       | 
       | https://www.manning.com/books/relevant-search
        
         | softwaredoug wrote:
         | Awe thanks Max <3
        
       | francoisprunier wrote:
       | Not a book, but this paper from 2019 covers a lot of ground and
       | reviews the different topics extensively:
       | https://tonellotto.github.io/publication/fntir/fntir_main.pd...
        
       | fiedzia wrote:
       | https://www.manning.com/books/relevant-search
       | 
       | Also "taming text"
        
         | arooaroo wrote:
         | Manning also have a book on Lucene, the library that powers
         | Solr and ElasticSearch. IIRC the book covered how Lucene
         | actually works under-the-good and would therefore act as a good
         | reference on the subject in general.
        
         | gardenfelder wrote:
         | Taming Text is about building a question-answering system; it
         | came out about the time Watson came online; it's not a plan,
         | rather a cookbook of experiments using Apache products like
         | Solr and OpenNLP, but is a great tutorial on how question
         | answering works.
        
       | vdfs wrote:
       | Lucene in Action, good introduction to Lucene, which can be
       | helpful to learn ElasticSearch (most used FTS these days)
        
         | _tom_ wrote:
         | Lucene in Action covers Lucene 3.0, and is from 2010. Current
         | version is 9.4.2. So much has changed.
        
       | cb321 wrote:
       | It's all in the Nim programming language, but if you prefer
       | reading code or running diffs then you might get a vague sense of
       | (some) low level nuts & bolts from:
       | https://github.com/c-blake/nimsearch
        
       | unixhero wrote:
       | Just use Postgres fulltext Search, its good enough
       | http://rachbelaid.com/postgres-full-text-search-is-good-enou...
        
       | ssn wrote:
       | Three reference textbooks are available openly:
       | 
       | * Introduction to Information Retrieval,
       | http://informationretrieval.org/
       | 
       | * Information Retrieval in Practice, http://www.search-engines-
       | book.com/
       | 
       | * Entity-Oriented Search, https://eos-book.org/
       | 
       | Modern Information Retrieval is also a classic reference. Not
       | openly available but some contents are (were?) available online.
       | Their site seems to be down but the Internet Archive has a copy.
       | 
       | Additional resources here:
       | 
       | * https://nlp.stanford.edu/IR-book/information-retrieval.html
       | http://web.archive.org/web/20220708135205/http://grupoweb.up...
        
       | brudgers wrote:
       | Not a book but Hellerstein's CS186 from 2015 starting with
       | Lecture 17 gave me a basic understanding (I think).
       | 
       | Playlist
       | https://youtube.com/playlist?list=PLhMnuBfGeCDPtyC9kUf_hG_Qw...
       | 
       | Also from that lecture series, the low level is always IO. One
       | disk read tends to dwarf n^2 in-memory algorithms.
       | 
       | And IO is all about tuning caches and hardware for the specific
       | structural relationships in the data, the way in which it is
       | accessed, and the hardware everything runs on.
       | 
       | Good luck.
        
       ___________________________________________________________________
       (page generated 2022-11-24 23:00 UTC)