[HN Gopher] Building a full-text search engine in 150 lines of P...
       ___________________________________________________________________
        
       Building a full-text search engine in 150 lines of Python code
        
       Author : bartdegoede
       Score  : 206 points
       Date   : 2021-03-25 16:17 UTC (6 hours ago)
        
 (HTM) web link (bart.degoe.de)
 (TXT) w3m dump (bart.degoe.de)
        
       | KAdot wrote:
       | The article looks suspiciously similar to
       | https://artem.krylysov.com/blog/2020/07/28/lets-build-a-full....
       | Very similar examples, code and structure.
        
         | jeroenhd wrote:
         | Does it? Both are implementations and explanations of a well-
         | known algorithm. Most articles on quicksort will also look
         | alike, but there's no reason to assume the author has
         | plagiarized anything.
        
         | diogenesjunior wrote:
         | Except that one is written in go and the other in python...
        
           | johnwheeler wrote:
           | Yes, but where's the attribution?
        
             | pmiller2 wrote:
             | How many different ways do you think there are to write
             | this code?
        
         | pmiller2 wrote:
         | So? Wikipedia is one of the most convenient, large English
         | corpora available, and I doubt there are many significantly
         | different ways to write the bit of functionality that's built
         | up here. I'm not sure if that's what you're meaning to suggest,
         | or that there was some kind of plagiarism / inspiration going
         | on here.
        
         | argaba wrote:
         | Everything is a paraphrase of the original source - whatever
         | and/or whenever that maybe from.
        
       | mlthoughts2018 wrote:
       | I worked previously for a very high traffic ecommerce company
       | (Alexa top 300 site).
       | 
       | As part of the search team, I worked on a project where we
       | deliberately rewrote the whole product search engine in Python
       | and Cython, including our own algorithms manipulating documents
       | for deletion, low latency reindexing after edits, and more.
       | 
       | We did this because SOLR was too slow and the process of defining
       | custom sort orders (for example, new sort orders defined by
       | machine learning ranking algorithms, and needing to be A/B tested
       | regularly) was awful and performance in SOLR was poor.
       | 
       | It was a really fun project. One of the slogans for our group at
       | the time was "rewriting search in Python _for speed_."
       | 
       | The ultimate system we deployed was insanely fast. I doubt you
       | could have made it faster even writing the entire thing directly
       | in C or C++. It became a reference project in the company to help
       | avoid various flavors of arrogant dismissal of Python as a
       | backend service language for performance critical systems.
        
         | NicoJuicy wrote:
         | Could i send you a mail somewhere? I'd like to have a casual
         | conversation about e-commerce and some of your insights.
        
         | inertiatic wrote:
         | Defining custom sort orders in Solr is as simple as uploading a
         | text file with the values you intend to use for ranking. This
         | is a great feature that is in fact missing from Elasticsearch
         | and saves you so much reindexing time.
         | 
         | There certainly are usecases where Lucene based solutions
         | aren't the best fit. But I think the claim that you couldn't
         | make something faster by moving away from Python is outlandish.
        
           | IncRnd wrote:
           | > There certainly are usecases where Lucene based solutions
           | aren't the best fit. But I think the claim that you couldn't
           | make something faster by moving away from Python is
           | outlandish.
           | 
           | I read that as a statement that they implemented a proper and
           | bespoke algorithm, not that the speed of Python is greater
           | than C. I am surprised that you read it that way. Who in
           | their right mind would say Python speed is faster than C
           | speed?
        
             | inertiatic wrote:
             | You read
             | 
             | >I doubt you could have made it faster even writing the
             | entire thing directly in C or C++.
             | 
             | as
             | 
             | > a statement that they implemented a proper and bespoke
             | algorithm, not that the speed of Python is greater than C.
             | 
             | ?
        
       | happyweasel wrote:
       | As Joel spolsky said: "Just do me a favor and search the damned
       | hard drive, quickly, for the string I typed, using full-text
       | indexes and other technologies that were boring in 1973."
        
       | arafalov wrote:
       | Nice introduction. A good ramp-up from basic splitting to
       | ranking.
       | 
       | It does need to be said that when Lucene had a set of features
       | this small, it was also pretty tiny. And, if those are the needs,
       | one could still download it and it will probably run on modern
       | JVM:
       | 
       | https://archive.apache.org/dist/lucene/java/
       | 
       | lucene-1.4.3.jar 2004-11-29 14:13 316K
       | 
       | It is a bit bigger of course, but that's because it already had
       | stemmers, multilingual support, multiple Query Parsers including
       | phrase, RAM and Disk directory implementations and a bunch of
       | other advanced features one can easily see by unzipping the jar
       | file.
        
       | machiaweliczny wrote:
       | Cool article.
       | 
       | I recently build program in Go that takes wikipedia article and
       | gets all dependencies then using tfidf*count ranks concepts in
       | order of "importance". Seems quite good for math articles to get
       | list of more basic concepts to understand first.
        
         | throwaway320912 wrote:
         | Fun times. I once applied pagerank onto a set of 8000 math
         | articles and ran the result as a web app in 2009/10.
         | 
         | http://web.archive.org/web/20091230103939/http://myyn.org/m
         | 
         | As a gimmick, I created 36 groups, with group 1 containing the
         | most important concepts:
         | 
         | http://web.archive.org/web/20100109055506/http://myyn.org/m/...
         | 
         | Spoiler, the top ten concepts were:
         | 
         | Function * Set * Number * Integer * Real Number * Point *
         | Property * Finite * Ring * Relation Theory
         | 
         | BTW: Glad the archive has a copy, because I do not.
        
       | habibur wrote:
       | Excellent read. Was searching for a full text search engine but
       | not finding any suitable one. Plan to implement one just this
       | way.
        
         | bartdegoede wrote:
         | Author here: you may want to check out something like Whoosh
         | https://whoosh.readthedocs.io/en/latest/intro.html (it's like a
         | clone of Lucene but in pure Python). I've used this to build
         | some basic search for a small Python website and it was more
         | than fast enough for my purposes :-)
        
         | rakoo wrote:
         | SQLite has a pretty good built-in fts engine:
         | https://www.sqlite.org/fts5.html
        
           | habibur wrote:
           | Problem is FTS5 isn't included in the most default
           | installation through package managers [I use Fedora]. And
           | recompiling from source breaks a lot of things, as sqlite
           | libraries are generally linked with all apps that use it.
        
             | rakoo wrote:
             | I admit I only used sqlite through the go driver
             | (https://github.com/mattn/go-sqlite3) where using fts5
             | amounts to one flag during the compile phase.
        
       | ignoramous wrote:
       | Reminds of this David Crawshaw (CTO at Tailscale) presentation on
       | full-text search with SQLite which probably requires 10 lines or
       | less: https://www.youtube-nocookie.com/embed/RqubKSF3wig
        
         | edoceo wrote:
         | Yea! FTS5 FTW! Amazing for what it costs.
        
         | asdfasgasdgasdg wrote:
         | It's not relevant or an indicator of power that this example
         | requires ten lines of SQL. This presentation just uses SQLite's
         | builtin full-text search system [1]. Of course it's going to
         | require less code to call a library than to implement it.
         | 
         | [1]: https://sqlite.org/fts5.html
        
           | ignoramous wrote:
           | Relevant because "building a full-text search engine" is
           | solved using SQLite just as much as it can be solved by
           | writing Python on top of lxml and py-stemmer.
           | 
           | SQLite is no heavyweight dependency ala Apache Lucene. It
           | also helps that it is bundled in billions of Android and iOS
           | devices.
        
       | denysvitali wrote:
       | Does anybody work for Atlassian here? If so, can you please share
       | this with the Confluence team? Thanks...
        
         | boyter wrote:
         | Do people really find the search in confluence that bad? Would
         | they actually be willing to pay to have it fixed?
        
           | denysvitali wrote:
           | I wouldn't pay to fix it, Confluence isn't free. There are a
           | lot of Open Source alternatives that provide such features
           | out of the box. It's the only feature that shouldn't be
           | broken.
           | 
           | I really hate Confluence nowadays (and I really liked in the
           | past, when I was using it randomly): - Counter productive
           | syntax that is not even equal to the one of Jira
           | 
           | - Search functionality that cannot find anything (I spend
           | HOURS trying to find my content, let alone somebody else's).
           | Some days I just wish I had the DB at hand / a folder with
           | the whole pages exported to grep it myself.
           | 
           | - Proprietary file format - good luck migrating away from it
           | 
           | I'm really sad to say these things, I like Atlassian as a
           | company but the more I use Atlassian products, the more I
           | realize they're bloated and have _broken_ features that were
           | implemented quickly just to tick some more boxes on their
           | features sheet.
        
         | tyingq wrote:
         | If you want to feel extra sad about it, see this:
         | https://jira.atlassian.com/browse/CONFSERVER-13499
        
           | nova22033 wrote:
           | _If a wiki page has the word adrecordset, a search on
           | recordset should include the page. Currently only a search on
           | adrecordset or a search with wildcards returns the page._
           | 
           | Wait..this seems unreasonable?
           | 
           | Isn't this how it works with elasticsearch or solr? or even
           | google for that matter..a search for green won't return
           | evergreen..
        
             | tyingq wrote:
             | They don't support leading wildcards. So foo* will find
             | foobar but *bar won't find foobar. Elastic does fine with
             | that as of 7.9.
        
               | denysvitali wrote:
               | Well, in my experience they cannot even find URLs, IP
               | addresses, or wildcarded words like foo*
        
         | bigtones wrote:
         | And the Jira team
        
           | 12ian34 wrote:
           | JQL really isn't that bad is it?
        
             | fteem wrote:
             | Not bad - horrible.
        
       | runningmike wrote:
       | Using nlp techniques to create tokens is for a fast and simple
       | python search often not needed. It makes things slow and python
       | standard string function are often good enough. Issues arise when
       | searching for combinations of words like 'machine learning' in a
       | sentence. Nice read, but the GitHub repro needs a license to be
       | more useful.
        
         | bartdegoede wrote:
         | Added an MIT license. Have fun :-)
        
       | creamytaco wrote:
       | Why would one choose Python for this instead of Go or Rust?
       | 
       | I imagine Python performance would be terrible.
        
         | Thaxll wrote:
         | The article is good and so can be applied to any language, also
         | Python is not that slow.
        
         | roelschroeven wrote:
         | As a lot of the work is done by code library code, there's a
         | very good chance that the code is not all that slow.
        
         | theplague42 wrote:
         | Because it's easier for the average developer to follow Python
         | code? This is a tutorial/demo, not a library.
        
         | bartdegoede wrote:
         | The idea was to illustrate the concepts of an inverted index
         | and tf-idf :-) I've built some stuff like this for a smaller
         | project written in Python because it was Easier(tm) than
         | spinning up an Elasticsearch cluster (ie it definitely didn't
         | need to scale :-D)
        
           | sodapopcan wrote:
           | Yes, I've been really interested in FTS recently and love
           | articles like this. I'm currently implementing a paired-down
           | version myself in Elixir because I'm searching one "table"
           | (not postgres) and I don't want to bring in external
           | dependencies for it.
        
         | neolog wrote:
         | If the dataset is small, speed doesn't matter. Also, there's
         | PyPy if you need speed.
        
       | superyesh wrote:
       | Neat! One production ready python library I love to use in this
       | space is https://radimrehurek.com/gensim/ It is quite mature and
       | can handle a good amount of data well! It offers topic modeling
       | too and can b helpful to find similar documents also.
        
       | andrewmatte wrote:
       | This doesn't account for synonyms. I'd rather use a document
       | embedding search.
        
       ___________________________________________________________________
       (page generated 2021-03-25 23:00 UTC)