[HN Gopher] Building a full-text search engine in 150 lines of P... ___________________________________________________________________ Building a full-text search engine in 150 lines of Python code Author : bartdegoede Score : 206 points Date : 2021-03-25 16:17 UTC (6 hours ago) (HTM) web link (bart.degoe.de) (TXT) w3m dump (bart.degoe.de) | KAdot wrote: | The article looks suspiciously similar to | https://artem.krylysov.com/blog/2020/07/28/lets-build-a-full.... | Very similar examples, code and structure. | jeroenhd wrote: | Does it? Both are implementations and explanations of a well- | known algorithm. Most articles on quicksort will also look | alike, but there's no reason to assume the author has | plagiarized anything. | diogenesjunior wrote: | Except that one is written in go and the other in python... | johnwheeler wrote: | Yes, but where's the attribution? | pmiller2 wrote: | How many different ways do you think there are to write | this code? | pmiller2 wrote: | So? Wikipedia is one of the most convenient, large English | corpora available, and I doubt there are many significantly | different ways to write the bit of functionality that's built | up here. I'm not sure if that's what you're meaning to suggest, | or that there was some kind of plagiarism / inspiration going | on here. | argaba wrote: | Everything is a paraphrase of the original source - whatever | and/or whenever that maybe from. | mlthoughts2018 wrote: | I worked previously for a very high traffic ecommerce company | (Alexa top 300 site). | | As part of the search team, I worked on a project where we | deliberately rewrote the whole product search engine in Python | and Cython, including our own algorithms manipulating documents | for deletion, low latency reindexing after edits, and more. | | We did this because SOLR was too slow and the process of defining | custom sort orders (for example, new sort orders defined by | machine learning ranking algorithms, and needing to be A/B tested | regularly) was awful and performance in SOLR was poor. | | It was a really fun project. One of the slogans for our group at | the time was "rewriting search in Python _for speed_." | | The ultimate system we deployed was insanely fast. I doubt you | could have made it faster even writing the entire thing directly | in C or C++. It became a reference project in the company to help | avoid various flavors of arrogant dismissal of Python as a | backend service language for performance critical systems. | NicoJuicy wrote: | Could i send you a mail somewhere? I'd like to have a casual | conversation about e-commerce and some of your insights. | inertiatic wrote: | Defining custom sort orders in Solr is as simple as uploading a | text file with the values you intend to use for ranking. This | is a great feature that is in fact missing from Elasticsearch | and saves you so much reindexing time. | | There certainly are usecases where Lucene based solutions | aren't the best fit. But I think the claim that you couldn't | make something faster by moving away from Python is outlandish. | IncRnd wrote: | > There certainly are usecases where Lucene based solutions | aren't the best fit. But I think the claim that you couldn't | make something faster by moving away from Python is | outlandish. | | I read that as a statement that they implemented a proper and | bespoke algorithm, not that the speed of Python is greater | than C. I am surprised that you read it that way. Who in | their right mind would say Python speed is faster than C | speed? | inertiatic wrote: | You read | | >I doubt you could have made it faster even writing the | entire thing directly in C or C++. | | as | | > a statement that they implemented a proper and bespoke | algorithm, not that the speed of Python is greater than C. | | ? | happyweasel wrote: | As Joel spolsky said: "Just do me a favor and search the damned | hard drive, quickly, for the string I typed, using full-text | indexes and other technologies that were boring in 1973." | arafalov wrote: | Nice introduction. A good ramp-up from basic splitting to | ranking. | | It does need to be said that when Lucene had a set of features | this small, it was also pretty tiny. And, if those are the needs, | one could still download it and it will probably run on modern | JVM: | | https://archive.apache.org/dist/lucene/java/ | | lucene-1.4.3.jar 2004-11-29 14:13 316K | | It is a bit bigger of course, but that's because it already had | stemmers, multilingual support, multiple Query Parsers including | phrase, RAM and Disk directory implementations and a bunch of | other advanced features one can easily see by unzipping the jar | file. | machiaweliczny wrote: | Cool article. | | I recently build program in Go that takes wikipedia article and | gets all dependencies then using tfidf*count ranks concepts in | order of "importance". Seems quite good for math articles to get | list of more basic concepts to understand first. | throwaway320912 wrote: | Fun times. I once applied pagerank onto a set of 8000 math | articles and ran the result as a web app in 2009/10. | | http://web.archive.org/web/20091230103939/http://myyn.org/m | | As a gimmick, I created 36 groups, with group 1 containing the | most important concepts: | | http://web.archive.org/web/20100109055506/http://myyn.org/m/... | | Spoiler, the top ten concepts were: | | Function * Set * Number * Integer * Real Number * Point * | Property * Finite * Ring * Relation Theory | | BTW: Glad the archive has a copy, because I do not. | habibur wrote: | Excellent read. Was searching for a full text search engine but | not finding any suitable one. Plan to implement one just this | way. | bartdegoede wrote: | Author here: you may want to check out something like Whoosh | https://whoosh.readthedocs.io/en/latest/intro.html (it's like a | clone of Lucene but in pure Python). I've used this to build | some basic search for a small Python website and it was more | than fast enough for my purposes :-) | rakoo wrote: | SQLite has a pretty good built-in fts engine: | https://www.sqlite.org/fts5.html | habibur wrote: | Problem is FTS5 isn't included in the most default | installation through package managers [I use Fedora]. And | recompiling from source breaks a lot of things, as sqlite | libraries are generally linked with all apps that use it. | rakoo wrote: | I admit I only used sqlite through the go driver | (https://github.com/mattn/go-sqlite3) where using fts5 | amounts to one flag during the compile phase. | ignoramous wrote: | Reminds of this David Crawshaw (CTO at Tailscale) presentation on | full-text search with SQLite which probably requires 10 lines or | less: https://www.youtube-nocookie.com/embed/RqubKSF3wig | edoceo wrote: | Yea! FTS5 FTW! Amazing for what it costs. | asdfasgasdgasdg wrote: | It's not relevant or an indicator of power that this example | requires ten lines of SQL. This presentation just uses SQLite's | builtin full-text search system [1]. Of course it's going to | require less code to call a library than to implement it. | | [1]: https://sqlite.org/fts5.html | ignoramous wrote: | Relevant because "building a full-text search engine" is | solved using SQLite just as much as it can be solved by | writing Python on top of lxml and py-stemmer. | | SQLite is no heavyweight dependency ala Apache Lucene. It | also helps that it is bundled in billions of Android and iOS | devices. | denysvitali wrote: | Does anybody work for Atlassian here? If so, can you please share | this with the Confluence team? Thanks... | boyter wrote: | Do people really find the search in confluence that bad? Would | they actually be willing to pay to have it fixed? | denysvitali wrote: | I wouldn't pay to fix it, Confluence isn't free. There are a | lot of Open Source alternatives that provide such features | out of the box. It's the only feature that shouldn't be | broken. | | I really hate Confluence nowadays (and I really liked in the | past, when I was using it randomly): - Counter productive | syntax that is not even equal to the one of Jira | | - Search functionality that cannot find anything (I spend | HOURS trying to find my content, let alone somebody else's). | Some days I just wish I had the DB at hand / a folder with | the whole pages exported to grep it myself. | | - Proprietary file format - good luck migrating away from it | | I'm really sad to say these things, I like Atlassian as a | company but the more I use Atlassian products, the more I | realize they're bloated and have _broken_ features that were | implemented quickly just to tick some more boxes on their | features sheet. | tyingq wrote: | If you want to feel extra sad about it, see this: | https://jira.atlassian.com/browse/CONFSERVER-13499 | nova22033 wrote: | _If a wiki page has the word adrecordset, a search on | recordset should include the page. Currently only a search on | adrecordset or a search with wildcards returns the page._ | | Wait..this seems unreasonable? | | Isn't this how it works with elasticsearch or solr? or even | google for that matter..a search for green won't return | evergreen.. | tyingq wrote: | They don't support leading wildcards. So foo* will find | foobar but *bar won't find foobar. Elastic does fine with | that as of 7.9. | denysvitali wrote: | Well, in my experience they cannot even find URLs, IP | addresses, or wildcarded words like foo* | bigtones wrote: | And the Jira team | 12ian34 wrote: | JQL really isn't that bad is it? | fteem wrote: | Not bad - horrible. | runningmike wrote: | Using nlp techniques to create tokens is for a fast and simple | python search often not needed. It makes things slow and python | standard string function are often good enough. Issues arise when | searching for combinations of words like 'machine learning' in a | sentence. Nice read, but the GitHub repro needs a license to be | more useful. | bartdegoede wrote: | Added an MIT license. Have fun :-) | creamytaco wrote: | Why would one choose Python for this instead of Go or Rust? | | I imagine Python performance would be terrible. | Thaxll wrote: | The article is good and so can be applied to any language, also | Python is not that slow. | roelschroeven wrote: | As a lot of the work is done by code library code, there's a | very good chance that the code is not all that slow. | theplague42 wrote: | Because it's easier for the average developer to follow Python | code? This is a tutorial/demo, not a library. | bartdegoede wrote: | The idea was to illustrate the concepts of an inverted index | and tf-idf :-) I've built some stuff like this for a smaller | project written in Python because it was Easier(tm) than | spinning up an Elasticsearch cluster (ie it definitely didn't | need to scale :-D) | sodapopcan wrote: | Yes, I've been really interested in FTS recently and love | articles like this. I'm currently implementing a paired-down | version myself in Elixir because I'm searching one "table" | (not postgres) and I don't want to bring in external | dependencies for it. | neolog wrote: | If the dataset is small, speed doesn't matter. Also, there's | PyPy if you need speed. | superyesh wrote: | Neat! One production ready python library I love to use in this | space is https://radimrehurek.com/gensim/ It is quite mature and | can handle a good amount of data well! It offers topic modeling | too and can b helpful to find similar documents also. | andrewmatte wrote: | This doesn't account for synonyms. I'd rather use a document | embedding search. ___________________________________________________________________ (page generated 2021-03-25 23:00 UTC)