hngopher.com

       [HN Gopher] Postgres full-text search: A search engine in a data...
       ___________________________________________________________________
        
       Postgres full-text search: A search engine in a database (2021)
        
       Author : craigkerstiens
       Score  : 126 points
       Date   : 2022-07-11 18:07 UTC (4 hours ago)
        
 (HTM) web link (www.crunchydata.com)
 (TXT) w3m dump (www.crunchydata.com)
        
       | itsthecourier wrote:
       | we built a listings site similar to ebay in latin america after
       | buying the operation from an european company,
       | https://corotos.com
       | 
       | we made use of bson in the listings table for metadata and FTS
       | inside postgres. spanish dictionaries, weighting, white-listings
       | of words and including metadata inside the tsvectors have been
       | awesome. really quick search, you never have unsync documents in
       | another part of the stack and you save a lot in storage
       | 
       | glad to help anybody going this path, no regrets.
        
         | cpursley wrote:
         | A writeup or gist of this approach would be awesome.
        
       | mixcocam wrote:
       | I think SQLite is even more batteries included!
       | https://www.sqlite.org/fts5.html
        
         | canadiantim wrote:
         | More difficult to do facets with sqlite vs postgres though
        
         | shakezula wrote:
         | I used SQLite's built in fuzzy search and it is surprisingly
         | powerful. I used it to for full text search over a magic the
         | gathering card database and it was amazing.
        
           | cduzz wrote:
           | I've often wondered why someone doesn't reimplement
           | elasticsearch using sqlite as the back-end datastore, instead
           | of lucene.
           | 
           | Or did I just describe foundationDB?
           | 
           | I guess the real problem is that such a beast wouldn't have
           | kibana, and people _want_ kibana and put up with
           | elasticsearch to get it.
        
       | jfbaro wrote:
       | I have seen this presentation from a company that used PG Full
       | Text search for a pretty complex use case. Interesting ->
       | http://matheusoliveira.s3-website-us-east-1.amazonaws.com/pr...
       | (updated thanks to ddevault)
        
         | RowanH wrote:
         | Good link thinks. Have to say we're moving to pgSearch after
         | implementing SOLR alongside a PG/Rails backend. Makes the stack
         | simpler, less components to worry about (less headaches for gem
         | versioning dependencies with SOLR). Lot to be said for it after
         | reading through what's available now...
        
           | sandGorgon wrote:
           | are you planning to do relevance ? this is one of the
           | blockers for us - TFIDF vs BM25 or something.
           | 
           | From what i know - PG fulltext seaarch does not implement
           | relevance.
        
             | rocmcd wrote:
             | This has been my understanding of the state of Postgres
             | full-text search. It's great if your search requirements
             | are fairly vanilla, but I haven't seen any solutions for
             | more advanced search needs, such boosting, relevance,
             | scoring, etc.
        
             | mamcx wrote:
             | An alternative is to just layer the FTS on top of vanilla
             | sql to get the extra stuff (this is what I do for my
             | eCommerce backend), so is pretty simple to have something
             | alike:                 SELECT ..         -- Get the fts
             | IN ( FTS QUERY)       ORDER BY          -- The relevance is
             | hardcoded? mayber in another table that store th rankings?
             | (           Products,            Inventory,
             | Invoices,..           )
             | 
             | I found is much easier and predictable if I code the
             | "rankings" based on the business logic instead of let the
             | FTS engine guess it. You can store that stuff as normal
             | columns or use the "sources" (ie: products, inventory) as
             | ways to know what could be more important to pull first.
             | 
             | This have the nice property that our search results are ver
             | good and better: Never return non-sensical stuff! (like
             | searching for a apple in the store and get and blog post!)
        
         | ddevault wrote:
         | Please don't use URL shorteners.
         | 
         | http://matheusoliveira.s3-website-us-east-1.amazonaws.com/pr...
        
           | jfbaro wrote:
           | Updated! Thanks
        
       | lettergram wrote:
       | I've written some tips on this before:
       | https://austingwalters.com/fast-full-text-search-in-postgres...
       | 
       | Basically, it's easy to create triggers that'll update an index
       | for you on entry. Then you can create custom indices that have
       | key words, etc.
        
         | Latty wrote:
         | Is there a benefit to using a trigger over a generated stored
         | column?
        
           | lettergram wrote:
           | You can technically do way more robust things in a trigger.
           | If you don't need it then :shrug: but I often create fairly
           | robust updates
        
       | crtxcr wrote:
       | I was a bit disappointed by some limitations back then when I
       | tried it for a project of mine. When searching phrases where
       | ordering matters, phraseto_tsquery() does not quite work for
       | larger documents, as the tsvector position values are quite
       | limited: https://www.postgresql.org/docs/14/textsearch-
       | limitations.ht... Here I had much better success with sqlite's
       | FTS implemention.
        
       | cpursley wrote:
       | Here's the approach I settled on that allows for fuzzy search
       | (also some useful links at the bottom):
       | 
       | https://gist.github.com/cpursley/e3586382c3a42c54ca7f5fef166...
       | 
       | Downside is I haven't found a way to do weighting.
        
       | panabee wrote:
       | slightly off-topic, but if the goal is to store vast amounts of
       | protein and gene nucleotide sequences and allow wildcard searches
       | (e.g., %CTAAACGGG%), would postgres be the recommended approach?
       | 
       | assume 200K proteins/genes with nucleotide sequences ranging from
       | 20 base pairs to 2.3 million base pairs.
        
         | giovannibonetti wrote:
         | Pg_trgrm based indexes works well to make this type of LIKE
         | %...% queries fast
        
       | jonatron wrote:
       | Last time I looked, Postgres full text search doesn't really do
       | CJK without third party parsers, which can be a bit of a deal
       | breaker.
        
         | zkirill wrote:
         | I love Postgres but lack of CJK full-text search is the main
         | reason why I am going with Lucene/Solr for in-app search.
         | 
         | It's possible to make Postgres do amazing things (e.g. job
         | queue with SKIP LOCKED!) but at what point do you draw the line
         | and use the right tool for the job?
        
           | blowski wrote:
           | Any idea how well CJK works in general with Postgres? Or is
           | it just on full text where it's lacking?
        
             | zkirill wrote:
             | In general, we have absolutely no problem with CJK in
             | Postgres. Search is very application specific and some of
             | our users want to do things like kana + kanji search. [1] I
             | don't even know where you would start with that in
             | Postgres.
             | 
             | We also need stuff like language detection and analysis of
             | mixed language data.
             | 
             | A big advantage of going with a dedicated search tool is
             | that it teaches you what you don't know about search, and
             | it turned out that we knew pretty much nothing.
             | 
             | [1] https://www.elastic.co/blog/implementing-japanese-
             | autocomple...
        
             | mistrial9 wrote:
             | the Japanese national phone company NTT is a major
             | postgresql contributor
        
           | chucky_z wrote:
           | When the feature doesn't exist! For existing
           | workflows/functionality; I think using a new tool for a
           | feature that doesn't (and won't for some time) exist is a
           | _perfectly_ good reason to introduce the tool. On top of that
           | you're looking at using something that is mature and well-
           | used, Lucene/Solr.
           | 
           | If you wanted to be cheeky you could go ask your CTO if they
           | would foot the bill for someone like 2nd Quadrant to write
           | CJK full-text search into pgsql. ;)
           | 
           | I've done this for some features/issues in random OSS
           | projects and if you feel someone else has the same problem I
           | always feel paying money to have it implemented is well worth
           | it.
        
           | [deleted]
        
       | simonw wrote:
       | The hardest part of building any search engine is keeping the
       | index up-to-date with changes made to the underlying data store.
       | 
       | It's a solvable problem, but it's always a lot of work to build
       | and to keep working as the database schema changes in the future.
       | 
       | This is why I really like PostgreSQL FTS: it's good enough that
       | for many projects I don't need to use an external search engine
       | any more - and it's way easier to keep the search index up-to-
       | date than if the index lives in a separate system.
       | 
       | I wrote this tutorial on implementing faceted search with
       | PostgreSQL and Django a while ago:
       | https://simonwillison.net/2017/Oct/5/django-postgresql-facet...
        
         | canadiantim wrote:
         | Have you ever considered using PGroonga
         | (https://pgroonga.github.io/) to further extend FTS in
         | postgres? They've got a drilldown feature that is basically
         | their implementation of facets. Always wondered what your
         | thoughts on that might be?
         | 
         | I've got a django setup with postgres and facets implemented as
         | you detailed in your blog (thank you for all the amazing info
         | over the years btw!), now I'm looking to extend it with
         | PGroonga and potentially considering Hasura to try and further
         | extend what I can do while still using Postgres as my single
         | source of truth without need for keeping any other service in
         | sync.
        
         | almog wrote:
         | > The hardest part of building any search engine is keeping the
         | index up-to-date with changes made to the underlying data
         | store.
         | 
         | That makes sense and I'm curious to know if you're referring to
         | the challenges that arise from attempting to implement a
         | distributed heterogeneous transaction that includes system of
         | record (say RDBMS) and derived data (search engine), or did you
         | think about something that's more specific to search engines?
        
         | soperj wrote:
         | How hard is this on the database?
        
       | ravirajx7 wrote:
       | I am feeling real bad now as I used the like query as OP did
       | (although in MySQL) for implementing a custom search over two
       | fields ( Name & Description ) in one of my personal projects. I
       | thought it worked pretty well though and I was so happy about it.
        
         | giovannibonetti wrote:
         | Don't confuse full text search with word part search. If you
         | have a list of names, you're probably looking for the latter
         | one, where LIKE works fine and can be indexed with pg_trgrm.
        
       | phemartin wrote:
       | mongodb has its own full-text implementation with Atlas Search.
       | it has full Lucene capability. it's worth looking into.
       | 
       | https://www.mongodb.com/docs/atlas/atlas-search/
        
         | cpursley wrote:
         | If you're already running postgres, why would you set up mongo
         | for search instead of elastic, etc?
        
       | qohen wrote:
       | FYI, this crunchydata.com blog-post was discussed 11 months ago
       | on HN[0], in a thread awarded 601 points, with 137 comments.
       | 
       | [0] https://news.ycombinator.com/item?id=27973497
        
         | qohen wrote:
         | Among other things in that earlier thread, there's some
         | discussion of aspects of ZomboDB with its developer.
         | 
         |  _ZomboDB is a Postgres extension that enables efficient full-
         | text searching via the use of indexes backed by Elasticsearch._
         | (From the project 's website[0]).
         | 
         | Lots more info in readme at the project's github[1].
         | 
         | [0] http://zombodb.com
         | 
         | [1] https://github.com/zombodb/zombodb
        
       | puika wrote:
       | Mandatory mention of the RUM extension
       | (https://github.com/postgrespro/rum) if this caught your eye.
       | Lots of tutorials and conference presentations out there
       | showcasing the advantages in terms of ranking, timestamps...
        
       | adhoc_slime wrote:
       | Very convenient for me that this article swings around, I'm
       | weighing pros and cons of PGSQL and elastic search for full-text
       | search for the business's use-case.
        
         | avereveard wrote:
         | Also consider full ngram search, so you can do client side
         | stemming and use a single index for all languages. It's not out
         | of the box, but installing the ngram extension is literally one
         | line.
        
       ___________________________________________________________________
       (page generated 2022-07-11 23:00 UTC)