[HN Gopher] Postgres Full-Text Search: A search engine in a data...
       ___________________________________________________________________
        
       Postgres Full-Text Search: A search engine in a database
        
       Author : twakefield
       Score  : 318 points
       Date   : 2021-07-27 15:20 UTC (7 hours ago)
        
 (HTM) web link (blog.crunchydata.com)
 (TXT) w3m dump (blog.crunchydata.com)
        
       | simonw wrote:
       | The Django ORM includes support for PostgreSQL search and I've
       | found it a really productive way to add search to a project:
       | https://docs.djangoproject.com/en/3.2/ref/contrib/postgres/s...
        
       | lettergram wrote:
       | I actually built a search engine back in 2018 using postgresql
       | 
       | https://austingwalters.com/fast-full-text-search-in-postgres...
       | 
       | Worked quite well and still use it daily. Basically doing
       | weighted searches on vectors is slower than my approach, but
       | definitely good enough.
       | 
       | Currently, I can search around 50m HN & Reddit comments in 200ms
       | on the postgresql running on my machine.
        
         | vincnetas wrote:
         | Offtopic, but currious what are your use cases when searching
         | all HN and reddit comments? Im at the beggining of this path,
         | just crawled HN, but what to do with this, still a bit cloudy.
        
           | lettergram wrote:
           | I built a search engine that quantified the expertise of
           | authors of comments. Then I created what I called "expert
           | rank" that allowed me to build a really good search engine.
           | 
           | Super good if you're at a company or something
           | 
           | https://twitter.com/austingwalters/status/104189476543920128.
           | ..
        
             | turbocon wrote:
             | Wow, that lead me down quite a rabbit hole, impressive
             | work.
        
         | rattray wrote:
         | Nice - looks like the ~same approach recommended here of adding
         | a generated `tsvector` column with a GIN index and querying it
         | with `col @@ @@ to_tsquery('english', query)`.
        
           | lettergram wrote:
           | Yeah my internal approach was creating custom vectors which
           | are quicker to search.
        
       | pvsukale3 wrote:
       | If you are using Rails with Postgres you can use pg_search gem to
       | build the named scopes to take advantage of full text search.
       | 
       | https://github.com/Casecommons/pg_search
        
       | theandrewbailey wrote:
       | > You could also look into enabling extensions such as unaccent
       | (remove diacritic signs from lexemes) or pg_trgm (for fuzzy
       | search).
       | 
       | Trigrams (pg_trgm) are practically needed for usable search when
       | it comes to misspellings and compound words (e.g. a search for
       | "down loads" won't return "downloads").
       | 
       | I also recommend using websearch_to_tsquery instead of using the
       | cryptic syntax of to_tsquery.
        
         | kyrra wrote:
         | Trigrams are amazing. I was doing a sideproject where I wanted
         | to allow for substring searching, and trigrams seemed to be the
         | only way to do it (easily/well) in postgres. Gitlab did a great
         | writeup on this a few years ago that really helped me
         | understand it:
         | 
         | https://about.gitlab.com/blog/2016/03/18/fast-search-using-p...
         | 
         | You can also always read the official docs:
         | 
         | https://www.postgresql.org/docs/current/pgtrgm.html
        
       | Grimm1 wrote:
       | Postgres Full-Text search is a great way to get search running
       | for a lot of standard web applications. I recently used just this
       | in Elixir to set up a simple search by keyword. My only complaint
       | was Ecto (Elixir's query builder library) doesn't have first
       | class support for it and neither does Postgrex the lower level
       | connector they use. Still, using fragments with sanitized SQL
       | wasn't too messy at all.
        
       | mrinterweb wrote:
       | I've seen Elasticsearch set up for applications that would have
       | equal benefit from just using the postgresql db's full-text
       | search they already have access to.
       | 
       | The additional complexity is usually incurred when the data in
       | postgresql changes, and those changes need to be mirrored up to
       | Elasticsearch. Elasticsearch obviously has its uses, but for some
       | cases, postgresql's built in full-text search can make more
       | sense.
        
       | [deleted]
        
       | jcuenod wrote:
       | Huh, just yesterday I blogged[0] about using FTS in SQLite[1] to
       | search my PDF database. SQLite's full-text search is really
       | excellent. The thing that tripped me up for a while was `GROUP
       | BY` with the `snippet`/`highlight` function but that's the point
       | of the blog post.
       | 
       | [0] https://jcuenod.github.io/bibletech/2021/07/26/full-text-
       | sea...
       | 
       | [1] https://www.sqlite.org/fts5.html
        
       | MushyRoom wrote:
       | I was hyped when I found out about it a while ago. Then I wasn't
       | anymore.
       | 
       | When you have 12 locales (kr/ru/cn/jp/..) it's not that fun
       | anymore. Especially on a one man project :)
        
         | freewizard wrote:
         | For small project and simple full text search requirement, try
         | this generic parser:
         | https://github.com/freewizard/pg_cjk_parser
        
         | oauea wrote:
         | Why support so many locales in a one man project?
        
           | MushyRoom wrote:
           | It's just a project to learn all the things that are web.
           | It's mainly a database for a game now with most of the
           | information sourced from the game (including its localization
           | files).
           | 
           | I'm slowly transitioning from MariaDB to Postgres - again as
           | a learning experience. There is cool stuff and there is
           | annoying stuff to reproduce things like case-insensitive +
           | ignore accents (utf8_general_ci) in Postgres.
           | 
           | I've looked into FTS and searching for missing dictionaries
           | to support all the locales but Chinese is one of the harder
           | ones.
        
           | blacktriangle wrote:
           | One (dev) project here, we're up to 5 locales at a
           | surprisingly small number of customers. Problem is when your
           | customers are global, all of a sudden a single customer can
           | bring along multiple locales. I very much regret not taking
           | localization far more seriously early in development but we
           | were blindsided by the interest outside the Angleosphere.
        
           | habibur wrote:
           | To attract more visitors/customers I guess. I plan to extend
           | to 10 languages too. 1 man project.
        
       | tabbott wrote:
       | Zulip's search is powered by this built-in Postgres full-text
       | search feature, and it's been a fantastic experience. There's a
       | few things I love about it:
       | 
       | * One can cheaply compose full-text search with other search
       | operators by just doing normal joins on database indexes, which
       | means we can cheaply and performantly support tons of useful
       | operators (https://zulip.com/help/search-for-messages).
       | 
       | * We don't have to build a pipeline to synchronize data between
       | the real database and the search database. Being a chat product,
       | a lot of the things users search for are things that changed
       | recently; so lag, races, and inconsistencies are important to
       | avoid. With the Postgres full-text search, all one needs to do is
       | commit database transactions as usual, and we know that all
       | future searches will return correct results.
       | 
       | * We don't have to operate, manage, and scale a separate service
       | just to support search. And neither do the thousands of self-
       | hosted Zulip installations.
       | 
       | Responding to the "Scaling bottleneck" concerns in comments
       | below, one can send search traffic (which is fundamentally read-
       | only) to a replica, with much less complexity than a dedicated
       | search service.
       | 
       | Doing fancy scoring pipelines is a good reason to use a
       | specialized search service over the Postgres feature.
       | 
       | I should also mention that a weakness of Postgres full-text
       | search is that it only supports doing stemming for one language.
       | The excellent PGroonga extension (https://pgroonga.github.io/)
       | supports search in all languages; it's a huge improvement
       | especially for character-based languages like Japanese. We're
       | planning to migrate Zulip to using it by default; right now it's
       | available as an option.
       | 
       | More details are available here:
       | https://zulip.readthedocs.io/en/latest/subsystems/full-text-...
        
         | stavros wrote:
         | I cannot tell you how much I love Zulip, but I can tell you
         | that I have no friends any more because everyone is tired of me
         | evangelizing it.
        
         | brightball wrote:
         | All of this. It's such a good operational experience that I
         | will actively fight against the introduction of a dedicated
         | search tool unless it's absolutely necessary.
        
       | rattray wrote:
       | TBH I hadn't known you could do weighted ranking with Postgres
       | search before.
       | 
       | Curious there's no mention of zombodb[0] though, which gives you
       | the full power of elasticsearch from within postgres (with
       | _consistency_ no, less!). You have to be willing to tolerate slow
       | writes, of course, so using postgres ' built-in search
       | functionality still makes sense for a lot of cases.
       | 
       | [0] https://github.com/zombodb/zombodb
        
         | craigkerstiens wrote:
         | Zombo is definitely super interesting and we should probably
         | add a bit in the post about it. Part of the goal here was that
         | you can do a LOT with Postgres, without adding one more system
         | to maintain. Zombo is great if you have Elastic around, but
         | want Postgres as the primary interface, but what if you don't
         | want to maintain Elastic.
         | 
         | My ideal is always though to start with Postgres, and then see
         | if it can solve my problem. I would never Postgres is the best
         | at everything it can do, but for most things it is good enough
         | without having another system to maintain and wear a pager for.
        
           | rattray wrote:
           | Zombo does at least promise to handle "complex reindexing
           | processes" for you (which IME can be very painful) but yeah,
           | I assume you'd still have to deal with shard rebalancing,
           | hardware issues, network failures or latency between postgres
           | and elastic, etc etc.
           | 
           | The performance and cost implications of Zombo are more
           | salient tradeoffs in my mind - if you want to index one of
           | the main tables in your app, you'll have to wait for a
           | network roundtrip and a multi-node write consensus on every
           | update (~150ms or more[0]), you can't `CREATE INDEX
           | CONCURRENTLY`, etc.
           | 
           | All that said, IMO the fact that Zombo exists makes it easier
           | to pitch "hey lets just build search with postgres for now
           | and if we ever need ES's features, we can easily port it to
           | Zombo without rearchitecting our product".
           | 
           | [0] https://github.com/zombodb/zombodb/issues/640
        
             | zombodb wrote:
             | I wonder what the ZomboDB developers are up to now? What
             | great text-search-in-postgres things could they be secretly
             | working on?
        
       | syoc wrote:
       | My worst search experiences always come from the features
       | applauded here. Word stemming and removing stop words is a big
       | hurdle when you know what you are looking for but get flooded by
       | noise because some part of the search string was ignored. Another
       | issue is having to type out a full word before you get a hit in
       | dynamic search boxes (looking at you Confluence).
        
         | Someone1234 wrote:
         | I'd argue that isn't a problem with the _feature_ , but a
         | thoughtless implementation.
         | 
         | A _good_ implementation will weigh verbatim results highest
         | before considering the stop-word stripped or stemmed version.
         | Configuring to_tsvector() to not strip stop words or using a
         | stemming dictionary is, in my opinion, a little clunky in
         | Postgres: You 'll want to make a new [language] dictionary and
         | then call to_tsvector() using your new dictionary as the first
         | parameter.
         | 
         | After you've set up the dictionary globally, this would look
         | _something_ like:
         | 
         | setweight(to_tsvector('english_no_stem_stop', col), 'A') ||
         | setweight(to_tsvector('english', col), 'B'))
         | 
         | I think blaming Postgres for adding stemming/stop-word support
         | because it can be [ab]used for a poor search user experience is
         | like blaming a hammer for a poorly built home. It is just a
         | tool, it can be used for good or evil.
         | 
         | PS - You can do a verbatim search without using to_tsvector(),
         | but that cannot be easily passed into setweight() and you
         | cannot use features like ts_rank().
        
       | shakascchen wrote:
       | No fun doing it for Chinese, especially for traditional Chinese.
       | 
       | I had to install software but on Cloud SQL you can't. You have to
       | do it on your instances.
        
       | rattray wrote:
       | Something that's missing from this which I'm curious about is how
       | far _can 't_ postgres search take you?
       | 
       | That is, what tends to be the "killer feature" that makes teams
       | groan and set up Elasticsearch because you just can't do it in
       | Postgres and your business needs it?
       | 
       | Having dealt with ES, I'd really like to avoid the operational
       | burden if possible, but I wouldn't want to choose an intermediary
       | solution without being able to say, "keep in mind we'll need to
       | budget a 3-mo transition to ES once we need X, Y, or Z".
        
         | nostrademons wrote:
         | Used to work on Google Search, used ES extensively for a
         | startup I founded (which was sort of quasi-search...it was
         | built around feed ranking, where the query is constant and a
         | stream of documents is constantly coming in), and have also
         | used Postgres extensively in other companies.
         | 
         | The big problem with all the off-the-shelf search solutions
         | (RDBMS full-text search, ES, Algolia) is that _search ranking
         | is a complicated and subtle problem_ , and frequently depends
         | on signals that are _not in the document itself_. Google 's big
         | insight is that how _other_ people talk about a website is more
         | important than how the website talks about itself, and its
         | ranking algorithm weights accordingly.
         | 
         | ES has the basic building blocks to construct such a ranking
         | algorithm. In terms of fundamental infrastructure I found ES to
         | be just as good as Google, and better in some ways. But its
         | out-of-the-box ranking function _sucks_. Expect to put a domain
         | expert just on search ranking and evaluation to get decent
         | results, and they 're going to have to delve pretty deeply into
         | advanced features of ES to get there.
         | 
         | AFAICT Postgres search only lets you tweak the ranking
         | algorithm by assigning different weights to fields, assuming
         | that the final document score is a linear combination of
         | individual fields. This is usually not what you want - it's
         | pretty common to have non-linear terms from different signals.
        
         | sandGorgon wrote:
         | it doesnt do TF-IDF or BM-25 - the current state of art in
         | search relevance algorithms.
         | 
         | that's where it cant be used for anything serious.
        
           | SigmundA wrote:
           | That why this really needs to get merged:
           | 
           | https://github.com/postgrespro/rum
        
             | rattray wrote:
             | TF/IDF is listed as a TODO on that repo, and I don't see a
             | PR which promises to provide it.
        
         | Thaxll wrote:
         | PG is average at best for text search, it's not even good.
        
         | nextaccountic wrote:
         | If your search needs outgrow Postgres' native search engine,
         | you can use Postgres search with an ElasticSearch backend,
         | using Zombo
         | 
         | https://github.com/zombodb/zombodb
         | 
         | It basically gives you a new kind of index (create index ..
         | using zombodb(..) ..)
        
         | mjewkes wrote:
         | Exact phrase matching. This generally requires falling back to
         | ILIKE, which is not performant.
        
           | e12e wrote:
           | Even if using ILIKE over the result of an imprecise query?
        
             | [deleted]
        
           | simonw wrote:
           | Exact phrase searching works in PostgreSQL full-text search -
           | here's an example:
           | https://simonwillison.net/search/?q=%22nosql+database%22
           | 
           | I'm using search_type=websearch https://github.com/simonw/sim
           | onwillisonblog/blob/a5b53a24b00...
           | 
           | That's using websearch_to_tsquery() which was added in
           | PostgreSQL 11: https://www.postgresql.org/docs/11/textsearch-
           | controls.html#...
        
             | SahAssar wrote:
             | I had some issues with this recently as I couldn't get a
             | FTS query to find something looking like a path or url in a
             | query. As an example from your site:
             | https://simonwillison.net/2020/Jan/6/sitemap-xml/ contains
             | the exact text https://www.niche-museums.com/, but I cannot
             | find a way to search for that phrase exactly (trying
             | https://simonwillison.net/search/?q=%22www.niche-
             | museums.com... works though). I tried both in my own psql
             | setup and on your site, and it seems like exact phrase
             | searching is limited to the language used, even if it would
             | be an exact string match.
             | 
             | Are there any workarounds for that?
        
           | rattray wrote:
           | Pardon my ignorance - what is exact phrase matching and why
           | doesn't it work with tsvector?
        
         | amichal wrote:
         | Postgres was not good at (for us) - IDF and other corpus based
         | relevancy measures. had to hand roll - thesaurus and
         | missspelling - again possible of course with preprocessing and
         | by adding config files - non Latin alphabet languages. E.g.
         | Arabic - needed filesystem access (we used aws rds so couldn't
         | do it) to add a dictionary based stemmed/word breaker
         | 
         | We used es or solr for those cases. For English FTS with 100k
         | documents doing it in PG is super easy and one less dependency
        
         | some_developer wrote:
         | Anectodal note:
         | 
         | A few years ago we added yet-another part to our product and,
         | whilst ES worked "okay", we got a bit weary of ES due to "some
         | issues" (some bug in the architecture keeping things not
         | perfect in sync, certain queries with "joins" of types taking
         | long, demand on HW due to the size of database, no proper
         | multi-node setup due to $$$ and time constraint, etc.; small
         | things piling up over time).
         | 
         | Bright idea: let's see how far Postgres, which is our primary
         | datastore, can take us!
         | 
         | Unfortunately, the feature never made it fully into production.
         | 
         | We thought that on paper, the basic requirements were ideal:
         | 
         | - although the table has multiple hundreds of millions of
         | entries, natural segmentation by customer IDs made possible
         | individual results much smaller
         | 
         | - no weighted search result needed: datetime based is perfect
         | enough for this use-case, we thought it would be easy to come
         | up with the "perfect index [tm]"
         | 
         | Alas, we didn't even get that far:
         | 
         | - we identified ("only") 2 columns necessary for the search =>
         | "yay, easy"
         | 
         | - one of those columns was multi-language; though we didn't
         | have specific requirements and did not have to deal with
         | language specific behaviour in ES, we had to decide on one for
         | the TS vectorization (details elude me why "simple" wasn't
         | appropriate for this one column; it was certainly for the other
         | one)
         | 
         | - unsure which one, or both, we would need, for one of the
         | columns we created both indices (difference being the
         | "language")
         | 
         | - we started out with a GIN index (see
         | https://www.postgresql.org/docs/9.6/textsearch-indexes.html )
         | 
         | - creating a single index took > 15 hours
         | 
         | But once the second index was done, and had not even rolled out
         | the feature in the app itself (which at this point was still an
         | ever changing MVP), unrelated we suddenly got hit by lot of
         | customer complains that totally different operations on this
         | table (INSERTs and UPDATEs) started to be getting slow (like
         | 5-15 seconds slow, something which usually takes tiny ms).
         | 
         | Backend developer eyes were wide open O_O
         | 
         | But since we knew that second index just finished, after
         | checking the Posgres logs we decided to drop the FTS indices
         | and, lo' and behold, "performance problem solved".
         | 
         | Communication lines were very short back then (still are today,
         | actually) and it was promptly decided we just cut the search
         | functionality from this new part of the product and be done
         | with it. This also solved the problem, basically (guess there's
         | some "business lesson" to be learned here too, not just
         | technical ones).
         | 
         | Since no one within the company counter argued this decision,
         | we did not spend more time analyzing the details of the
         | performance issue though I would have loved to dig into this
         | and get an expert on board to dissect this.
         | 
         | --
         | 
         | A year later or so I had a bit free time and analyzed one
         | annoying recurring slow UPDATE query problem on a completely
         | different table, but also involving FTS on a single column
         | there also using a GIN index. That's when I stumble over
         | https://www.postgresql.org/docs/9.6/gin-implementation.html
         | 
         | > Updating a GIN index tends to be slow because of the
         | intrinsic nature of inverted indexes: inserting or updating one
         | heap row can cause many inserts into the index (one for each
         | key extracted from the indexed item). As of PostgreSQL 8.4, GIN
         | is capable of postponing much of this work by inserting new
         | tuples into a temporary, unsorted list of pending entries. When
         | the table is vacuumed or autoanalyzed, or when
         | gin_clean_pending_list function is called, or if the pending
         | list becomes larger than gin_pending_list_limit, the entries
         | are moved to the main GIN data structure using the same bulk
         | insert techniques used during initial index creation. This
         | greatly improves GIN index update speed, even counting the
         | additional vacuum overhead. Moreover the overhead work can be
         | done by a background process instead of in foreground query
         | processing.
         | 
         | In this particular case I was able to solve the occasional slow
         | UPDATE queries with "FASTUPDATE=OFF" on that table and,
         | thinking back about the other issue, it might have solved or
         | minimized the impact.
         | 
         | Back to the original story: yep, this one table can have
         | "peaks" of inserts but it's far from "facebook scale" or
         | whatever, basically 1.5k inserts / second were the absolute
         | rare peak I measured and usually it's in the <500 area. But I
         | guess it was enough for this scenario to add latency within the
         | database.
         | 
         | --
         | 
         | Turning back my memory further, I was always "pro" trying to
         | minimize / get rid of ES after learning about
         | http://rachbelaid.com/postgres-full-text-search-is-good-enou...
         | even before we used any FTS feature. At also mentions the
         | GIN/GiST issue but alas, in our case: ElasticSearch is good
         | enough and, besides the thwarts we've with it, actually easier
         | to reason about (so far).
        
         | grncdr wrote:
         | My experience has been that sorting by relevance ranking is
         | quite expensive. I looked into this a bit and found
         | https://github.com/postgrespro/rum (and some earlier slide
         | decks about it) that explains why the GIN index type can't
         | support searching and ranking itself (meaning you need to do
         | heap scans for ranking). This is especially problematic if your
         | users routinely do searches that match a lot of documents and
         | you only want to show the top X results.
         | 
         | Edit: if any of the Crunchy Data people are reading this:
         | support for RUM indexes would be super cool to have in your
         | managed service.
        
           | thom wrote:
           | Top X queries should be optimised with gist indexes.
        
             | grncdr wrote:
             | Ah sounds like I need to update myself on this. I don't
             | recall what the other trade-offs vs GIN indexes are.
        
           | craigkerstiens wrote:
           | Actually thats a great suggestion. We need to take a deeper
           | look at the code itself and ability to support the extension,
           | but we'll definitely take and evaluate it at some of our
           | upcoming roadmap planning for Crunchy Bridge
           | (https://www.crunchybridge.com).
        
             | grncdr wrote:
             | Cool cool
             | 
             | If I could have your other ear for a moment: support for
             | _any_ EU-based cloud provider would be super nice. Ever
             | since the Privacy Shield fig leaf was removed, questions
             | about whether we store any data under US jurisdiction have
             | become a lot more frequent.
        
               | craigkerstiens wrote:
               | Do you have a preferred one?
               | 
               | We're currently on the big 3 US ones, in process of
               | working on our 4th provider, our goal is very much to
               | deliver the best Postgres experience whether on bare
               | metal/on-premise or in the cloud, self hosted or fully
               | managed.
               | 
               | Edit: Always feel free to reach out directly, I'm always
               | happy to spend time with anyone that has questions and
               | usually pretty easy to track me down.
        
         | xcambar wrote:
         | This is anecdote, not proper feedback, since I wasn't directly
         | involved in the topic.
         | 
         | My company relied on PG as its search engine and everything
         | went well from POC to production. After a few years of
         | production and new clients requiring volumes of data an order
         | of magnitude above our comfort zone, things went south pretty
         | fast.
         | 
         | Not many months later but many sweaty weeks of engineering
         | after, we switched to ES and we're not looking back.
         | 
         | tl;dr; even with great DB engineers (which we had), I'd suggest
         | that scale is a strong limiting factor on this feature.
        
           | dante_dev wrote:
           | can you tell us the scale you're talking about? getting good
           | enough results with ~1 billion rows
        
             | lacksconfidence wrote:
             | Can you tell us about your scoring function? Selecting 40M
             | results from a dataset of ~1B and returning the top 10
             | based on some trivial scoring function is easy, any
             | reasonable search system will handle that. The problem is
             | when you have to run your scoring function on all 40M
             | matching docs to decide which 10 are the most relevant.
             | It's even more of a problem when your scoring function
             | captures some of the complexities of the real word, rather
             | than something trivial like tf/idf or bm25.
        
         | _009 wrote:
         | If you are looking to do semantic search (Cosine similarity) +
         | filtering (SQL) on data that can be represented as vectors
         | (audio, text, video, bio) I suggest,
         | https://github.com/ankane/pgvector
        
         | iav wrote:
         | I moved from ElasticSearch to PG FTS in production, and here
         | are the things I had to give up:
         | 
         | 1. PostgreSQL has a cap on column length, and the search index
         | has to be stored in a column. The length of the column is
         | indeterminate - it is storing every word in the document and
         | where it's located, so a short document with very many unique
         | words (numbers are treated as words too) can easily burst the
         | cap. This means you have to truncate each document before
         | indexing it, and pray that your cap is set low enough. You can
         | use multiple columns but that slows down search and makes
         | ranking a lot more complicated. I truncate documents at 4MB.
         | 
         | 2. PostgreSQL supports custom dictionaries for specific
         | languages, stemmers, and other nice tricks, but none of those
         | are supported by AWS because the dictionary gets stored as a
         | file on the filesystem (it's not a config setting). You can
         | still have custom rules like whether or not numbers count as
         | words.
        
         | gk1 wrote:
         | Semantic search using text embeddings. With Open Distro for
         | Elasticsearch you can store your text embeddings and then
         | perform a nearest-neighbor search[1] to find most similar
         | documents using cosine similarity[2]. Elasticsearch (vanilla)
         | will get this feature with 8.0.
         | 
         | If migrating to ES makes you groan you can use a managed
         | service like Pinecone[3] (disclaimer: I work there) just for
         | storing and searching through text embeddings in-memory through
         | an API while keeping the rest of your data in PG.
         | 
         | [1] Nearest-neighbor searches in Open Distro:
         | https://opendistro.github.io/for-elasticsearch-docs/docs/knn...
         | 
         | [2] More on how semantic similarity is measured:
         | https://www.pinecone.io/learn/semantic-search/
         | 
         | [3] https://www.pinecone.io
        
         | matsemann wrote:
         | I'm not well versed on modern pg for this use, but when I
         | managed a Solr instance ~5 years ago, it was the ranking of the
         | results that was the killer feature. Finding results fast most
         | systems can do. Knowing which results to present is harder.
         | 
         | Our case was a domain specific knowledge base, with certain
         | terms occurring often in many articles. Searching for a term
         | could bring up thousands of results, but few of them were
         | actually relevant to show in context of the search, they just
         | happened to use the term.
        
         | jka wrote:
         | "Faceted search"[1] (aka aggregates in Elasticsearch) tends to
         | be a popular one, to provide user-facing content navigation.
         | 
         | That said, simonw has been on the case[2] demonstrating an
         | implementation of that using Django and PostgreSQL.
         | 
         | [1] - https://en.wikipedia.org/wiki/Faceted_search
         | 
         | [2] - https://simonwillison.net/2017/Oct/5/django-postgresql-
         | facet...
        
           | craigds wrote:
           | This is the key one for us that makes Postgres a non-starter
           | for FTS (We use postgres for everything _else_ )
           | 
           | We begrudgingly use Solr instead (we started before ES was
           | really a thing and haven't found a _need_ to switch yet)
           | 
           | When you get more than about two different _types_ of filters
           | (e.g. types of filters could be  'tags', 'categories',
           | 'geotags', 'media type', 'author' etc), the combinatorial
           | explosion of Postgres queries required to provide facet
           | counts gets unmanageable.
           | 
           | For example, when I do a query filtered by
           | `?tag=abc&category=category1`, I need to do these queries:
           | - `... where tag = 'abc' and category_id = 1` (the current
           | results)            - `count(*) ... where tag = 'def' and
           | category_id = 1` (for each other tag present in the results)
           | - `count(*) ... where tag = 'abc' and category_id = 2` (for
           | each other category present in the results)            -
           | `count(*) ... where category_id = 1`            - `count(*)
           | ... where tag = 'abc'`            - `count(*)`
           | 
           | There are certainly smarter ways to do this than lots of tiny
           | queries, but all this complexity still ends up somewhere and
           | isn't likely to be great for performance.
           | 
           | Whereas solr/elasticsearch have faceting built in and handle
           | this with ease.
        
         | brightball wrote:
         | Streaming data ingestion is the biggest. If you're constantly
         | writing data to be searched, this is where ES really outshines
         | everything.
        
         | kayodelycaon wrote:
         | If I recall correctly, Postgres search doesn't scale well. Not
         | sure where it falls apart but it isn't optimized in the same
         | way something like Solr is.
        
           | ezekg wrote:
           | I have a table with over a billion rows and most full-text
           | searches still respond in around a few milliseconds. I think
           | this will depend on a lot of factors, such as proper
           | indexing, and filtering down the dataset as much as possible
           | before performing the full-text ops. I've spent a
           | considerable amount of time on optimizing these queries,
           | thanks to tools like PgMustard [0]. Granted, I do still have
           | a couple slow queries (1-10s query time), but that's likely
           | due to very infrequent access i.e. cold cache.
           | 
           | I will say, if you use open source libraries like pg_search,
           | you are unlikely to ever have performant full-text search.
           | Most full-text queries need to be written by hand to actually
           | utilize indexes, instead of the query-soup that these types
           | of libraries output. (No offense to the maintainers -- it's
           | just how it be when you create a "general" solution.)
           | 
           | [0]: https://pgmustard.com
        
             | snack-boye wrote:
             | Silly question, I'm using pg right now and most of my
             | queries are something like this (in english)
             | 
             | Find me some results in my area that contain these
             | categoryIds and are slotted to start between now and next
             | 10 days.
             | 
             | Since its already quite a filtered set of data, would that
             | mean I should have little issues adding pg text search
             | because with correct indexing and all, it will usually be
             | applied to a small set of data?
             | 
             | Thanks
        
               | ezekg wrote:
               | I'm not a DBA, so I can't say for certain simply due to a
               | gap in my knowledge. But in my experience, it depends on
               | a lot of factors. Sometimes pg will use an index before
               | performing the search ops, other times a subquery is
               | needed. Check out pgmustard and dig into your slow query
               | plans. :)
        
               | john-shaffer wrote:
               | You might be just fine adding an unindexed tsvector
               | column, since you've already filtered down the results.
               | 
               | The GIN indexes for FTS don't really work in conjunction
               | with other indices, which is why
               | https://github.com/postgrespro/rum exists. Luckily, it
               | sounds like you can use your existing indices to filter
               | and let postgres scan for matches on the tsvector. The
               | GIN tsvector indices are quite expensive to build, so
               | don't add one if postgres can't make use of it!
        
             | kayodelycaon wrote:
             | Oh cool. I was right and wrong. Thanks!
        
         | fizx wrote:
         | Hi, I started an Elasticsearch hosting company, since sold, and
         | have built products on PG's search and SQLite FTS search.
         | 
         | There are in my mind two reasons to not use PG's search.
         | 
         | 1. Elasticsearch allows you to build sophisticated linguistic
         | and feature scoring pipelines to optimize your search quality.
         | This is not a typical use case in PG.
         | 
         | 2. Your primary database is usually your scaling bottleneck
         | even without adding a relatively expensive search workload into
         | the mix. A full-text search tends to be around as expensive as
         | a 5% table scan of the related table. Most DBAs don't like
         | large scan workloads.
        
           | kurko wrote:
           | I agree with reason 1, but reason 2 is an answer for, "should
           | I use PG search in the same PG instance I already have", and
           | that's a different discussion. You can set up a replica for
           | that.
        
           | quietbritishjim wrote:
           | Do you see any particular reasons to use or not use ZomboDB
           | [1]? It claims to lets you use ElasticSearch from PG
           | seamlessly e.g. it manages coherency of which results ought
           | to be returned according to the current transaction. (I've
           | never quite ended up needing to use ES but it's always seemed
           | to me I'd be likely to need ZomboDB if I did.)
           | 
           | [1] https://github.com/zombodb/zombodb
        
             | mixmastamyk wrote:
             | You can do anything, anything at all, at https://zombo.com/
             | 
             |  _" The only limit, is yourself..."_
        
               | FpUser wrote:
               | This one is my favorite for many many years. Good for
               | relaxing.
        
               | habibur wrote:
               | It's still around?
        
               | mumblemumble wrote:
               | That seems like one for the philosophers. If you
               | completely rewrite a Flash site in HTML5, but it looks
               | the same and has the same URL, is it still the same site?
        
               | da_chicken wrote:
               | So, YouTube?
        
               | mixmastamyk wrote:
               | Klicken-Sie Link.
        
             | Conlectus wrote:
             | Though I have only a recreational interest in datastores, I
             | would be pretty wary of a service that claims to strap a
             | Consistant-Unavailable database (Postgres) to an
             | Inconsistent-Available database (ElasticSearch) in a useful
             | way.
             | 
             | Doing so would require ElasticSearch to reach consensus on
             | every read/write, which would remove most of the point of a
             | distributed cluster. Despite this, ZomboDB's documentation
             | says "complex aggregate queries can be answered in parallel
             | across your ElasticSearch cluster".
             | 
             | They also claim that transactions will abort if
             | ElasticSearch runs into network trouble, but the
             | ElasticSearch documentation notes that writes during
             | network partitions don't wait for confirmation of
             | success[1], so I'm not sure how they would be able to
             | detect that.
             | 
             | In short: I'll wait for the Jepsen analysis.
             | 
             | [1] https://www.elastic.co/blog/tracking-in-sync-shard-
             | copies#:~...
        
               | zombodb wrote:
               | > Doing so would require ElasticSearch to reach consensus
               | on every read/write
               | 
               | ZomboDB only requires that ES have a view of its index
               | that's consistent with the active Postgres transaction
               | snapshot. ZDB handles this by ensuring that the ES index
               | is fully refreshed after writes.
               | 
               | This doesn't necessarily make ZDB great for high-update
               | loads, but that's not ZDB's target usage.
               | 
               | > They also claim that transactions will abort if
               | ElasticSearch runs into network trouble...
               | 
               | I had to search my own repo to see where I make this
               | claim. I don't. I do note that network failures between
               | PG & ES will cause the active Postgres xact to abort. On
               | top of that, any error that ES is capable of reporting
               | back to the client will cause the PG xact to abort --
               | ensuring consistency between the two.
               | 
               | Because the ES index is properly refreshed as it relates
               | to the active Postgres transaction, all of ES' aggregate
               | search functions are capable of providing proper MVCC-
               | correct results, using the parallelism provided by the ES
               | cluster.
               | 
               | I don't have the time to detail everything that ZDB does
               | to project Postgres xact snapshots on top of ES, but the
               | above two points are the easy ones to solve.
        
           | NortySpock wrote:
           | Regarding point 2: Shouldn't you be moving your search
           | queries from your transaction server to a separate analysis
           | or read-replica server? OLTP copies to OLAP and suddenly
           | you've separated these two problems.
        
             | blowski wrote:
             | ...and created a new one if the projection is only
             | eventually consistent. No free lunches here.
        
               | zepolen wrote:
               | How is that different from running Postgres and
               | Elasticsearch separately?
        
               | blowski wrote:
               | It's not.
               | 
               | It's a difference between having separate OLAP and OLTP
               | databases, which is what the parent post suggested.
        
               | justinclift wrote:
               | Probably skill set of staff?
               | 
               | For example, if a place has developed fairly good
               | knowledge of PG already, they can "just" (!) continue
               | developing their PG knowledge.
               | 
               | Adding ES into the mix though, introduces a whole new
               | thing that needs to be learned, optimised, etc.
        
           | pmarreck wrote:
           | > sophisticated linguistic and feature scoring pipelines to
           | optimize your search quality
           | 
           | You CAN score results using setweight, although it's likely
           | not as sophisticated as Elasticsearch's
           | 
           | https://www.postgresql.org/docs/9.1/textsearch-controls.html
           | 
           | Disclaimer: I use Postgres fulltext search in production,
           | very happy with it although maintaining the various triggers
           | and stored procs it requires to work becomes cumbersome
           | whenever you have to write a migration that alters any of
           | them (or that in fact touches any related field, as you may
           | be required to drop and recreate all of the parts in order
           | not to violate referential integrity)
           | 
           | It is certainly nice having not to worry about 1 additional
           | dependency when deploying, though
        
       | thom wrote:
       | We get really nice results with gist indexes (gist_trgm_ops)
       | searching across multiple entity types to do top X queries. It's
       | very useful to be able to make a stab at a difficult-to-spell
       | foreign football player's name, possibly with lots of diacritics,
       | and get quick results back. I'm always surprised when I find a
       | search engine on any site that is so unkind as to make you spell
       | things exactly.
        
       | bityard wrote:
       | I know Postgres and SQLite have mostly different purposes but
       | FWIW, SQLite also has a surprisingly capable full-text search
       | extension built right in: https://www.sqlite.org/fts5.html
        
         | jjice wrote:
         | It's very impressive, especially considering the SQLite version
         | you're already using probably has it enabled already. I use it
         | for a small site I run and it works fantastic. Little finicky
         | with deletes and updates due to virtual tables in SQLite, but
         | definitely impressive and has its uses.
        
       | SigmundA wrote:
       | Keep wondering if RUM Indexes [1] will ever get merged for faster
       | and better ranking (TF/IDF). Really would make PG a much more
       | complete text search engine.
       | 
       | https://github.com/postgrespro/rum
        
       ___________________________________________________________________
       (page generated 2021-07-27 23:00 UTC)