[HN Gopher] Postgres Full-Text Search: A search engine in a data... ___________________________________________________________________ Postgres Full-Text Search: A search engine in a database Author : twakefield Score : 318 points Date : 2021-07-27 15:20 UTC (7 hours ago) (HTM) web link (blog.crunchydata.com) (TXT) w3m dump (blog.crunchydata.com) | simonw wrote: | The Django ORM includes support for PostgreSQL search and I've | found it a really productive way to add search to a project: | https://docs.djangoproject.com/en/3.2/ref/contrib/postgres/s... | lettergram wrote: | I actually built a search engine back in 2018 using postgresql | | https://austingwalters.com/fast-full-text-search-in-postgres... | | Worked quite well and still use it daily. Basically doing | weighted searches on vectors is slower than my approach, but | definitely good enough. | | Currently, I can search around 50m HN & Reddit comments in 200ms | on the postgresql running on my machine. | vincnetas wrote: | Offtopic, but currious what are your use cases when searching | all HN and reddit comments? Im at the beggining of this path, | just crawled HN, but what to do with this, still a bit cloudy. | lettergram wrote: | I built a search engine that quantified the expertise of | authors of comments. Then I created what I called "expert | rank" that allowed me to build a really good search engine. | | Super good if you're at a company or something | | https://twitter.com/austingwalters/status/104189476543920128. | .. | turbocon wrote: | Wow, that lead me down quite a rabbit hole, impressive | work. | rattray wrote: | Nice - looks like the ~same approach recommended here of adding | a generated `tsvector` column with a GIN index and querying it | with `col @@ @@ to_tsquery('english', query)`. | lettergram wrote: | Yeah my internal approach was creating custom vectors which | are quicker to search. | pvsukale3 wrote: | If you are using Rails with Postgres you can use pg_search gem to | build the named scopes to take advantage of full text search. | | https://github.com/Casecommons/pg_search | theandrewbailey wrote: | > You could also look into enabling extensions such as unaccent | (remove diacritic signs from lexemes) or pg_trgm (for fuzzy | search). | | Trigrams (pg_trgm) are practically needed for usable search when | it comes to misspellings and compound words (e.g. a search for | "down loads" won't return "downloads"). | | I also recommend using websearch_to_tsquery instead of using the | cryptic syntax of to_tsquery. | kyrra wrote: | Trigrams are amazing. I was doing a sideproject where I wanted | to allow for substring searching, and trigrams seemed to be the | only way to do it (easily/well) in postgres. Gitlab did a great | writeup on this a few years ago that really helped me | understand it: | | https://about.gitlab.com/blog/2016/03/18/fast-search-using-p... | | You can also always read the official docs: | | https://www.postgresql.org/docs/current/pgtrgm.html | Grimm1 wrote: | Postgres Full-Text search is a great way to get search running | for a lot of standard web applications. I recently used just this | in Elixir to set up a simple search by keyword. My only complaint | was Ecto (Elixir's query builder library) doesn't have first | class support for it and neither does Postgrex the lower level | connector they use. Still, using fragments with sanitized SQL | wasn't too messy at all. | mrinterweb wrote: | I've seen Elasticsearch set up for applications that would have | equal benefit from just using the postgresql db's full-text | search they already have access to. | | The additional complexity is usually incurred when the data in | postgresql changes, and those changes need to be mirrored up to | Elasticsearch. Elasticsearch obviously has its uses, but for some | cases, postgresql's built in full-text search can make more | sense. | [deleted] | jcuenod wrote: | Huh, just yesterday I blogged[0] about using FTS in SQLite[1] to | search my PDF database. SQLite's full-text search is really | excellent. The thing that tripped me up for a while was `GROUP | BY` with the `snippet`/`highlight` function but that's the point | of the blog post. | | [0] https://jcuenod.github.io/bibletech/2021/07/26/full-text- | sea... | | [1] https://www.sqlite.org/fts5.html | MushyRoom wrote: | I was hyped when I found out about it a while ago. Then I wasn't | anymore. | | When you have 12 locales (kr/ru/cn/jp/..) it's not that fun | anymore. Especially on a one man project :) | freewizard wrote: | For small project and simple full text search requirement, try | this generic parser: | https://github.com/freewizard/pg_cjk_parser | oauea wrote: | Why support so many locales in a one man project? | MushyRoom wrote: | It's just a project to learn all the things that are web. | It's mainly a database for a game now with most of the | information sourced from the game (including its localization | files). | | I'm slowly transitioning from MariaDB to Postgres - again as | a learning experience. There is cool stuff and there is | annoying stuff to reproduce things like case-insensitive + | ignore accents (utf8_general_ci) in Postgres. | | I've looked into FTS and searching for missing dictionaries | to support all the locales but Chinese is one of the harder | ones. | blacktriangle wrote: | One (dev) project here, we're up to 5 locales at a | surprisingly small number of customers. Problem is when your | customers are global, all of a sudden a single customer can | bring along multiple locales. I very much regret not taking | localization far more seriously early in development but we | were blindsided by the interest outside the Angleosphere. | habibur wrote: | To attract more visitors/customers I guess. I plan to extend | to 10 languages too. 1 man project. | tabbott wrote: | Zulip's search is powered by this built-in Postgres full-text | search feature, and it's been a fantastic experience. There's a | few things I love about it: | | * One can cheaply compose full-text search with other search | operators by just doing normal joins on database indexes, which | means we can cheaply and performantly support tons of useful | operators (https://zulip.com/help/search-for-messages). | | * We don't have to build a pipeline to synchronize data between | the real database and the search database. Being a chat product, | a lot of the things users search for are things that changed | recently; so lag, races, and inconsistencies are important to | avoid. With the Postgres full-text search, all one needs to do is | commit database transactions as usual, and we know that all | future searches will return correct results. | | * We don't have to operate, manage, and scale a separate service | just to support search. And neither do the thousands of self- | hosted Zulip installations. | | Responding to the "Scaling bottleneck" concerns in comments | below, one can send search traffic (which is fundamentally read- | only) to a replica, with much less complexity than a dedicated | search service. | | Doing fancy scoring pipelines is a good reason to use a | specialized search service over the Postgres feature. | | I should also mention that a weakness of Postgres full-text | search is that it only supports doing stemming for one language. | The excellent PGroonga extension (https://pgroonga.github.io/) | supports search in all languages; it's a huge improvement | especially for character-based languages like Japanese. We're | planning to migrate Zulip to using it by default; right now it's | available as an option. | | More details are available here: | https://zulip.readthedocs.io/en/latest/subsystems/full-text-... | stavros wrote: | I cannot tell you how much I love Zulip, but I can tell you | that I have no friends any more because everyone is tired of me | evangelizing it. | brightball wrote: | All of this. It's such a good operational experience that I | will actively fight against the introduction of a dedicated | search tool unless it's absolutely necessary. | rattray wrote: | TBH I hadn't known you could do weighted ranking with Postgres | search before. | | Curious there's no mention of zombodb[0] though, which gives you | the full power of elasticsearch from within postgres (with | _consistency_ no, less!). You have to be willing to tolerate slow | writes, of course, so using postgres ' built-in search | functionality still makes sense for a lot of cases. | | [0] https://github.com/zombodb/zombodb | craigkerstiens wrote: | Zombo is definitely super interesting and we should probably | add a bit in the post about it. Part of the goal here was that | you can do a LOT with Postgres, without adding one more system | to maintain. Zombo is great if you have Elastic around, but | want Postgres as the primary interface, but what if you don't | want to maintain Elastic. | | My ideal is always though to start with Postgres, and then see | if it can solve my problem. I would never Postgres is the best | at everything it can do, but for most things it is good enough | without having another system to maintain and wear a pager for. | rattray wrote: | Zombo does at least promise to handle "complex reindexing | processes" for you (which IME can be very painful) but yeah, | I assume you'd still have to deal with shard rebalancing, | hardware issues, network failures or latency between postgres | and elastic, etc etc. | | The performance and cost implications of Zombo are more | salient tradeoffs in my mind - if you want to index one of | the main tables in your app, you'll have to wait for a | network roundtrip and a multi-node write consensus on every | update (~150ms or more[0]), you can't `CREATE INDEX | CONCURRENTLY`, etc. | | All that said, IMO the fact that Zombo exists makes it easier | to pitch "hey lets just build search with postgres for now | and if we ever need ES's features, we can easily port it to | Zombo without rearchitecting our product". | | [0] https://github.com/zombodb/zombodb/issues/640 | zombodb wrote: | I wonder what the ZomboDB developers are up to now? What | great text-search-in-postgres things could they be secretly | working on? | syoc wrote: | My worst search experiences always come from the features | applauded here. Word stemming and removing stop words is a big | hurdle when you know what you are looking for but get flooded by | noise because some part of the search string was ignored. Another | issue is having to type out a full word before you get a hit in | dynamic search boxes (looking at you Confluence). | Someone1234 wrote: | I'd argue that isn't a problem with the _feature_ , but a | thoughtless implementation. | | A _good_ implementation will weigh verbatim results highest | before considering the stop-word stripped or stemmed version. | Configuring to_tsvector() to not strip stop words or using a | stemming dictionary is, in my opinion, a little clunky in | Postgres: You 'll want to make a new [language] dictionary and | then call to_tsvector() using your new dictionary as the first | parameter. | | After you've set up the dictionary globally, this would look | _something_ like: | | setweight(to_tsvector('english_no_stem_stop', col), 'A') || | setweight(to_tsvector('english', col), 'B')) | | I think blaming Postgres for adding stemming/stop-word support | because it can be [ab]used for a poor search user experience is | like blaming a hammer for a poorly built home. It is just a | tool, it can be used for good or evil. | | PS - You can do a verbatim search without using to_tsvector(), | but that cannot be easily passed into setweight() and you | cannot use features like ts_rank(). | shakascchen wrote: | No fun doing it for Chinese, especially for traditional Chinese. | | I had to install software but on Cloud SQL you can't. You have to | do it on your instances. | rattray wrote: | Something that's missing from this which I'm curious about is how | far _can 't_ postgres search take you? | | That is, what tends to be the "killer feature" that makes teams | groan and set up Elasticsearch because you just can't do it in | Postgres and your business needs it? | | Having dealt with ES, I'd really like to avoid the operational | burden if possible, but I wouldn't want to choose an intermediary | solution without being able to say, "keep in mind we'll need to | budget a 3-mo transition to ES once we need X, Y, or Z". | nostrademons wrote: | Used to work on Google Search, used ES extensively for a | startup I founded (which was sort of quasi-search...it was | built around feed ranking, where the query is constant and a | stream of documents is constantly coming in), and have also | used Postgres extensively in other companies. | | The big problem with all the off-the-shelf search solutions | (RDBMS full-text search, ES, Algolia) is that _search ranking | is a complicated and subtle problem_ , and frequently depends | on signals that are _not in the document itself_. Google 's big | insight is that how _other_ people talk about a website is more | important than how the website talks about itself, and its | ranking algorithm weights accordingly. | | ES has the basic building blocks to construct such a ranking | algorithm. In terms of fundamental infrastructure I found ES to | be just as good as Google, and better in some ways. But its | out-of-the-box ranking function _sucks_. Expect to put a domain | expert just on search ranking and evaluation to get decent | results, and they 're going to have to delve pretty deeply into | advanced features of ES to get there. | | AFAICT Postgres search only lets you tweak the ranking | algorithm by assigning different weights to fields, assuming | that the final document score is a linear combination of | individual fields. This is usually not what you want - it's | pretty common to have non-linear terms from different signals. | sandGorgon wrote: | it doesnt do TF-IDF or BM-25 - the current state of art in | search relevance algorithms. | | that's where it cant be used for anything serious. | SigmundA wrote: | That why this really needs to get merged: | | https://github.com/postgrespro/rum | rattray wrote: | TF/IDF is listed as a TODO on that repo, and I don't see a | PR which promises to provide it. | Thaxll wrote: | PG is average at best for text search, it's not even good. | nextaccountic wrote: | If your search needs outgrow Postgres' native search engine, | you can use Postgres search with an ElasticSearch backend, | using Zombo | | https://github.com/zombodb/zombodb | | It basically gives you a new kind of index (create index .. | using zombodb(..) ..) | mjewkes wrote: | Exact phrase matching. This generally requires falling back to | ILIKE, which is not performant. | e12e wrote: | Even if using ILIKE over the result of an imprecise query? | [deleted] | simonw wrote: | Exact phrase searching works in PostgreSQL full-text search - | here's an example: | https://simonwillison.net/search/?q=%22nosql+database%22 | | I'm using search_type=websearch https://github.com/simonw/sim | onwillisonblog/blob/a5b53a24b00... | | That's using websearch_to_tsquery() which was added in | PostgreSQL 11: https://www.postgresql.org/docs/11/textsearch- | controls.html#... | SahAssar wrote: | I had some issues with this recently as I couldn't get a | FTS query to find something looking like a path or url in a | query. As an example from your site: | https://simonwillison.net/2020/Jan/6/sitemap-xml/ contains | the exact text https://www.niche-museums.com/, but I cannot | find a way to search for that phrase exactly (trying | https://simonwillison.net/search/?q=%22www.niche- | museums.com... works though). I tried both in my own psql | setup and on your site, and it seems like exact phrase | searching is limited to the language used, even if it would | be an exact string match. | | Are there any workarounds for that? | rattray wrote: | Pardon my ignorance - what is exact phrase matching and why | doesn't it work with tsvector? | amichal wrote: | Postgres was not good at (for us) - IDF and other corpus based | relevancy measures. had to hand roll - thesaurus and | missspelling - again possible of course with preprocessing and | by adding config files - non Latin alphabet languages. E.g. | Arabic - needed filesystem access (we used aws rds so couldn't | do it) to add a dictionary based stemmed/word breaker | | We used es or solr for those cases. For English FTS with 100k | documents doing it in PG is super easy and one less dependency | some_developer wrote: | Anectodal note: | | A few years ago we added yet-another part to our product and, | whilst ES worked "okay", we got a bit weary of ES due to "some | issues" (some bug in the architecture keeping things not | perfect in sync, certain queries with "joins" of types taking | long, demand on HW due to the size of database, no proper | multi-node setup due to $$$ and time constraint, etc.; small | things piling up over time). | | Bright idea: let's see how far Postgres, which is our primary | datastore, can take us! | | Unfortunately, the feature never made it fully into production. | | We thought that on paper, the basic requirements were ideal: | | - although the table has multiple hundreds of millions of | entries, natural segmentation by customer IDs made possible | individual results much smaller | | - no weighted search result needed: datetime based is perfect | enough for this use-case, we thought it would be easy to come | up with the "perfect index [tm]" | | Alas, we didn't even get that far: | | - we identified ("only") 2 columns necessary for the search => | "yay, easy" | | - one of those columns was multi-language; though we didn't | have specific requirements and did not have to deal with | language specific behaviour in ES, we had to decide on one for | the TS vectorization (details elude me why "simple" wasn't | appropriate for this one column; it was certainly for the other | one) | | - unsure which one, or both, we would need, for one of the | columns we created both indices (difference being the | "language") | | - we started out with a GIN index (see | https://www.postgresql.org/docs/9.6/textsearch-indexes.html ) | | - creating a single index took > 15 hours | | But once the second index was done, and had not even rolled out | the feature in the app itself (which at this point was still an | ever changing MVP), unrelated we suddenly got hit by lot of | customer complains that totally different operations on this | table (INSERTs and UPDATEs) started to be getting slow (like | 5-15 seconds slow, something which usually takes tiny ms). | | Backend developer eyes were wide open O_O | | But since we knew that second index just finished, after | checking the Posgres logs we decided to drop the FTS indices | and, lo' and behold, "performance problem solved". | | Communication lines were very short back then (still are today, | actually) and it was promptly decided we just cut the search | functionality from this new part of the product and be done | with it. This also solved the problem, basically (guess there's | some "business lesson" to be learned here too, not just | technical ones). | | Since no one within the company counter argued this decision, | we did not spend more time analyzing the details of the | performance issue though I would have loved to dig into this | and get an expert on board to dissect this. | | -- | | A year later or so I had a bit free time and analyzed one | annoying recurring slow UPDATE query problem on a completely | different table, but also involving FTS on a single column | there also using a GIN index. That's when I stumble over | https://www.postgresql.org/docs/9.6/gin-implementation.html | | > Updating a GIN index tends to be slow because of the | intrinsic nature of inverted indexes: inserting or updating one | heap row can cause many inserts into the index (one for each | key extracted from the indexed item). As of PostgreSQL 8.4, GIN | is capable of postponing much of this work by inserting new | tuples into a temporary, unsorted list of pending entries. When | the table is vacuumed or autoanalyzed, or when | gin_clean_pending_list function is called, or if the pending | list becomes larger than gin_pending_list_limit, the entries | are moved to the main GIN data structure using the same bulk | insert techniques used during initial index creation. This | greatly improves GIN index update speed, even counting the | additional vacuum overhead. Moreover the overhead work can be | done by a background process instead of in foreground query | processing. | | In this particular case I was able to solve the occasional slow | UPDATE queries with "FASTUPDATE=OFF" on that table and, | thinking back about the other issue, it might have solved or | minimized the impact. | | Back to the original story: yep, this one table can have | "peaks" of inserts but it's far from "facebook scale" or | whatever, basically 1.5k inserts / second were the absolute | rare peak I measured and usually it's in the <500 area. But I | guess it was enough for this scenario to add latency within the | database. | | -- | | Turning back my memory further, I was always "pro" trying to | minimize / get rid of ES after learning about | http://rachbelaid.com/postgres-full-text-search-is-good-enou... | even before we used any FTS feature. At also mentions the | GIN/GiST issue but alas, in our case: ElasticSearch is good | enough and, besides the thwarts we've with it, actually easier | to reason about (so far). | grncdr wrote: | My experience has been that sorting by relevance ranking is | quite expensive. I looked into this a bit and found | https://github.com/postgrespro/rum (and some earlier slide | decks about it) that explains why the GIN index type can't | support searching and ranking itself (meaning you need to do | heap scans for ranking). This is especially problematic if your | users routinely do searches that match a lot of documents and | you only want to show the top X results. | | Edit: if any of the Crunchy Data people are reading this: | support for RUM indexes would be super cool to have in your | managed service. | thom wrote: | Top X queries should be optimised with gist indexes. | grncdr wrote: | Ah sounds like I need to update myself on this. I don't | recall what the other trade-offs vs GIN indexes are. | craigkerstiens wrote: | Actually thats a great suggestion. We need to take a deeper | look at the code itself and ability to support the extension, | but we'll definitely take and evaluate it at some of our | upcoming roadmap planning for Crunchy Bridge | (https://www.crunchybridge.com). | grncdr wrote: | Cool cool | | If I could have your other ear for a moment: support for | _any_ EU-based cloud provider would be super nice. Ever | since the Privacy Shield fig leaf was removed, questions | about whether we store any data under US jurisdiction have | become a lot more frequent. | craigkerstiens wrote: | Do you have a preferred one? | | We're currently on the big 3 US ones, in process of | working on our 4th provider, our goal is very much to | deliver the best Postgres experience whether on bare | metal/on-premise or in the cloud, self hosted or fully | managed. | | Edit: Always feel free to reach out directly, I'm always | happy to spend time with anyone that has questions and | usually pretty easy to track me down. | xcambar wrote: | This is anecdote, not proper feedback, since I wasn't directly | involved in the topic. | | My company relied on PG as its search engine and everything | went well from POC to production. After a few years of | production and new clients requiring volumes of data an order | of magnitude above our comfort zone, things went south pretty | fast. | | Not many months later but many sweaty weeks of engineering | after, we switched to ES and we're not looking back. | | tl;dr; even with great DB engineers (which we had), I'd suggest | that scale is a strong limiting factor on this feature. | dante_dev wrote: | can you tell us the scale you're talking about? getting good | enough results with ~1 billion rows | lacksconfidence wrote: | Can you tell us about your scoring function? Selecting 40M | results from a dataset of ~1B and returning the top 10 | based on some trivial scoring function is easy, any | reasonable search system will handle that. The problem is | when you have to run your scoring function on all 40M | matching docs to decide which 10 are the most relevant. | It's even more of a problem when your scoring function | captures some of the complexities of the real word, rather | than something trivial like tf/idf or bm25. | _009 wrote: | If you are looking to do semantic search (Cosine similarity) + | filtering (SQL) on data that can be represented as vectors | (audio, text, video, bio) I suggest, | https://github.com/ankane/pgvector | iav wrote: | I moved from ElasticSearch to PG FTS in production, and here | are the things I had to give up: | | 1. PostgreSQL has a cap on column length, and the search index | has to be stored in a column. The length of the column is | indeterminate - it is storing every word in the document and | where it's located, so a short document with very many unique | words (numbers are treated as words too) can easily burst the | cap. This means you have to truncate each document before | indexing it, and pray that your cap is set low enough. You can | use multiple columns but that slows down search and makes | ranking a lot more complicated. I truncate documents at 4MB. | | 2. PostgreSQL supports custom dictionaries for specific | languages, stemmers, and other nice tricks, but none of those | are supported by AWS because the dictionary gets stored as a | file on the filesystem (it's not a config setting). You can | still have custom rules like whether or not numbers count as | words. | gk1 wrote: | Semantic search using text embeddings. With Open Distro for | Elasticsearch you can store your text embeddings and then | perform a nearest-neighbor search[1] to find most similar | documents using cosine similarity[2]. Elasticsearch (vanilla) | will get this feature with 8.0. | | If migrating to ES makes you groan you can use a managed | service like Pinecone[3] (disclaimer: I work there) just for | storing and searching through text embeddings in-memory through | an API while keeping the rest of your data in PG. | | [1] Nearest-neighbor searches in Open Distro: | https://opendistro.github.io/for-elasticsearch-docs/docs/knn... | | [2] More on how semantic similarity is measured: | https://www.pinecone.io/learn/semantic-search/ | | [3] https://www.pinecone.io | matsemann wrote: | I'm not well versed on modern pg for this use, but when I | managed a Solr instance ~5 years ago, it was the ranking of the | results that was the killer feature. Finding results fast most | systems can do. Knowing which results to present is harder. | | Our case was a domain specific knowledge base, with certain | terms occurring often in many articles. Searching for a term | could bring up thousands of results, but few of them were | actually relevant to show in context of the search, they just | happened to use the term. | jka wrote: | "Faceted search"[1] (aka aggregates in Elasticsearch) tends to | be a popular one, to provide user-facing content navigation. | | That said, simonw has been on the case[2] demonstrating an | implementation of that using Django and PostgreSQL. | | [1] - https://en.wikipedia.org/wiki/Faceted_search | | [2] - https://simonwillison.net/2017/Oct/5/django-postgresql- | facet... | craigds wrote: | This is the key one for us that makes Postgres a non-starter | for FTS (We use postgres for everything _else_ ) | | We begrudgingly use Solr instead (we started before ES was | really a thing and haven't found a _need_ to switch yet) | | When you get more than about two different _types_ of filters | (e.g. types of filters could be 'tags', 'categories', | 'geotags', 'media type', 'author' etc), the combinatorial | explosion of Postgres queries required to provide facet | counts gets unmanageable. | | For example, when I do a query filtered by | `?tag=abc&category=category1`, I need to do these queries: | - `... where tag = 'abc' and category_id = 1` (the current | results) - `count(*) ... where tag = 'def' and | category_id = 1` (for each other tag present in the results) | - `count(*) ... where tag = 'abc' and category_id = 2` (for | each other category present in the results) - | `count(*) ... where category_id = 1` - `count(*) | ... where tag = 'abc'` - `count(*)` | | There are certainly smarter ways to do this than lots of tiny | queries, but all this complexity still ends up somewhere and | isn't likely to be great for performance. | | Whereas solr/elasticsearch have faceting built in and handle | this with ease. | brightball wrote: | Streaming data ingestion is the biggest. If you're constantly | writing data to be searched, this is where ES really outshines | everything. | kayodelycaon wrote: | If I recall correctly, Postgres search doesn't scale well. Not | sure where it falls apart but it isn't optimized in the same | way something like Solr is. | ezekg wrote: | I have a table with over a billion rows and most full-text | searches still respond in around a few milliseconds. I think | this will depend on a lot of factors, such as proper | indexing, and filtering down the dataset as much as possible | before performing the full-text ops. I've spent a | considerable amount of time on optimizing these queries, | thanks to tools like PgMustard [0]. Granted, I do still have | a couple slow queries (1-10s query time), but that's likely | due to very infrequent access i.e. cold cache. | | I will say, if you use open source libraries like pg_search, | you are unlikely to ever have performant full-text search. | Most full-text queries need to be written by hand to actually | utilize indexes, instead of the query-soup that these types | of libraries output. (No offense to the maintainers -- it's | just how it be when you create a "general" solution.) | | [0]: https://pgmustard.com | snack-boye wrote: | Silly question, I'm using pg right now and most of my | queries are something like this (in english) | | Find me some results in my area that contain these | categoryIds and are slotted to start between now and next | 10 days. | | Since its already quite a filtered set of data, would that | mean I should have little issues adding pg text search | because with correct indexing and all, it will usually be | applied to a small set of data? | | Thanks | ezekg wrote: | I'm not a DBA, so I can't say for certain simply due to a | gap in my knowledge. But in my experience, it depends on | a lot of factors. Sometimes pg will use an index before | performing the search ops, other times a subquery is | needed. Check out pgmustard and dig into your slow query | plans. :) | john-shaffer wrote: | You might be just fine adding an unindexed tsvector | column, since you've already filtered down the results. | | The GIN indexes for FTS don't really work in conjunction | with other indices, which is why | https://github.com/postgrespro/rum exists. Luckily, it | sounds like you can use your existing indices to filter | and let postgres scan for matches on the tsvector. The | GIN tsvector indices are quite expensive to build, so | don't add one if postgres can't make use of it! | kayodelycaon wrote: | Oh cool. I was right and wrong. Thanks! | fizx wrote: | Hi, I started an Elasticsearch hosting company, since sold, and | have built products on PG's search and SQLite FTS search. | | There are in my mind two reasons to not use PG's search. | | 1. Elasticsearch allows you to build sophisticated linguistic | and feature scoring pipelines to optimize your search quality. | This is not a typical use case in PG. | | 2. Your primary database is usually your scaling bottleneck | even without adding a relatively expensive search workload into | the mix. A full-text search tends to be around as expensive as | a 5% table scan of the related table. Most DBAs don't like | large scan workloads. | kurko wrote: | I agree with reason 1, but reason 2 is an answer for, "should | I use PG search in the same PG instance I already have", and | that's a different discussion. You can set up a replica for | that. | quietbritishjim wrote: | Do you see any particular reasons to use or not use ZomboDB | [1]? It claims to lets you use ElasticSearch from PG | seamlessly e.g. it manages coherency of which results ought | to be returned according to the current transaction. (I've | never quite ended up needing to use ES but it's always seemed | to me I'd be likely to need ZomboDB if I did.) | | [1] https://github.com/zombodb/zombodb | mixmastamyk wrote: | You can do anything, anything at all, at https://zombo.com/ | | _" The only limit, is yourself..."_ | FpUser wrote: | This one is my favorite for many many years. Good for | relaxing. | habibur wrote: | It's still around? | mumblemumble wrote: | That seems like one for the philosophers. If you | completely rewrite a Flash site in HTML5, but it looks | the same and has the same URL, is it still the same site? | da_chicken wrote: | So, YouTube? | mixmastamyk wrote: | Klicken-Sie Link. | Conlectus wrote: | Though I have only a recreational interest in datastores, I | would be pretty wary of a service that claims to strap a | Consistant-Unavailable database (Postgres) to an | Inconsistent-Available database (ElasticSearch) in a useful | way. | | Doing so would require ElasticSearch to reach consensus on | every read/write, which would remove most of the point of a | distributed cluster. Despite this, ZomboDB's documentation | says "complex aggregate queries can be answered in parallel | across your ElasticSearch cluster". | | They also claim that transactions will abort if | ElasticSearch runs into network trouble, but the | ElasticSearch documentation notes that writes during | network partitions don't wait for confirmation of | success[1], so I'm not sure how they would be able to | detect that. | | In short: I'll wait for the Jepsen analysis. | | [1] https://www.elastic.co/blog/tracking-in-sync-shard- | copies#:~... | zombodb wrote: | > Doing so would require ElasticSearch to reach consensus | on every read/write | | ZomboDB only requires that ES have a view of its index | that's consistent with the active Postgres transaction | snapshot. ZDB handles this by ensuring that the ES index | is fully refreshed after writes. | | This doesn't necessarily make ZDB great for high-update | loads, but that's not ZDB's target usage. | | > They also claim that transactions will abort if | ElasticSearch runs into network trouble... | | I had to search my own repo to see where I make this | claim. I don't. I do note that network failures between | PG & ES will cause the active Postgres xact to abort. On | top of that, any error that ES is capable of reporting | back to the client will cause the PG xact to abort -- | ensuring consistency between the two. | | Because the ES index is properly refreshed as it relates | to the active Postgres transaction, all of ES' aggregate | search functions are capable of providing proper MVCC- | correct results, using the parallelism provided by the ES | cluster. | | I don't have the time to detail everything that ZDB does | to project Postgres xact snapshots on top of ES, but the | above two points are the easy ones to solve. | NortySpock wrote: | Regarding point 2: Shouldn't you be moving your search | queries from your transaction server to a separate analysis | or read-replica server? OLTP copies to OLAP and suddenly | you've separated these two problems. | blowski wrote: | ...and created a new one if the projection is only | eventually consistent. No free lunches here. | zepolen wrote: | How is that different from running Postgres and | Elasticsearch separately? | blowski wrote: | It's not. | | It's a difference between having separate OLAP and OLTP | databases, which is what the parent post suggested. | justinclift wrote: | Probably skill set of staff? | | For example, if a place has developed fairly good | knowledge of PG already, they can "just" (!) continue | developing their PG knowledge. | | Adding ES into the mix though, introduces a whole new | thing that needs to be learned, optimised, etc. | pmarreck wrote: | > sophisticated linguistic and feature scoring pipelines to | optimize your search quality | | You CAN score results using setweight, although it's likely | not as sophisticated as Elasticsearch's | | https://www.postgresql.org/docs/9.1/textsearch-controls.html | | Disclaimer: I use Postgres fulltext search in production, | very happy with it although maintaining the various triggers | and stored procs it requires to work becomes cumbersome | whenever you have to write a migration that alters any of | them (or that in fact touches any related field, as you may | be required to drop and recreate all of the parts in order | not to violate referential integrity) | | It is certainly nice having not to worry about 1 additional | dependency when deploying, though | thom wrote: | We get really nice results with gist indexes (gist_trgm_ops) | searching across multiple entity types to do top X queries. It's | very useful to be able to make a stab at a difficult-to-spell | foreign football player's name, possibly with lots of diacritics, | and get quick results back. I'm always surprised when I find a | search engine on any site that is so unkind as to make you spell | things exactly. | bityard wrote: | I know Postgres and SQLite have mostly different purposes but | FWIW, SQLite also has a surprisingly capable full-text search | extension built right in: https://www.sqlite.org/fts5.html | jjice wrote: | It's very impressive, especially considering the SQLite version | you're already using probably has it enabled already. I use it | for a small site I run and it works fantastic. Little finicky | with deletes and updates due to virtual tables in SQLite, but | definitely impressive and has its uses. | SigmundA wrote: | Keep wondering if RUM Indexes [1] will ever get merged for faster | and better ranking (TF/IDF). Really would make PG a much more | complete text search engine. | | https://github.com/postgrespro/rum ___________________________________________________________________ (page generated 2021-07-27 23:00 UTC)