[HN Gopher] An automatic indexing system for Postgres ___________________________________________________________________ An automatic indexing system for Postgres Author : ako Score : 116 points Date : 2023-11-17 06:38 UTC (1 days ago) (HTM) web link (pganalyze.com) (TXT) w3m dump (pganalyze.com) | westurner wrote: | > _A fundamental decision we 've made for the pganalyze Indexing | Engine is that we break down queries into smaller parts we call | "scans". Scans are always on a single table, and you may be | familiar with this concept from reading an EXPLAIN plan. For | example, in an EXPLAIN plan you could see a Sequential Scan or | Index Scan, both representing a different scan method for the | same scan on a given table._ | | Sequential scan == Full table scan: | https://en.wikipedia.org/wiki/Full_table_scan | financltravsty wrote: | Yes, and a neat thing about indexes: sometimes it's faster to | do a sequential scan than load an index into memory. | gavinray wrote: | > This is powered by a modified version of the Postgres planner | that runs as part of the pganalyze app. This modified planner can | generate EXPLAIN-like data from just a query and schema | information - and most importantly, that means we can take query | statistics data from pg_stat_statements and generate a generic | query plan from it. You can read more about that in our blog post | "How we deconstructed the Postgres planner". | | Having something like this available as a library inside of | Postgres seems really beneficial for tool authors. I wonder what | the odds of getting it upstreamed are? | lfittl wrote: | Author here - what we call "pg_plan" internally, is essentially | very similar in spirit to what we do with pg_query [0], but the | difference is that we pull in a lot more code from Postgres, | and have more overrides in places that are not needed for the | use case (e.g. MVCC handling, etc). | | My gut feeling tells me that the chances of having an upstream | library that contains the parser/parse analysis/planner are | slim. Mainly from there being a lot of entanglement with | reading files on disk, memory management, etc - I suspect one | of the pushbacks would be it would complicate development work | for Postgres itself, to the point its not worth the benefits. | | For the pg_query library on the other hand I have hopes that we | can upstream this eventually - there are enough third-party | users out there to clearly show the need, and its much more | contained (i.e. raw parser + AST structs). Hopefully something | we can spend a bit of time on next year. | | [0]: https://github.com/pganalyze/libpg_query | zubairq wrote: | I might try something like this indexing method | whalesalad wrote: | I've often wondered why we don't put a thin tiny proxy in front | of psql (or augment pgbouncer, other proxies) to collect | telemetry on all real world queries against the db. Being a | middle man there would give you lots of visibility into every | single query being made as well as how long it takes. I guess the | modern integrated stats tools help. | | I think the real problem is not indexing correctly but rather | modeling your problem correctly. You can throw indexes at the | problem but that is sometimes just a bandaid to a more integral | issue which would be inventing a new schema, new access patterns | etc. | phamilton wrote: | We have this with APM monitoring, where every SQL query in the | application becomes a span in a trace. It is very useful. We | can quickly see the specific parameters that make an otherwise | fast query slow. | ako wrote: | Most databases will allow you to collect statistics on every | query run by users, e.g., pg_stat_statements: | https://www.postgresql.org/docs/current/pgstatstatements.htm... | | A relational database really shines when you have all sorts of | queries on tables, so it's hard to optimize your model for | every potentially future desired query. Precalculating these, | caching the results in separate tables is not very efficient, | and requires you to predetermine every query you want to | answer. | | Indexing in this situation is often the better alternative. | szundi wrote: | Lots of apps however do the same N kind of queries, but the | data under them can change over time without anyone noticing | ako wrote: | So? Just make sure you regularly refresh the statistics... | ellisv wrote: | I definitely agree on problems not being modeled correctly, | although I also frequently see application developers forego | indexes. I think this is fundamentally a lack of understanding | or inexperience. | jes5199 wrote: | you can get a long way just checking the slow query log every | few days | bbminner wrote: | Exactly, I have always been curious about the same thing - if | we have live statistics on queries then what stops us from | cloning data (to correctly capture statics such as skew, etc.) | and running these live queries with a bunch of knobs tuned | automatically (which indexes, on which columns, etc.), and | reporting the cost-throughput paretto optimum to the user? The | configuration space is too large to explore randomly? And it's | structure is too complex to produce automatic suggestions? | ahachete wrote: | Have a look at the Postgres filter for the Envoy proxy: [1] | (blog announcement post). | | While it's not capturing as of today query performance, it | collects notable telemetry for Postgres in exactly the way you | mention: just because the traffic flows through it, making it a | way to collect data "for free" and definitely without taking | any resources from the upstream database. | | It can also offload SSL from Postgres. The filter could be | extended for other use cases. | | Disclaimer: my company developed this plugin for Envoy with the | help of Envoy's awesome community. | | [1]: | https://www.cncf.io/blog/2020/08/13/envoy-1-15-introduces-a-... | nijave wrote: | MaxScale does this for MariaDB | monlockandkey wrote: | Any open source alternatives to PGAnalyze? | brunoqc wrote: | https://github.com/ankane/dexter | VeejayRampay wrote: | the guys at pganalyze do a fantastic job explaining the inner | workings of postgresql on their YouTube channel, especially the 5 | minutes of postgre series | colincooke wrote: | I had no idea what was going on under the hood but used PGAnalyze | for about 6 months after launching a new product. It was | excellent at suggesting indexes and monitoring unused ones. | | However, after a few months the ROI crept down until it wasn't | worth it anymore (access patterns stabilized). I'm tempted to | bring it back once and a while but the price tag keeps me from | having it always on. | chatmasta wrote: | That makes sense, if the structure of your queries is largely | static per commit of your codebase. You'd probably get more | benefit out of a tool like this by running it as part of a CI | pipeline, since that's the best time to test code changes | including new queries. But then the challenge becomes | simulating user activity and traffic levels during the CI | pipeline. That's a solvable problem, but it's not as easy as | just observing the real traffic in your prod network. | lfittl wrote: | OP here (happy to see this on HN!) - if you're interested in our | work in this area, I hosted a webinar earlier this week where we | walked through a new constraint programming based system | (utilizing CP-SAT) for Postgres index selection that we're | starting to roll out in pganalyze: | | https://www.youtube.com/watch?v=SlNQTtfjlnI | | Compared to the initial version, this updated version is both | more configurable, as well as has better handling of competing | objectives in index selection (write overhead vs query | performance). | | If you want to give this a try, and have pganalyze set up / want | to try it out, feel free to send me a message (email in profile). ___________________________________________________________________ (page generated 2023-11-18 23:00 UTC)