[HN Gopher] An automatic indexing system for Postgres
       ___________________________________________________________________
        
       An automatic indexing system for Postgres
        
       Author : ako
       Score  : 116 points
       Date   : 2023-11-17 06:38 UTC (1 days ago)
        
 (HTM) web link (pganalyze.com)
 (TXT) w3m dump (pganalyze.com)
        
       | westurner wrote:
       | > _A fundamental decision we 've made for the pganalyze Indexing
       | Engine is that we break down queries into smaller parts we call
       | "scans". Scans are always on a single table, and you may be
       | familiar with this concept from reading an EXPLAIN plan. For
       | example, in an EXPLAIN plan you could see a Sequential Scan or
       | Index Scan, both representing a different scan method for the
       | same scan on a given table._
       | 
       | Sequential scan == Full table scan:
       | https://en.wikipedia.org/wiki/Full_table_scan
        
         | financltravsty wrote:
         | Yes, and a neat thing about indexes: sometimes it's faster to
         | do a sequential scan than load an index into memory.
        
       | gavinray wrote:
       | > This is powered by a modified version of the Postgres planner
       | that runs as part of the pganalyze app. This modified planner can
       | generate EXPLAIN-like data from just a query and schema
       | information - and most importantly, that means we can take query
       | statistics data from pg_stat_statements and generate a generic
       | query plan from it. You can read more about that in our blog post
       | "How we deconstructed the Postgres planner".
       | 
       | Having something like this available as a library inside of
       | Postgres seems really beneficial for tool authors. I wonder what
       | the odds of getting it upstreamed are?
        
         | lfittl wrote:
         | Author here - what we call "pg_plan" internally, is essentially
         | very similar in spirit to what we do with pg_query [0], but the
         | difference is that we pull in a lot more code from Postgres,
         | and have more overrides in places that are not needed for the
         | use case (e.g. MVCC handling, etc).
         | 
         | My gut feeling tells me that the chances of having an upstream
         | library that contains the parser/parse analysis/planner are
         | slim. Mainly from there being a lot of entanglement with
         | reading files on disk, memory management, etc - I suspect one
         | of the pushbacks would be it would complicate development work
         | for Postgres itself, to the point its not worth the benefits.
         | 
         | For the pg_query library on the other hand I have hopes that we
         | can upstream this eventually - there are enough third-party
         | users out there to clearly show the need, and its much more
         | contained (i.e. raw parser + AST structs). Hopefully something
         | we can spend a bit of time on next year.
         | 
         | [0]: https://github.com/pganalyze/libpg_query
        
       | zubairq wrote:
       | I might try something like this indexing method
        
       | whalesalad wrote:
       | I've often wondered why we don't put a thin tiny proxy in front
       | of psql (or augment pgbouncer, other proxies) to collect
       | telemetry on all real world queries against the db. Being a
       | middle man there would give you lots of visibility into every
       | single query being made as well as how long it takes. I guess the
       | modern integrated stats tools help.
       | 
       | I think the real problem is not indexing correctly but rather
       | modeling your problem correctly. You can throw indexes at the
       | problem but that is sometimes just a bandaid to a more integral
       | issue which would be inventing a new schema, new access patterns
       | etc.
        
         | phamilton wrote:
         | We have this with APM monitoring, where every SQL query in the
         | application becomes a span in a trace. It is very useful. We
         | can quickly see the specific parameters that make an otherwise
         | fast query slow.
        
         | ako wrote:
         | Most databases will allow you to collect statistics on every
         | query run by users, e.g., pg_stat_statements:
         | https://www.postgresql.org/docs/current/pgstatstatements.htm...
         | 
         | A relational database really shines when you have all sorts of
         | queries on tables, so it's hard to optimize your model for
         | every potentially future desired query. Precalculating these,
         | caching the results in separate tables is not very efficient,
         | and requires you to predetermine every query you want to
         | answer.
         | 
         | Indexing in this situation is often the better alternative.
        
           | szundi wrote:
           | Lots of apps however do the same N kind of queries, but the
           | data under them can change over time without anyone noticing
        
             | ako wrote:
             | So? Just make sure you regularly refresh the statistics...
        
         | ellisv wrote:
         | I definitely agree on problems not being modeled correctly,
         | although I also frequently see application developers forego
         | indexes. I think this is fundamentally a lack of understanding
         | or inexperience.
        
         | jes5199 wrote:
         | you can get a long way just checking the slow query log every
         | few days
        
         | bbminner wrote:
         | Exactly, I have always been curious about the same thing - if
         | we have live statistics on queries then what stops us from
         | cloning data (to correctly capture statics such as skew, etc.)
         | and running these live queries with a bunch of knobs tuned
         | automatically (which indexes, on which columns, etc.), and
         | reporting the cost-throughput paretto optimum to the user? The
         | configuration space is too large to explore randomly? And it's
         | structure is too complex to produce automatic suggestions?
        
         | ahachete wrote:
         | Have a look at the Postgres filter for the Envoy proxy: [1]
         | (blog announcement post).
         | 
         | While it's not capturing as of today query performance, it
         | collects notable telemetry for Postgres in exactly the way you
         | mention: just because the traffic flows through it, making it a
         | way to collect data "for free" and definitely without taking
         | any resources from the upstream database.
         | 
         | It can also offload SSL from Postgres. The filter could be
         | extended for other use cases.
         | 
         | Disclaimer: my company developed this plugin for Envoy with the
         | help of Envoy's awesome community.
         | 
         | [1]:
         | https://www.cncf.io/blog/2020/08/13/envoy-1-15-introduces-a-...
        
         | nijave wrote:
         | MaxScale does this for MariaDB
        
       | monlockandkey wrote:
       | Any open source alternatives to PGAnalyze?
        
       | brunoqc wrote:
       | https://github.com/ankane/dexter
        
       | VeejayRampay wrote:
       | the guys at pganalyze do a fantastic job explaining the inner
       | workings of postgresql on their YouTube channel, especially the 5
       | minutes of postgre series
        
       | colincooke wrote:
       | I had no idea what was going on under the hood but used PGAnalyze
       | for about 6 months after launching a new product. It was
       | excellent at suggesting indexes and monitoring unused ones.
       | 
       | However, after a few months the ROI crept down until it wasn't
       | worth it anymore (access patterns stabilized). I'm tempted to
       | bring it back once and a while but the price tag keeps me from
       | having it always on.
        
         | chatmasta wrote:
         | That makes sense, if the structure of your queries is largely
         | static per commit of your codebase. You'd probably get more
         | benefit out of a tool like this by running it as part of a CI
         | pipeline, since that's the best time to test code changes
         | including new queries. But then the challenge becomes
         | simulating user activity and traffic levels during the CI
         | pipeline. That's a solvable problem, but it's not as easy as
         | just observing the real traffic in your prod network.
        
       | lfittl wrote:
       | OP here (happy to see this on HN!) - if you're interested in our
       | work in this area, I hosted a webinar earlier this week where we
       | walked through a new constraint programming based system
       | (utilizing CP-SAT) for Postgres index selection that we're
       | starting to roll out in pganalyze:
       | 
       | https://www.youtube.com/watch?v=SlNQTtfjlnI
       | 
       | Compared to the initial version, this updated version is both
       | more configurable, as well as has better handling of competing
       | objectives in index selection (write overhead vs query
       | performance).
       | 
       | If you want to give this a try, and have pganalyze set up / want
       | to try it out, feel free to send me a message (email in profile).
        
       ___________________________________________________________________
       (page generated 2023-11-18 23:00 UTC)