[HN Gopher] Scalable PostgreSQL Connection Pooler
       ___________________________________________________________________
        
       Scalable PostgreSQL Connection Pooler
        
       Author : sbuttgereit
       Score  : 144 points
       Date   : 2021-11-12 16:25 UTC (6 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | Tostino wrote:
       | I appreciate the work going on to make high quality poolers for
       | Postgres...I just really wish the work was done in-core so we
       | could have built in connection pooling, or the work done to make
       | grabbing a new connection as cheap as if you were using a pooler.
       | It sucks to have to add complexity to your stack to fix a
       | "deficiency" in Postgres' design.
       | 
       | Still, I am glad there is effort put into project even if I
       | selfishly wish it was done in-core.
        
         | anarazel wrote:
         | There has been some work towards that - things did get cheaper
         | in 14.
         | 
         | Note that you very well might still want a separately run
         | pooler for some workloads - having local poolers on application
         | servers can be good for latency.
        
           | saurik wrote:
           | Yeah: the latency of establishing a new connection
           | fundamentally requires _at least_ one network round trip, and
           | likely a few (and so I tend to have my servers going through
           | _two_ layers of pgbouncer, one on each side of the network);
           | and using in-process pools couples your pool state to your
           | front end code updates... so, since you are going to want to
           | build this connection pooling mechanism anyway--and then are
           | likely going to want to be able to take advtanage off
           | application-specific capability limitations (such as not
           | supporting connection state outside of a transaction) to
           | further improve performance--it doesn 't really make much
           | sense to prioritize this inside the database server itself.
        
             | anarazel wrote:
             | > it doesn't really make much sense to prioritize this
             | inside the database server itself.
             | 
             | I wouldn't go that far. There's a lot of things that are
             | hard to do within an external pooler, unless the pooler
             | uses a 1:1 connection model. Leading to most setups using
             | 1:1 connection pooling methods. Which in turn puts pressure
             | on postgres to handle large numbers of connections
             | gracefully.
        
         | Znafon wrote:
         | It is being worked on: https://www.postgresql.org/message-
         | id/flat/ac873432-31cf-d5e... !
        
         | riyadparvez wrote:
         | Can someone point to an article that explains the connection
         | issue in detail?
        
           | hn_throwaway_99 wrote:
           | Short of it is that Postgres uses a process per connection,
           | so architectures that spin up and close connections
           | frequently can have serious scalability issues.
           | 
           | Note the landing page for the AWS RDS Proxy,
           | https://aws.amazon.com/rds/proxy/ , is as good a discussion
           | as any as to why you'd want to put a pooling proxy in front
           | of Postgres.
        
           | ddorian43 wrote:
           | https://www.citusdata.com/blog/2020/10/08/analyzing-
           | connecti...
        
         | x4m wrote:
         | There's ongoing work on making Postgres connections more
         | scalable. I really hope that one day we will not need a pooler.
         | https://www.citusdata.com/blog/2020/10/08/analyzing-connecti...
        
       | jordanthoms wrote:
       | We've been using this for a while after we found pgbouncer became
       | a bottleneck - it's so much faster and has been totally reliable.
       | Thanks!
        
       | colesantiago wrote:
       | Has anyone used Yandex Cloud [0] in production?, How does it
       | compare to DigitalOcean, AWS, GCP and others?
       | 
       | Also the tech from Yandex such as Clickhouse [1] is really
       | interesting, as they recently spun it out of Yandex.
       | 
       | [0] https://cloud.yandex.com/en/
       | 
       | [1] https://clickhouse.com/
        
         | x4m wrote:
         | We are also developing managed databases for US\EU market
         | https://double.cloud/
        
         | zaius wrote:
         | Last time I tried they wouldn't take a US credit card, which
         | made it too big of a challenge to consider
        
         | the-alchemist wrote:
         | Only used Clickhouse, although not through their Cloud
         | offering.
         | 
         | It's an impressive piece of engineering. The best column-
         | oriented DB in the open source space, I would say.
        
         | [deleted]
        
       | derefr wrote:
       | Question for anyone who knows a lot about Postgres connection
       | pooling:
       | 
       | We want to move to using a connection pooling infrastructure
       | component, but so far, we haven't been able to figure out how to
       | do so given our setup.
       | 
       | We've got a wacky data model where we have N Postgres schemas,
       | each of which has the same tables (but with different data,
       | obviously.) They're not tenants; more like separate "datasets."
       | Picture something like: each schema is a digraph, with tables
       | like "vertices", "edges", "edge_labels", etc. (Not what we're
       | doing, but it's close.)
       | 
       | In theory, we could solve our problem by just sticking all the
       | "vertices" tables together as list-partitions of one mega
       | "vertices" table keyed by dataset_id; but 1. it's kind of
       | incoherent -- the data makes no sense when combined together like
       | that, it's only valid when considered in isolation; and 2. it's
       | very operationally convenient for us to manage datasets as
       | separate schemas, re: security, naming, onlining/offlining data,
       | etc.
       | 
       | Also, while each user request is only going to interact with a
       | single dataset (schema), a given user might care about a lot of
       | datasets, and make requests about many of them at arbitrary times
       | --so, as far as I can tell, we don't have the sort of stable load
       | per dataset that would allow us to hold separate connection pools
       | per dataset.
       | 
       |  _And_ , since there's no way to make a schema name into a bind
       | variable, fully-qualifying our queries means generating dynamic
       | SQL strings, which is both expensive on the business layer, and
       | on the SQL query planner, which can't just recognize+unify its
       | plans from the query's history from other schemas.
       | 
       | Right now, the way we query the data, is that in our business
       | layer, before each query, we have a middleware that injects a
       | statement like this:                   SET search_path TO
       | 'name_of_dataset', 'public';
       | 
       | We then run a non-fully-schema-qualified query, and it finds
       | whichever tables that have been made visible to it.
       | 
       | Our queries currently don't run in transactions (i.e. we're using
       | JDBC auto-commit on the client side), because they're by-and-
       | large just single (complex) SELECT statements. So the middleware-
       | generated statement above (for now) runs as a separate statement,
       | in a separate single-statement transaction, on the same
       | connection, that runs the query. So per-statement connection-
       | pooling would break everything, as our queries would be being
       | routed to backend connections "primed" with the wrong
       | search_path.
       | 
       | And I get the feeling that _transaction_ -level connection
       | pooling won't work for us either (at least for now), because our
       | whole operational problem _currently_ is that our HTTP requests
       | acquire DB connections and then sit on them while other things
       | are processed, depleting the connection pool; and if we turned
       | off JDBC auto-commit, our search_path-injecting middleware (which
       | injects its statement on connection-pool checkout) would just end
       | up _starting a DB transaction_ at the beginning of each of those
       | HTTP requests, where the business layer would then be sitting
       | around with an _idle in transaction_ DB connection, mapping
       | directly to an open pgBouncer backend connection, completely
       | destroying any wins connection-pooling would gain us. We 'd be
       | right back where we started.
       | 
       | So, in other words, we want/need _per-statement_ pooling; but we
       | need it to allow us to also specify the search path _per
       | statement_.
       | 
       | I've seen that at least pgBouncer has an explicit WONTFIX for
       | this requirement, since in their minds it conflicts with their
       | "can't tell it apart from a regular Postgres session" session-
       | state semantics.
       | 
       | Should we bite the bullet and move to putting everything in one
       | schema + list-partitioning + an explicit dataset_id column, so
       | that we _can_ parameterize out the dataset per-query using a bind
       | variable? Or something else?
        
         | chatmasta wrote:
         | We are building a solution for this problem at Splitgraph [0] -
         | it sounds like we could probably help with your use case,
         | assuming this is for analytical (OLAP) data. You can get it to
         | work yourself with our open source code [1], but our (private
         | beta, upcoming public) SaaS will put all your schemas on a more
         | scalable "data delivery network," which incidentally, happens
         | to be implemented with PgBouncer + rewriting + ephemeral
         | instances. We also have private repositories / fine-grained ACL
         | shipping shortly.
         | 
         | In a local engine (just a Postgres DB managed by Splitgraph
         | client to add extra stuff), there is no PgBouncer or batteries-
         | included authN/Z, but we use Foreign Data Wrappers to
         | accomplish the same query resolving. Our goal is for you to
         | have the ability to do everything locally as an individual,
         | with the SaaS becoming useful for "multiplayer" (teams and
         | orgs).
         | 
         | On Splitgraph, every dataset - and every _version_ of every
         | dataset - has an address. Think of it like tagged Docker
         | images. The address either points to an immutable "data image"
         | (in which case we can optionally download objects required to
         | resolve a query on-the-fly, although loading up-front is
         | possible too) or to a live data source (in which case we proxy
         | directly to it via FDW translation). This simple idea of
         | _addressable data products_ goes a long way - for example, it
         | means that computing a diff is now as simple as joining across
         | two tables (one with the previous version, one with the new).
         | 
         | Please excuse the Frankenstein marketing site - we're in the
         | midst of redesign / rework of info architecture while we build
         | out our SaaS product.
         | 
         | Feel free to reach out if you've got questions. And if you have
         | a business case, we have spots available in our private pilot.
         | My email is in my profile - mention HN :)
         | 
         | [0] https://www.splitgraph.com/connect
         | 
         | [1] examples:
         | https://github.com/splitgraph/splitgraph/tree/master/example...
        
           | derefr wrote:
           | We're already sharding our datasets onto separate PG nodes
           | using FDWs (mostly due to us running PG on GCP, and GCE VMs
           | having inherent vertical scaling limitations on node-local
           | storage, which we rely 100% on for our workloads.)
           | 
           | Also, our "datasets" are all live data. They aren't modified
           | by the users querying our API, but they are constantly
           | appended to (and I mean constantly, i.e. every few
           | milliseconds.)
           | 
           | For us, PG is in our stack at this point because of its
           | hybrid-OLAP nature: it can do realtime, row-at-a-time
           | ingestion of data without degradation, like an OLTP store /
           | time-series DB; but it can then perform intensive OLAP
           | workloads against that up-to-the-moment data, involving
           | joins, CTEs, partial computed-expression indices, etc.
           | 
           | (The use-case for this sort of mixed workload? Think
           | "realtime monitoring/alerting on custom-per-user financial
           | Business-Intelligence queries from a common financial
           | transaction stream." We can't pre-denormalize the data,
           | because each client wants something different. Instead, we
           | need a normalized representation with tons of indices that
           | serves all potential queries equally well, in roughly real
           | time.)
           | 
           | For non-realtime analysis of "data at rest" (i.e. data that
           | can be ingested in at-most-hourly batches), we can just use
           | Snowflake. We already do, for logs and other things.
           | 
           | To be honest, our fondest dream would be to have a two-tiered
           | DB setup:
           | 
           | 1. a cold layer, where we get the SQL semantics of Postgres's
           | querying engine, but where the storage engine is similar to
           | those of scale-out DW services like Snowflake, with elastic
           | per-workload compute warehouse nodes fetching compressed-
           | columnar data from object storage into local per-worker
           | caches (it sounds like this is similar to what you're
           | building?)
           | 
           | 2. a hot layer, using traditional Postgres storage, which
           | serves as a sort of writeback cache for the cold layer:
           | 
           | * where all INSERTs hit the hot layer and then get async-
           | batched to the cold layer;
           | 
           | * where the hot layer keeps its own indices to enable fancy
           | OLAP querying of the hot data;
           | 
           | * where those queries see the cold data as "part of" the hot
           | data, in a UNION ALL sense (probably through a FDW);
           | 
           | * where those queries will _constraint-exclude_ the cold data
           | (and thus trigger no workload on the cold layer) if the cold
           | data isn 't relevant to resolving the query -- like PG11
           | partition constraint-exclusion.
           | 
           | AFAIK, you can't currently use a foreign table as a partition
           | of a local parent table, so that'd be the first thing needing
           | to be solved there. The next problem after that would be
           | resolving the need for an AccessExclusiveLock to modify the
           | range constraint on each of the parititons, since 1. the
           | split-point between "historical" and "timely" data would be
           | sliding forward every hour-or-so, but 2. the table would be
           | just _saturated_ with long-running SELECTs, so much so that
           | it would never get a moment to lock itself to make a
           | modification like that.
           | 
           | (Really, the magic wand _there_ would be to allow partitions
           | to have multiple parents and for  "a partition" -- the
           | metadata that contains the list/range/hash constraint -- to
           | be a separate DB object from the DB table it refers to, where
           | multiple "partitions" can reference one table-with-the-
           | partition's-data-in-it, as long as the data meets the union
           | of all their constraints. With those abstractions, you could
           | build a new parent table that _also_ references the old child
           | tables through new partition metadata, and then just do an
           | atomic swap of the parent tables ' _names_ when you 're
           | ready, where old queries would go on using the tables they
           | had already dereferenced to their OIDs, and new queries would
           | start using the new tables. Then queue up an effectively-
           | async DROP TABLE on the old parent table, that would resolve
           | when the DB runs out of old queries locking it open.)
        
         | x4m wrote:
         | This may sound naive, but implementing your own pooler is very
         | easy e.g. in Go pgproto3 already does all packet-parsing for
         | you. Transaction poolers are looking on ReadyForQuery packet
         | and it's "in trnsaction" property like this [0]. All you need -
         | is stick server connection on new ParameterStatus[1] packet for
         | "SET search_path" instead of ReadyForQuery.
         | 
         | [0] https://github.com/pg-
         | sharding/spqr/blob/358f816cd8a964a9c9e... [1]
         | https://www.postgresql.org/docs/10/protocol-flow.html#id-1.1...
        
           | CyberDem0n wrote:
           | It's not enough to intercept `set search_path`, in fact any
           | arbitrary function could change it, but Postgres doesn't
           | report it back, unlike some other gucs.
        
             | x4m wrote:
             | Hmm, yes. ParameterStatus is not sent when search_path is
             | changed. But I think it would be trivial to patch
             | PostgreSQL to send it. I bet one even can push such patch
             | through commitfest.
        
         | nezirus wrote:
         | Maybe this can help https://github.com/awslabs/pgbouncer-rr-
         | patch
         | 
         | It is a pgbouncer fork/patch with query rewriting support.
        
       | r1b wrote:
       | Why might I want to use this or pgbouncer instead of, say, the
       | connection pooling built into ORMs like SQLAlchemy?
        
         | PhoenixReborn wrote:
         | The ORM connection pooler is client-side - so each instance of
         | your application using SQLAlchemy will have its own connection
         | pool. This may become unwieldy as you scale up the number of
         | app servers/containers/processes.
         | 
         | In contrast, connection poolers like Odyssey or pgbouncer live
         | external to your applications, meaning that:
         | 
         | 1. You can connect to them from any application written in any
         | language (not just Python) 2. They provide a global pool for
         | all of the applications connecting and help to not overload
         | Postgres.
        
           | djbusby wrote:
           | Why not python? Works on my machine.
        
         | x4m wrote:
         | You might want to have many app backends, each with it's own
         | Alchemy pool. This setup would create a lot of Postgres
         | connections. When the network flaps you need to reinstall lots
         | of TLS connections. Maybe 100ms of cpu for each handshake. Also
         | each connection may cost you a fork with lots of following CoW,
         | also up to 100ms of cpu. And also Postgres gives you maximum
         | TPS throughput when you run some hundreds of connections, not
         | thousands. And you might want to isolate one microservice from
         | another (in terms of DB throughput utilization). Poolers can
         | help here too.
        
         | maxpert wrote:
         | You would only see the real value in these poolers in dire
         | times. Screwing up apps or deploys leaking connections is
         | really easy. If that happens you will bring down the whole DB.
         | With these poolers you will not only be able to limit that but
         | also enforce per user (service based if every service has it's
         | own creds) limits. These tools are useful beyond certain scale
         | and you will only realize there true value once you are working
         | in distributed micro-service world specially when they are
         | connecting to same DB.
        
       | whitepoplar wrote:
       | Everything that comes out of Yandex is seemingly great. It feels
       | like a lighter, leaner FAANG and as a result, it feels like their
       | software is more usable by small teams (as compared to what the
       | actual FAANGs put out).
        
       | x4m wrote:
       | FWIW I maintain Odyssey and will be happy to answer any question
       | regarding it. Actually there was just a CVE-triggered release...
       | I really hope to release more often with more new functionality.
       | 
       | If you are interested in connection pooling maybe you will find
       | interesing SPQR too https://github.com/pg-sharding/spqr Currently
       | it's our experiment to build pooling-based PostgreSQL sharding.
        
         | jerrysievert wrote:
         | Any thoughts on an arm64 version?
        
         | hn_throwaway_99 wrote:
         | Thanks very much for posting this. Odyssey is analogous to
         | pgbouncer, correct? I haven't set up a connection pooler yet
         | for our postgres instance but was planning to do it eventually
         | with pgbouncer - are you able to comment how Odyssey compares?
        
           | x4m wrote:
           | Connection pooler allows you to get more transactions per
           | second on the same hardware. All it does - pack a lot of
           | connections into handful of concurrently running PG backends.
           | In theory all it does - relay bytes between sockets. But
           | there are workloads when it's CPU intensive, for example lots
           | of concurrent TLS handshakes. And pgbouncer is single-
           | threaded! Odyssey is more scalable.
           | 
           | PgBouncer is a great software, actually. We built Odyssey
           | merely to create competitor for PgBouncer. But then started
           | to add functionality that we needed for cloud installations.
           | E.g. Odyssey computes transaction time quantiles, when
           | PgBouncer computes average transaction time.
        
             | whitepoplar wrote:
             | How does Odyssey compare to other high-performance poolers
             | like pgagroal?
        
               | x4m wrote:
               | Depends on how you measure performance :) pgagroal
               | utilizes a lot of linux processes to track each
               | connection state, we use coroutines packed into small
               | number of processes. Actually if you have a bottleneck in
               | multithreaded pooler - something really goes wrong. The
               | database should consume most of resources. I'd be happy
               | to help tuning Odyssey to someone who will bachmark both
               | poolers (in fact there's only one number - number of
               | worker processes..well, maybe pool_size too).
               | 
               | pgagroal claims performance superiority over all poolers
               | [0]. I doubt that Odyssey was used in transaction pooling
               | mode in those experiments.
               | 
               | [0] https://github.com/agroal/pgagroal/blob/master/doc/PE
               | RFORMAN...
        
               | whitepoplar wrote:
               | Thanks! For a Postgres n00b, can Odyssey work alongside a
               | HA solution like pg_auto_failover running on the same
               | nodes?
        
               | x4m wrote:
               | Sure. We use Odyssey as a component of HA cluster in
               | Yandex.Cloud. Odyssey runs on each node of a Patroni-like
               | system. One day we want to implement load balancing
               | functionality in Odyssey. So that your queries go to
               | other node if local is overloaded or lagging long behind
               | primary.
        
               | tsenart wrote:
               | This, please! Native support for read-replicas would be
               | awesome. Ideally it would now if a query is read-only or
               | not without application changes.
        
       | sudhirj wrote:
       | For those wondering why use a connection pooler at all with
       | Postgres, I wrote a (really long) post about it:
       | https://sudhir.io/understanding-connections-pools/
        
         | SOLAR_FIELDS wrote:
         | This is a great write up that really helped me understand the
         | issue in depth. Thanks for sharing.
        
       | barefeg wrote:
       | What is the use case for this?
        
         | hn_throwaway_99 wrote:
         | Probably the biggest downside in my mind with postgres (which
         | is otherwise an amazing product!) is that it doesn't scale
         | connections well, as there is 1 process per connection. So
         | teams often put a connection pooler _in front_ of postgres -
         | end clients connect to that connection pooler, which forwards
         | on requests to postgres. AFAIK pgbouncer is the  'de facto
         | standard' for postgres connection pooling, so I asked the
         | author how Odyssey compares.
        
       | ggregoire wrote:
       | Just to be sure, Odyssey (or pgBouncer) requires to run on the
       | same box than Postgres, right? So if you are on AWS, you would
       | have to migrate from RDS to running Postgres on a EC2 and install
       | the pooler on the same EC2?
       | 
       | Which comes with some big drawbacks (you obviously lose all the
       | benefits of using RDS) but also some benefits (can install every
       | pg extension that you need, have access to every pg settings,
       | etc)
        
         | speedyapoc wrote:
         | I run pgBouncer on an EC2 instance, connecting to my database
         | on RDS. Does not have to be on the same box.
        
           | ggregoire wrote:
           | Oh nice. Thanks for the quick answer!
        
         | x4m wrote:
         | I'd certainly recommend to run pooler on the same node to avoid
         | double encryption. Reestablishing 2x connections, dealing with
         | 2x restarts of VM - is kind of a downside too. However it's not
         | strictly neccesary, Odyssey can run even in another AZ :)
        
       | claytonjy wrote:
       | Does anyone know if you can specify parent roles (group-style
       | non-login roles)? That's something I've wanted from pgbouncer, so
       | instead of making settings at the user level (applications) I
       | could make them at the group level (e.g. apiv2-readonly).
        
         | x4m wrote:
         | We are going to implement config inheritance. This should bring
         | somewhat resemblant functionality.
        
       | sandGorgon wrote:
       | is the Dockerfile not production ready ?
       | https://github.com/yandex/odyssey/blob/master/docker/Dockerf...
       | 
       | i mean i see valgrind, vim, etc in there. this would be a very
       | fat dockerfile.
       | 
       | It seems that way -
       | https://github.com/yandex/odyssey/issues/29#issuecomment-764...
       | 
       | > _But it 's hard to "bless" some "official" as image: no one
       | from active contributors uses Odyssey in Docker to maintain it
       | thoroughly._
       | 
       | OTOH pgbouncer docker images are rock-solid in production.
       | 
       | Very quickly updated to track upstream.
       | 
       | e.g. the Bitnami ones -
       | https://hub.docker.com/r/bitnami/pgbouncer/ which also have CVE
       | security scans
       | https://quay.io/repository/bitnami/pgbouncer?tab=tags
       | 
       | Microsoft releases an officially supported k8s sidecar of
       | pgbouncer - https://hub.docker.com/_/microsoft-azure-oss-db-
       | tools-pgboun...
        
         | x4m wrote:
         | Yes, that's a problem. We use Docker only for developement. And
         | mostly because Odyssey is based on epoll and just will not
         | compile on MacOS.
        
           | sandGorgon wrote:
           | so im not sure what you meant by that - because MacOS
           | compatibility is generally unrelated to production (which is
           | almost always on Linux).
           | 
           | Am I understanding this wrong ? do you have production
           | workloads that use Odyssey on MacOS ?
        
             | x4m wrote:
             | No. It's just about developer experience. PostgreSQL core
             | developemnt under MacOS is just so great. You can develop
             | PG patches completely without internet connection on a very
             | old MacBook Air. And everything will be so blazingly fast.
             | For Odyssey development MacOS user needs VM, Linux box or
             | Docker.
        
       ___________________________________________________________________
       (page generated 2021-11-12 23:00 UTC)