[HN Gopher] Scalable PostgreSQL Connection Pooler ___________________________________________________________________ Scalable PostgreSQL Connection Pooler Author : sbuttgereit Score : 144 points Date : 2021-11-12 16:25 UTC (6 hours ago) (HTM) web link (github.com) (TXT) w3m dump (github.com) | Tostino wrote: | I appreciate the work going on to make high quality poolers for | Postgres...I just really wish the work was done in-core so we | could have built in connection pooling, or the work done to make | grabbing a new connection as cheap as if you were using a pooler. | It sucks to have to add complexity to your stack to fix a | "deficiency" in Postgres' design. | | Still, I am glad there is effort put into project even if I | selfishly wish it was done in-core. | anarazel wrote: | There has been some work towards that - things did get cheaper | in 14. | | Note that you very well might still want a separately run | pooler for some workloads - having local poolers on application | servers can be good for latency. | saurik wrote: | Yeah: the latency of establishing a new connection | fundamentally requires _at least_ one network round trip, and | likely a few (and so I tend to have my servers going through | _two_ layers of pgbouncer, one on each side of the network); | and using in-process pools couples your pool state to your | front end code updates... so, since you are going to want to | build this connection pooling mechanism anyway--and then are | likely going to want to be able to take advtanage off | application-specific capability limitations (such as not | supporting connection state outside of a transaction) to | further improve performance--it doesn 't really make much | sense to prioritize this inside the database server itself. | anarazel wrote: | > it doesn't really make much sense to prioritize this | inside the database server itself. | | I wouldn't go that far. There's a lot of things that are | hard to do within an external pooler, unless the pooler | uses a 1:1 connection model. Leading to most setups using | 1:1 connection pooling methods. Which in turn puts pressure | on postgres to handle large numbers of connections | gracefully. | Znafon wrote: | It is being worked on: https://www.postgresql.org/message- | id/flat/ac873432-31cf-d5e... ! | riyadparvez wrote: | Can someone point to an article that explains the connection | issue in detail? | hn_throwaway_99 wrote: | Short of it is that Postgres uses a process per connection, | so architectures that spin up and close connections | frequently can have serious scalability issues. | | Note the landing page for the AWS RDS Proxy, | https://aws.amazon.com/rds/proxy/ , is as good a discussion | as any as to why you'd want to put a pooling proxy in front | of Postgres. | ddorian43 wrote: | https://www.citusdata.com/blog/2020/10/08/analyzing- | connecti... | x4m wrote: | There's ongoing work on making Postgres connections more | scalable. I really hope that one day we will not need a pooler. | https://www.citusdata.com/blog/2020/10/08/analyzing-connecti... | jordanthoms wrote: | We've been using this for a while after we found pgbouncer became | a bottleneck - it's so much faster and has been totally reliable. | Thanks! | colesantiago wrote: | Has anyone used Yandex Cloud [0] in production?, How does it | compare to DigitalOcean, AWS, GCP and others? | | Also the tech from Yandex such as Clickhouse [1] is really | interesting, as they recently spun it out of Yandex. | | [0] https://cloud.yandex.com/en/ | | [1] https://clickhouse.com/ | x4m wrote: | We are also developing managed databases for US\EU market | https://double.cloud/ | zaius wrote: | Last time I tried they wouldn't take a US credit card, which | made it too big of a challenge to consider | the-alchemist wrote: | Only used Clickhouse, although not through their Cloud | offering. | | It's an impressive piece of engineering. The best column- | oriented DB in the open source space, I would say. | [deleted] | derefr wrote: | Question for anyone who knows a lot about Postgres connection | pooling: | | We want to move to using a connection pooling infrastructure | component, but so far, we haven't been able to figure out how to | do so given our setup. | | We've got a wacky data model where we have N Postgres schemas, | each of which has the same tables (but with different data, | obviously.) They're not tenants; more like separate "datasets." | Picture something like: each schema is a digraph, with tables | like "vertices", "edges", "edge_labels", etc. (Not what we're | doing, but it's close.) | | In theory, we could solve our problem by just sticking all the | "vertices" tables together as list-partitions of one mega | "vertices" table keyed by dataset_id; but 1. it's kind of | incoherent -- the data makes no sense when combined together like | that, it's only valid when considered in isolation; and 2. it's | very operationally convenient for us to manage datasets as | separate schemas, re: security, naming, onlining/offlining data, | etc. | | Also, while each user request is only going to interact with a | single dataset (schema), a given user might care about a lot of | datasets, and make requests about many of them at arbitrary times | --so, as far as I can tell, we don't have the sort of stable load | per dataset that would allow us to hold separate connection pools | per dataset. | | _And_ , since there's no way to make a schema name into a bind | variable, fully-qualifying our queries means generating dynamic | SQL strings, which is both expensive on the business layer, and | on the SQL query planner, which can't just recognize+unify its | plans from the query's history from other schemas. | | Right now, the way we query the data, is that in our business | layer, before each query, we have a middleware that injects a | statement like this: SET search_path TO | 'name_of_dataset', 'public'; | | We then run a non-fully-schema-qualified query, and it finds | whichever tables that have been made visible to it. | | Our queries currently don't run in transactions (i.e. we're using | JDBC auto-commit on the client side), because they're by-and- | large just single (complex) SELECT statements. So the middleware- | generated statement above (for now) runs as a separate statement, | in a separate single-statement transaction, on the same | connection, that runs the query. So per-statement connection- | pooling would break everything, as our queries would be being | routed to backend connections "primed" with the wrong | search_path. | | And I get the feeling that _transaction_ -level connection | pooling won't work for us either (at least for now), because our | whole operational problem _currently_ is that our HTTP requests | acquire DB connections and then sit on them while other things | are processed, depleting the connection pool; and if we turned | off JDBC auto-commit, our search_path-injecting middleware (which | injects its statement on connection-pool checkout) would just end | up _starting a DB transaction_ at the beginning of each of those | HTTP requests, where the business layer would then be sitting | around with an _idle in transaction_ DB connection, mapping | directly to an open pgBouncer backend connection, completely | destroying any wins connection-pooling would gain us. We 'd be | right back where we started. | | So, in other words, we want/need _per-statement_ pooling; but we | need it to allow us to also specify the search path _per | statement_. | | I've seen that at least pgBouncer has an explicit WONTFIX for | this requirement, since in their minds it conflicts with their | "can't tell it apart from a regular Postgres session" session- | state semantics. | | Should we bite the bullet and move to putting everything in one | schema + list-partitioning + an explicit dataset_id column, so | that we _can_ parameterize out the dataset per-query using a bind | variable? Or something else? | chatmasta wrote: | We are building a solution for this problem at Splitgraph [0] - | it sounds like we could probably help with your use case, | assuming this is for analytical (OLAP) data. You can get it to | work yourself with our open source code [1], but our (private | beta, upcoming public) SaaS will put all your schemas on a more | scalable "data delivery network," which incidentally, happens | to be implemented with PgBouncer + rewriting + ephemeral | instances. We also have private repositories / fine-grained ACL | shipping shortly. | | In a local engine (just a Postgres DB managed by Splitgraph | client to add extra stuff), there is no PgBouncer or batteries- | included authN/Z, but we use Foreign Data Wrappers to | accomplish the same query resolving. Our goal is for you to | have the ability to do everything locally as an individual, | with the SaaS becoming useful for "multiplayer" (teams and | orgs). | | On Splitgraph, every dataset - and every _version_ of every | dataset - has an address. Think of it like tagged Docker | images. The address either points to an immutable "data image" | (in which case we can optionally download objects required to | resolve a query on-the-fly, although loading up-front is | possible too) or to a live data source (in which case we proxy | directly to it via FDW translation). This simple idea of | _addressable data products_ goes a long way - for example, it | means that computing a diff is now as simple as joining across | two tables (one with the previous version, one with the new). | | Please excuse the Frankenstein marketing site - we're in the | midst of redesign / rework of info architecture while we build | out our SaaS product. | | Feel free to reach out if you've got questions. And if you have | a business case, we have spots available in our private pilot. | My email is in my profile - mention HN :) | | [0] https://www.splitgraph.com/connect | | [1] examples: | https://github.com/splitgraph/splitgraph/tree/master/example... | derefr wrote: | We're already sharding our datasets onto separate PG nodes | using FDWs (mostly due to us running PG on GCP, and GCE VMs | having inherent vertical scaling limitations on node-local | storage, which we rely 100% on for our workloads.) | | Also, our "datasets" are all live data. They aren't modified | by the users querying our API, but they are constantly | appended to (and I mean constantly, i.e. every few | milliseconds.) | | For us, PG is in our stack at this point because of its | hybrid-OLAP nature: it can do realtime, row-at-a-time | ingestion of data without degradation, like an OLTP store / | time-series DB; but it can then perform intensive OLAP | workloads against that up-to-the-moment data, involving | joins, CTEs, partial computed-expression indices, etc. | | (The use-case for this sort of mixed workload? Think | "realtime monitoring/alerting on custom-per-user financial | Business-Intelligence queries from a common financial | transaction stream." We can't pre-denormalize the data, | because each client wants something different. Instead, we | need a normalized representation with tons of indices that | serves all potential queries equally well, in roughly real | time.) | | For non-realtime analysis of "data at rest" (i.e. data that | can be ingested in at-most-hourly batches), we can just use | Snowflake. We already do, for logs and other things. | | To be honest, our fondest dream would be to have a two-tiered | DB setup: | | 1. a cold layer, where we get the SQL semantics of Postgres's | querying engine, but where the storage engine is similar to | those of scale-out DW services like Snowflake, with elastic | per-workload compute warehouse nodes fetching compressed- | columnar data from object storage into local per-worker | caches (it sounds like this is similar to what you're | building?) | | 2. a hot layer, using traditional Postgres storage, which | serves as a sort of writeback cache for the cold layer: | | * where all INSERTs hit the hot layer and then get async- | batched to the cold layer; | | * where the hot layer keeps its own indices to enable fancy | OLAP querying of the hot data; | | * where those queries see the cold data as "part of" the hot | data, in a UNION ALL sense (probably through a FDW); | | * where those queries will _constraint-exclude_ the cold data | (and thus trigger no workload on the cold layer) if the cold | data isn 't relevant to resolving the query -- like PG11 | partition constraint-exclusion. | | AFAIK, you can't currently use a foreign table as a partition | of a local parent table, so that'd be the first thing needing | to be solved there. The next problem after that would be | resolving the need for an AccessExclusiveLock to modify the | range constraint on each of the parititons, since 1. the | split-point between "historical" and "timely" data would be | sliding forward every hour-or-so, but 2. the table would be | just _saturated_ with long-running SELECTs, so much so that | it would never get a moment to lock itself to make a | modification like that. | | (Really, the magic wand _there_ would be to allow partitions | to have multiple parents and for "a partition" -- the | metadata that contains the list/range/hash constraint -- to | be a separate DB object from the DB table it refers to, where | multiple "partitions" can reference one table-with-the- | partition's-data-in-it, as long as the data meets the union | of all their constraints. With those abstractions, you could | build a new parent table that _also_ references the old child | tables through new partition metadata, and then just do an | atomic swap of the parent tables ' _names_ when you 're | ready, where old queries would go on using the tables they | had already dereferenced to their OIDs, and new queries would | start using the new tables. Then queue up an effectively- | async DROP TABLE on the old parent table, that would resolve | when the DB runs out of old queries locking it open.) | x4m wrote: | This may sound naive, but implementing your own pooler is very | easy e.g. in Go pgproto3 already does all packet-parsing for | you. Transaction poolers are looking on ReadyForQuery packet | and it's "in trnsaction" property like this [0]. All you need - | is stick server connection on new ParameterStatus[1] packet for | "SET search_path" instead of ReadyForQuery. | | [0] https://github.com/pg- | sharding/spqr/blob/358f816cd8a964a9c9e... [1] | https://www.postgresql.org/docs/10/protocol-flow.html#id-1.1... | CyberDem0n wrote: | It's not enough to intercept `set search_path`, in fact any | arbitrary function could change it, but Postgres doesn't | report it back, unlike some other gucs. | x4m wrote: | Hmm, yes. ParameterStatus is not sent when search_path is | changed. But I think it would be trivial to patch | PostgreSQL to send it. I bet one even can push such patch | through commitfest. | nezirus wrote: | Maybe this can help https://github.com/awslabs/pgbouncer-rr- | patch | | It is a pgbouncer fork/patch with query rewriting support. | r1b wrote: | Why might I want to use this or pgbouncer instead of, say, the | connection pooling built into ORMs like SQLAlchemy? | PhoenixReborn wrote: | The ORM connection pooler is client-side - so each instance of | your application using SQLAlchemy will have its own connection | pool. This may become unwieldy as you scale up the number of | app servers/containers/processes. | | In contrast, connection poolers like Odyssey or pgbouncer live | external to your applications, meaning that: | | 1. You can connect to them from any application written in any | language (not just Python) 2. They provide a global pool for | all of the applications connecting and help to not overload | Postgres. | djbusby wrote: | Why not python? Works on my machine. | x4m wrote: | You might want to have many app backends, each with it's own | Alchemy pool. This setup would create a lot of Postgres | connections. When the network flaps you need to reinstall lots | of TLS connections. Maybe 100ms of cpu for each handshake. Also | each connection may cost you a fork with lots of following CoW, | also up to 100ms of cpu. And also Postgres gives you maximum | TPS throughput when you run some hundreds of connections, not | thousands. And you might want to isolate one microservice from | another (in terms of DB throughput utilization). Poolers can | help here too. | maxpert wrote: | You would only see the real value in these poolers in dire | times. Screwing up apps or deploys leaking connections is | really easy. If that happens you will bring down the whole DB. | With these poolers you will not only be able to limit that but | also enforce per user (service based if every service has it's | own creds) limits. These tools are useful beyond certain scale | and you will only realize there true value once you are working | in distributed micro-service world specially when they are | connecting to same DB. | whitepoplar wrote: | Everything that comes out of Yandex is seemingly great. It feels | like a lighter, leaner FAANG and as a result, it feels like their | software is more usable by small teams (as compared to what the | actual FAANGs put out). | x4m wrote: | FWIW I maintain Odyssey and will be happy to answer any question | regarding it. Actually there was just a CVE-triggered release... | I really hope to release more often with more new functionality. | | If you are interested in connection pooling maybe you will find | interesing SPQR too https://github.com/pg-sharding/spqr Currently | it's our experiment to build pooling-based PostgreSQL sharding. | jerrysievert wrote: | Any thoughts on an arm64 version? | hn_throwaway_99 wrote: | Thanks very much for posting this. Odyssey is analogous to | pgbouncer, correct? I haven't set up a connection pooler yet | for our postgres instance but was planning to do it eventually | with pgbouncer - are you able to comment how Odyssey compares? | x4m wrote: | Connection pooler allows you to get more transactions per | second on the same hardware. All it does - pack a lot of | connections into handful of concurrently running PG backends. | In theory all it does - relay bytes between sockets. But | there are workloads when it's CPU intensive, for example lots | of concurrent TLS handshakes. And pgbouncer is single- | threaded! Odyssey is more scalable. | | PgBouncer is a great software, actually. We built Odyssey | merely to create competitor for PgBouncer. But then started | to add functionality that we needed for cloud installations. | E.g. Odyssey computes transaction time quantiles, when | PgBouncer computes average transaction time. | whitepoplar wrote: | How does Odyssey compare to other high-performance poolers | like pgagroal? | x4m wrote: | Depends on how you measure performance :) pgagroal | utilizes a lot of linux processes to track each | connection state, we use coroutines packed into small | number of processes. Actually if you have a bottleneck in | multithreaded pooler - something really goes wrong. The | database should consume most of resources. I'd be happy | to help tuning Odyssey to someone who will bachmark both | poolers (in fact there's only one number - number of | worker processes..well, maybe pool_size too). | | pgagroal claims performance superiority over all poolers | [0]. I doubt that Odyssey was used in transaction pooling | mode in those experiments. | | [0] https://github.com/agroal/pgagroal/blob/master/doc/PE | RFORMAN... | whitepoplar wrote: | Thanks! For a Postgres n00b, can Odyssey work alongside a | HA solution like pg_auto_failover running on the same | nodes? | x4m wrote: | Sure. We use Odyssey as a component of HA cluster in | Yandex.Cloud. Odyssey runs on each node of a Patroni-like | system. One day we want to implement load balancing | functionality in Odyssey. So that your queries go to | other node if local is overloaded or lagging long behind | primary. | tsenart wrote: | This, please! Native support for read-replicas would be | awesome. Ideally it would now if a query is read-only or | not without application changes. | sudhirj wrote: | For those wondering why use a connection pooler at all with | Postgres, I wrote a (really long) post about it: | https://sudhir.io/understanding-connections-pools/ | SOLAR_FIELDS wrote: | This is a great write up that really helped me understand the | issue in depth. Thanks for sharing. | barefeg wrote: | What is the use case for this? | hn_throwaway_99 wrote: | Probably the biggest downside in my mind with postgres (which | is otherwise an amazing product!) is that it doesn't scale | connections well, as there is 1 process per connection. So | teams often put a connection pooler _in front_ of postgres - | end clients connect to that connection pooler, which forwards | on requests to postgres. AFAIK pgbouncer is the 'de facto | standard' for postgres connection pooling, so I asked the | author how Odyssey compares. | ggregoire wrote: | Just to be sure, Odyssey (or pgBouncer) requires to run on the | same box than Postgres, right? So if you are on AWS, you would | have to migrate from RDS to running Postgres on a EC2 and install | the pooler on the same EC2? | | Which comes with some big drawbacks (you obviously lose all the | benefits of using RDS) but also some benefits (can install every | pg extension that you need, have access to every pg settings, | etc) | speedyapoc wrote: | I run pgBouncer on an EC2 instance, connecting to my database | on RDS. Does not have to be on the same box. | ggregoire wrote: | Oh nice. Thanks for the quick answer! | x4m wrote: | I'd certainly recommend to run pooler on the same node to avoid | double encryption. Reestablishing 2x connections, dealing with | 2x restarts of VM - is kind of a downside too. However it's not | strictly neccesary, Odyssey can run even in another AZ :) | claytonjy wrote: | Does anyone know if you can specify parent roles (group-style | non-login roles)? That's something I've wanted from pgbouncer, so | instead of making settings at the user level (applications) I | could make them at the group level (e.g. apiv2-readonly). | x4m wrote: | We are going to implement config inheritance. This should bring | somewhat resemblant functionality. | sandGorgon wrote: | is the Dockerfile not production ready ? | https://github.com/yandex/odyssey/blob/master/docker/Dockerf... | | i mean i see valgrind, vim, etc in there. this would be a very | fat dockerfile. | | It seems that way - | https://github.com/yandex/odyssey/issues/29#issuecomment-764... | | > _But it 's hard to "bless" some "official" as image: no one | from active contributors uses Odyssey in Docker to maintain it | thoroughly._ | | OTOH pgbouncer docker images are rock-solid in production. | | Very quickly updated to track upstream. | | e.g. the Bitnami ones - | https://hub.docker.com/r/bitnami/pgbouncer/ which also have CVE | security scans | https://quay.io/repository/bitnami/pgbouncer?tab=tags | | Microsoft releases an officially supported k8s sidecar of | pgbouncer - https://hub.docker.com/_/microsoft-azure-oss-db- | tools-pgboun... | x4m wrote: | Yes, that's a problem. We use Docker only for developement. And | mostly because Odyssey is based on epoll and just will not | compile on MacOS. | sandGorgon wrote: | so im not sure what you meant by that - because MacOS | compatibility is generally unrelated to production (which is | almost always on Linux). | | Am I understanding this wrong ? do you have production | workloads that use Odyssey on MacOS ? | x4m wrote: | No. It's just about developer experience. PostgreSQL core | developemnt under MacOS is just so great. You can develop | PG patches completely without internet connection on a very | old MacBook Air. And everything will be so blazingly fast. | For Odyssey development MacOS user needs VM, Linux box or | Docker. ___________________________________________________________________ (page generated 2021-11-12 23:00 UTC)