[HN Gopher] Running Postgres in Kubernetes [pdf] ___________________________________________________________________ Running Postgres in Kubernetes [pdf] Author : craigkerstiens Score : 98 points Date : 2020-06-29 20:29 UTC (2 hours ago) (HTM) web link (static.sched.com) (TXT) w3m dump (static.sched.com) | caiobegotti wrote: | Please don't. It's not because it's possible that it is a good | idea. The PDF itself clearly shows how it can get complex | quickly. The great majority of people won't ever be able to do | this properly, securely and with decent reliability. Of course I | may have to swallow my words in the future in case a job requires | it but unless you REALLY REALLY REALLY need PostgreSQL inside | Kubernetes IMHO you should just stick with private RDS or Cloud | SQL then point your Kubernetes workloads to it inside your VPCs, | all peered etc. Your SRE mental health, your managers and company | costs will thank you. | deathanatos wrote: | I've done MySQL RDS, and I've seen k8s database setups. (But | not w/ PG.) | | RDS is okay, but I would not dismiss the maintenance work | required; RDS puts you at the mercy of AWS when things go | wrong. We had a fair bit of trouble with failovers taking 10x+ | longer than they should. We also set up encryption, and _that_ | was also a PITA: we 'd consistently get nodes with incorrect | subjectAltNames. (Also, at the time, the certs were either for | a short key or signed by a short key, I forget which. It was | not acceptable at that time, either; this was only 1-2 years | ago, and I'm guessing hasn't been fixed.) Getting AWS to | actually investigate, instead of "have you tried upgrading" | (and there's _always_ an upgrade, it felt like). RDS MySQL 's | (maybe Aurora? I don't recall fully) first implementation of | spatial indexes was flat-out _broken_ , and that was another | lengthy support ticket. The point is that bugs will happen, and | cloud platform support channels are _terrible_ at getting an | engineer in contact with an engineer who can actually do | something about the problem. | renewiltord wrote: | I much prefer just using RDS Aurora. Far fewer headaches. If I | don't need low latency I'd use RDS Aurora no matter which cloud | I'm hosted on. Otherwise I'll use hosted SQL. | | The reason I mention this is that Kubernetes requires a lot of | management to run so the best solution is to use GKE or things | like that. If you're using managed k8s, there's little reason to | not use managed SQL. | | The advantages of k8s are not that valuable for a SQL server | cluster. You don't even really get colocation of data because | you're realistically going to use a GCE Persistent Disk or EBS | volume and those are network attached anyway. | sspies wrote: | I have been running my own postgres helm chart with read | replication and pgpool2 for three years and never had major | trouble. If you're interested check out | https://github.com/sspies8684/helm-repo | blyry wrote: | ooh! I've been running the Zalando operator in production on | Azure for ~ a year now, nothing crazy but a couple thousand qps | and a tb of data spread across a several clusters. It's been a | little rough since it was designed for AWS, but pretty fun. At | this point, I'm 50/50, our team is small and i'm not sure that | the extra complexity added by k8s solved any problems that azures | managed postgres product doesn't also solve. We weren't sure we | were going to stay on azure at the time we made the decision as | well -- if I was running in a hybrid cloud environment I would | 100% choose postgres on k8s. | | The operator let us ramp up real quickly with postgres as a POC | and gave us mature clustering and point-in-time restoration, and | the value is 100% there for dev/test/uat instances, but depending | on our team growth it might be worth it to switch to managed for | some subset of those clusters once "Logical Decoding" goes GA on | the azure side. Their hyperscale option looks pretty fun as well, | hopefully some day i'll have that much data to play with. | | I can also say that the Zalando crew has been crazy responsive on | their github, it's an extremely well managed open source project! | GordonS wrote: | So, a bit OT, but I'm looking for some advice on building a | Postgres cluster, and I'm pretty sure k8s is going to add a lot | of complexity with no benefit. | | I'm a Postgres fan, and use it a lot, but I've never actually | used it in a clustered setup. | | What I'm looking at clustering for is not really for scalability | (still at the stage where we can scale vertically), but for high | availability and backup - if one node is done for update, or | crashes, the other node can take over, and I'd also ideally like | point-in-time restore. | | There seems to be a plethora of OSS projects claiming to help | with this, so it looks like there isn't "one true way" - I'd love | to hear how people are actually setting up their Postgres | clusters for in practice? | johncolanduoni wrote: | Kubernetes can't change any database's HA or durability | features; there's no magic k8s can apply to make a database | that does e.g. asynchronous replication have the properties of | one that does synchronous replication. So you'll never gain any | properties your underlying database is incapable of providing. | | However, if I had to run Postgres as part of something I | deployed on k8s AND for some reason couldn't use my cloud | provider's built in solution (AWS RDS, Cloud SQL, etc.) I would | probably go with using/writing a k8s operator. The big | advantage of this route is that it gives you good framework for | coordinating the operational changes you need to be able to | handle to actually have failover and self-healing from a | Postgres cluster, in a self-contained and testable part of your | infrastructure. | | When setting up a few Postgres nodes with your chosen HA | configuration you'll quickly run into a few problems you have | to solve: | | * I lose connectivity to an instance. Is it ever coming back? | How do I signal that it's dead and buried to the system so it | knows to spin up a fresh replica in the cases where this cannot | be automatically detected? | | * How do I safely follow the process I need to when upgrading a | component (Postgres, HAProxy, PGBouncer, etc.)? How do I test | this procedure, in particular the not-so-happy paths (e.g. | where a node decides to die while upgrading). | | * How do I make sure whatever daemon that watches to figure out | if I need to make some state change to the cluster (due to a | failure or requested state change) can both be deployed in a HA | manner AND doesn't have to contend with multiple instances of | itself issuing conflicting commands? | | * How do I verify that my application can actually handle | failover in the way that I expect? If I test this manually, how | confident am I that it will continue to handle it gracefully | when I next need it? | | A k8s operator is a nice way to crystallize these kinds of | state management issues on top of a consistent and easily | observable state store (namely the k8s API's etcd instance). | They also provide a great way to run continuous integration | tests that you can actually throw the situations you're trying | to prepare for at the implementation of the failover logic (and | your application code) to actually give you some confidence | that your HA setup deserves the name. | | But again, I wouldn't bite this off if you can use a managed | service for the database. Pay someone else to handle that part, | and focus on making your app actually not shit the bed if a | failover of Postgres happens. The vast majority of applications | I've worked on that were pointed at a HA instance would have | (and in some cases did) broken down during a failover due to | things like expecting durability but using asynchronous | replication. You don't get points for "one of the two things | that needed to work to have let us avoid that incident worked". | GordonS wrote: | > AND for some reason couldn't use my cloud provider's built | in solution (AWS RDS, Cloud SQL, etc.) | | Money. | | > Pay someone else to handle that part | | Aside from the money (managed Postgres is _expensive_ ), I'd | actually like to understand what good, high-availability | Postgres solutions look like. | random3 wrote: | The main advantage with Kubernetes (especially in low ops | environments like GKE) is not scalability, but availability and | ease of development (spinning things up and down is super- | easy). The learning curve to stand something up is not very | high and pays of over time compared to SSH-ing into VMs. | GordonS wrote: | I'm very comfortable with containers (less so specifically | with k8s), but generally for stateless or stateless'ish | services. What are the advantages of k8s specifically for a | database? | lhenk wrote: | Patroni might be interesting: | https://github.com/zalando/patroni | penagwin wrote: | Compared to many databases, postgres HA is a mess. It has | builtin streaming, but no fail over of any kind, all of that | has to be managed by another application. | | We've had the best luck with patron, but even then you'll find | the documentation confusing, have weird issues, etc. You'll | need to setup etcd/Consul to use it. That's right you need a | second database cluster to setup your database cluster.... | Great... | | I have no clue how such a community favorite database has no | clear solution to basic HA. | peterwwillis wrote: | Google Cloud blog, gently dissuading you from running a | traditional DB in K8s: | https://cloud.google.com/blog/products/databases/to-run-or-n... | | K8s docs explaining how to run MySQL: | https://kubernetes.io/docs/tasks/run-application/run-replica... | | You could also run it with Nomad, and skip a few layers of | complexity: https://learn.hashicorp.com/nomad/stateful- | workloads/host-vo... / | https://mysqlrelease.com/2017/12/hashicorp-nomad-and-app-dep... | | One of the big problems of K8s is it's a monolith. It's designed | for a very specific kind of org to run microservices. Anything | else and you're looking at an uphill battle to try to shim | something into it. | | You can also skip all the automatic scheduling fancyness and just | build system images with Packer, and deploy them however you | like. If you're on a cloud provider, you can choose how many | instances of what kind (manager, read-replica) you deploy, using | the storage of your choice, networking of choice, etc. Then later | you can add cluster scheduling and other features as needed. This | gradual approach to DevOps allows you to get something up and | running using best practices, but without immediately incurring | the significant maintenance, integration, and | performance/availability costs of a full-fledged K8s. | nightowl_games wrote: | I think cockroachDB is designed for this. | tyingq wrote: | They've thought about the use case. But it still ends up being | a cluster inside a cluster, which sounds potentially pretty bad | to me. Clusters of different types, mostly unaware of each | other. Schema changes and database version upgrades would be | complicated. | rafiss wrote: | There certainly are pain points. I don't work on this myself, | but one of our other engineers wrote this blog post [0] that | discusses the experience of running CockroachDB in k8s and | why we chose to use it for our hosted cloud product. Another | complication mentioned in there is about how to deal with the | multi-region case. | | [0] https://www.cockroachlabs.com/blog/managed-cockroachdb- | on-ku... | zelly wrote: | Why? So you pay more money to AWS? Deploying databases is a | solved problem. What's the point of the overhead? | IanGabes wrote: | In my personal opinion, there are three database types. | | 'Small' Databases are the first, and are easy to dump into | kubernetes. Anything DB with a total storage requirement 100GB or | less (if I lick my finger and try to measure the wind), really, | can be easily containerized, dumped into kubernetes and you will | be a happy camper because it makes prod / dev testing easy, and | you don't really need to think too much here. | | 'Large' database are too big to seriously put into a container. | You will run into storage and networking limits for cloud | providers. Good luck transferring all that data off bare metal! | Your tables will more than likely need to be sharded to even | start thinking about gaining any benefit from kubernetes. From my | own rubric, my team runs a "large" Mysql database with large sets | of archived data that uses more storage that managed cloud SQL | solutions can provide. It would take us months to re-design to | take advantage of the Mysql Clustering mechanisms, along with | following the learning curve that comes with it. | | 'Massive' databases need to be planned and designed from "the | ground up" to live in multiple regions, and leverage respective | clustering technologies. Your tables are sharded, replicated and | backed up, and you are running in different DCs attempting to | serve edge traffic. Kubernetes wins here as well, but, as the OP | suggests, not without high effort. K8S give you the scaling and | operational interface to manage hundreds of database nodes. | | It seems weird to me that the Vitess and OP belabour their | Monitoring, Pooling, and Backup story, when I think the #1 reason | you reach for an orchestrator in these problem spaces is scaling. | | All that being said, my main point here is that orchestration | technologies are tools, and picking the right one is hard , but | can be important :) Databases can go into k8s! Make it easy on | yourself and choose the right databases to put there | aprdm wrote: | That looks very interesting and super complex. | | I wonder how many companies really need this complexity, I bet | 99.99% of the companies could get away with vertical scaling the | writes and horizontal scaling the read only replica which would | reduce the number of moving parts a lot. | | I have yet to play much with kubernetes but when I see those | diagrams it just baffles me how people are OK with running so | much complexity in their technical stack. | cryptonector wrote: | "But it has to run in Kubernetes!" | mamon wrote: | It actually is not that complex. I'm using Crunchy Postgres | Operator at my current employer. You get an Ansible playbook to | install an Operator inside Kubernetes, and after that you get a | commandline administration tool that let's you create a cluster | with a simple | | pgo create cluster <cluster_name> | | command. | | Most administrative tasks like creating or restoring backups | (which can be automatically pushed to S3) are just one or two | pgo commands. | | The linked pdf looks complex, because it: | | a. compares 3 different operators | | b. goes into implementation details that most users are | shielded from. | | And I'm actually not sure which one of the three operators is | the author recommending :) | dewey wrote: | Usually things become complex once something isn't going as | planned. Like if your database slows down because the pods | get scheduled on a weird node with some noisy neighbour, your | backups failed because the node went down or other more | hidden issues that take a lot longer to debug compared to | some Postgres running on a normal compute instance somewhere. | | It's just additional layers to dig through if something goes | wrong, if everything works even the most complex systems are | nice to operate so I wouldn't call it less complex just | because someone wrote a nice wrapper for the happy path. | jashmatthews wrote: | Crunchy PGO is super cool but I'm not sure how we got to the | idea that it's not that complex compared to a managed service | like RDS. | craigkerstiens wrote: | Coming from someone at Crunchy I don't disagree on the | notion of managed service being easier than | running/managing yourself inside Kubernetes. Clicking a | button and having things taken care of you is great. | | Though personally I do feel like much of the managed | services have not evolved/changed/improved since their | inception many years ago. There is definitely some | opportunity here to innovate, though that's probably not | actually coupled with running it in K8s itself. | qeternity wrote: | I don't think anyone would argue that RDS isn't vastly | simpler. If it weren't, there'd be no reason to pay such a | premium for it. | Townley wrote: | I agree that the k8s ecosystem isn't quite as complex as it | seems at first, but specifically running stateful apps does | come pretty close to earning the bad reputation. | | (Disclaimer: I've tried and failed several times to get pgsql | up and running in k8s with and without operators, so that | either makes me unqualified to discuss this, or perfectly | qualified to discuss this) | | If the operator were simple enough to be | installed/uninstalled via a helm chart that Just Worked, I'd | feel better about the complexity. But running a complicated, | non-deterministic ansible playbook scares me. The other | options (installing a pgo installer, or installing an | installer to your cluster) are no better. | | Also, configuring the operator is more complicated than it | should be. Devs and sysadmins alike are used to `brew install | postgresql-server` or `apt install postgresql-server` working | just fine for 99% of use cases. I'll grant that it's not | apples-to-apples since HA pgsql has never been easy, but if | the sales pitch is that any superpower-less k8s admin can now | run postgres, I think the manual should be shorter. | craigkerstiens wrote: | Not the author of the slides, but know him well. A number of | things to chime in on. First thanks for the kinda words on | the Crunchy operator. | | Second, on the earlier question higher in the thread about | why would you choose to run a database in K8s. In my | experience and what I've observed it's not so much you choose | explicitly to run a database in K8s. Instead you've decided | on K8s as your orchestration layer for a lot of workloads and | it's become your standardized mechanism for deploying and | managing apps. In some sense it's more that it's the standard | deployment mechanism than anything else. | | If you're running and managing a single Postgres database and | don't have any K8s anywhere setup, I can't say I'd recommend | going all in on K8s just for that. That said if you are using | it then going with one of the existing operators is going to | save you a lot. | perrygeo wrote: | Building systems on top of complexity doesn't shield anyone | from it. The author acknowledges this explicitly: | | > High Effort - Running anything in Kubernetes is complex, | and databases are worse | | By definition, it's more stuff you need to know. | | Even if the K8s operator saves time for 95% of the use cases, | the last 5% is required. For instance, how do these operators | handle upgrading extensions that require on-disk changes? Can | you upgrade them concurrently with major version PG upgrades? | When the operator doesn't provide a command line admin tool | that fits your needs, how do you proceed? | wiml wrote: | I've come to the conclusion that, much like how purchasing | decisions seem irrational until you realize that different | kinds of purchases come out of different budgets, there are | different "complexity budgets" or "ongoing operational | maintenance burden" budgets in an organization, and some are | tighter than others. | merb wrote: | btw. zalando operator is more rough, but still pretty easy to | use. crunchy operator does not work in every environment but is | extremly simple (btw. the crunchy operator uses the building | blocks of zalando) used zalando operator since k8s 1.4, no data | loss, everything just works, ok major upgrades are rough, but | they are rough even without zalando operator. | adamcharnock wrote: | I generally work with smaller companies, but early on | (Kubernetes 1.4 ish) I found that hosting mission-critical | stateful services inside Kubernetes was more trouble than it | was worth. I now run stand-alone Postgres instances in which | each service has its own DB. I've found this very reliable. | | That being said, I think Kubernetes now has much better support | for this kind of thing. But given my method has been so stable, | I just keep on going with it. | LoSboccacc wrote: | > stateful services | | yeah either these services support natively partitioning, | fail over and self recovery or you have to be extremely | careful not to cause any eviction or agent crash ever. | | even something born for the cloud like cockroachdb can fail | in interesting ways if the load order varies and you can't | just autoscale it because every new node has to be nudged | into action with a manual cluster-init, and draining nodes | after the peak means manually telling the cluster not to wait | for the node to come alive ever again for each node, wait for | the repartitioning and then repeat for as many nodes as you | need to scale back | jeremychone wrote: | Looks interesting but difficult to get the details from just the | slides. | | Also, not sure why Azure Arc still gets mentioned. I would have | expected something more cloud independent. | | Our approach, for now, is to use Kubernetes Postgres for dev, | test, and even stage, but cloud Postgres for prod. We have one | db.yaml that in production just become an endpoint so that all of | the services do not even have to know if it is an internal or | external Postgres. | | Another interesting use of Kubernetes Postgres would be for some | transient but bigger than memory store that needs to be queryable | for a certain amount of time. It's probably a very niche use- | case, but the deployment could be dramatically more | straightforward since HA is not performance bound. | sasavilic wrote: | Unless you have a really good shared storage, I don't see any | advantage for running Postgres in Kubernetes. Everything is more | complicated without any real benefit. You can't scale it up, you | can't move pod. If pg fails to start for some reason, good luck | jumping into container to inspect/debug stuff. I am neither going | to upgrade PG every 2 weeks nor it is my fresh new microservice | that needs to be restarted when it crashes or scaled up when I | need more performances. And PG has high availability solution | which kind of orthogonal to what k8s offers. | | One could argue that for sake of consistency you could run PG in | K8S, but that is just hammer & nail argument for me. | | But if you have a really good shared storage, then it is worth | considering. But, I still don't know if any network attached | storage can beat local attached RAID of Solid state disks in | terms of performance and/or latency. And there is/was fsync bug, | which is terrible in combination with somewhat unreliable network | storage. | | For me, I see any database the same way I see etcd and other | components of k8s masters: they are the backbone. And inside | cluster I run my apps/microservices. This apps are subject to | frequent change and upgrades and thus profit most from having | automatic recovery, failover, (auto)scaling, etc. | jonfw wrote: | The nice thing about running a DB inside a cluster is running | your entire application, end to end, through one unified | declarative model. It's really really easy to spin up a brand | new dev or staging environment. | | generally though in production, you're not going to be taking | down DBs on purpose. If it's not supposed to be ephemeral, it | doesn't fit the model | ci5er wrote: | My biases are probably 5+ years old, but it used to be that | running PostGres on anything other than bare metal (for | example, running it in Docker) was fraught with all sorts of | data/file/io sync issues that you might not be able to | recover from. So, I just got used to running the databases on | metal (graph, postgres, columnar) with whatever reliability | schemes myself and leaving docker and k8s outside of that. | | Has that changed? (It may well have, but once burned, twice | shy and all that). | random3 wrote: | GCP volumes are over network already. You can deploy stateful | workloads using StatefulSets. We've run an HBase workloads for | development purposes (about 20-30x cheaper than BigTable) and | it worked great (no issues for over 12 months). While Postgres | is hardly a distributed database, there may be some advantages | to ensure availability and perhaps even more in replicated | setup. | georgebarnett wrote: | Why are you comparing HBase with Postgres? They are very | different technologies with completely different | architectural constraints? | qeternity wrote: | You don't run shared/network storage. You run PVCs on local | storage and you run an HA setup. You ship log files every | 10s/30s/1m to object storage. You test your backups regularly | which k8s is great for. | | All of this means that I don't worry about most things you | mention. PG upgrade? Failover and upgrade the pod. Upgrade | fails? wal-g clone from object storage and rejoin cluster. | Scaling up? Adjust the resource claims. If resource claims | necessitate node migration, see failover scenario. It's so | resilient. And this is all with raid 10 nvme direct attached | storage just as fast as any other setup. | | You mention etcd but people don't run etcd the way you're | describing postgres. You run a redundant cluster that can | achieve quorum and tolerate losses. If you follow that | paradigm, you end up with postgres on k8s. | dillonmckay wrote: | How big are your DBs in terms of size? | qeternity wrote: | 50gb - multi TB | dilyevsky wrote: | Ime most pg deployments don't need insanely high iops and | become cpu bound much quicker. So running ebs or gcp pd ssd or | even ceph pd is usually enough. | dijit wrote: | this is very contra to my own experience, we're IOPS bound | far more than we're CPU bound. | | This is true in Video Games (current job) and e-commerce | (what became part of Oracle Netsuite) | ianhowson wrote: | IOPS limitations and network latency are _the_ reason I | want to run my Postgres replicas in k8s. Every machine gets | NVMe and a replica instance. They 're still cattle, but | they can serve a _lot_ of requests. | | Database CPU usage is negligible. | | It was easier to put a replica on every host than try to | rearchitect things to tolerate single-digit millisecond RTT | to a cloud database. | kgraves wrote: | What's the use-case for running databases in k8s, is this a | widely accepted best practice? | ghshephard wrote: | I guess I look at it the opposite way - which is why _wouldn | 't_ you run everything in k8s once you have the basic | investment in it. Let's you spin up new environments, vertical | scaling becomes trivial, disaster recovery/business continuity | is automatic along with everything else in your k8s | environment. | etxm wrote: | Losing your data. | | JK, sort of. | | My first go to is something like RDS, but I've run Postgres in | k8s for pretty much one use case: everything else is already in | k8s _and_ I need a PG extension/functionality not present in | RDS. | dkhenry wrote: | I don't think its a widely accepted best practice yet, mainly | because its hard to do well, and by its self its hard to take | advantage of the benefits of using k8s. The company I work for | has been building out the tools require to run databases well | in k8s ( fully automated, fully managed, survivable, and scale- | able ) and we are seeing people come around to it. Once you | have all the tools in place you can have a system that scales | right along side your applications on heterogeneous hardware. | Isn't dependent on any single server, can be deployed and | managed exactly like your applications, and can be transported | everywhere. If you want to take a look check out | planetscale.com | caniszczyk wrote: | For the MySQL folks, see Vitess as an example on how to run | Kubernetes on MySQL: https://vitess.io ___________________________________________________________________ (page generated 2020-06-29 23:00 UTC)