[HN Gopher] How we upgraded our 4TB Postgres database ___________________________________________________________________ How we upgraded our 4TB Postgres database Author : mrbabbage Score : 251 points Date : 2022-04-19 15:07 UTC (7 hours ago) (HTM) web link (retool.com) (TXT) w3m dump (retool.com) | giovannibonetti wrote: | By the way, Google Cloud recently launched in-place upgrade of | Postgres instances. A few days ago we used it to upgrade our | multi TB database in my company as well. | | https://cloud.google.com/blog/products/databases/cloud-sql-l... | davidkuennen wrote: | Woah, this is great. Been waiting for this, since Cloud SQL has | been very reliable in the past few years I've been using it, | but upgrading was always a pain. | aeyes wrote: | What is "in-place" about this? According to the docs you'll | have ~10min downtime and you'll loose table statistics which is | exactly what happens when you run pg_upgrade manually. | | The biggest problem with all the cloud providers is that you | won't know exactly when this 10 minute downtime window will | start | | I guess the only advantage here is that you don't have to do | 9->10->11->12->13->14 like in the past and maybe that was one | of the blockers Azure has. AWS allows to skip some major | versions but 9->14 is not possible. | Teletio wrote: | The first upgrade Strategie is not the normal or easy one on | anything production btw. | | Very small companies might be able to do this on 'no load day' | but from a pure business perspective, running your db twice is | easier and way less risky. | | You could have done this even without downtime by letting your | connection proxy handling the switch. | gamegod wrote: | If you're running a 4 TB Postgres database, but you still have to | worry about this level of maintenance, what's the value | proposition of using a hosted service? There's usually insane | markup on any hosted Postgres instance. | | If you want to pay multi-thousands dollars a month for a database | server, it's WAY cheaper just to slap a server with a ton of | drives in colocation. | orangepurple wrote: | Might be an accounting scam^H^H^H^H trick to book the costs as | an operating expense vs a capital expenditure. In general | capital expenditures have to be planned carefully, held on the | books for years, and show a return on investment. Perhaps an | accountant can provide more detail. | ipaddr wrote: | This is a big reason why yearly pricing vs buying hardware is | popular (one of many reasons). | | An expensive can be used that year where a capital expenses | vest over years so only a portion can be applied that year. | mywittyname wrote: | I feel like the explanation is a much more simple: it's | easier to use Azure/AWS/GCP for everything. | | Yeah, this migration might have been somewhat painful, but it | looks like it's a situation the company experiences every few | years. | Spivak wrote: | It's crazy how many apps opt for an RDBMS for append-only data | like audit events. It's so tantalizing at the beginning but turns | into a nightmare time marches forward. audit | events -> queue -> elastic -> blob storage | | is so easy to maintain and we save TBs from living in the DB. | luhn wrote: | Mind expanding on what the "nightmare" is? | mrbabbage wrote: | We (Retool) are going to be doing this very soon and cut our | database size by 50%+. And you're exactly right: it's so easy | to get started by sticking audits (or other append-only data | schema) in an RDBMS, but it quickly becomes a headache and | bottleneck. | abrkn wrote: | What will you use as a queue and storage? | mrbabbage wrote: | To be determined! | baq wrote: | up until some size the headache from maintaining a separate | datastore is bigger. everything should be in the RDBMS until | proven otherwise for the sake of simplicity. it's actually | amazing how much you can squeeze out of 'old school' | databases. | | e.g. https://docs.microsoft.com/en-us/azure/azure- | sql/database/le... | nrmitchi wrote: | One of the biggest thing that keeping audit records in your DB | gives you is transactionality around your audit logs. Sending | audit events to an external system (quite often) loses this, | and the resources to address this before you have to are way | larger than a slightly larger AWS/GCP/Azure/<insert-computer- | provider-here> bill. | magicalhippo wrote: | We're implementing something similar to what OP describes, | but we'll keep the "queue" in the DB in order to insert the | application audit event in the same transaction as the data | change. A background process then uploads to secondary | storage. | | We won't have billions of rows though, so once uploaded to | secondary storage we'll just clear the blob field and set a | "processed" flag. | | This way we can find all the relevant keys for a given order, | invoice etc quickly based on a partial key search in the | database, and transparently fetch from either db directly or | secondary storage as needed. | onphonenow wrote: | Actually, I've seen more problems with folks mixing lots of | different tools up then I have from folks doing an append only | audit event in a RDBMS. | | When your audit trail is in DB, you can pretty easily surface | audit events to your customers. Who changed what when is just | another feature. Capturing audit events is also usually pretty | smooth. | | The folks doing the blob storage route, you would not BELIEVE | the complexity they have to spin up to expose very simple | histories etc. This matters a LOT in some spaces (financial | etc), less so in others. | | In my RDBMS model, who changed this field when from what to | what is a basic select. You can even shard by recordID or | similar if you want to reduce table scans, good select of | indexes etc can be a huge help as well. In most cases users | don't mind a bit of latency on these queries. | jeffbee wrote: | My only experience in the financial sector indicates the | opposite. The firm held its trading history for tens of | thousands of accounts going back 60 years in a SQL Server. | Anyone who wanted a question answered had to submit it for | overnight analysis and get the answer the next day. But in an | optimal non-RDMS representation, said trading history could | be condensed to a single 200MB flat file that could be | queried interactively, in microseconds. Dumping the RDBMS for | most use cases pretty much revolutionized the daily | experience for the people at that firm. | icedchai wrote: | Were they running this on a 386? Why didn't they optimize | their database? | jeffbee wrote: | It was "optimized" for the aesthetic concerns of RDBMS | purists and for ease of implementation of certain other | systems, in other words it was optimal for the developers | and severely sub-optimal for the users, which is a | problem endemic to RDBMS fandom. | icedchai wrote: | So it was basically "write optimized", fully normalized, | required at least a half dozen joins to get anything out | of it? | onphonenow wrote: | This seems borderline impossible? I'd be curious if there | were missing indexes or bad index selection or efforts to | do things like joins? Normally for very large data sets you | can shard by account if needed if that's the common filter | if there is some insane table scan. Audit stuff tends to be | "sparse" so if you can get an index which just tells you | which pages have results, that is usually a 100X speedup | with a pretty small index size. | | But agreed, a daily dump to something can go a long way to | unlocking other tools - in govt this is especially true not | because the DBMS is hard to use, but because so many layers | of consultants and others in the way it's not usable. | baq wrote: | SQL Server can easily handle such datasets if columnstore | is employed. I wouldn't be surprised if a single weekend of | building a proper index (...being very generous here) | wouldn't make their DB go literally 100x faster. | jeffbee wrote: | The problem at that firm wasn't the SQL Server it was the | RDBMS mindset which led them to do joins between tables | of journaled trades and much larger tables of corporate | actions going back decades, instead of materializing a | more useful intermediate result. This is my main beef | with RDBMSs: they lead to cognitive hazards of this type. | It's not that the databases are themselves naturally bad | systems. | more_corn wrote: | Set site read only. Snapshot master. Create new db from snapshot. | Point site to new db? | mrbabbage wrote: | Hey folks--I wrote the post! This was my biggest Postgres project | to date, and it proved quite tricky since I didn't rehearse with | a test database of the same size. I learned a bunch about | Postgres, not least the incredibly powerful NOT VALID option for | safely and quickly adding constraints. | | Happy to stick around and answer any questions you have. | palijer wrote: | I'm curious why a test run on a proper sized replica database | wasn't in the testing plans. That is something I've been | ensuring happens for a while now for similar projects. | wrs wrote: | Wondering why you couldn't use in-place upgrade (pg_upgrade | [0]). Not supported in Azure? | | [0] https://www.postgresql.org/docs/current/pgupgrade.html | majewsky wrote: | I'm guessing this would still take more than one hour since | it needs to rewrite all (or most) of the 4TB? | aidos wrote: | Not in my experience. You can use the ---link flag to get | it to use hard links so it doesn't need to move the data at | all. Have been through this process myself a few times and | it only took seconds on a 50-100GB db. I'm always a little | surprised with how week it works. | aoms wrote: | This instance must cost you a ton | mrbabbage wrote: | More than I care to admit! | | As mentioned downthread, we're doing some work soon to remove | the audit table from the database, which will cut storage | usage by over 50%. | tmikaeld wrote: | Seems to be around 1784$/monthly according to the pricing | calculator | loopdoend wrote: | Let me save you a bunch of money: | https://adamj.eu/tech/2021/04/13/reindexing-all-tables- | after... | karmelapple wrote: | This technique saved us from seriously increasing the | cost of our Heroku Postgres instance. Thank goodness it | exists and works so well. Multiple 80+ GB indexes shrinks | down to less than 10GB after just a couple of hours. | Teletio wrote: | What's your strategy? | | Using compression? Second instance for old data? Creating | in-between states? | | Just curious :) | clessg wrote: | > we're doing some work soon to remove the audit table from | the database | | Out of pure curiosity, what are you replacing it with (if | anything)? Just a simple rotating log file? | icheishvili wrote: | You can partition your audit/event table by time period | and archive old events [1] or you can avoid the records | hitting the database in the first place by generating the | events elsewhere to begin with [2]. | | [1] https://github.com/icheishvili/audit-trigger [2] | https://github.com/eulerto/wal2json | mrbabbage wrote: | Exact strategy to be determined--we're looking at various | data layers at the moment. I wish we could do something | simple like a rotating log file, but we want to be able | to query it in the app (for instance, to show recent | logins). | jamesrr39 wrote: | Have you considered an OLAP database like Clickhouse or | QuestDB? An OLAP database would be a much better fit for | audit tables given the read and append-only writing | requirements, would compress better, and you might be | able to fit it directly without changing any app code | with a Postgresql foreign data wrapper. | carlhjerpe wrote: | I was thinking TimescaleDB | clessg wrote: | Very interesting! Looking forward to the next blog post | on that ;) | abrkn wrote: | Great read! Thanks | tmikaeld wrote: | How come you didn't consider a distributed database like | CochroachDB, instead of Postgres? | nrmitchi wrote: | I'm not OP, but they were upgrading from Postgres 9.6, which | at least _implies_ that this initial db was from ~2017. | | This is barely past the initial release of Cockroach. It | would have been kind of crazy for the Retool team to use an | experimental db with a lack of history when building up their | product (that was not dependent on experimental new features) | tmikaeld wrote: | Baidu migrated from an old MySQL cluster to CockroachDB for | their 40TBs of databases, now runs on 20 clusters. | | When does it stop being experimental to you? | | https://www.cockroachlabs.com/customers/ | flyinknockin wrote: | I don't think they're calling Cockroach experimental | right now - rather that it was experimental in 2017. | burai wrote: | I used to work on a company that had MongoDB as the main | database. Leaving a lot of criticism aside, the replicaset model | for Mongo made the upgrades much easier than the ones in other | type of databses. | stingraycharles wrote: | While that's true, managed services on eg AWS provide hot | replica's as well, which you can use to upgrade the database | and do a failover to the new version. | | We actually migrated from vanilla Postgres to Aurora that way | with minimal risk / downtime, it was a really smooth process. | sscarduzio wrote: | Hey @mrbabbage! Retool customer here! Great service :) | | Non distributed RDBMS is a great (yet underrated) choice. Thank | you for the good writeup. | | I was thinking you could have a much less delicate migration | experience next time (some years from now). So you can go for a | quicker parallel "dump and restore" migration. | | For example: | | - Client side sharding in the application layer: you could shard | your customers' data across N smaller DB instances (consistent | hashing on customer ID) | | - Moving the append-only data somewhere else than postgres prior | to the upgrade. You don't need RDBMS capabilities for that stuff | anyway. Look at Elasticsearch, Clickhouse, or any DB oriented to | time series data. | | WDYT? | mrbabbage wrote: | The second bullet point is underway! Getting audit events out | of the main database will be a major headache saver. | | The first bullet point is on our radar for the near term. We | have a very natural shard key in our schema (the customer ID), | with AFAIK no relationships across that shard key. And once we | start horizontally sharding, we can do cool things like putting | your data in a shard geographically close to you, which will | greatly increase app performance for our non US customers. | Exciting stuff coming down the pike! | sscarduzio wrote: | Oh cool! Yes locality is very important! Great idea :) | georgewfraser wrote: | It is amazing how many large-scale applications run on a single | or a few large RDBMS. It seems like a bad idea at first: surely a | single point of failure must be bad for availability and | scalability? But it turns out you can achieve excellent | availability using simple replication and failover, and you can | get _huge_ database instances from the cloud providers. You can | basically serve the entire world with a single supercomputer | running Postgres and a small army of stateless app servers | talking to it. | lazide wrote: | Well, it's always been that way. | | The big names started using no-sql type stuff because their | instances got 2-3 orders of magnitude larger, and that didn't | work. It adds a lot of other overhead and problems doing all | the denormalization though, but if you literally have multi-PB | metadata stores, not like you have a choice. | | Then everyone started copying them without knowing why.... and | then everyone forgot how much you can actually do with a normal | database. | | And hardware has been getting better and cheaper, which makes | it only more so. | | Still not a good idea to store multi-PB metadata stores in a | single DB though. | more_corn wrote: | Ideally you offer developers both a relational data store and | a fast key-value data store. Train your developers to | understand the pros and cons and then step back. | | There's nothing inherently wrong with a big db instance. The | cloud providers have fantastic automation around multi-az | masters, read replicas and failover. They even do cross | region or cross account replication. | | That being said. Multi PB would raise an eyebrow. | seunosewa wrote: | Large solid state drives really changed the game; your | working set doesn't have to fit in RAM for decent performance | anymore. | openthc wrote: | With postgres you'll want to tune those cost paramters | however. Eg: lowering the random page cost will change how | the planner does things on some queries. But don't just | blindly change it -- like modify the value and run the | benchmark again. The point is that the SSD is not 10x the | cost of RAM (0.1 vs 1.0). In our example a few queries the | planner move to, what I always assumed was the slower | sequential scan -- but it's only slower depending on your | table size (how tall and how wide). I mean, PG works | awesome w/o tweaking that stuff but if you've got a few | days to play with these values it's quite educational. | jghn wrote: | > Then everyone started copying them without knowing why | | People tend to have a very bad sense of what constitutes | large scale. It usually maps to "larger than the largest | thing I've personally seen". So they hear "Use X instead of Y | when operating at scale", and all of a sudden we have people | implementing distributed datastore for a few MB of data. | | Having gone downward in scale over the last few years of my | career it has been eye opening how many people tell me X | won't work due to "our scale", and I point out I have | _already_ used X in prior jobs for scale that 's much larger | than what we have. | dzhiurgis wrote: | Place I've recently left had 10M record MongoDB table | without indexes which would take tens of seconds to query. | Celery was running in cron mode every 2 second or so | meaning jobs would just pile up and redis eventually ran | out of memory. No one understood why this was happening so | just restart everything after pagerduty alert... | lazide wrote: | Yikes. Don't get me wrong, it's always been this way to | some extent - not enough people who can look into a | problem and understand what is happening to make many | things actually work correctly, so iterate with new shiny | thing. | | It seems like the last 4-5 years though have really made | it super common again. Bubble maybe? | | Huge horde of newbs? | | Maybe I'm getting crustier. | | I remember it was SUPER bad before the dot-com crash, all | the fake it 'til you make it too. I even had someone | claim 10 years of Java experience who couldn't write out | a basic class on a whiteboard at all, and tons of folks | starting that literally couldn't write a hello world in | the language they claimed experience in, and this was | before decent GUI IDEs. | dzhiurgis wrote: | Technically we were a tech startup with 10+ "senior" | engineers which scrape entire web ;D | [deleted] | lazide wrote: | 100% agree. I've also run across many cases where no-one | bothered to even attempt any benchmarks or load tests on | anything (either old or new solutions), compared latency, | optimize anything, etc. | | Sometimes making 10+ million dollar decisions off that gut | feel with literally zero data on what is actually going on. | | It rarely works out well, but hey, have to leave that | opening for competition somehow I guess? | | And I'm not talking about 'why didn't they spend 6 months | optimizing that one call which would save them $50 type | stuff'. I mean literally zero idea what is going on, what | actual performance issues are, etc. | cortesoft wrote: | This is funny, because I suffer from the opposite issue... | every time I try to bring up scaling issues on forums like | HN, everyone says I don't actually need to worry because it | can scale up to size X... but my current work is with | systems at 100X size. | | I feel like sometimes the pendulum has swung too far the | other way, where people deny that there ARE people dealing | with actual scale problems. | dmd wrote: | Yep. I've personally been in the situation where I had to | show someone that I could do their analysis in a few | seconds using the proverbial awk-on-a-laptop when they were | planning on building a hadoop cluster in the cloud because | "BIG DATA". (Their Big Data was 50 gigabytes.) | strictfp wrote: | I agree in principle. But one major headache for us has been | upgrading the database software without downtime. Is there any | solution that does this without major headaches? I would love | some out-of-the-box solution. | AtlasBarfed wrote: | Cassandra can do it since it has cell level timestamps, so | you can mirror online writes and clone existing data to the | new database, and there's no danger of newer mutations being | overwritten by the bulk restored data. | | Doing an active no-downtime database migration basically | involves having a coherent row-level merge policy (assuming | you AT LEAST have a per-row last updated column), or other | tricks. Or maybe you temporarily write cell-level timestamps | and then drop it later. | | Or if you have data that expires on a window, you just do | double-writes for that period and then switch over. | simonw wrote: | The best trick I know of for zero-downtime upgrades is to | have a read-only mode. | | Sure, that's not the same thing as pure zero-downtime but for | many applications it's OK to put the entire thing into read- | only mode for a few minutes at a well selected time of day. | | While it's in read-only mode (so no writes are being | accepted) you can spin up a brand new DB server, upgrade it, | finish copying data across - do all kinds of big changes. | Then you switch read-only mode back off again when you're | finished. | | I've even worked with a team used this trick to migrate | between two data centers without visible end-user downtime. | | A trick I've always wanted to try for smaller changes is the | ability to "pause" traffic at a load balancer - effectively | to have a 5 second period where each incoming HTTP request | appears to take 5 seconds longer to return, but actually it's | being held by the load balancer until some underlying upgrade | has completed. | | Depends how much you can get done in 5 seconds though! | kgeist wrote: | >The best trick I know of for zero-downtime upgrades is to | have a read-only mode. | | I've done something similar, although it wasn't about | upgrading the database. We needed to not only migrate data | between different DB instances, but also between completely | different data models (as part of refactoring). We had | several options, such as proper replication + schema | migration in the target DB, or by making the app itself | write to two models at the same time (which would require a | multi-stage release). It all sounded overly complex to me | and prone to error, due to a lot of asynchronous | code/queues running in parallel. I should also mention that | our DB is sharded per tenant (i.e. per an organization). | What I came up with was much simpler: I wrote a simple | script which simply marked a shard read-only (for this | feature), transformed and copied data via a simple HTTP | interface, then marked it read-write again, and proceeded | to the next shard. All other shards were read-write at a | given moment. Since the migration window only affected a | single shard at any given moment, no one noticed anything: | for a tenant, it translated to 1-2 seconds of not being | able to save. In case of problems it would also be easier | to revert a few shards than the entire database. | nuclearnice3 wrote: | I like this approach. | | I'm picturing your migration script looping over shards. | It flips it to read-only, migrates, then flips back to | read-write. | | How did the app handle having some shards in read-write | mode pre-migration and other shards in read-write post- | migration simultaneously. | kgeist wrote: | Yes, it simply looped over shards, we already had a tool | to do that. | | The app handled it by proxying calls to the new | implementation if the shard was marked as "post- | migration", the API stayed the same. If it was "in | migration", all write operations returned an error. If | the state was "pre-migration", it worked as before. | | I don't already remember the details but it was something | about the event queue or the notification queue which | made me prefer this approach over the others. When a | shard was in migration, queue processing was also | temporarily halted. | | Knowing that a shard is completely "frozen" during | migration made it much easier to reason about the whole | process. | ww520 wrote: | The system needs to be architected in certain way to make | upgrade without downtime. Something like the Command and | Query Responsibility Segregation (CQRS) would work. A update | queue serves as the explicit transaction log keeping track of | the updates from the frontend applications, while the | databases at the end of the queue applies updates and serves | as the querying service. Upgrading the live database just | means having a standby database with new version software | replaying all the changes from the queue to catch up to the | latest changes, pausing the live database from taking new | changes from the queue when the new db has caught up, | switching all client connections to the new db, and shutting | down the old db. | jtc331 wrote: | Preconditions: | | 1. Route all traffic through pgbouncer in transaction pooling | mode. | | 2. Logically replicate from old to new. | | For failover: | | 1. Ensure replication is not far behind. | | 2. Issue a PAUSE on pgbouncer. | | 3. Wait for replication to be fully caught up. | | 4. Update pgbouncer config to point to the new database. | | 5. Issue a RELOAD on pgbouncer. | | 6. Issue a RESUME on pgbouncer. | | Zero downtime; < 2s additional latency for in-flight queries | at time of op is possible (and I've done it at scale). | georgewfraser wrote: | This is such a huge problem. It's even worse than it looks: | because users are slow to upgrade, changes to the database | system take years to percolate down to the 99th percentile | user. The decreases the incentive to do certain kinds of | innovation. My opinion is that we need to fundamentally | change how DBMS are engineered and deployed to support silent | in-the-background minor version upgrades, and probably stop | doing major version bumps that incorporate breaking changes. | brentjanderson wrote: | Depends on the database - I know that CockroachDB supports | rolling upgrades with zero downtime, as it is built with a | multi-primary architecture. | | For PostgresQL or MySQL/MariaDB, your options are more | limited. Here are two that come to mind, there may be more: | | # The "Dual Writer" approach | | 1. Spin up a new database cluster on the new version. 2. Get | all your data into it (including dual writes to both the old | and new version). 3. Once you're confident that the new | version is 100% up to date, switch to using it as your | primary database. 4. Shut down the old cluster. | | # The eventually consistent approach | | 1. Put a queue in front of each service for writes, where | each service of your system has its own database. 2. When you | need to upgrade the database, stop consuming from the queue, | upgrade in place (bringing the DB down temporarily) and | resume consumption once things are back online. 3. No service | can directly read from another service's database. Eventually | consistent caches/projections service reads during normal | service operation and during the upgrade. | | A system like this is more flexible, but suffers from stale | reads or temporary service degradation. | jtc331 wrote: | Dual writing has huge downsides: namely you're now moving | consistency into the application, and it's almost | guaranteed that the databases won't match in any | interesting application. | msh wrote: | Migrate the data to a new host having the new version. | karmakaze wrote: | The way I've done it with MySQL since 5.7 is to use multiple | writers of which only one is actively used by clients. Take | one out, upgrade it, put it back into replication but not | serving requests until caught up. Switch the clients to | writing to the upgraded one then upgrade the others. | dijit wrote: | Even when it's not a cloud provider, in fact, especially when | it's not a cloud provider: you can achieve insane scale from | single instances. | | Of course these systems have warm standbys, dedicated backup | infrastructure and so it's not really a "single machine"; but | I've seen 80TiB Postgres instances back in 2011. | cogman10 wrote: | We are currently pushing close to 80tb mssql on prem | instances. | | The biggest issue we have with these giant dbs is they | require pretty massive amounts of RAM. That's currently our | main bottle neck. | | But I agree. While our design is pretty bad in a few ways, | the amount of data that we are able to serve from these big | DBs is impressive. We have something like 6 dedicated servers | for a company with something like 300 apps. A hand full of | them hit dedicated dbs. | | Were I to redesign the system, I'd have more tiny dedicated | dbs per app to avoid a lot of the noisy neighbor/scaling | problems we've had. But at the same time, It's impressive how | far this design has gotten us and appears to have a lot more | legs on it. | Nathanba wrote: | Can I ask you how large tables can generally get before | querying becomes slower? I just can't intuitively wrap my | head around how tables can grow from 10gb to 100gb and why | this wouldnt worsen query performance by x10. Surely you do | table partitions or cycle data out into archive tables to | keep up the query performance of the more recent table | data, correct? | WJW wrote: | At my old workplace we had a few multi-TB tables with | several billion rows in a vanilla RDS MySql 5.7 instance | (although it was obviously a sizable instance type), | simple single-row SELECT queries on an indexed column (ie | SELECT * FROM table WHERE external_id = 123;) would be | low single-digit milliseconds. | | Proper indexing is key of course, and metrics to find | bottlenecks. | cogman10 wrote: | > I just can't intuitively wrap my head around how tables | can grow from 10gb to 100gb and why this wouldnt worsen | query performance by x10 | | Sql server data is stored as a BTree structure. So a 10 | -> 100gb growth ends up being roughly a 1/2 query | performance slowdown (since it grows by a factor of log | n) assuming good indexes are in place. | | Filtered indexes can work pretty well for improving query | performance. But ultimately we do have some tables which | are either archived if we can or partitioned if we can't. | SQL Server native partitioning is rough if the query | patterns are all over the board. | | The other thing that has helped is we've done a bit of | application data shuffling. Moving heavy hitters onto new | database servers that aren't as highly utilized. | | We are currently in the process of getting read only | replicas (always on) setup and configured in our | applications. That will allow for a lot more load | distribution. | AtlasBarfed wrote: | The issue with b-tree scaling isn't really the lookup | performance issues, it is the index update time issues, | which is why log structured merge trees were created. | | EVENTUALLY, yes even read query performance also would | degrade, but typically the insert / update load on a | typical index is the first limiter. | abraxas wrote: | If there is a natural key and updates are infrequent then | table partitioning can help extend the capacity of a | table almost indefinitely. There are limitations of | course but even for non-insane time series workloads, | Postgres with partitioned tables will work just fine. | cmckn wrote: | Well, any hot table should be indexed (with regards to | your access patterns) and, thankfully, the data | structures used to implement tables and indexes don't | behave linearly :) | | Of course, if your application rarely makes use of older | rows, it could still make sense to offload them to some | kind of colder, cheaper storage. | noselasd wrote: | Think of finding a record amongst many as e.g. a binary | search. It doesn't take 10 times as many tries to find a | thing(row/record) amongst 100 as it does amonst 1000. | jlokier wrote: | A lot depends on the type of queries. You could have | tables the size of the entire internet and every disk | drive ever made, and they'd still be reasonably fast for | queries that just look up a single value by an indexed | key. | | The trick is to have the right indexes (which includes | the interior structures of the storage data structure) so | that queries jump quickly to the relevant data and ignore | the rest. Like opening a book at the right page because | the page number is known. Sometimes a close guess is good | enough. | | In addition, small indexes and tree interior nodes should | stay hot in RAM between queries. | | When the indexes are too large to fit in RAM, those get | queried from storage as well, and at a low level it's | analogous to the system finding the right page in an | "index book", using an "index index" to get that page | number. As many levels deep as you need. The number of | levels is generally small. | | For example, the following is something I worked on | recently. It's a custom database (written by me) not | Postgres, so the performance is higher but the table | scaling principles are similar. The thing has 200GB | tables at the moment, and when it's warmed up, querying a | single value takes just one 4k read from disk, a single | large sector, because the tree index fits comfortably in | RAM. | | It runs at approximately 1.1 million random-access | queries/second from a single SSD on my machine, which is | just a $110/month x86 server. The CPU has to work quite | hard to keep up because the data is compressed, albeit | with special query-friendly compression. | | If there was very little RAM so nothing could be kept in | it, the speed would drop by a factor of about 5, to 0.2 | million queries/second. That shows you don't need a lot | of RAM, it just helps. | | Keeping the RAM and increasing table size to roughly 10TB | the speed would drop by half to 0.5 million | queries/second. In principle, with the same storage | algorithms a table size of roughly 1000TB (1PB) would | drop it to 0.3 million queries/second, and roughly | 50,000TB (50PB) would drop it to 0.2 million. (But of | course those sizes won't fit on a single SSD. A real | system of that size would have more parallel components, | and could have higher query performance.) You can grow to | very large tables without much slowdown. | the8472 wrote: | > scalability | | You can scale quite far vertically and avoid all the clustering | headaches for a long time these days. With EPYCs you can get | 128C/256T, 128PCIe lanes (= 32 4x NVMes = ~half a petabyte of | low-latency storage, minus whatever you need for your network | cards), 4TB of RAM in a single machine. Of course that'll cost | you an arm and a leg and maybe a kidney too, but so would | renting the equivalent in the cloud. | Vladimof wrote: | I don't think that it's amazing.... I think that the new and | shinny databases tried to make you think that it was not | possible... | eastbound wrote: | React makes us believe everything must have 1-2s response to | clicks and the maximum table size is 10 rows. | | When I come back to web 1.0 apps, I'm often surprised that it | does a round-trip to the server in less than 200ms, and | reloads the page seamlessly, including a full 5ms SQL query | for 5k rows and returned them in the page (=a full 1MB of | data, with basically no JS). | icedchai wrote: | Caches also help a ton (redis, memcache...) | nesarkvechnep wrote: | Also HTTP caching. It's always funny to me why people, not | you in particular, reach for Redis when they don't even use | HTTP caching. | markandrewj wrote: | Scaling databases vertically, like Oracle DB, in the past was | the norm. It is possible to serve a large number of users, and | data, from a single instance. There are some things worth | considering though. First of all, no matter how reliable your | database is, you will have to take it down eventually to do | things like upgrades. | | The other consideration that isn't initially obvious, is how | you may hit an upper bound for resources in most modern | environments. If your database is sitting on top of a virtual | or containerized environment, your single instance database | will be limited in resources (CPU/memory/network) to a single | node of the cluster. You could also eventually hit the same | problem on bare metal. | | That said there are some very high density systems available. | You may also not need the ability to scale as large as I am | talking, or choose to shard and scale your database | horizontally at later time. | | If your project gets big enough you might also start wanting to | replicate your data to localize it closer to the user. Another | strategy might be to cache the data locally to the user. | | There are positive and negatives with a single node or cluster. | If retools database was clustered they would have been able to | do a rolling upgrade though. | booleanbetrayal wrote: | Low / zero downtime is totally achievable with pg_logical and | really boils down to whether or not you want to try to adopt bi- | directional writes (and conflict management / integrity issues), | or if you're willing to just have a brief session termination | event and swapover. To me, the latter has generally been | preferable as conflict management systems tend to be more | complicated in reality (based on business logic / state) than | what pg_logcical provides. Interested if people here have had | success with bi-directional writes though. | bogomipz wrote: | >"Last fall, we migrated this database from Postgres version 9.6 | to version 13 with minimal downtime." | | I thought it was interesting that they upgraded 4 major version | numbers in one go. I kept expecting to read something about | version compatibility and configuration but was surprised there | was none. Are major upgrades like this just less of an issue with | Postgres in general? | rkwasny wrote: | "However, on our 4 TB production database, the initial dump and | restore never completed: DMS encountered an error but failed to | report the error to us." | | THERE IS NO CLOUD: It's just someone else's computer | forinti wrote: | I recently upgraded a 9.x database to 14 using pg_upgrade and it | was just simple and fast. No issues whatsoever. | | I only wish my Oracle updates were so simple and bug-free. | VincentEvans wrote: | Why is SQL Server able to backup and restore databases 100s of | gigabytes in size in single-digit minutes, while Postgres is at | least 10x slower? | dboreham wrote: | Perhaps time to re-architect to avoid having a huge monolith | database, particularly since the SaaS product is inherently | multi-tenant? | wenbin wrote: | For listennotes.com, we did a postgres 9.6 => 11 upgrade (in | 2019), and 11 => 13 upgrade (in 2021). ~0 downtime for read ops, | and ~45 seconds downtime for write ops. | | Our database is less than 1TB. One master (for writes + some | reads) + multiple slaves (read-only). | | Here's what we did - | | 1, Launched a new read-only db with pg9.6, let's call it DB_A. | | 2, Stopped all offline tasks, and only maintained a minimal fleet | of online servers (e.g., web, api...). | | 3, Changed all db hosts (no matter master or slave) in /etc/hosts | on the minimal fleet of online servers (e.g., web, api...) to use | old read-only db with pg9.6, let's call it DB_B. From this point | on, all write ops should fail. | | 4, Ran pg_upgrade (with --link) on DB_A to upgrade to pg11, and | promoted it to be a master db. | | 5, Changed /etc/hosts on the minimal fleet of online servers | (e.g., web, api...) to use DB_A for all db hosts. By this point, | DB_A is a master db. And write ops should be good now. | | 6, Changed /etc/hosts for all other servers and brought back all | services. | | Step 4 is the most critical. If it fails or runs too long (e.g., | more than 10 minutes), then we had to rollback by changing | /etc/hosts on those online servers. | | We carefully rehearsed these steps for an entire week, and timed | each step. By the time we did it on production, we knew how many | seconds/minutes each step would take. And we tried to automate as | many things as possible in bash scripts. | aolle wrote: | wow | aolle wrote: | karmelapple wrote: | We did something similar recently jumping from 10 to 13. We | took measurements, did some dry runs, and came up with | strategies to ensure our read-only followers would work fine | and we'd have a minimum downtime for writes. | | We missed one or two pieces of reconnecting things afterwards, | and some of that seems to be limitations of Heroku Postgres | that we couldn't change. Hopefully those keep improving. | hankman86 wrote: | Having successfully built (and sold!) a technology startup | myself, I would always, always opt for a managed database | service. Yes, it's more expensive on paper and you want to run | the numbers and choose the right offering. But nothing beats the | peace of mind of storing your customers' data on a system that | others (Google Cloud in our case) look after. Not to mention that | you're better off focussing on your core value proposition and | spending your scarce resources there than to have a database | administrator on your payroll. | pojzon wrote: | This only works to some scale. Startup from my previous | contract reached the point where "using bigger instance" would | be losing them money. | | On the other hand, self built databases maintained by | professionals were ALOT cheaper. | | We are talking here about million of dollars bills only for | databases per month. | | Self hosted solution did cost around 300k per month. | | This included precompiled kernel patches, tweaks to postgres | engine etc. | | Overall the investment of a year of work of dbadmins will | probably return itself by ten times if not more. | lijogdfljk wrote: | How does MySQL fair in this type of setup? Do the DBs differ | greatly? | thatwasunusual wrote: | I worked for a company that did a similar "upgrade by | replication", but with MySQL. It's quite a few years ago, so I | don't remember the versions involved, but it was quite | straight-forward once we had done _weeks_ of test runs on a dev | environment. | | One invaluable thing, though: our application was from the | beginning designed to do 100% of all the reads from a read-only | slave _if the slave was up to sync_ (which it was 95% of the | time). We could also identify testers/developers in the | application itself, so we had them using the upgraded slave for | two weeks before the actual upgrade. | | This made it possible for us to filter out problems in the | application/DB-layer, which were few, which means that we | probably did a minor version upgrade. | | But upgrading by replication is something I can recommend. | evanelias wrote: | MySQL's built-in replication has always been logical | replication, and it officially supports replicating from an | older-version primary to newer-version replicas. So similar | concept to what's described here, but much simpler upgrade | process. | | Generally you just upgrade the replicas; then promote a replica | to be the new primary; then upgrade the old primary and turn it | into a replica. | | The actual "upgrade" step is quite fast, since it doesn't | actually need to iterate over your tables' row data. | | At large scale, the painful part of major-version MySQL | upgrades tends to be performance testing, but that's performed | separately and prior to the actual upgrade process. Third-party | tools (pt-upgrade, proxysql mirroring, etc) help a lot with | this. | cube00 wrote: | _> To resolve this, we ended up choosing to leave foreign key | constraints unenforced on a few large tables. | | > We reasoned this was likely safe, as Retool's product logic | performs its own consistency checks, and also doesn't delete from | the referenced tables, meaning it was unlikely we'd be left with | a dangling reference._ | | I was holding my breath here and I'm glad these were eventually | turned back on. | | Nobody should ever rely on their own product logic to ensure | consistency of the database. | | The database has features (constraints, transactions, etc) for | this purpose which are guaranteed to work correctly and | atomically in all situations such as database initiated rollbacks | that your application will never have control over. | phphphphp wrote: | Does that pattern you describe require any considerations when | writing code? I'm thinking of applications I've worked on where | events are triggered by change, and so a database rolling back | independent of my application would be a nightmare. I treat the | database as a place to store data, not an authority: the | application is the authority. Do you approach it differently? | Thanks! | kijin wrote: | ON DELETE CASCADE can be dangerous when used with | applications that expect to be notified of deletions, like in | your case. | | Ideally, everything that needs to change when a row is | deleted would be changed automatically and atomically using | database-side constraints and triggers. In practice, | applications often need to sync state with external services | that the database knows nothing about, so I understand your | concerns. | | ON DELETE RESTRICT, on the other hand, will result in errors | just like any other query error that you can handle in your | application. Nothing happened, so there's nothing to be | notified of. | brightball wrote: | The database is the only place that can be the authority | because the application can have race conditions. It's the | only way to guarantee data integrity. | grogers wrote: | There's no way to specify every single application specific | constraint directly in the database. Race conditions are | not present when using locking reads (select ... for | update, or DB specific shared locking selects) or | serializable isolation level, which are the typical way of | enforcing application level constraints. | icedchai wrote: | You'd be surprised. I used to work on a product where the lead | developer made sure foreign keys were NOT enabled on | production. They were only "allowed" in dev. Some teams have a | strict "no foreign keys" rule. | oppositelock wrote: | It's difficult to make a blanket statement like this. | | I've built some very high throughput Postgres backed systems in | my years, and doing application side foreign key constraints | (FKC) does have its benefits. Doing this client side will | result in constraints that are usually, but not always in sync | with data. However, this kind of almost-consistency lets you do | much higher throughput queries. An FKC is a read on every | write, for example, and does limit write throughput. Of course, | this isn't ok for some workloads, and you do proper FKC in the | DB, but if you don't need absolute consistency, you can make | writes far cheaper. | javajosh wrote: | The trade-offs between foreign key constraints vs none are | almost identical to the trade-offs between static typing vs | dynamic typing. Nowadays people realize that when they turn | off these features is that they'll eventually have to re- | implement them later. | WJW wrote: | You make this claim as if this happens to every company | sooner or later, but if a company the size of GitHub can | still do without ( https://github.com/github/gh- | ost/issues/331#issuecomment-266...) it does become a little | bit of a "you do not have google problems" type discussion. | | (Perhaps you do have such problems, I don't know where you | work! But 99%+ of companies don't have such problems and | never will.) | [deleted] | jasfi wrote: | True, but the DB constraints often aren't enough. I like to | have a verify program to check that the DB is as expected. | weird-eye-issue wrote: | You don't trust DB constraints? | [deleted] | deathanatos wrote: | Depends on the database, sometimes the database config, as | to whether they'll actually be enforced or not, or in what | situations data might evade enforcement of the | constraints... | | Applies to vendors, too. Had some data in Rackspace Files | where "list files" would say X existed, but "GET X" got you | a 404. Had an AWS RDS instance; query on it returned no | results. Adding the "don't use the index" index hint caused | it to return data. (Allegedly, this bug was fixed, but we | had migrated off by that point, so I never got to confirm | it.) | | Conversely, I do like DB constraints, because if the DB | constraint _doesn 't_ exist, then I guarantee you the | production DB has a row that is a counter-example to | whatever constraint you think the data _should_ obey... | majewsky wrote: | > Had some data in Rackspace Files where "list files" | would say X existed, but "GET X" got you a 404. | | Well yes, Rackspace Files (aka OpenStack Swift) is | eventually consistent. It says so literally in the _first | sentence_ of the documentation [1]. But this discussion | is about relational databases with ACID guarantees, where | the C is literally "consistent". | | [1] https://docs.openstack.org/swift/latest/ | rattray wrote: | They're limited in what they can express; your application | often has invariants you want to enforce/maintain that | can't be (performantly) expressed with DB constraints, and | must be validated another way. | | As great as it can be to enforce rules within the database, | a lot of them usually end up needing to be enforced at the | application layer instead. Especially when performance at | scale comes into play. | brightball wrote: | I think it's a balance. Transactions + Constraints can | enforce most things but there will certainly be things | that can only be verified in the app. | | My goal is always to verify what I can in the database to | minimize potential data cleanup. In my experience, app | only verification always leads to future time investments | to clean up the mess. | jasfi wrote: | Make no mistake, I think DB constraints are a best | practice. | WJW wrote: | They have their place, but also their limits. | simonw wrote: | I found the script they used for copying data really interesting: | https://gist.github.com/peterwj/0614bf6b6fe339a3cbd42eb93dc5... | | It's written in Python, spins up a queue.Queue object, populates | it with ranges of rows that need to be copied (based on min < ID | < max ranges), starts up a bunch of Python threads and then each | of those threads uses os.system() to run this: | psql "{source_url}" -c "COPY (SELECT * FROM ...) TO STDOUT" \ | | psql "{dest_url}" -c "COPY {table_name} FROM STDIN" | | This feels really smart to me. The Python GIL won't be a factor | here. | teej wrote: | For ETL out of Postgres, it is very hard to beat psql. | Something as simple as this will happily saturate all your | available network, CPU, and disk write. Wrapping it in Python | helps you batch it out cleanly. psql -c "..." | | pigz -c > file.tsv.gz | mrbabbage wrote: | Thanks Simon! I can indeed confirm that this script managed to | saturate the database's hardware capacity (I recall CPU being | the bottleneck, and I had to dial down the parallelism to leave | some CPU for actual application queries). | SahAssar wrote: | Sounds to me like this is the exact thing that the normal | parallel command was made for, not sure python is needed here | if the end result is shelling out to os.system anyway. | valzam wrote: | I am generally a fan of using as few moving parts as possible but | if > 60-70% (2TB + a "few hundred GB") of your prod database are | an append only audit log surely if would make sense to split that | part into a separate DB server? Especially when you are using a | hosted service. It sounds like both uptime and consistency | requirements are very different between these two parts of the | production data. | ranyefet wrote: | My thoughts exactly. | | Does Postgres still make sense for append only tables or maybe | elastic or other kind of database would be more suitable? | abraxas wrote: | Postgres makes a lot of sense with append only tables. You | can easily partition them by time (usually) and thus have an | easy way to break up the index trees as well as using a cheap | indexing scheme like BRIN and being able to just drop old | chunks as they become irrelevant. | sega_sai wrote: | Using pg_upgrade I recently updated this 100Tb sized DB (from | PG12 to PG13) in ~10 min of downtime. | | => select pg_size_pretty(pg_database_size ('**')); | | pg_size_pretty | | ---------------- | | 99 TB | | (1 row) | | (The re-analyze of tables took a day or so though) | riku_iki wrote: | Curious what is your storage story? Where DB is actually | stored? Some NAS? | sega_sai wrote: | It's DAS. Mostly it's in one box of RAIDed 32x5.5Tb drives | with a couple of tablespaces/WAL elsewhere. The DB is mostly | read-only and not many concurrent users, so that's probably | not the most typical case. | comboy wrote: | Yeah, my petty 6Tb also went fine with pg_upgrade and | practically no downtime. Upgrade slave, promote to master, | upgrade master and then promote it back. It's a marvelous piece | of technology. | | It's really just a handful of core people who did most of the | work, crafting it so thoughtfully over the years and it has | such a huge impact on the world. | | Doing huge part of it before postgresql was as popular as it is | today, spending countless hours on making some great design | choices and implementing them carefully. | | It seems unlikely any of them will read that but I'm so deeply | grateful to these people. It allowed so many things to flourish | on top thanks to it being open source and free. | jeffrallen wrote: | Nifty. But I can't help thinking this was harder than it needed | to be in the cloud. Because frankly, 4 TB is not big: my home | Synology backup server is 4 TB. Making a pair of standalone Linux | servers to rehearse this locally, and with full control of which | Postgres modules and other software to use, would have made | things easier, it seems. | | Anyway, thanks for food for thought. | TruthWillHurt wrote: | Wow this takes me back 20 years ago when we did this kind of | partial-dump+sync migrations of mysql. | | Then the cloud and DBaaS was invented. | electroly wrote: | In SQL Server you just... do the upgrade. You install the upgrade | on your nodes starting with the passive nodes, and it will | automatically failover from the old version to the new version | once half the nodes have been upgraded. No downtime, but your | redundancy drops when some nodes have been upgraded but the | cluster hasn't fully been upgraded yet. You certainly don't have | to dump and restore your database. Without giving private | numbers, our database is much bigger than OP's 4TB; dump and | restore would be wildly unacceptable. | | The idea that you don't get a seamless upgrade of the database | itself with PostgreSQL is absurd to me. The part about | "maximizing the amount of time this upgrade buys is" is only | necessary because of how difficult PostgreSQL makes upgrades. We | upgrade to every new version of SQL Server. It's not that big of | a deal. | | With every PostgreSQL blog article I read, I become more and more | of an SQL Server fanboy. At this point it's full-blown. So many | "serious business" PostgreSQL ops posts are just nothingburgers | in the SQL Server world. | calpaterson wrote: | PG is far behind SQL Server on ease of upgrade but the method | described in this post is not the best practice right now, | which I think is: | | - physical restore to new cluster | | - pg_upgrade the new cluster | | - catch up on logical wal logs from old cluster | | - failover to new cluster | | - STONITH | | I think the above was not open to them because of the | limitations of their managed PG instance. I haven't used Azure | but GCP managed SQL has loads of limitations. It seems very | common and I think is a major (and undiscussed) drawback of | these managed instance. | | But the truth is that very few of the people who use PG want to | hear that things are better in the MS SQL community for reasons | of prejudice and as a result you're being downvoted unfairly | for pointing out PGs relative backwardness here. | Teletio wrote: | Amy real facts you can show besides your sentiment on blog | posts? | electroly wrote: | What facts are you looking for? I just described the steps | from this document: https://docs.microsoft.com/en-us/sql/sql- | server/failover-clu... -- specifically, the "Perform a | rolling upgrade or update" section. There's nothing else to | my post other than contrasting the SQL Server upgrade process | to the one described in the article, and musing about my | growing appreciation for SQL Server; I apologize if it seemed | like it was going to be deeper than that. | | EDIT: I realized you're looking for the other PostgreSQL blog | posts. Here's an example of two recent HN posts about | PostgreSQL issues that I pulled out of my comment history. | Both of these blog posts exist because PostgreSQL doesn't | have query hints. SQL Server has them; I've dealt with issues | like these blog posts describe but they have trivial fixes in | the SQL Server world. Nothing to write a blog post about. I | don't have a link handy regarding PostgreSQL's txn id | wraparound problem, but SQL Server doesn't have that problem, | either. | | - https://news.ycombinator.com/item?id=30296490 | | - https://news.ycombinator.com/item?id=29981737 | aidos wrote: | It's been a long time since I used SQL Server so I don't know | that upgrade process well (I'm willing to believe it's smoother | though, especially wrt to replication / failover). | | Keep in mind that they're upgrading from a database version | that's almost 6 years old. Postgres has improved a lot in the | last 5 major versions since then. | | Another thing here is that I'm pretty sure they could have just | done the in-place upgrade and it would have been fine. I've run | pg_upgrade myself for a bunch of major versions now and it's | easy and doesn't require dumping / restoring anything. Maybe | there's something else going on that I'm missing though. | | What setup are you running with sql server to have it | automatically failover? Is it a multi master configuration or | are the additional nodes just read replicas? | | These days Postgres actually allows logical replication so your | servers can be running different versions at the same time, | which allows for much smoother upgrades (haven't tried that | myself yet, don't quote me on it!) | Symbiote wrote: | They did not dump and restore. | | They created a replica database running the new version, then | switched over to it. Not too dissimilar to what you describe, | although more work since they started out with only a single | instance without replication support. | electroly wrote: | They _ultimately_ didn't dump and restore, but it was the | first thing they tried. It didn't work; it actually failed | catastrophically for them. They describe this under the | "Implementing logical replication" section. Their ultimate | solution is what they tried after the dump-and-restore based | DMS method failed and they took an unplanned outage due to | yet more PostgreSQL-specific issues (unvacuumed tuples). | | All of this is exactly what I'm talking about. This blog post | describes kind of a nightmare process for something that is | trivial in SQL Server. They actually needed a third party | product from Citus just to successfully upgrade PostgreSQL! | Stunning. | VincentEvans wrote: | Postgres dump and restore tooling is very poor performance- | wise , easily 10x slower compared to SQL Server. I love | Postgres dearly and prefer to use it despite that, but I | wish Postgres devs renewed their interest in improving | neglected dump/restore tooling. ___________________________________________________________________ (page generated 2022-04-19 23:00 UTC)