[HN Gopher] How we upgraded our 4TB Postgres database
       ___________________________________________________________________
        
       How we upgraded our 4TB Postgres database
        
       Author : mrbabbage
       Score  : 251 points
       Date   : 2022-04-19 15:07 UTC (7 hours ago)
        
 (HTM) web link (retool.com)
 (TXT) w3m dump (retool.com)
        
       | giovannibonetti wrote:
       | By the way, Google Cloud recently launched in-place upgrade of
       | Postgres instances. A few days ago we used it to upgrade our
       | multi TB database in my company as well.
       | 
       | https://cloud.google.com/blog/products/databases/cloud-sql-l...
        
         | davidkuennen wrote:
         | Woah, this is great. Been waiting for this, since Cloud SQL has
         | been very reliable in the past few years I've been using it,
         | but upgrading was always a pain.
        
         | aeyes wrote:
         | What is "in-place" about this? According to the docs you'll
         | have ~10min downtime and you'll loose table statistics which is
         | exactly what happens when you run pg_upgrade manually.
         | 
         | The biggest problem with all the cloud providers is that you
         | won't know exactly when this 10 minute downtime window will
         | start
         | 
         | I guess the only advantage here is that you don't have to do
         | 9->10->11->12->13->14 like in the past and maybe that was one
         | of the blockers Azure has. AWS allows to skip some major
         | versions but 9->14 is not possible.
        
       | Teletio wrote:
       | The first upgrade Strategie is not the normal or easy one on
       | anything production btw.
       | 
       | Very small companies might be able to do this on 'no load day'
       | but from a pure business perspective, running your db twice is
       | easier and way less risky.
       | 
       | You could have done this even without downtime by letting your
       | connection proxy handling the switch.
        
       | gamegod wrote:
       | If you're running a 4 TB Postgres database, but you still have to
       | worry about this level of maintenance, what's the value
       | proposition of using a hosted service? There's usually insane
       | markup on any hosted Postgres instance.
       | 
       | If you want to pay multi-thousands dollars a month for a database
       | server, it's WAY cheaper just to slap a server with a ton of
       | drives in colocation.
        
         | orangepurple wrote:
         | Might be an accounting scam^H^H^H^H trick to book the costs as
         | an operating expense vs a capital expenditure. In general
         | capital expenditures have to be planned carefully, held on the
         | books for years, and show a return on investment. Perhaps an
         | accountant can provide more detail.
        
           | ipaddr wrote:
           | This is a big reason why yearly pricing vs buying hardware is
           | popular (one of many reasons).
           | 
           | An expensive can be used that year where a capital expenses
           | vest over years so only a portion can be applied that year.
        
           | mywittyname wrote:
           | I feel like the explanation is a much more simple: it's
           | easier to use Azure/AWS/GCP for everything.
           | 
           | Yeah, this migration might have been somewhat painful, but it
           | looks like it's a situation the company experiences every few
           | years.
        
       | Spivak wrote:
       | It's crazy how many apps opt for an RDBMS for append-only data
       | like audit events. It's so tantalizing at the beginning but turns
       | into a nightmare time marches forward.                  audit
       | events -> queue -> elastic -> blob storage
       | 
       | is so easy to maintain and we save TBs from living in the DB.
        
         | luhn wrote:
         | Mind expanding on what the "nightmare" is?
        
         | mrbabbage wrote:
         | We (Retool) are going to be doing this very soon and cut our
         | database size by 50%+. And you're exactly right: it's so easy
         | to get started by sticking audits (or other append-only data
         | schema) in an RDBMS, but it quickly becomes a headache and
         | bottleneck.
        
           | abrkn wrote:
           | What will you use as a queue and storage?
        
             | mrbabbage wrote:
             | To be determined!
        
           | baq wrote:
           | up until some size the headache from maintaining a separate
           | datastore is bigger. everything should be in the RDBMS until
           | proven otherwise for the sake of simplicity. it's actually
           | amazing how much you can squeeze out of 'old school'
           | databases.
           | 
           | e.g. https://docs.microsoft.com/en-us/azure/azure-
           | sql/database/le...
        
         | nrmitchi wrote:
         | One of the biggest thing that keeping audit records in your DB
         | gives you is transactionality around your audit logs. Sending
         | audit events to an external system (quite often) loses this,
         | and the resources to address this before you have to are way
         | larger than a slightly larger AWS/GCP/Azure/<insert-computer-
         | provider-here> bill.
        
           | magicalhippo wrote:
           | We're implementing something similar to what OP describes,
           | but we'll keep the "queue" in the DB in order to insert the
           | application audit event in the same transaction as the data
           | change. A background process then uploads to secondary
           | storage.
           | 
           | We won't have billions of rows though, so once uploaded to
           | secondary storage we'll just clear the blob field and set a
           | "processed" flag.
           | 
           | This way we can find all the relevant keys for a given order,
           | invoice etc quickly based on a partial key search in the
           | database, and transparently fetch from either db directly or
           | secondary storage as needed.
        
         | onphonenow wrote:
         | Actually, I've seen more problems with folks mixing lots of
         | different tools up then I have from folks doing an append only
         | audit event in a RDBMS.
         | 
         | When your audit trail is in DB, you can pretty easily surface
         | audit events to your customers. Who changed what when is just
         | another feature. Capturing audit events is also usually pretty
         | smooth.
         | 
         | The folks doing the blob storage route, you would not BELIEVE
         | the complexity they have to spin up to expose very simple
         | histories etc. This matters a LOT in some spaces (financial
         | etc), less so in others.
         | 
         | In my RDBMS model, who changed this field when from what to
         | what is a basic select. You can even shard by recordID or
         | similar if you want to reduce table scans, good select of
         | indexes etc can be a huge help as well. In most cases users
         | don't mind a bit of latency on these queries.
        
           | jeffbee wrote:
           | My only experience in the financial sector indicates the
           | opposite. The firm held its trading history for tens of
           | thousands of accounts going back 60 years in a SQL Server.
           | Anyone who wanted a question answered had to submit it for
           | overnight analysis and get the answer the next day. But in an
           | optimal non-RDMS representation, said trading history could
           | be condensed to a single 200MB flat file that could be
           | queried interactively, in microseconds. Dumping the RDBMS for
           | most use cases pretty much revolutionized the daily
           | experience for the people at that firm.
        
             | icedchai wrote:
             | Were they running this on a 386? Why didn't they optimize
             | their database?
        
               | jeffbee wrote:
               | It was "optimized" for the aesthetic concerns of RDBMS
               | purists and for ease of implementation of certain other
               | systems, in other words it was optimal for the developers
               | and severely sub-optimal for the users, which is a
               | problem endemic to RDBMS fandom.
        
               | icedchai wrote:
               | So it was basically "write optimized", fully normalized,
               | required at least a half dozen joins to get anything out
               | of it?
        
             | onphonenow wrote:
             | This seems borderline impossible? I'd be curious if there
             | were missing indexes or bad index selection or efforts to
             | do things like joins? Normally for very large data sets you
             | can shard by account if needed if that's the common filter
             | if there is some insane table scan. Audit stuff tends to be
             | "sparse" so if you can get an index which just tells you
             | which pages have results, that is usually a 100X speedup
             | with a pretty small index size.
             | 
             | But agreed, a daily dump to something can go a long way to
             | unlocking other tools - in govt this is especially true not
             | because the DBMS is hard to use, but because so many layers
             | of consultants and others in the way it's not usable.
        
             | baq wrote:
             | SQL Server can easily handle such datasets if columnstore
             | is employed. I wouldn't be surprised if a single weekend of
             | building a proper index (...being very generous here)
             | wouldn't make their DB go literally 100x faster.
        
               | jeffbee wrote:
               | The problem at that firm wasn't the SQL Server it was the
               | RDBMS mindset which led them to do joins between tables
               | of journaled trades and much larger tables of corporate
               | actions going back decades, instead of materializing a
               | more useful intermediate result. This is my main beef
               | with RDBMSs: they lead to cognitive hazards of this type.
               | It's not that the databases are themselves naturally bad
               | systems.
        
       | more_corn wrote:
       | Set site read only. Snapshot master. Create new db from snapshot.
       | Point site to new db?
        
       | mrbabbage wrote:
       | Hey folks--I wrote the post! This was my biggest Postgres project
       | to date, and it proved quite tricky since I didn't rehearse with
       | a test database of the same size. I learned a bunch about
       | Postgres, not least the incredibly powerful NOT VALID option for
       | safely and quickly adding constraints.
       | 
       | Happy to stick around and answer any questions you have.
        
         | palijer wrote:
         | I'm curious why a test run on a proper sized replica database
         | wasn't in the testing plans. That is something I've been
         | ensuring happens for a while now for similar projects.
        
         | wrs wrote:
         | Wondering why you couldn't use in-place upgrade (pg_upgrade
         | [0]). Not supported in Azure?
         | 
         | [0] https://www.postgresql.org/docs/current/pgupgrade.html
        
           | majewsky wrote:
           | I'm guessing this would still take more than one hour since
           | it needs to rewrite all (or most) of the 4TB?
        
             | aidos wrote:
             | Not in my experience. You can use the ---link flag to get
             | it to use hard links so it doesn't need to move the data at
             | all. Have been through this process myself a few times and
             | it only took seconds on a 50-100GB db. I'm always a little
             | surprised with how week it works.
        
         | aoms wrote:
         | This instance must cost you a ton
        
           | mrbabbage wrote:
           | More than I care to admit!
           | 
           | As mentioned downthread, we're doing some work soon to remove
           | the audit table from the database, which will cut storage
           | usage by over 50%.
        
             | tmikaeld wrote:
             | Seems to be around 1784$/monthly according to the pricing
             | calculator
        
             | loopdoend wrote:
             | Let me save you a bunch of money:
             | https://adamj.eu/tech/2021/04/13/reindexing-all-tables-
             | after...
        
               | karmelapple wrote:
               | This technique saved us from seriously increasing the
               | cost of our Heroku Postgres instance. Thank goodness it
               | exists and works so well. Multiple 80+ GB indexes shrinks
               | down to less than 10GB after just a couple of hours.
        
             | Teletio wrote:
             | What's your strategy?
             | 
             | Using compression? Second instance for old data? Creating
             | in-between states?
             | 
             | Just curious :)
        
             | clessg wrote:
             | > we're doing some work soon to remove the audit table from
             | the database
             | 
             | Out of pure curiosity, what are you replacing it with (if
             | anything)? Just a simple rotating log file?
        
               | icheishvili wrote:
               | You can partition your audit/event table by time period
               | and archive old events [1] or you can avoid the records
               | hitting the database in the first place by generating the
               | events elsewhere to begin with [2].
               | 
               | [1] https://github.com/icheishvili/audit-trigger [2]
               | https://github.com/eulerto/wal2json
        
               | mrbabbage wrote:
               | Exact strategy to be determined--we're looking at various
               | data layers at the moment. I wish we could do something
               | simple like a rotating log file, but we want to be able
               | to query it in the app (for instance, to show recent
               | logins).
        
               | jamesrr39 wrote:
               | Have you considered an OLAP database like Clickhouse or
               | QuestDB? An OLAP database would be a much better fit for
               | audit tables given the read and append-only writing
               | requirements, would compress better, and you might be
               | able to fit it directly without changing any app code
               | with a Postgresql foreign data wrapper.
        
               | carlhjerpe wrote:
               | I was thinking TimescaleDB
        
               | clessg wrote:
               | Very interesting! Looking forward to the next blog post
               | on that ;)
        
         | abrkn wrote:
         | Great read! Thanks
        
         | tmikaeld wrote:
         | How come you didn't consider a distributed database like
         | CochroachDB, instead of Postgres?
        
           | nrmitchi wrote:
           | I'm not OP, but they were upgrading from Postgres 9.6, which
           | at least _implies_ that this initial db was from ~2017.
           | 
           | This is barely past the initial release of Cockroach. It
           | would have been kind of crazy for the Retool team to use an
           | experimental db with a lack of history when building up their
           | product (that was not dependent on experimental new features)
        
             | tmikaeld wrote:
             | Baidu migrated from an old MySQL cluster to CockroachDB for
             | their 40TBs of databases, now runs on 20 clusters.
             | 
             | When does it stop being experimental to you?
             | 
             | https://www.cockroachlabs.com/customers/
        
               | flyinknockin wrote:
               | I don't think they're calling Cockroach experimental
               | right now - rather that it was experimental in 2017.
        
       | burai wrote:
       | I used to work on a company that had MongoDB as the main
       | database. Leaving a lot of criticism aside, the replicaset model
       | for Mongo made the upgrades much easier than the ones in other
       | type of databses.
        
         | stingraycharles wrote:
         | While that's true, managed services on eg AWS provide hot
         | replica's as well, which you can use to upgrade the database
         | and do a failover to the new version.
         | 
         | We actually migrated from vanilla Postgres to Aurora that way
         | with minimal risk / downtime, it was a really smooth process.
        
       | sscarduzio wrote:
       | Hey @mrbabbage! Retool customer here! Great service :)
       | 
       | Non distributed RDBMS is a great (yet underrated) choice. Thank
       | you for the good writeup.
       | 
       | I was thinking you could have a much less delicate migration
       | experience next time (some years from now). So you can go for a
       | quicker parallel "dump and restore" migration.
       | 
       | For example:
       | 
       | - Client side sharding in the application layer: you could shard
       | your customers' data across N smaller DB instances (consistent
       | hashing on customer ID)
       | 
       | - Moving the append-only data somewhere else than postgres prior
       | to the upgrade. You don't need RDBMS capabilities for that stuff
       | anyway. Look at Elasticsearch, Clickhouse, or any DB oriented to
       | time series data.
       | 
       | WDYT?
        
         | mrbabbage wrote:
         | The second bullet point is underway! Getting audit events out
         | of the main database will be a major headache saver.
         | 
         | The first bullet point is on our radar for the near term. We
         | have a very natural shard key in our schema (the customer ID),
         | with AFAIK no relationships across that shard key. And once we
         | start horizontally sharding, we can do cool things like putting
         | your data in a shard geographically close to you, which will
         | greatly increase app performance for our non US customers.
         | Exciting stuff coming down the pike!
        
           | sscarduzio wrote:
           | Oh cool! Yes locality is very important! Great idea :)
        
       | georgewfraser wrote:
       | It is amazing how many large-scale applications run on a single
       | or a few large RDBMS. It seems like a bad idea at first: surely a
       | single point of failure must be bad for availability and
       | scalability? But it turns out you can achieve excellent
       | availability using simple replication and failover, and you can
       | get _huge_ database instances from the cloud providers. You can
       | basically serve the entire world with a single supercomputer
       | running Postgres and a small army of stateless app servers
       | talking to it.
        
         | lazide wrote:
         | Well, it's always been that way.
         | 
         | The big names started using no-sql type stuff because their
         | instances got 2-3 orders of magnitude larger, and that didn't
         | work. It adds a lot of other overhead and problems doing all
         | the denormalization though, but if you literally have multi-PB
         | metadata stores, not like you have a choice.
         | 
         | Then everyone started copying them without knowing why.... and
         | then everyone forgot how much you can actually do with a normal
         | database.
         | 
         | And hardware has been getting better and cheaper, which makes
         | it only more so.
         | 
         | Still not a good idea to store multi-PB metadata stores in a
         | single DB though.
        
           | more_corn wrote:
           | Ideally you offer developers both a relational data store and
           | a fast key-value data store. Train your developers to
           | understand the pros and cons and then step back.
           | 
           | There's nothing inherently wrong with a big db instance. The
           | cloud providers have fantastic automation around multi-az
           | masters, read replicas and failover. They even do cross
           | region or cross account replication.
           | 
           | That being said. Multi PB would raise an eyebrow.
        
           | seunosewa wrote:
           | Large solid state drives really changed the game; your
           | working set doesn't have to fit in RAM for decent performance
           | anymore.
        
             | openthc wrote:
             | With postgres you'll want to tune those cost paramters
             | however. Eg: lowering the random page cost will change how
             | the planner does things on some queries. But don't just
             | blindly change it -- like modify the value and run the
             | benchmark again. The point is that the SSD is not 10x the
             | cost of RAM (0.1 vs 1.0). In our example a few queries the
             | planner move to, what I always assumed was the slower
             | sequential scan -- but it's only slower depending on your
             | table size (how tall and how wide). I mean, PG works
             | awesome w/o tweaking that stuff but if you've got a few
             | days to play with these values it's quite educational.
        
           | jghn wrote:
           | > Then everyone started copying them without knowing why
           | 
           | People tend to have a very bad sense of what constitutes
           | large scale. It usually maps to "larger than the largest
           | thing I've personally seen". So they hear "Use X instead of Y
           | when operating at scale", and all of a sudden we have people
           | implementing distributed datastore for a few MB of data.
           | 
           | Having gone downward in scale over the last few years of my
           | career it has been eye opening how many people tell me X
           | won't work due to "our scale", and I point out I have
           | _already_ used X in prior jobs for scale that 's much larger
           | than what we have.
        
             | dzhiurgis wrote:
             | Place I've recently left had 10M record MongoDB table
             | without indexes which would take tens of seconds to query.
             | Celery was running in cron mode every 2 second or so
             | meaning jobs would just pile up and redis eventually ran
             | out of memory. No one understood why this was happening so
             | just restart everything after pagerduty alert...
        
               | lazide wrote:
               | Yikes. Don't get me wrong, it's always been this way to
               | some extent - not enough people who can look into a
               | problem and understand what is happening to make many
               | things actually work correctly, so iterate with new shiny
               | thing.
               | 
               | It seems like the last 4-5 years though have really made
               | it super common again. Bubble maybe?
               | 
               | Huge horde of newbs?
               | 
               | Maybe I'm getting crustier.
               | 
               | I remember it was SUPER bad before the dot-com crash, all
               | the fake it 'til you make it too. I even had someone
               | claim 10 years of Java experience who couldn't write out
               | a basic class on a whiteboard at all, and tons of folks
               | starting that literally couldn't write a hello world in
               | the language they claimed experience in, and this was
               | before decent GUI IDEs.
        
               | dzhiurgis wrote:
               | Technically we were a tech startup with 10+ "senior"
               | engineers which scrape entire web ;D
        
             | [deleted]
        
             | lazide wrote:
             | 100% agree. I've also run across many cases where no-one
             | bothered to even attempt any benchmarks or load tests on
             | anything (either old or new solutions), compared latency,
             | optimize anything, etc.
             | 
             | Sometimes making 10+ million dollar decisions off that gut
             | feel with literally zero data on what is actually going on.
             | 
             | It rarely works out well, but hey, have to leave that
             | opening for competition somehow I guess?
             | 
             | And I'm not talking about 'why didn't they spend 6 months
             | optimizing that one call which would save them $50 type
             | stuff'. I mean literally zero idea what is going on, what
             | actual performance issues are, etc.
        
             | cortesoft wrote:
             | This is funny, because I suffer from the opposite issue...
             | every time I try to bring up scaling issues on forums like
             | HN, everyone says I don't actually need to worry because it
             | can scale up to size X... but my current work is with
             | systems at 100X size.
             | 
             | I feel like sometimes the pendulum has swung too far the
             | other way, where people deny that there ARE people dealing
             | with actual scale problems.
        
             | dmd wrote:
             | Yep. I've personally been in the situation where I had to
             | show someone that I could do their analysis in a few
             | seconds using the proverbial awk-on-a-laptop when they were
             | planning on building a hadoop cluster in the cloud because
             | "BIG DATA". (Their Big Data was 50 gigabytes.)
        
         | strictfp wrote:
         | I agree in principle. But one major headache for us has been
         | upgrading the database software without downtime. Is there any
         | solution that does this without major headaches? I would love
         | some out-of-the-box solution.
        
           | AtlasBarfed wrote:
           | Cassandra can do it since it has cell level timestamps, so
           | you can mirror online writes and clone existing data to the
           | new database, and there's no danger of newer mutations being
           | overwritten by the bulk restored data.
           | 
           | Doing an active no-downtime database migration basically
           | involves having a coherent row-level merge policy (assuming
           | you AT LEAST have a per-row last updated column), or other
           | tricks. Or maybe you temporarily write cell-level timestamps
           | and then drop it later.
           | 
           | Or if you have data that expires on a window, you just do
           | double-writes for that period and then switch over.
        
           | simonw wrote:
           | The best trick I know of for zero-downtime upgrades is to
           | have a read-only mode.
           | 
           | Sure, that's not the same thing as pure zero-downtime but for
           | many applications it's OK to put the entire thing into read-
           | only mode for a few minutes at a well selected time of day.
           | 
           | While it's in read-only mode (so no writes are being
           | accepted) you can spin up a brand new DB server, upgrade it,
           | finish copying data across - do all kinds of big changes.
           | Then you switch read-only mode back off again when you're
           | finished.
           | 
           | I've even worked with a team used this trick to migrate
           | between two data centers without visible end-user downtime.
           | 
           | A trick I've always wanted to try for smaller changes is the
           | ability to "pause" traffic at a load balancer - effectively
           | to have a 5 second period where each incoming HTTP request
           | appears to take 5 seconds longer to return, but actually it's
           | being held by the load balancer until some underlying upgrade
           | has completed.
           | 
           | Depends how much you can get done in 5 seconds though!
        
             | kgeist wrote:
             | >The best trick I know of for zero-downtime upgrades is to
             | have a read-only mode.
             | 
             | I've done something similar, although it wasn't about
             | upgrading the database. We needed to not only migrate data
             | between different DB instances, but also between completely
             | different data models (as part of refactoring). We had
             | several options, such as proper replication + schema
             | migration in the target DB, or by making the app itself
             | write to two models at the same time (which would require a
             | multi-stage release). It all sounded overly complex to me
             | and prone to error, due to a lot of asynchronous
             | code/queues running in parallel. I should also mention that
             | our DB is sharded per tenant (i.e. per an organization).
             | What I came up with was much simpler: I wrote a simple
             | script which simply marked a shard read-only (for this
             | feature), transformed and copied data via a simple HTTP
             | interface, then marked it read-write again, and proceeded
             | to the next shard. All other shards were read-write at a
             | given moment. Since the migration window only affected a
             | single shard at any given moment, no one noticed anything:
             | for a tenant, it translated to 1-2 seconds of not being
             | able to save. In case of problems it would also be easier
             | to revert a few shards than the entire database.
        
               | nuclearnice3 wrote:
               | I like this approach.
               | 
               | I'm picturing your migration script looping over shards.
               | It flips it to read-only, migrates, then flips back to
               | read-write.
               | 
               | How did the app handle having some shards in read-write
               | mode pre-migration and other shards in read-write post-
               | migration simultaneously.
        
               | kgeist wrote:
               | Yes, it simply looped over shards, we already had a tool
               | to do that.
               | 
               | The app handled it by proxying calls to the new
               | implementation if the shard was marked as "post-
               | migration", the API stayed the same. If it was "in
               | migration", all write operations returned an error. If
               | the state was "pre-migration", it worked as before.
               | 
               | I don't already remember the details but it was something
               | about the event queue or the notification queue which
               | made me prefer this approach over the others. When a
               | shard was in migration, queue processing was also
               | temporarily halted.
               | 
               | Knowing that a shard is completely "frozen" during
               | migration made it much easier to reason about the whole
               | process.
        
           | ww520 wrote:
           | The system needs to be architected in certain way to make
           | upgrade without downtime. Something like the Command and
           | Query Responsibility Segregation (CQRS) would work. A update
           | queue serves as the explicit transaction log keeping track of
           | the updates from the frontend applications, while the
           | databases at the end of the queue applies updates and serves
           | as the querying service. Upgrading the live database just
           | means having a standby database with new version software
           | replaying all the changes from the queue to catch up to the
           | latest changes, pausing the live database from taking new
           | changes from the queue when the new db has caught up,
           | switching all client connections to the new db, and shutting
           | down the old db.
        
           | jtc331 wrote:
           | Preconditions:
           | 
           | 1. Route all traffic through pgbouncer in transaction pooling
           | mode.
           | 
           | 2. Logically replicate from old to new.
           | 
           | For failover:
           | 
           | 1. Ensure replication is not far behind.
           | 
           | 2. Issue a PAUSE on pgbouncer.
           | 
           | 3. Wait for replication to be fully caught up.
           | 
           | 4. Update pgbouncer config to point to the new database.
           | 
           | 5. Issue a RELOAD on pgbouncer.
           | 
           | 6. Issue a RESUME on pgbouncer.
           | 
           | Zero downtime; < 2s additional latency for in-flight queries
           | at time of op is possible (and I've done it at scale).
        
           | georgewfraser wrote:
           | This is such a huge problem. It's even worse than it looks:
           | because users are slow to upgrade, changes to the database
           | system take years to percolate down to the 99th percentile
           | user. The decreases the incentive to do certain kinds of
           | innovation. My opinion is that we need to fundamentally
           | change how DBMS are engineered and deployed to support silent
           | in-the-background minor version upgrades, and probably stop
           | doing major version bumps that incorporate breaking changes.
        
           | brentjanderson wrote:
           | Depends on the database - I know that CockroachDB supports
           | rolling upgrades with zero downtime, as it is built with a
           | multi-primary architecture.
           | 
           | For PostgresQL or MySQL/MariaDB, your options are more
           | limited. Here are two that come to mind, there may be more:
           | 
           | # The "Dual Writer" approach
           | 
           | 1. Spin up a new database cluster on the new version. 2. Get
           | all your data into it (including dual writes to both the old
           | and new version). 3. Once you're confident that the new
           | version is 100% up to date, switch to using it as your
           | primary database. 4. Shut down the old cluster.
           | 
           | # The eventually consistent approach
           | 
           | 1. Put a queue in front of each service for writes, where
           | each service of your system has its own database. 2. When you
           | need to upgrade the database, stop consuming from the queue,
           | upgrade in place (bringing the DB down temporarily) and
           | resume consumption once things are back online. 3. No service
           | can directly read from another service's database. Eventually
           | consistent caches/projections service reads during normal
           | service operation and during the upgrade.
           | 
           | A system like this is more flexible, but suffers from stale
           | reads or temporary service degradation.
        
             | jtc331 wrote:
             | Dual writing has huge downsides: namely you're now moving
             | consistency into the application, and it's almost
             | guaranteed that the databases won't match in any
             | interesting application.
        
           | msh wrote:
           | Migrate the data to a new host having the new version.
        
           | karmakaze wrote:
           | The way I've done it with MySQL since 5.7 is to use multiple
           | writers of which only one is actively used by clients. Take
           | one out, upgrade it, put it back into replication but not
           | serving requests until caught up. Switch the clients to
           | writing to the upgraded one then upgrade the others.
        
         | dijit wrote:
         | Even when it's not a cloud provider, in fact, especially when
         | it's not a cloud provider: you can achieve insane scale from
         | single instances.
         | 
         | Of course these systems have warm standbys, dedicated backup
         | infrastructure and so it's not really a "single machine"; but
         | I've seen 80TiB Postgres instances back in 2011.
        
           | cogman10 wrote:
           | We are currently pushing close to 80tb mssql on prem
           | instances.
           | 
           | The biggest issue we have with these giant dbs is they
           | require pretty massive amounts of RAM. That's currently our
           | main bottle neck.
           | 
           | But I agree. While our design is pretty bad in a few ways,
           | the amount of data that we are able to serve from these big
           | DBs is impressive. We have something like 6 dedicated servers
           | for a company with something like 300 apps. A hand full of
           | them hit dedicated dbs.
           | 
           | Were I to redesign the system, I'd have more tiny dedicated
           | dbs per app to avoid a lot of the noisy neighbor/scaling
           | problems we've had. But at the same time, It's impressive how
           | far this design has gotten us and appears to have a lot more
           | legs on it.
        
             | Nathanba wrote:
             | Can I ask you how large tables can generally get before
             | querying becomes slower? I just can't intuitively wrap my
             | head around how tables can grow from 10gb to 100gb and why
             | this wouldnt worsen query performance by x10. Surely you do
             | table partitions or cycle data out into archive tables to
             | keep up the query performance of the more recent table
             | data, correct?
        
               | WJW wrote:
               | At my old workplace we had a few multi-TB tables with
               | several billion rows in a vanilla RDS MySql 5.7 instance
               | (although it was obviously a sizable instance type),
               | simple single-row SELECT queries on an indexed column (ie
               | SELECT * FROM table WHERE external_id = 123;) would be
               | low single-digit milliseconds.
               | 
               | Proper indexing is key of course, and metrics to find
               | bottlenecks.
        
               | cogman10 wrote:
               | > I just can't intuitively wrap my head around how tables
               | can grow from 10gb to 100gb and why this wouldnt worsen
               | query performance by x10
               | 
               | Sql server data is stored as a BTree structure. So a 10
               | -> 100gb growth ends up being roughly a 1/2 query
               | performance slowdown (since it grows by a factor of log
               | n) assuming good indexes are in place.
               | 
               | Filtered indexes can work pretty well for improving query
               | performance. But ultimately we do have some tables which
               | are either archived if we can or partitioned if we can't.
               | SQL Server native partitioning is rough if the query
               | patterns are all over the board.
               | 
               | The other thing that has helped is we've done a bit of
               | application data shuffling. Moving heavy hitters onto new
               | database servers that aren't as highly utilized.
               | 
               | We are currently in the process of getting read only
               | replicas (always on) setup and configured in our
               | applications. That will allow for a lot more load
               | distribution.
        
               | AtlasBarfed wrote:
               | The issue with b-tree scaling isn't really the lookup
               | performance issues, it is the index update time issues,
               | which is why log structured merge trees were created.
               | 
               | EVENTUALLY, yes even read query performance also would
               | degrade, but typically the insert / update load on a
               | typical index is the first limiter.
        
               | abraxas wrote:
               | If there is a natural key and updates are infrequent then
               | table partitioning can help extend the capacity of a
               | table almost indefinitely. There are limitations of
               | course but even for non-insane time series workloads,
               | Postgres with partitioned tables will work just fine.
        
               | cmckn wrote:
               | Well, any hot table should be indexed (with regards to
               | your access patterns) and, thankfully, the data
               | structures used to implement tables and indexes don't
               | behave linearly :)
               | 
               | Of course, if your application rarely makes use of older
               | rows, it could still make sense to offload them to some
               | kind of colder, cheaper storage.
        
               | noselasd wrote:
               | Think of finding a record amongst many as e.g. a binary
               | search. It doesn't take 10 times as many tries to find a
               | thing(row/record) amongst 100 as it does amonst 1000.
        
               | jlokier wrote:
               | A lot depends on the type of queries. You could have
               | tables the size of the entire internet and every disk
               | drive ever made, and they'd still be reasonably fast for
               | queries that just look up a single value by an indexed
               | key.
               | 
               | The trick is to have the right indexes (which includes
               | the interior structures of the storage data structure) so
               | that queries jump quickly to the relevant data and ignore
               | the rest. Like opening a book at the right page because
               | the page number is known. Sometimes a close guess is good
               | enough.
               | 
               | In addition, small indexes and tree interior nodes should
               | stay hot in RAM between queries.
               | 
               | When the indexes are too large to fit in RAM, those get
               | queried from storage as well, and at a low level it's
               | analogous to the system finding the right page in an
               | "index book", using an "index index" to get that page
               | number. As many levels deep as you need. The number of
               | levels is generally small.
               | 
               | For example, the following is something I worked on
               | recently. It's a custom database (written by me) not
               | Postgres, so the performance is higher but the table
               | scaling principles are similar. The thing has 200GB
               | tables at the moment, and when it's warmed up, querying a
               | single value takes just one 4k read from disk, a single
               | large sector, because the tree index fits comfortably in
               | RAM.
               | 
               | It runs at approximately 1.1 million random-access
               | queries/second from a single SSD on my machine, which is
               | just a $110/month x86 server. The CPU has to work quite
               | hard to keep up because the data is compressed, albeit
               | with special query-friendly compression.
               | 
               | If there was very little RAM so nothing could be kept in
               | it, the speed would drop by a factor of about 5, to 0.2
               | million queries/second. That shows you don't need a lot
               | of RAM, it just helps.
               | 
               | Keeping the RAM and increasing table size to roughly 10TB
               | the speed would drop by half to 0.5 million
               | queries/second. In principle, with the same storage
               | algorithms a table size of roughly 1000TB (1PB) would
               | drop it to 0.3 million queries/second, and roughly
               | 50,000TB (50PB) would drop it to 0.2 million. (But of
               | course those sizes won't fit on a single SSD. A real
               | system of that size would have more parallel components,
               | and could have higher query performance.) You can grow to
               | very large tables without much slowdown.
        
         | the8472 wrote:
         | > scalability
         | 
         | You can scale quite far vertically and avoid all the clustering
         | headaches for a long time these days. With EPYCs you can get
         | 128C/256T, 128PCIe lanes (= 32 4x NVMes = ~half a petabyte of
         | low-latency storage, minus whatever you need for your network
         | cards), 4TB of RAM in a single machine. Of course that'll cost
         | you an arm and a leg and maybe a kidney too, but so would
         | renting the equivalent in the cloud.
        
         | Vladimof wrote:
         | I don't think that it's amazing.... I think that the new and
         | shinny databases tried to make you think that it was not
         | possible...
        
           | eastbound wrote:
           | React makes us believe everything must have 1-2s response to
           | clicks and the maximum table size is 10 rows.
           | 
           | When I come back to web 1.0 apps, I'm often surprised that it
           | does a round-trip to the server in less than 200ms, and
           | reloads the page seamlessly, including a full 5ms SQL query
           | for 5k rows and returned them in the page (=a full 1MB of
           | data, with basically no JS).
        
         | icedchai wrote:
         | Caches also help a ton (redis, memcache...)
        
           | nesarkvechnep wrote:
           | Also HTTP caching. It's always funny to me why people, not
           | you in particular, reach for Redis when they don't even use
           | HTTP caching.
        
         | markandrewj wrote:
         | Scaling databases vertically, like Oracle DB, in the past was
         | the norm. It is possible to serve a large number of users, and
         | data, from a single instance. There are some things worth
         | considering though. First of all, no matter how reliable your
         | database is, you will have to take it down eventually to do
         | things like upgrades.
         | 
         | The other consideration that isn't initially obvious, is how
         | you may hit an upper bound for resources in most modern
         | environments. If your database is sitting on top of a virtual
         | or containerized environment, your single instance database
         | will be limited in resources (CPU/memory/network) to a single
         | node of the cluster. You could also eventually hit the same
         | problem on bare metal.
         | 
         | That said there are some very high density systems available.
         | You may also not need the ability to scale as large as I am
         | talking, or choose to shard and scale your database
         | horizontally at later time.
         | 
         | If your project gets big enough you might also start wanting to
         | replicate your data to localize it closer to the user. Another
         | strategy might be to cache the data locally to the user.
         | 
         | There are positive and negatives with a single node or cluster.
         | If retools database was clustered they would have been able to
         | do a rolling upgrade though.
        
       | booleanbetrayal wrote:
       | Low / zero downtime is totally achievable with pg_logical and
       | really boils down to whether or not you want to try to adopt bi-
       | directional writes (and conflict management / integrity issues),
       | or if you're willing to just have a brief session termination
       | event and swapover. To me, the latter has generally been
       | preferable as conflict management systems tend to be more
       | complicated in reality (based on business logic / state) than
       | what pg_logcical provides. Interested if people here have had
       | success with bi-directional writes though.
        
       | bogomipz wrote:
       | >"Last fall, we migrated this database from Postgres version 9.6
       | to version 13 with minimal downtime."
       | 
       | I thought it was interesting that they upgraded 4 major version
       | numbers in one go. I kept expecting to read something about
       | version compatibility and configuration but was surprised there
       | was none. Are major upgrades like this just less of an issue with
       | Postgres in general?
        
       | rkwasny wrote:
       | "However, on our 4 TB production database, the initial dump and
       | restore never completed: DMS encountered an error but failed to
       | report the error to us."
       | 
       | THERE IS NO CLOUD: It's just someone else's computer
        
       | forinti wrote:
       | I recently upgraded a 9.x database to 14 using pg_upgrade and it
       | was just simple and fast. No issues whatsoever.
       | 
       | I only wish my Oracle updates were so simple and bug-free.
        
       | VincentEvans wrote:
       | Why is SQL Server able to backup and restore databases 100s of
       | gigabytes in size in single-digit minutes, while Postgres is at
       | least 10x slower?
        
       | dboreham wrote:
       | Perhaps time to re-architect to avoid having a huge monolith
       | database, particularly since the SaaS product is inherently
       | multi-tenant?
        
       | wenbin wrote:
       | For listennotes.com, we did a postgres 9.6 => 11 upgrade (in
       | 2019), and 11 => 13 upgrade (in 2021). ~0 downtime for read ops,
       | and ~45 seconds downtime for write ops.
       | 
       | Our database is less than 1TB. One master (for writes + some
       | reads) + multiple slaves (read-only).
       | 
       | Here's what we did -
       | 
       | 1, Launched a new read-only db with pg9.6, let's call it DB_A.
       | 
       | 2, Stopped all offline tasks, and only maintained a minimal fleet
       | of online servers (e.g., web, api...).
       | 
       | 3, Changed all db hosts (no matter master or slave) in /etc/hosts
       | on the minimal fleet of online servers (e.g., web, api...) to use
       | old read-only db with pg9.6, let's call it DB_B. From this point
       | on, all write ops should fail.
       | 
       | 4, Ran pg_upgrade (with --link) on DB_A to upgrade to pg11, and
       | promoted it to be a master db.
       | 
       | 5, Changed /etc/hosts on the minimal fleet of online servers
       | (e.g., web, api...) to use DB_A for all db hosts. By this point,
       | DB_A is a master db. And write ops should be good now.
       | 
       | 6, Changed /etc/hosts for all other servers and brought back all
       | services.
       | 
       | Step 4 is the most critical. If it fails or runs too long (e.g.,
       | more than 10 minutes), then we had to rollback by changing
       | /etc/hosts on those online servers.
       | 
       | We carefully rehearsed these steps for an entire week, and timed
       | each step. By the time we did it on production, we knew how many
       | seconds/minutes each step would take. And we tried to automate as
       | many things as possible in bash scripts.
        
         | aolle wrote:
         | wow
        
           | aolle wrote:
        
         | karmelapple wrote:
         | We did something similar recently jumping from 10 to 13. We
         | took measurements, did some dry runs, and came up with
         | strategies to ensure our read-only followers would work fine
         | and we'd have a minimum downtime for writes.
         | 
         | We missed one or two pieces of reconnecting things afterwards,
         | and some of that seems to be limitations of Heroku Postgres
         | that we couldn't change. Hopefully those keep improving.
        
       | hankman86 wrote:
       | Having successfully built (and sold!) a technology startup
       | myself, I would always, always opt for a managed database
       | service. Yes, it's more expensive on paper and you want to run
       | the numbers and choose the right offering. But nothing beats the
       | peace of mind of storing your customers' data on a system that
       | others (Google Cloud in our case) look after. Not to mention that
       | you're better off focussing on your core value proposition and
       | spending your scarce resources there than to have a database
       | administrator on your payroll.
        
         | pojzon wrote:
         | This only works to some scale. Startup from my previous
         | contract reached the point where "using bigger instance" would
         | be losing them money.
         | 
         | On the other hand, self built databases maintained by
         | professionals were ALOT cheaper.
         | 
         | We are talking here about million of dollars bills only for
         | databases per month.
         | 
         | Self hosted solution did cost around 300k per month.
         | 
         | This included precompiled kernel patches, tweaks to postgres
         | engine etc.
         | 
         | Overall the investment of a year of work of dbadmins will
         | probably return itself by ten times if not more.
        
       | lijogdfljk wrote:
       | How does MySQL fair in this type of setup? Do the DBs differ
       | greatly?
        
         | thatwasunusual wrote:
         | I worked for a company that did a similar "upgrade by
         | replication", but with MySQL. It's quite a few years ago, so I
         | don't remember the versions involved, but it was quite
         | straight-forward once we had done _weeks_ of test runs on a dev
         | environment.
         | 
         | One invaluable thing, though: our application was from the
         | beginning designed to do 100% of all the reads from a read-only
         | slave _if the slave was up to sync_ (which it was 95% of the
         | time). We could also identify testers/developers in the
         | application itself, so we had them using the upgraded slave for
         | two weeks before the actual upgrade.
         | 
         | This made it possible for us to filter out problems in the
         | application/DB-layer, which were few, which means that we
         | probably did a minor version upgrade.
         | 
         | But upgrading by replication is something I can recommend.
        
         | evanelias wrote:
         | MySQL's built-in replication has always been logical
         | replication, and it officially supports replicating from an
         | older-version primary to newer-version replicas. So similar
         | concept to what's described here, but much simpler upgrade
         | process.
         | 
         | Generally you just upgrade the replicas; then promote a replica
         | to be the new primary; then upgrade the old primary and turn it
         | into a replica.
         | 
         | The actual "upgrade" step is quite fast, since it doesn't
         | actually need to iterate over your tables' row data.
         | 
         | At large scale, the painful part of major-version MySQL
         | upgrades tends to be performance testing, but that's performed
         | separately and prior to the actual upgrade process. Third-party
         | tools (pt-upgrade, proxysql mirroring, etc) help a lot with
         | this.
        
       | cube00 wrote:
       | _> To resolve this, we ended up choosing to leave foreign key
       | constraints unenforced on a few large tables.
       | 
       | > We reasoned this was likely safe, as Retool's product logic
       | performs its own consistency checks, and also doesn't delete from
       | the referenced tables, meaning it was unlikely we'd be left with
       | a dangling reference._
       | 
       | I was holding my breath here and I'm glad these were eventually
       | turned back on.
       | 
       | Nobody should ever rely on their own product logic to ensure
       | consistency of the database.
       | 
       | The database has features (constraints, transactions, etc) for
       | this purpose which are guaranteed to work correctly and
       | atomically in all situations such as database initiated rollbacks
       | that your application will never have control over.
        
         | phphphphp wrote:
         | Does that pattern you describe require any considerations when
         | writing code? I'm thinking of applications I've worked on where
         | events are triggered by change, and so a database rolling back
         | independent of my application would be a nightmare. I treat the
         | database as a place to store data, not an authority: the
         | application is the authority. Do you approach it differently?
         | Thanks!
        
           | kijin wrote:
           | ON DELETE CASCADE can be dangerous when used with
           | applications that expect to be notified of deletions, like in
           | your case.
           | 
           | Ideally, everything that needs to change when a row is
           | deleted would be changed automatically and atomically using
           | database-side constraints and triggers. In practice,
           | applications often need to sync state with external services
           | that the database knows nothing about, so I understand your
           | concerns.
           | 
           | ON DELETE RESTRICT, on the other hand, will result in errors
           | just like any other query error that you can handle in your
           | application. Nothing happened, so there's nothing to be
           | notified of.
        
           | brightball wrote:
           | The database is the only place that can be the authority
           | because the application can have race conditions. It's the
           | only way to guarantee data integrity.
        
             | grogers wrote:
             | There's no way to specify every single application specific
             | constraint directly in the database. Race conditions are
             | not present when using locking reads (select ... for
             | update, or DB specific shared locking selects) or
             | serializable isolation level, which are the typical way of
             | enforcing application level constraints.
        
         | icedchai wrote:
         | You'd be surprised. I used to work on a product where the lead
         | developer made sure foreign keys were NOT enabled on
         | production. They were only "allowed" in dev. Some teams have a
         | strict "no foreign keys" rule.
        
         | oppositelock wrote:
         | It's difficult to make a blanket statement like this.
         | 
         | I've built some very high throughput Postgres backed systems in
         | my years, and doing application side foreign key constraints
         | (FKC) does have its benefits. Doing this client side will
         | result in constraints that are usually, but not always in sync
         | with data. However, this kind of almost-consistency lets you do
         | much higher throughput queries. An FKC is a read on every
         | write, for example, and does limit write throughput. Of course,
         | this isn't ok for some workloads, and you do proper FKC in the
         | DB, but if you don't need absolute consistency, you can make
         | writes far cheaper.
        
           | javajosh wrote:
           | The trade-offs between foreign key constraints vs none are
           | almost identical to the trade-offs between static typing vs
           | dynamic typing. Nowadays people realize that when they turn
           | off these features is that they'll eventually have to re-
           | implement them later.
        
             | WJW wrote:
             | You make this claim as if this happens to every company
             | sooner or later, but if a company the size of GitHub can
             | still do without ( https://github.com/github/gh-
             | ost/issues/331#issuecomment-266...) it does become a little
             | bit of a "you do not have google problems" type discussion.
             | 
             | (Perhaps you do have such problems, I don't know where you
             | work! But 99%+ of companies don't have such problems and
             | never will.)
        
               | [deleted]
        
         | jasfi wrote:
         | True, but the DB constraints often aren't enough. I like to
         | have a verify program to check that the DB is as expected.
        
           | weird-eye-issue wrote:
           | You don't trust DB constraints?
        
             | [deleted]
        
             | deathanatos wrote:
             | Depends on the database, sometimes the database config, as
             | to whether they'll actually be enforced or not, or in what
             | situations data might evade enforcement of the
             | constraints...
             | 
             | Applies to vendors, too. Had some data in Rackspace Files
             | where "list files" would say X existed, but "GET X" got you
             | a 404. Had an AWS RDS instance; query on it returned no
             | results. Adding the "don't use the index" index hint caused
             | it to return data. (Allegedly, this bug was fixed, but we
             | had migrated off by that point, so I never got to confirm
             | it.)
             | 
             | Conversely, I do like DB constraints, because if the DB
             | constraint _doesn 't_ exist, then I guarantee you the
             | production DB has a row that is a counter-example to
             | whatever constraint you think the data _should_ obey...
        
               | majewsky wrote:
               | > Had some data in Rackspace Files where "list files"
               | would say X existed, but "GET X" got you a 404.
               | 
               | Well yes, Rackspace Files (aka OpenStack Swift) is
               | eventually consistent. It says so literally in the _first
               | sentence_ of the documentation [1]. But this discussion
               | is about relational databases with ACID guarantees, where
               | the C is literally  "consistent".
               | 
               | [1] https://docs.openstack.org/swift/latest/
        
             | rattray wrote:
             | They're limited in what they can express; your application
             | often has invariants you want to enforce/maintain that
             | can't be (performantly) expressed with DB constraints, and
             | must be validated another way.
             | 
             | As great as it can be to enforce rules within the database,
             | a lot of them usually end up needing to be enforced at the
             | application layer instead. Especially when performance at
             | scale comes into play.
        
               | brightball wrote:
               | I think it's a balance. Transactions + Constraints can
               | enforce most things but there will certainly be things
               | that can only be verified in the app.
               | 
               | My goal is always to verify what I can in the database to
               | minimize potential data cleanup. In my experience, app
               | only verification always leads to future time investments
               | to clean up the mess.
        
               | jasfi wrote:
               | Make no mistake, I think DB constraints are a best
               | practice.
        
               | WJW wrote:
               | They have their place, but also their limits.
        
       | simonw wrote:
       | I found the script they used for copying data really interesting:
       | https://gist.github.com/peterwj/0614bf6b6fe339a3cbd42eb93dc5...
       | 
       | It's written in Python, spins up a queue.Queue object, populates
       | it with ranges of rows that need to be copied (based on min < ID
       | < max ranges), starts up a bunch of Python threads and then each
       | of those threads uses os.system() to run this:
       | psql "{source_url}" -c "COPY (SELECT * FROM ...) TO STDOUT" \
       | | psql "{dest_url}" -c "COPY {table_name} FROM STDIN"
       | 
       | This feels really smart to me. The Python GIL won't be a factor
       | here.
        
         | teej wrote:
         | For ETL out of Postgres, it is very hard to beat psql.
         | Something as simple as this will happily saturate all your
         | available network, CPU, and disk write. Wrapping it in Python
         | helps you batch it out cleanly.                   psql -c "..."
         | | pigz -c > file.tsv.gz
        
         | mrbabbage wrote:
         | Thanks Simon! I can indeed confirm that this script managed to
         | saturate the database's hardware capacity (I recall CPU being
         | the bottleneck, and I had to dial down the parallelism to leave
         | some CPU for actual application queries).
        
         | SahAssar wrote:
         | Sounds to me like this is the exact thing that the normal
         | parallel command was made for, not sure python is needed here
         | if the end result is shelling out to os.system anyway.
        
       | valzam wrote:
       | I am generally a fan of using as few moving parts as possible but
       | if > 60-70% (2TB + a "few hundred GB") of your prod database are
       | an append only audit log surely if would make sense to split that
       | part into a separate DB server? Especially when you are using a
       | hosted service. It sounds like both uptime and consistency
       | requirements are very different between these two parts of the
       | production data.
        
         | ranyefet wrote:
         | My thoughts exactly.
         | 
         | Does Postgres still make sense for append only tables or maybe
         | elastic or other kind of database would be more suitable?
        
           | abraxas wrote:
           | Postgres makes a lot of sense with append only tables. You
           | can easily partition them by time (usually) and thus have an
           | easy way to break up the index trees as well as using a cheap
           | indexing scheme like BRIN and being able to just drop old
           | chunks as they become irrelevant.
        
       | sega_sai wrote:
       | Using pg_upgrade I recently updated this 100Tb sized DB (from
       | PG12 to PG13) in ~10 min of downtime.
       | 
       | => select pg_size_pretty(pg_database_size ('**'));
       | 
       | pg_size_pretty
       | 
       | ----------------
       | 
       | 99 TB
       | 
       | (1 row)
       | 
       | (The re-analyze of tables took a day or so though)
        
         | riku_iki wrote:
         | Curious what is your storage story? Where DB is actually
         | stored? Some NAS?
        
           | sega_sai wrote:
           | It's DAS. Mostly it's in one box of RAIDed 32x5.5Tb drives
           | with a couple of tablespaces/WAL elsewhere. The DB is mostly
           | read-only and not many concurrent users, so that's probably
           | not the most typical case.
        
         | comboy wrote:
         | Yeah, my petty 6Tb also went fine with pg_upgrade and
         | practically no downtime. Upgrade slave, promote to master,
         | upgrade master and then promote it back. It's a marvelous piece
         | of technology.
         | 
         | It's really just a handful of core people who did most of the
         | work, crafting it so thoughtfully over the years and it has
         | such a huge impact on the world.
         | 
         | Doing huge part of it before postgresql was as popular as it is
         | today, spending countless hours on making some great design
         | choices and implementing them carefully.
         | 
         | It seems unlikely any of them will read that but I'm so deeply
         | grateful to these people. It allowed so many things to flourish
         | on top thanks to it being open source and free.
        
       | jeffrallen wrote:
       | Nifty. But I can't help thinking this was harder than it needed
       | to be in the cloud. Because frankly, 4 TB is not big: my home
       | Synology backup server is 4 TB. Making a pair of standalone Linux
       | servers to rehearse this locally, and with full control of which
       | Postgres modules and other software to use, would have made
       | things easier, it seems.
       | 
       | Anyway, thanks for food for thought.
        
       | TruthWillHurt wrote:
       | Wow this takes me back 20 years ago when we did this kind of
       | partial-dump+sync migrations of mysql.
       | 
       | Then the cloud and DBaaS was invented.
        
       | electroly wrote:
       | In SQL Server you just... do the upgrade. You install the upgrade
       | on your nodes starting with the passive nodes, and it will
       | automatically failover from the old version to the new version
       | once half the nodes have been upgraded. No downtime, but your
       | redundancy drops when some nodes have been upgraded but the
       | cluster hasn't fully been upgraded yet. You certainly don't have
       | to dump and restore your database. Without giving private
       | numbers, our database is much bigger than OP's 4TB; dump and
       | restore would be wildly unacceptable.
       | 
       | The idea that you don't get a seamless upgrade of the database
       | itself with PostgreSQL is absurd to me. The part about
       | "maximizing the amount of time this upgrade buys is" is only
       | necessary because of how difficult PostgreSQL makes upgrades. We
       | upgrade to every new version of SQL Server. It's not that big of
       | a deal.
       | 
       | With every PostgreSQL blog article I read, I become more and more
       | of an SQL Server fanboy. At this point it's full-blown. So many
       | "serious business" PostgreSQL ops posts are just nothingburgers
       | in the SQL Server world.
        
         | calpaterson wrote:
         | PG is far behind SQL Server on ease of upgrade but the method
         | described in this post is not the best practice right now,
         | which I think is:
         | 
         | - physical restore to new cluster
         | 
         | - pg_upgrade the new cluster
         | 
         | - catch up on logical wal logs from old cluster
         | 
         | - failover to new cluster
         | 
         | - STONITH
         | 
         | I think the above was not open to them because of the
         | limitations of their managed PG instance. I haven't used Azure
         | but GCP managed SQL has loads of limitations. It seems very
         | common and I think is a major (and undiscussed) drawback of
         | these managed instance.
         | 
         | But the truth is that very few of the people who use PG want to
         | hear that things are better in the MS SQL community for reasons
         | of prejudice and as a result you're being downvoted unfairly
         | for pointing out PGs relative backwardness here.
        
         | Teletio wrote:
         | Amy real facts you can show besides your sentiment on blog
         | posts?
        
           | electroly wrote:
           | What facts are you looking for? I just described the steps
           | from this document: https://docs.microsoft.com/en-us/sql/sql-
           | server/failover-clu... -- specifically, the "Perform a
           | rolling upgrade or update" section. There's nothing else to
           | my post other than contrasting the SQL Server upgrade process
           | to the one described in the article, and musing about my
           | growing appreciation for SQL Server; I apologize if it seemed
           | like it was going to be deeper than that.
           | 
           | EDIT: I realized you're looking for the other PostgreSQL blog
           | posts. Here's an example of two recent HN posts about
           | PostgreSQL issues that I pulled out of my comment history.
           | Both of these blog posts exist because PostgreSQL doesn't
           | have query hints. SQL Server has them; I've dealt with issues
           | like these blog posts describe but they have trivial fixes in
           | the SQL Server world. Nothing to write a blog post about. I
           | don't have a link handy regarding PostgreSQL's txn id
           | wraparound problem, but SQL Server doesn't have that problem,
           | either.
           | 
           | - https://news.ycombinator.com/item?id=30296490
           | 
           | - https://news.ycombinator.com/item?id=29981737
        
         | aidos wrote:
         | It's been a long time since I used SQL Server so I don't know
         | that upgrade process well (I'm willing to believe it's smoother
         | though, especially wrt to replication / failover).
         | 
         | Keep in mind that they're upgrading from a database version
         | that's almost 6 years old. Postgres has improved a lot in the
         | last 5 major versions since then.
         | 
         | Another thing here is that I'm pretty sure they could have just
         | done the in-place upgrade and it would have been fine. I've run
         | pg_upgrade myself for a bunch of major versions now and it's
         | easy and doesn't require dumping / restoring anything. Maybe
         | there's something else going on that I'm missing though.
         | 
         | What setup are you running with sql server to have it
         | automatically failover? Is it a multi master configuration or
         | are the additional nodes just read replicas?
         | 
         | These days Postgres actually allows logical replication so your
         | servers can be running different versions at the same time,
         | which allows for much smoother upgrades (haven't tried that
         | myself yet, don't quote me on it!)
        
         | Symbiote wrote:
         | They did not dump and restore.
         | 
         | They created a replica database running the new version, then
         | switched over to it. Not too dissimilar to what you describe,
         | although more work since they started out with only a single
         | instance without replication support.
        
           | electroly wrote:
           | They _ultimately_ didn't dump and restore, but it was the
           | first thing they tried. It didn't work; it actually failed
           | catastrophically for them. They describe this under the
           | "Implementing logical replication" section. Their ultimate
           | solution is what they tried after the dump-and-restore based
           | DMS method failed and they took an unplanned outage due to
           | yet more PostgreSQL-specific issues (unvacuumed tuples).
           | 
           | All of this is exactly what I'm talking about. This blog post
           | describes kind of a nightmare process for something that is
           | trivial in SQL Server. They actually needed a third party
           | product from Citus just to successfully upgrade PostgreSQL!
           | Stunning.
        
             | VincentEvans wrote:
             | Postgres dump and restore tooling is very poor performance-
             | wise , easily 10x slower compared to SQL Server. I love
             | Postgres dearly and prefer to use it despite that, but I
             | wish Postgres devs renewed their interest in improving
             | neglected dump/restore tooling.
        
       ___________________________________________________________________
       (page generated 2022-04-19 23:00 UTC)