[HN Gopher] Zero downtime Postgres migration, done right
       ___________________________________________________________________
        
       Zero downtime Postgres migration, done right
        
       Author : rigaspapas
       Score  : 281 points
       Date   : 2021-06-11 14:25 UTC (8 hours ago)
        
 (HTM) web link (engineering.theblueground.com)
 (TXT) w3m dump (engineering.theblueground.com)
        
       | ckboii89 wrote:
       | does anyone know if this works when the target database is a
       | replica/standby? The downside using pg_dump is that it acquires a
       | lock on the table its dumping, and doing this on production may
       | cause some slowness
        
       | jeffbee wrote:
       | Bucardo has no performance impact, it just adds a trigger to
       | every mutation. Negligible!
       | 
       | I really think articles of this kind are not useless, but need to
       | explicitly narrow their audience to set expectation at the start.
       | This particular topic is PG replication for users who aren't very
       | sensitive to write latency.
        
       | Nextgrid wrote:
       | Internet Archive link in case the original is overloaded:
       | https://web.archive.org/web/20210611143214/https://engineeri...
        
       | jlmorton wrote:
       | If you are in AWS, or have connectivity available, AWS Database
       | Migration Service makes this relatively trivial.
       | 
       | DMS for Postgres is based on Postgres Logical Replication, which
       | is built-in to Postgres, and the same thing Bucardo is using
       | behind the scenes. But AWS DMS is very nearly point-and-click to
       | do this sort of migration.
        
         | chousuke wrote:
         | The key here is Postgres 9.5. AWS DMS does not support it
         | because they require logical replication support.
         | 
         | A few years back I migrated a PostgreSQL 9.2 database to AWS
         | and wasn't able to use RDS because logical replication was not
         | available.
         | 
         | I did try to use Bucardo but ultimately didn't trust myself to
         | configure it such that it wouldn't lose data (first attempt
         | left nearly all BLOBs unreplicated because the data isn't
         | actually in the tables you set triggers on)
         | 
         | Physical replication to a self-built instance was easy, I was
         | 100% confident it wouldn't be missing data, and the downtime
         | from cutover was about 15 minutes (It involved restarting piles
         | of slow Java applications)
        
         | booleanbetrayal wrote:
         | I think DMS generally lags RDS releases. Last I checked, there
         | still wasn't a way to replicate from Postgres RDS 12 -> 13 with
         | DMS.
        
       | endisneigh wrote:
       | I wonder how much easier software engineering would be if there
       | were a period where things are simply not available.
       | 
       | What problems are currently very difficult would be made trivial
       | if 6 hours of downtime every Sunday were acceptable? 10PM-4AM EST
        
         | legohead wrote:
         | Easier sure, but Zero downtime migration is a fun engineering
         | problem (albeit stressful). Just need an employer who is
         | willing to pay for it.
        
         | yupper32 wrote:
         | I find it very hard to come up with a use case where a weekly 6
         | hours of downtime at night EST would be acceptable.
         | 
         | US government websites already often do this, and I find that
         | completely unacceptable since these services need to be
         | available to anyone, regardless of work and life schedules.
         | 
         | Any website that's international would also suffer greatly for
         | an EST centralized scheduled downtime.
         | 
         | Maybe a very localized website that doesn't have much impact on
         | real life?
        
           | endisneigh wrote:
           | Why wouldn't it be acceptable? People aren't available all
           | the time either, nor are stores. In fact software is unique
           | in its availability.
        
             | yupper32 wrote:
             | Frankly, because it can be. Weekly scheduled downtime is
             | arbitrary, manufactured, and lazy.
             | 
             | Of course, having to schedule the occasional downtime for a
             | database migration is fine. There's probably a few times a
             | year that you'd need to do it if you don't have the
             | bandwidth to do fancy zero-downtime solutions. It's the
             | weekly arbitrary downtime that I'm firmly against.
        
               | endisneigh wrote:
               | Why is it lazy? When things are up people have to work.
               | Do you believe people should be working all of the time?
               | 
               | I think regular downtime is only natural. If you had to
               | choose between 95% availability or 100% availability
               | other than the before mentioned downtime which would you
               | choose?
        
               | yupper32 wrote:
               | Sorry I'm a bit confused by your argument.
               | 
               | > When things are up people have to work.
               | 
               | That's not true. Monitoring software and on-call
               | rotations are well established things, and wouldn't even
               | go away in your world. Believe it or not, people who run
               | websites already sleep at night.
               | 
               | > Do you believe people should be working all of the
               | time?
               | 
               | I'm not sure where you got that. Besides, you're already
               | having people in the office in the middle of the night
               | every Sunday already. Or are you saying that the downtime
               | is when your team gets to sleep?
               | 
               | > If you had to choose between 95% availability or 100%
               | availability other than the before mentioned downtime
               | which would you choose?
               | 
               | Again, not sure what you're getting at. There's only a
               | few reasons why you'd need to schedule a downtime.
               | Database migrations are a good one. All of the reasons
               | should only happen a few times a year, likely. Way better
               | than 52 times a year.
        
               | endisneigh wrote:
               | I'm not making a particular argument. I'm just wondering
               | if having regular downtime is better or worse in terms of
               | reducing engineering complexity.
        
         | kabdib wrote:
         | What part of the world?
         | 
         | One service I worked on (with customers around the globe) had
         | about 10 minutes of scheduled downtime a week. Peak user load
         | was, interestingly enough, during our core hours, when there
         | are a lot more hands on deck to handle any issues. So that's
         | the time we chose.
         | 
         | Naturally people still complained bitterly about that ten
         | minutes.
         | 
         | Getting to zero didn't seem worth the significant engineering
         | effort required. "You're not Amazon, and you're not 911"
         | [emergency number in the US]
        
         | doctor_eval wrote:
         | Setting aside the other comments, I think it depends on your
         | definition of "easier".
         | 
         | Yes - it is easier in terms of up-front investment to have a
         | scheduled (daily, weekly) downtime. But that often means the
         | downtime is not automated - it's someone doing the work, often
         | out of hours when they're not fresh, things can do wrong, etc.
         | And it also means that the duration of the work is limited to
         | the scheduled downtime. Some large upgrades just won't fit in
         | the allotted time, and you need to schedule an outage.
         | 
         | On the other hand, creating a system designed for zero downtime
         | (ie, in-place upgrades, removal of big batch jobs etc) is a lot
         | more of an investment up front, and makes upgrade planning more
         | complex, but ultimately it saves time and effort because you
         | can do all the changes during working hours, and "upgrades"
         | become just how you deploy code - part of the core skill set of
         | your team - and you do them 10 times a day.
         | 
         | The main difference that I've seen is that the scheduled
         | downtime approach tends to create a much more conservative
         | upgrade cycle, much bigger upgrades, and more risk of an
         | upgrade failing - which means that a pile of features don't get
         | released. Also, certain customers will push back and demand a
         | ton of documentation and detailed release notes and advanced
         | warning of the upgrades etc - this creates loads of
         | unproductive work for a small company. A continuous deployment
         | model tends to reduce these problems because you're not
         | externalising your upgrade schedule and you don't have a single
         | big upgrade that can go bang, so maybe one feature doesn't get
         | released but everything else does.
         | 
         | Having taken both approaches (and even moved a company from
         | scheduled to continuous) I would never go back to a scheduled
         | downtime approach.
        
         | maccard wrote:
         | Your team is now working 10pm to 4am est every Sunday, and
         | anyone who uses your services in any time zone around the world
         | is without your service. 10pm est is lunch time in Australia.
         | 4am est is working hours in many parts of western Europe.
         | 
         | You don't have to be huge to have customers all around the
         | world.
        
           | yifanl wrote:
           | I think the idea is that this world is one where the customer
           | accepts that there may be an outage every Sunday because
           | feature work can be done far more cheaply.
        
           | endisneigh wrote:
           | You're correct, but if there happened to be an outage during
           | that time now you'd have to have your team working then
           | anyway.
           | 
           | The existence of a planned downtime doesn't necessarily mean
           | deployments, or work, have to happen during that time.
        
             | yupper32 wrote:
             | > but if there happened to be an outage during that time
             | now you'd have to have your team working then anyway.
             | 
             | So you're having them work late 52 times a year AND when
             | there's an outage vs only when there's an outage. They're
             | working late WAY more in your plan.
        
               | endisneigh wrote:
               | The existence of planned downtime doesn't mean they have
               | to work during said time. I'm not following your
               | reasoning
        
               | kabdib wrote:
               | Er, you bring the service down so you can do work on it.
               | _Somebody_ is doing that work.
               | 
               | I guarantee that you'll have trouble hiring qualified
               | people for a role that includes core hours of "2AM-5AM
               | Sunday mornings" (or whatever other oddball time you've
               | decided). Updating systems is precisely when you may need
               | to have all-hands-on-deck responses. Been there, multiple
               | times.
        
               | sigstoat wrote:
               | they might not have to work during every planned
               | downtime, but they're going to have to work during some
               | of them, or what was the point of the downtime?
        
               | endisneigh wrote:
               | The point is to create a predictable period in which
               | downtime, if any, would occur during
        
               | xmprt wrote:
               | This predictability only matters for frequent, repeat
               | customers. But those are the exact customers that you
               | could just email saying "the website is going to be down
               | on Sunday for 1 hour for planned maintenance" a few times
               | a year when it's actually required.
               | 
               | If you run a business (eg. HFT where you run certain
               | cleanup scripts after trading hours) where regular
               | planned downtime is fine then go for it but most of the
               | time, you don't need a downtime schedule every single
               | week because you're not doing planned maintenance work
               | every single week.
        
               | endisneigh wrote:
               | I think people are getting too caught up in my particular
               | example. It could be a month, biweekly, etc.
        
         | i_have_an_idea wrote:
         | Interestingly, this is unacceptable in the US and probably most
         | of Western Europe. However, I know of more than one bank in
         | Eastern Europe where online banking simply doesn't work after
         | midnight, until about 6am.
        
       | ivan888 wrote:
       | I didn't read this article, but I really hate the tag line "done
       | right". It expresses such a poor sense of humility, which is one
       | of, or perhaps the most, important traits in the world of
       | software
        
         | doctor_eval wrote:
         | Yeah I got the same feeling but couldn't quite put my finger on
         | it.
         | 
         | "Here's how we did this. Feedback welcome" would seem more
         | appropriate.
         | 
         | Perhaps it doesn't fit the hyper-aggressive mould of the would
         | be titan of industry.
         | 
         | That said their engineering blog seems to be down so
         | -\\_(tsu)_/-
        
       | dkhenry wrote:
       | This is one of the areas where Postgres is so far behind MySQL is
       | embarrassing. Zero downtime migrations in MySQL have been a
       | common method for over 10 years. This solution is far from ideal
       | due to the use of Triggers which can greatly increase the load on
       | the database and slow down transactions. If you don't have a lot
       | of load on your DB thats fine, but if you are pushing your DB
       | this will bring it down.
       | 
       | In MySQL they started with triggers with PT-OSC, but now there is
       | GH-OST which does it with Replication. You can do something like
       | this with Postgres by using Logical replication, but its still
       | requires hand holding, and to my knowledge there is no way to
       | "throttle" the migration like you can with GH-OST. Where I work
       | now we are building all this out so we can have first class
       | online migrations, but the chasm is still pretty big.
        
         | merb wrote:
         | > but its still requires hand holding, and to my knowledge
         | there is no way to "throttle" the migration like you can with
         | GH-OST
         | 
         | eh? you only do a basebackup and than you can begin the logical
         | replication. at some point you than you do a failover?
         | 
         | chtitux basically described the process which is extremly
         | simple.
        
       | booleanbetrayal wrote:
       | Has anyone here leveraged pglogical for this before? Looking at a
       | similar migration but it has native extension support in AWS RDS.
       | Would love to hear any success / horror stories!
        
       | coolspot wrote:
       | I just don't get all the "just shutdown the db for an hour,
       | you're not Netflix" comments.
       | 
       | If you can do things properly, as an engineer, you absolutely
       | should, even if your company serves "just" hundreds of thousands
       | instead of hundreds of millions users.
       | 
       | It is not like they wrote their own database for that, they just
       | used an open source tool.
        
         | mswtk wrote:
         | My understanding of doing things "properly" as an engineer, is
         | picking the solution with the right tradeoffs for my use case.
         | If the cost to the business of having some amount of scheduled
         | downtime occasionally is significantly less than the
         | engineering cost of maintaining several 9s worth of
         | availability over major migrations, then I consider the former
         | to be "done right".
        
         | perrygeo wrote:
         | If you do things properly as an engineer, you've already
         | negotiated and committed to service-level agreements with
         | specific downtime objectives, right? Right?
         | 
         | If you've got say 5 minutes of downtime budget per month, do
         | you really think it's a good investment to spend a million
         | dollars of engineering effort and opportunity costs to get
         | database downtime to zero seconds? Or you could use off-the-
         | shelf techniques for orders of magnitude less investment and
         | suffer a few minutes of downtime at worst, well within your
         | SLO. It's an economic decision, not a technical one.
         | 
         | If zero downtime at any cost is non-negotiable, well you'd
         | better hire, plan and budget accordingly. And hope those few
         | extra minutes per year are worth the cost.
        
           | coolspot wrote:
           | I agree that it is an economic decision. In my understanding
           | it didn't cost them millions of dollars to do it zero-
           | downtime. Maybe a $10k-$50k in development/admin/test hours
           | to make it so.
           | 
           | Also engineers get better (and happier) when they do
           | challenging tasks!
        
       | holoduke wrote:
       | Looking for something similar for a mariadb setup. Anyone knows
       | some resources?
        
         | nick__m wrote:
         | You can do multi master replication with conflicts detection
         | with symmetricDS. The migration would be similar to the
         | procedure described in the article.
        
       | mmanulis wrote:
       | Can this be done using existing PostgreSQL functionality around
       | replicas? I think there's a plugin for PostgreSQL that supports
       | master-master replication as well.
        
       | luhn wrote:
       | Braintree (IIRC) had a really clever migration strategy, although
       | I can't seem to find the blog post now. They paused all traffic
       | at the load balancer, cut over to the new DB, and then resumed
       | traffic. No requests failed, just a slight bump in latency while
       | the LBs were paused.
       | 
       | This app apparently had robust enough retry mechanisms that they
       | were able just eat the errors and not have customer issues--Color
       | me impressed! I'm not sure how many teams can make that claim;
       | that's a hard thing to nail down.
        
         | scrollaway wrote:
         | I remember reading something like that about Adyen. That might
         | be why you're unable to find it.
        
           | nezirus wrote:
           | pgbouncer has PAUSE comnand, which can be used for seamless
           | restarts/failovers or similar
        
           | rattray wrote:
           | I wonder if that was it (Adyen and Braintree are
           | competitors).
           | 
           | Here's the post:
           | https://www.adyen.com/blog/updating-a-50-terabyte-
           | postgresql...
           | 
           | Discussed here: https://news.ycombinator.com/item?id=26535357
           | 
           | They architected their application to be able to tolerate
           | 15-30min of postgres downtime.
        
         | simonw wrote:
         | I think that's covered in this talk (I've not watched the video
         | though): https://www.braintreepayments.com/blog/ruby-conf-
         | australia-h...
        
           | pgr0ss wrote:
           | It was also covered in:
           | https://www.braintreepayments.com/blog/switching-
           | datacenters...
        
             | luhn wrote:
             | Aha, that's the post I was looking for!
        
         | xyst wrote:
         | probably had a good circuit breaker design. would love to read
         | the article though.
        
       | silviogutierrez wrote:
       | Very interesting article. But I have to ask: would taking down
       | the system for a couple of hours be _that_ bad?
       | 
       | I looked at the company, and while they seem rather large,
       | they're not Netflix or AWS.
       | 
       | I imagine they need to be up for people to be able to check in,
       | etc. But they could just block out the planned maintenance as
       | check in times far in advance. I'm sure there's a million other
       | edge cases but those can be thought out and weighed against the
       | engineering effort.
       | 
       | Don't get me wrong, this is very cool.
       | 
       | But I wonder what the engineering cost was. I'd think easily in
       | the hundreds of thousands of dollars.
        
         | jeffbee wrote:
         | The problem happens when you estimate it will only take a few
         | hours and at the end of the first hour it's 1% done, or it
         | nears completion and crashes, in a loop. Now what?
        
         | [deleted]
        
         | maccard wrote:
         | I think it's always worth questioning both sides. Why is
         | downtime acceptable? People on this site routinely complain
         | about windows needing a restart for system updates while
         | boasting about their Linux servers uptime.
         | 
         | People talk about how kubernetes is overkill for many people,
         | but it gives you rolling deployments for your applications out
         | of the box.
         | 
         | There's also the "slippery slope" argument. A 0 downtime
         | migration means 0 downtime. A 5 minute migration creeps up to a
         | 30 minutes migration occasionally, and then regression to the
         | mean happens, causing standard migrations to be 30 minutes.
        
           | Daishiman wrote:
           | A lot of 0-downtime migrations end up turning into a months-
           | long ordeal and can still fail.
           | 
           | Allowing maintenance windows means you can do things in a
           | much simple manner. Need to take a snapshot of a DB without
           | running into issues with a production system adding data?
           | Sure, go ahead, you just saved two months.
        
             | hinkley wrote:
             | It's the old can/should argument.
             | 
             | The principle of something like zero downtime or switching
             | a vendor out to save 20% may not stack up against the
             | reality in opportunity costs, employee burnout, and
             | customer turnover.
             | 
             | My company is reorganizing and so everyone is trying to
             | settle old philosophical debates during The Churn. But
             | everyone is overextended and a number of us have started
             | pushing back. Just the meetings to discuss changing vendors
             | might push the break-even point out 6 months, before you
             | even get into implementation costs. The more time we spend
             | working on infrastructure instead of customer-facing
             | features and bugs, the more likely those customers are to
             | wander off.
        
           | xmprt wrote:
           | I think the slippery slope argument is almost always a
           | fallacy. In practice you can say "we accept 99.99%
           | availability" (which is about 1 hour every year) and budget
           | for that. Your service might go down for unplanned reasons
           | such as an outage but it could also go down for planned
           | maintenance and as long as you're within that budget it seems
           | ok.
        
           | munk-a wrote:
           | > Why is downtime acceptable?
           | 
           | At least where I'm working downtime is acceptable because our
           | customer base is composed of professionals in north america
           | that work 9-5 thus taking the DB offline for five minutes at
           | midnight PST has essentially no cost to our client base. To
           | contrast that, spending three months developing a rollout
           | plan costs our company dearly in competitive advantage.
           | 
           | I agree that it's always worth examining from both sides but
           | I also think that 0 downtime migration is both a shiny
           | interesting problem and a point of pride for developers that
           | can cause a company to vastly over invest in a complex
           | technical solution to a problem they never had.
        
         | fullstop wrote:
         | I work for a small company and it would be devastating if our
         | database was down for a few hours.
        
           | silviogutierrez wrote:
           | Unplanned? Probably. But maybe it's not as a dire as you
           | think? Azure Active Directory (auth as a service) went down
           | for a _while_ , globally, and life went on. Same with GCP and
           | occasionally AWS[1]
           | 
           | I'm not saying there's no downside, I'm asking _against_ the
           | downside of engineering cost. And that itself carries risk of
           | failure when you go live. It 's not guaranteed.
           | 
           | [1] AWS is _so_ ubiquitous though, that half the internet
           | would be affected so it makes it less individually
           | noticeable.
        
             | dividedbyzero wrote:
             | So don't host on AWS, wait out their next big outage and
             | then take down your app for that big migration, so you can
             | hide among all the other dead apps. Half joking, of course.
        
               | silviogutierrez wrote:
               | That's brilliant. Have the PR ready to go and
               | merge/deploy only when that happens. Just host on us-
               | east1 and it'll happen eventually.
        
           | jbverschoor wrote:
           | I doubt that.. aws is down sometimes too. Card payments are
           | sometimes down, even 112(911) is sometimes down.
           | 
           | I hope you have everything redundant, including the CEO
        
             | fullstop wrote:
             | The database has a hot standby and two warm standby
             | instances, redundant hardware, network infrastructure, all
             | of that.
        
             | xcambar wrote:
             | > I hope you have everything redundant, including the CEO
             | 
             | Made me laugh, I'll sure reuse it!
        
           | hobs wrote:
           | OTOH I worked for several small companies that had "business
           | hours" software - if you had to take downtime on the weekend
           | or after hours you'd be looking at impacting <1% of users.
        
           | willcipriano wrote:
           | In my view that's just saying at some point in the future the
           | company will be devastated. Nothing has 100% uptime.
        
           | chefkoch wrote:
           | Then hopefully everything is HA in your company.
        
             | fullstop wrote:
             | It absolutely is, BGP with geographically diverse paths,
             | databases, app servers, etc, are all redundant. It's hosted
             | in-house, so there is a cold standby database in AWS which
             | would only be used if, say, an aircraft crashed into our
             | server rooms.
        
               | ikiris wrote:
               | It always amazes me when someone says they can't ever be
               | down, and then says they only serve from one physical
               | location actively.
               | 
               | One of these things is not like the other.
        
               | fullstop wrote:
               | We have everything in place to run from AWS if needed but
               | do not operate from there because of cost.
               | 
               | This goes both ways -- there have been many AWS outages
               | which have not affected us. I hear what you're saying,
               | but we've had only one instance of extended (hours)
               | downtime in the last 20 years.
        
               | xcambar wrote:
               | The more experienced I become, the more such down to
               | earth solutions seem OMG SO MUCH MORE reasonable than all
               | the bells and whistles of "modern" engineering practices.
        
               | jnpatel wrote:
               | Could you share what domain your company is in? Just
               | curious what has such HA requirements
        
         | sharadov wrote:
         | pg_upgrade with --hard-link option will upgrade the db in mins,
         | I've done a 2TB db in less than a minute. Yes, there is no
         | rollback, but if you do enough testing prior it can be really
         | smooth. Am in the middle of doing this across an environment
         | where we have tens of hosts from a few GB to couple TB.
        
           | zozbot234 wrote:
           | You could take a filesystem-level snapshot for rollback.
        
       | data_ders wrote:
       | way cool!
       | 
       | fyi the hyperlink "found here" to setup_new_database.template is
       | broken.
        
       | andrewmcwatters wrote:
       | What is the equivalent for MySQL?
        
       | tbrock wrote:
       | Its insane that it has to be this complex and require third party
       | software to accomplish...
       | 
       | Most modern rdbms/nosql database vendors allow a rolling upgrade
       | where you roll in new servers and roll out the old ones
       | seamlessly.
       | 
       | Also the fact that AWS rds doesnt do this with zero downtime by
       | default through automating it this way is also crazy. Why pay for
       | hosted when the upgrade story is incomplete? Take downtime to
       | upgrade a DB in 2021? Everyone must be joking.
        
         | Pxtl wrote:
         | Well, Postgres is F/OSS, so I expect the solution to problems
         | to be "lots of small tools"... but I see the same kind of
         | herculean battle-plans for MS SQL Server work, and I get
         | shocked. That is a paid product, what on Earth are we paying
         | for?
        
           | marcosdumay wrote:
           | > what on Earth are we paying for?
           | 
           | The same is the same for any relational DBMS: lock-in.
        
           | worewood wrote:
           | You're paying for an "if something goes wrong and it's the
           | software's fault you can sue us" license
        
             | senko wrote:
             | You're not. Approximately 25% of SQL Server EULA text deals
             | with ways in which the warranty is limited. The best the
             | license gives you is your money back if you prove that it
             | was the software's fault.
             | 
             | Of course, you can always sue. A $2T company. Good luck
             | with that.
        
               | doctor_eval wrote:
               | Went to post the same thing but HN had a little downtime
               | burp.
               | 
               | Would only add that the main reason people buy these
               | licenses, apart from familiarity, is so they can shift
               | the blame if something goes wrong. Its all about ass
               | covering.
        
               | salmo wrote:
               | Well, this and
               | 
               | 1) the illusion of "support" 2) OSS-fearing companies'
               | love of indemnification
        
         | 0xbadcafebee wrote:
         | > Most modern rdbms/nosql database vendors
         | 
         | "Modern" = "not a traditional RDBMS". Expecting your 1983
         | Toyota Corolla to _fly_ , regardless of the fact that it's now
         | 2021, is unrealistic. But personal helicopters have been around
         | for a while.
        
         | defaultname wrote:
         | This is a complex task, fraught on _any_ platform. Most systems
         | base such clustering on a shared filesystem, which is obviously
         | orthogonal to the shared nothing designs most try to pursue.
        
       | rhacker wrote:
       | "the zero downtime" + this site is definitely down is making me
       | laugh.
        
         | zild3d wrote:
         | next post "zero downtime static site"
        
       | chtitux wrote:
       | If you can afford a one off 1 second of latency for your SQL
       | queries, then using logical replication with pgbouncer seems way
       | easier :
       | 
       | - setup logical replication between the old and the new server
       | (limitations exist on what is replicated, read the docs)
       | 
       | - PAUSE the pgbouncer (virtual) database. Your app will hang, but
       | not disconnect from pgbouncer
       | 
       | - Copy the sequences from the old to new server. Sequences are
       | not replicated with logical replication
       | 
       | - RESUME the pgbouncer virtual database.
       | 
       | You're done. If everything is automated, your app will see a
       | temporary increase of the SQL latency. But they will keep their
       | TCP connections, so virtually no outage.
        
         | nhumrich wrote:
         | I use pgbouncer and had no idea it supported logical
         | replication. I cant find anything about it in the docs. Do you
         | have something you can link me to to read more?
        
           | aidos wrote:
           | I'm assuming they mean to use Postgres logical replication.
        
             | nhumrich wrote:
             | Which is only possible if you are using a version of
             | postgres which is new enough, and isn't restricted, such as
             | some versions of RDS. Which, explains the whole original
             | post.
        
         | booleanbetrayal wrote:
         | This is precisely the migration I'm planning on doing in the
         | next few weeks with pglogical under the hood for replication.
         | Seems like the atomic switch is much easier than any sort of
         | problem that could stem from conflict or data duplication
         | errors while in a bi-directional replication strategy.
        
         | foobarbazetc wrote:
         | That's how you do it.
        
         | stingraycharles wrote:
         | Exactly this. The OP's approach reminded me so much of the days
         | of Slony, and I wondered why a simpler approach with logical
         | replication would not just suffice.
         | 
         | Rather than pgbouncer, I did this in the actual application
         | code once (write to both databases at the same time, once
         | everything is in sync and you're confident the new server works
         | well, fail over to the new one only), but it depends upon how
         | much control you can exercise over the application code.
         | 
         | Any approach that is based on triggers makes me shiver,
         | however.
        
         | asenchi wrote:
         | Yep, you can also prepare the new database by using a snapshot
         | of the primary's volume, and use pg-rewind to get them in sync.
         | Altogether the tooling can make migrations super easy without
         | minimal downtime.
        
         | fullstop wrote:
         | This works wonderfully. If you have any long running queries,
         | though, the PAUSE won't pause until they have finished.
         | 
         | I love pgbouncer, it is such a great tool.
        
         | twunde wrote:
         | To be fair about this page, this was used to migrate versions
         | of postgres __prior__ to the introduction of logical
         | replication. Logical replication makes this significantly
         | easier (ie you no longer need the triggers)
        
       | jbverschoor wrote:
       | > Blueground is a real estate tech company offering flexible and
       | move-in ready furnished apartments across three continents and 12
       | of the world's top cities. We search high and low for the best
       | properties in the best cities, then our in-house design team
       | transforms these spaces into turnkey spaces for 30 days or
       | longer.
       | 
       | Seriously, how big can that db be, and how bad would a 1hr
       | reduced availability / downtime be?
       | 
       | Seems like a lot of wasted engineering effort. "You are not
       | google"
        
         | cryptoz wrote:
         | You cut out the part in About where they hint at it being
         | important here even. "But we go a step further, merging
         | incredible on-the-ground support with app-based services for a
         | seamless experience for all our guests"
         | 
         | It sounds like across multiple timezones, with a tech stack
         | that backs the business in a specific way, that downtime could
         | be a problem and that it would reduce their offering,
         | "seamless" and reliability for moving.
         | 
         | If twitter goes down and you can't tweet, only really Twitter
         | loses money and you can try again later. But if you're moving
         | into an apartment you don't want to be standing outside trying
         | to get the keys but the system is down. Edit: And you also
         | don't want the service to tell you that you can't move in from
         | 11am-1pm because of 'maintenance'
        
         | svaha1728 wrote:
         | Agree that it's a lot of wasted engineering effort for
         | companies that can have planned downtimes.
         | 
         | My guess the conversation was 'To be agile, we need to be 100%
         | CICD'. Next thing you know everything needs to get pushed
         | straight to prod continuously with no downtime.
        
         | majormajor wrote:
         | You can often get away with some downtime, but that's not the
         | same as not spending any engineering effort.
         | 
         | What kills you is when you take an hour scheduled downtime,
         | communicate this out, get everyone on board... and then are
         | down for a day. If you don't have a good, well-rehearsed, plan,
         | you might be unexpectedly screwed even then. As a rule of
         | thumb... something usually doesn't quite go how you expect it
         | to!
        
         | Hallucinaut wrote:
         | I was trying to get on Zoopla (and, I think, Rightmove) two
         | nights ago for over an hour after midnight and the site
         | continued to tell me it was down for maintenance, and to wait
         | again until the non-night owls were up in the morning. Pretty
         | sure their market cap is ten, if not eleven, figures.
        
       ___________________________________________________________________
       (page generated 2021-06-11 23:00 UTC)