[HN Gopher] Improving how we deploy GitHub
       ___________________________________________________________________
        
       Improving how we deploy GitHub
        
       Author : todsacerdoti
       Score  : 95 points
       Date   : 2021-01-25 18:06 UTC (4 hours ago)
        
 (HTM) web link (github.blog)
 (TXT) w3m dump (github.blog)
        
       | zoobab wrote:
       | Github source code did not leak recently?
        
       | aszen wrote:
       | Kind of sad to see GitHub doesn't use GitHub itself to deploy and
       | monitor their releases.
        
         | WJW wrote:
         | That seems like an extremely good idea actually, since if you
         | dogfood your own releasing service then you can't fix it
         | anymore if you accidentally bring down the service.
        
           | notwhereyouare wrote:
           | I did a short stint at wayfair and about 1-2 months in, there
           | was a deploy that somehow got passed the test flow and when
           | deployed took down their entire site. So badly that they
           | couldn't even deploy the fix
        
           | xxpor wrote:
           | That's usually solved with a parallel stack deployment, use
           | the other stack if something is broken
        
             | paxys wrote:
             | If the "other stack" isn't regularly used then you can
             | assume it will be broken when needed
        
               | cpascal wrote:
               | You just run the previous version of the production stack
               | in your "dogfood/operations" stack. Once you've fully
               | rolled out production and have vetted it, you can upgrade
               | the other one to match production.
        
           | Xorlev wrote:
           | That also means when it does go wrong, it takes much longer
           | to fix. Good operational practice is to decrease MTTR, not
           | make it worse.
        
         | illnewsthat wrote:
         | I was surprised to read that they are using Slack since it is
         | such a competitor to Microsoft's Teams (parent company).
        
           | kuschkufan wrote:
           | Are you expecting them to use Windows everywhere as well?
        
             | dubcanada wrote:
             | No but why would you use a product that is $7 or what ever
             | times the number of employees (so let's say 200, so $1400 a
             | month) when you can use a free one.
        
               | maccard wrote:
               | Speaking from experience, just because you work for a
               | company doesn't mean you can use all of their products
               | (or that you'll even get favorable pricing on them).
        
               | lostapathy wrote:
               | I'd love to hear this story! Seems crazy ... but we live
               | in a crazy world.
        
               | scott_w wrote:
               | Unrelated to software but the company my dad works for
               | (motor repair) has to buy all its parts from its own
               | distribution arm, at the marked up price. He then has to
               | turn a profit on those parts as well as pricing the
               | labour.
               | 
               | If cost price is PS5 and the markup is 20%, he has to pay
               | PS6 to get the part, then charge PS7.20 on the invoice to
               | the customer. I'll let you guess what that does to tender
               | bids ;-)
        
               | names_are_hard wrote:
               | At Microsoft if you build a product using Azure (and if
               | you want to use the cloud you MUST use Azure, you're not
               | going to get approval to write a check to AWS) the costs
               | come out of your budget. And it's taken seriously, to the
               | point where teams will very much emphasize managing costs
               | (what will this new feature cost on our Azure bill? Can
               | we build it more efficiently? Oh wow, that refactor saved
               | us 100k/month in cloud costs, don't forget that when we
               | start talking about promotions...)
        
               | lostapathy wrote:
               | That makes sense since the amount you could use is
               | variable. I was thinking more like somebody couldn't get
               | a free word license at a MS subsidiary or something.
        
               | vulcan01 wrote:
               | When I worked at MS Azure, we had to pay for Azure
               | servers! (I believe our team had a $5k/month Azure bill.)
               | It's part of internal budgeting, so that people within MS
               | don't splurge on expensive things (because it does cost
               | MS money for each person on Teams).
        
               | names_are_hard wrote:
               | Did you drop a k? What can you do with 50 dollars?
        
               | vulcan01 wrote:
               | Yes, thank you, it should be $5k. Edited.
        
               | josephg wrote:
               | My uncle used to work at Compaq (back before they got
               | bought by HP). When their computers broke, his team had
               | to pay their support staff to get them fixed. (Via
               | internal budgeting). But the support team knew internal
               | customers would call them anyway and it was still
               | compaq's money, so they charged several times more for
               | internal support calls than normal support calls.
               | 
               | My uncle's team was having none of that, so they paid an
               | external computer repair service to fix their computers.
               | The external repair service subcontracted to compaq's
               | internal people anyway, so when their computers broke
               | they called up (and paid) external consultants. Who in
               | turn called compaq's internal support team, who came
               | downstairs and fixed their computers at a competitive
               | price.
        
               | theshrike79 wrote:
               | On the other hand sometimes it means you MUST use the
               | company products.
               | 
               | Consulted for a sub-sub-sub-subsidiary of Toshiba. All
               | computer equipment _had_ to be from Toshiba - the closest
               | place to get Toshiba laptops was two COUNTRIES over.
               | 
               | They even had to tape over non-Toshiba branding from
               | external displays that would be visible.
        
               | paxys wrote:
               | $1400 a month is less than a rounding error for a company
               | that size. If you can get even the tiniest bit of extra
               | developer productivity from the software then it is worth
               | it.
               | 
               | And Github will definitely still have to "pay" for Teams,
               | whether that is internal accounting or actual money being
               | exchanged.
        
             | names_are_hard wrote:
             | My understanding of Microsoft policy is that it's easier to
             | buy macbooks for your developers than it is to buy Slack.
             | Which makes sense, because they're currently doing head to
             | head with slack for market share right now, while a few
             | macbooks doesn't threaten their credibility when selling
             | windows.
             | 
             | My guess is that github was using slack before they were
             | bought and inertia is a thing. I'm sure there are people
             | within the parent company that would like to see them
             | transition, but I'm sure there's a ton of resistance,
             | especially "on the ground" at github. Buyouts are a
             | delicate thing, they don't want to ruin github by trying to
             | force it to change too quickly.
        
           | dubcanada wrote:
           | Probably because Teams is the worst.
           | 
           | More then likely it's because that's what they used before
           | they got bought and haven't been forced to migrate over yet,
           | they also seem to have bots, which are not really a direct
           | copy and paste into MS Teams, and likely them converting over
           | isn't a high priority.
        
             | jen20 wrote:
             | IIRC GitHub used to use Campfire and it took a long time to
             | switch to Slack - a switch to Teams would no doubt take a
             | long time too!
        
           | paxys wrote:
           | Easy to switch a chat application, hard to switch your entire
           | chatops ecosystem. This blog post shows the perfect example
           | of that.
        
         | jules2689 wrote:
         | There is some GitHub used, but as others stated we don't want
         | to create a circular dependency on ourselves in case we deploy
         | something that is broken.
        
       | hoprocker wrote:
       | This is generally a good flow, but something that absolutely
       | baffles me is that GitHub changes the commit SHAs when branches
       | are rebase-merged from PRs[0]. This totally breaks a fundamental
       | notion in Git that the same work, based on the same commits, has
       | the same hash. It also makes it incredibly difficult to determine
       | which PR branches have been merged into master.
       | 
       | [0] https://docs.github.com/en/github/collaborating-with-
       | issues-...
        
       | KinesisMagic wrote:
       | Can anyone explain why they might go with a slack based
       | deployment system as opposed to something more robust like
       | CircleCI or Jenkins? Is it mainly about the simplicity of it?
        
         | jules2689 wrote:
         | It's mainly the simplicity of the deployment system as it's
         | inline and visible, coupled with habit. It all actuality that
         | is just what _can_ trigger the deploy, the actual deploy is
         | based on an internal deploy application and deploys can be
         | triggered from there as well.
        
         | mrdonbrown wrote:
         | My team recently put in automation so that we use CircleCI for
         | the staging deployment, have it wait for manual approval, then
         | deploy to production. However, we can also give the Slack
         | staging deployment message a +1 reaction, which will
         | automatically approve the production deployment for CircleCI.
         | This way, we get an easy dev UX but all the CI features of
         | CircleCI.
        
         | pronoiac wrote:
         | There's easy transparency amongst multiple teams, without
         | having accounts for the other teams on CircleCI or Jenkins.
         | This is while the deploy is in flight, and it can provide
         | timestamped logs if there's an incident, and it could be useful
         | for tracking history. It's also clear who kicked off the
         | deploy.
        
         | zug_zug wrote:
         | As a devops person myself, I am super skeptical that there is
         | any good reason to do a chatops deploy. My guess is "new toys
         | are cool" / "Want this on my resume"
         | 
         | To be clear, it's hopefully just some connector that does slack
         | message -> triggers jenkins job.
         | 
         | But from a security, compliance, reliability, debuggability,
         | auditability perspective I think it's inferior. Not to mention
         | an inferior interface.
        
           | swagonomixxx wrote:
           | chatops deploys aren't really new toys, a place I was at was
           | doing them around 2013/14.
           | 
           | We liked it because the chat history you see is essentially a
           | deploy history, no need to login into some other website to
           | check some obscure logs page to see who did what. We did end
           | up having to debug the service that processed the chat
           | messages maybe once, but never ran into an issue when we had
           | to deploy a hotfix.
        
       | alexchamberlain wrote:
       | That's pretty awesome to go from nothing to full production in 15
       | minutes. I would like to encourage others to bear in mind that
       | simply adding more time wouldn't significantly decrease the risk
       | of things going wrong.
        
       | cytzol wrote:
       | Something I found surprising is that a change to the GitHub
       | codebase will be run in canary, get deployed to production, and
       | _then_ merged. I would have expected the PR to be merged first
       | before it gets served to the public, so even if you have to `git
       | revert` and undeploy it, you still have a record of every version
       | that was seen by actual users, even momentarily.
       | 
       | Does anyone know the pros and cons of GitHub's approach?
        
         | halukakin wrote:
         | I think this method seems to get more popular by day. IMHO,
         | previously master was the branch you merge before the deploy
         | process. But today this is reversed.
         | 
         | The main benefit is, other developers can rely on the master
         | branch even more. They will know there will not be a revert on
         | the master branch they just pulled one hour ago and already
         | started coding on.
        
           | Kwpolska wrote:
           | A `git revert` creates a new commit. To a developer, a revert
           | commit appearing on master has the same effect as a pull
           | request (or ten) being merged into it. If the revert affects
           | code you're working on, you will need to resolve conflicts,
           | just like you would need to if a merged PR affected the same
           | code.
        
         | bswinnerton wrote:
         | This is known as "GitHub Flow"
         | (https://guides.github.com/introduction/flow/). I was pretty
         | surprised by it when I first joined GitHub but I've grown to
         | love it. It makes rolling back changes much faster than having
         | to open up a revert branch, get it approved, and deploy it.
         | When something goes sideways, just deploy master / main, which
         | is meant to always be in a safe state.
        
       | sandGorgon wrote:
       | > _GitHub.com is deployed primarily through chatops_
       | 
       | What is the best chatops right now ? I dont see a lot of
       | popularity around chatops. Its most usually some version of
       | github based triggers.
       | 
       | Its funny that Github themselves uses chatops. I think that's a
       | very nice take - especially for early stage startups. Anyone else
       | use anything like it ?
        
         | paxys wrote:
         | I'm guessing they are using Hubot (https://hubot.github.com/)
        
           | swagonomixxx wrote:
           | A place I was at used Hubot as well. It gets the job done, we
           | never really ran into a fuss. Easily extensible as well.
        
           | jules2689 wrote:
           | This is correct :)
        
         | icey wrote:
         | We're just starting beta, but my friend Phil and I both worked
         | together at GitHub and are building what we hope to be a better
         | Hubot at https://ab.bot right now.
         | 
         | It's missing some of the chatops stuff that is mentioned in the
         | blog post but since we support a lot more languages than Hubot
         | we're hoping it's a matter of time before someone in our
         | community builds a better replacement deployment script (or
         | we'll do it while building out sample scripts :))
         | 
         | (Also, hi GitHub friends!)
        
       | Xorlev wrote:
       | I was surprised to see their canary stages are just 5 minutes.
       | Many problems take longer to manifest. That seems like a fairly
       | risky release process.
        
         | jules2689 wrote:
         | It's actually longer than 5 minutes. There is the duration of
         | the 2% canary deploy where we start to see pick up of traffic,
         | a 5 minute wait, then a 20% "deploy", and a 5 minute wait. All
         | in all this comes out to around 10-15ish minutes in canary.
         | This is a stage where we can almost instantly shut off traffic
         | to the canary deploy.
         | 
         | Could we reduce risk by lengthening the process? Maybe, but you
         | also make deploys longer which means less stuff can get through
         | in a day. This makes devs respond with larger PRs, for example,
         | which increases the risk profile.
         | 
         | So we need to balance time and duration. Typically large
         | problems will manifest quickly, or take a lot longer to detect
         | (and thus are generally more minor problems) when you have our
         | scale of a user base in my experience.
        
         | wdb wrote:
         | Yeah, wouldn't you need some sort of minimum amount of traffic
         | to be able to use canary deployment?
        
         | paxys wrote:
         | The problems that don't immediately manifest could very well
         | take hours or days or longer. There has to be a limit, and 5
         | minutes is as good as any.
        
           | closeparen wrote:
           | A lot of alerts use moving averages or sustain times to
           | squelch transient noise. You have to wait for the max sustain
           | time to pass before you can conclude that lack of alert =
           | lack of problem.
           | 
           | That time could very well be 5 minutes but the two need to be
           | coordinated.
        
       | bomdo wrote:
       | I'd love to learn more about their canary rollouts. Is there any
       | more info from either them or similar large sites about this?
       | 
       | For example, what usually has to happen for a dev to trigger a
       | rollback? Or how do they handle stateful changes such as database
       | schema changes?
        
         | t3rabytes wrote:
         | Re db migrations: they've built their own DB management tooling
         | (https://github.com/openark/orchestrator) and online migration
         | tooling (https://github.com/github/gh-ost)
        
         | jules2689 wrote:
         | We monitor Datadog dashboards, exceptions, and other metrics
         | mainly, as well as smoke testing the application
        
       ___________________________________________________________________
       (page generated 2021-01-25 23:01 UTC)