[HN Gopher] Improving how we deploy GitHub ___________________________________________________________________ Improving how we deploy GitHub Author : todsacerdoti Score : 95 points Date : 2021-01-25 18:06 UTC (4 hours ago) (HTM) web link (github.blog) (TXT) w3m dump (github.blog) | zoobab wrote: | Github source code did not leak recently? | aszen wrote: | Kind of sad to see GitHub doesn't use GitHub itself to deploy and | monitor their releases. | WJW wrote: | That seems like an extremely good idea actually, since if you | dogfood your own releasing service then you can't fix it | anymore if you accidentally bring down the service. | notwhereyouare wrote: | I did a short stint at wayfair and about 1-2 months in, there | was a deploy that somehow got passed the test flow and when | deployed took down their entire site. So badly that they | couldn't even deploy the fix | xxpor wrote: | That's usually solved with a parallel stack deployment, use | the other stack if something is broken | paxys wrote: | If the "other stack" isn't regularly used then you can | assume it will be broken when needed | cpascal wrote: | You just run the previous version of the production stack | in your "dogfood/operations" stack. Once you've fully | rolled out production and have vetted it, you can upgrade | the other one to match production. | Xorlev wrote: | That also means when it does go wrong, it takes much longer | to fix. Good operational practice is to decrease MTTR, not | make it worse. | illnewsthat wrote: | I was surprised to read that they are using Slack since it is | such a competitor to Microsoft's Teams (parent company). | kuschkufan wrote: | Are you expecting them to use Windows everywhere as well? | dubcanada wrote: | No but why would you use a product that is $7 or what ever | times the number of employees (so let's say 200, so $1400 a | month) when you can use a free one. | maccard wrote: | Speaking from experience, just because you work for a | company doesn't mean you can use all of their products | (or that you'll even get favorable pricing on them). | lostapathy wrote: | I'd love to hear this story! Seems crazy ... but we live | in a crazy world. | scott_w wrote: | Unrelated to software but the company my dad works for | (motor repair) has to buy all its parts from its own | distribution arm, at the marked up price. He then has to | turn a profit on those parts as well as pricing the | labour. | | If cost price is PS5 and the markup is 20%, he has to pay | PS6 to get the part, then charge PS7.20 on the invoice to | the customer. I'll let you guess what that does to tender | bids ;-) | names_are_hard wrote: | At Microsoft if you build a product using Azure (and if | you want to use the cloud you MUST use Azure, you're not | going to get approval to write a check to AWS) the costs | come out of your budget. And it's taken seriously, to the | point where teams will very much emphasize managing costs | (what will this new feature cost on our Azure bill? Can | we build it more efficiently? Oh wow, that refactor saved | us 100k/month in cloud costs, don't forget that when we | start talking about promotions...) | lostapathy wrote: | That makes sense since the amount you could use is | variable. I was thinking more like somebody couldn't get | a free word license at a MS subsidiary or something. | vulcan01 wrote: | When I worked at MS Azure, we had to pay for Azure | servers! (I believe our team had a $5k/month Azure bill.) | It's part of internal budgeting, so that people within MS | don't splurge on expensive things (because it does cost | MS money for each person on Teams). | names_are_hard wrote: | Did you drop a k? What can you do with 50 dollars? | vulcan01 wrote: | Yes, thank you, it should be $5k. Edited. | josephg wrote: | My uncle used to work at Compaq (back before they got | bought by HP). When their computers broke, his team had | to pay their support staff to get them fixed. (Via | internal budgeting). But the support team knew internal | customers would call them anyway and it was still | compaq's money, so they charged several times more for | internal support calls than normal support calls. | | My uncle's team was having none of that, so they paid an | external computer repair service to fix their computers. | The external repair service subcontracted to compaq's | internal people anyway, so when their computers broke | they called up (and paid) external consultants. Who in | turn called compaq's internal support team, who came | downstairs and fixed their computers at a competitive | price. | theshrike79 wrote: | On the other hand sometimes it means you MUST use the | company products. | | Consulted for a sub-sub-sub-subsidiary of Toshiba. All | computer equipment _had_ to be from Toshiba - the closest | place to get Toshiba laptops was two COUNTRIES over. | | They even had to tape over non-Toshiba branding from | external displays that would be visible. | paxys wrote: | $1400 a month is less than a rounding error for a company | that size. If you can get even the tiniest bit of extra | developer productivity from the software then it is worth | it. | | And Github will definitely still have to "pay" for Teams, | whether that is internal accounting or actual money being | exchanged. | names_are_hard wrote: | My understanding of Microsoft policy is that it's easier to | buy macbooks for your developers than it is to buy Slack. | Which makes sense, because they're currently doing head to | head with slack for market share right now, while a few | macbooks doesn't threaten their credibility when selling | windows. | | My guess is that github was using slack before they were | bought and inertia is a thing. I'm sure there are people | within the parent company that would like to see them | transition, but I'm sure there's a ton of resistance, | especially "on the ground" at github. Buyouts are a | delicate thing, they don't want to ruin github by trying to | force it to change too quickly. | dubcanada wrote: | Probably because Teams is the worst. | | More then likely it's because that's what they used before | they got bought and haven't been forced to migrate over yet, | they also seem to have bots, which are not really a direct | copy and paste into MS Teams, and likely them converting over | isn't a high priority. | jen20 wrote: | IIRC GitHub used to use Campfire and it took a long time to | switch to Slack - a switch to Teams would no doubt take a | long time too! | paxys wrote: | Easy to switch a chat application, hard to switch your entire | chatops ecosystem. This blog post shows the perfect example | of that. | jules2689 wrote: | There is some GitHub used, but as others stated we don't want | to create a circular dependency on ourselves in case we deploy | something that is broken. | hoprocker wrote: | This is generally a good flow, but something that absolutely | baffles me is that GitHub changes the commit SHAs when branches | are rebase-merged from PRs[0]. This totally breaks a fundamental | notion in Git that the same work, based on the same commits, has | the same hash. It also makes it incredibly difficult to determine | which PR branches have been merged into master. | | [0] https://docs.github.com/en/github/collaborating-with- | issues-... | KinesisMagic wrote: | Can anyone explain why they might go with a slack based | deployment system as opposed to something more robust like | CircleCI or Jenkins? Is it mainly about the simplicity of it? | jules2689 wrote: | It's mainly the simplicity of the deployment system as it's | inline and visible, coupled with habit. It all actuality that | is just what _can_ trigger the deploy, the actual deploy is | based on an internal deploy application and deploys can be | triggered from there as well. | mrdonbrown wrote: | My team recently put in automation so that we use CircleCI for | the staging deployment, have it wait for manual approval, then | deploy to production. However, we can also give the Slack | staging deployment message a +1 reaction, which will | automatically approve the production deployment for CircleCI. | This way, we get an easy dev UX but all the CI features of | CircleCI. | pronoiac wrote: | There's easy transparency amongst multiple teams, without | having accounts for the other teams on CircleCI or Jenkins. | This is while the deploy is in flight, and it can provide | timestamped logs if there's an incident, and it could be useful | for tracking history. It's also clear who kicked off the | deploy. | zug_zug wrote: | As a devops person myself, I am super skeptical that there is | any good reason to do a chatops deploy. My guess is "new toys | are cool" / "Want this on my resume" | | To be clear, it's hopefully just some connector that does slack | message -> triggers jenkins job. | | But from a security, compliance, reliability, debuggability, | auditability perspective I think it's inferior. Not to mention | an inferior interface. | swagonomixxx wrote: | chatops deploys aren't really new toys, a place I was at was | doing them around 2013/14. | | We liked it because the chat history you see is essentially a | deploy history, no need to login into some other website to | check some obscure logs page to see who did what. We did end | up having to debug the service that processed the chat | messages maybe once, but never ran into an issue when we had | to deploy a hotfix. | alexchamberlain wrote: | That's pretty awesome to go from nothing to full production in 15 | minutes. I would like to encourage others to bear in mind that | simply adding more time wouldn't significantly decrease the risk | of things going wrong. | cytzol wrote: | Something I found surprising is that a change to the GitHub | codebase will be run in canary, get deployed to production, and | _then_ merged. I would have expected the PR to be merged first | before it gets served to the public, so even if you have to `git | revert` and undeploy it, you still have a record of every version | that was seen by actual users, even momentarily. | | Does anyone know the pros and cons of GitHub's approach? | halukakin wrote: | I think this method seems to get more popular by day. IMHO, | previously master was the branch you merge before the deploy | process. But today this is reversed. | | The main benefit is, other developers can rely on the master | branch even more. They will know there will not be a revert on | the master branch they just pulled one hour ago and already | started coding on. | Kwpolska wrote: | A `git revert` creates a new commit. To a developer, a revert | commit appearing on master has the same effect as a pull | request (or ten) being merged into it. If the revert affects | code you're working on, you will need to resolve conflicts, | just like you would need to if a merged PR affected the same | code. | bswinnerton wrote: | This is known as "GitHub Flow" | (https://guides.github.com/introduction/flow/). I was pretty | surprised by it when I first joined GitHub but I've grown to | love it. It makes rolling back changes much faster than having | to open up a revert branch, get it approved, and deploy it. | When something goes sideways, just deploy master / main, which | is meant to always be in a safe state. | sandGorgon wrote: | > _GitHub.com is deployed primarily through chatops_ | | What is the best chatops right now ? I dont see a lot of | popularity around chatops. Its most usually some version of | github based triggers. | | Its funny that Github themselves uses chatops. I think that's a | very nice take - especially for early stage startups. Anyone else | use anything like it ? | paxys wrote: | I'm guessing they are using Hubot (https://hubot.github.com/) | swagonomixxx wrote: | A place I was at used Hubot as well. It gets the job done, we | never really ran into a fuss. Easily extensible as well. | jules2689 wrote: | This is correct :) | icey wrote: | We're just starting beta, but my friend Phil and I both worked | together at GitHub and are building what we hope to be a better | Hubot at https://ab.bot right now. | | It's missing some of the chatops stuff that is mentioned in the | blog post but since we support a lot more languages than Hubot | we're hoping it's a matter of time before someone in our | community builds a better replacement deployment script (or | we'll do it while building out sample scripts :)) | | (Also, hi GitHub friends!) | Xorlev wrote: | I was surprised to see their canary stages are just 5 minutes. | Many problems take longer to manifest. That seems like a fairly | risky release process. | jules2689 wrote: | It's actually longer than 5 minutes. There is the duration of | the 2% canary deploy where we start to see pick up of traffic, | a 5 minute wait, then a 20% "deploy", and a 5 minute wait. All | in all this comes out to around 10-15ish minutes in canary. | This is a stage where we can almost instantly shut off traffic | to the canary deploy. | | Could we reduce risk by lengthening the process? Maybe, but you | also make deploys longer which means less stuff can get through | in a day. This makes devs respond with larger PRs, for example, | which increases the risk profile. | | So we need to balance time and duration. Typically large | problems will manifest quickly, or take a lot longer to detect | (and thus are generally more minor problems) when you have our | scale of a user base in my experience. | wdb wrote: | Yeah, wouldn't you need some sort of minimum amount of traffic | to be able to use canary deployment? | paxys wrote: | The problems that don't immediately manifest could very well | take hours or days or longer. There has to be a limit, and 5 | minutes is as good as any. | closeparen wrote: | A lot of alerts use moving averages or sustain times to | squelch transient noise. You have to wait for the max sustain | time to pass before you can conclude that lack of alert = | lack of problem. | | That time could very well be 5 minutes but the two need to be | coordinated. | bomdo wrote: | I'd love to learn more about their canary rollouts. Is there any | more info from either them or similar large sites about this? | | For example, what usually has to happen for a dev to trigger a | rollback? Or how do they handle stateful changes such as database | schema changes? | t3rabytes wrote: | Re db migrations: they've built their own DB management tooling | (https://github.com/openark/orchestrator) and online migration | tooling (https://github.com/github/gh-ost) | jules2689 wrote: | We monitor Datadog dashboards, exceptions, and other metrics | mainly, as well as smoke testing the application ___________________________________________________________________ (page generated 2021-01-25 23:01 UTC)