[HN Gopher] Deploys at Slack ___________________________________________________________________ Deploys at Slack Author : michaeldeng18 Score : 60 points Date : 2020-04-08 20:20 UTC (2 hours ago) (HTM) web link (slack.engineering) (TXT) w3m dump (slack.engineering) | aledalgrande wrote: | A few questions I have left unanswered: | | - does the deploy commander create the hotfixes or the engineers | who authored the commits? | | - it seems that the deployment is fully automated, but engineers | still have to be available in case of problems, does that impact | productivity? | | - "Once we are confident that core functionality is unchanged", | is there a particular metric to assert that? | | - how long does deployment take currently? | | - switching directories doesn't seem like a fully atomic | operation yet, isn't there a delay from loading the files and | wouldn't that generate 502s from the service? Maybe it's better | to create new instances with the new files and then change the | router to use those (blue-green)? | brycethornton wrote: | It's always nice to see how other teams do it. Nothing too | groundbreaking here but that's a good thing. | | I did notice the screenshot of "Checkpoint", their deployment | tracking UI. Are there solid open source or SaaS tools doing | something similar? I've seen various companies build similar | tools but most deployment processes are consistent enough to have | a 3rd-party tool that was useful for most teams. | sahillavingia wrote: | We (Gumroad) open sourced ours: | http://github.com/gumroad/wilfred | | Here's what it looks like: | https://twitter.com/shl/status/1128039742308737024/photo/2 | jjeaff wrote: | Gitlabs pipelines and issues/merges UI is similar and open | source. | thinkingkong wrote: | I've built that tool 2-3 times now. The issue is really the | deploy function and what controls it. It's always a one-off, or | so tightly integrated into the hosting environment, that | reaching in with a SaaS product is somewhat difficult. That | being said, the new lowest-common-denominator standards like | K8s make it way easier. If anyone is interested in using a tool | just leave a comment and I'll reach out. | bsima wrote: | interested | mrdonbrown wrote: | Sleuth is a SaaS deployment tracker that pulls deployments from | source repositories, feature flags, and other sources, in | addition to pushes via curl. You can see Sleuth used to, well, | track Sleuth at https://app.sleuth.io/sleuth | | [Disclaimer: am a Sleuth co-founder] | ivanfon wrote: | Is it possible to view the page you linked without creating | an account? It redirects me to your landing page. | paxys wrote: | > most deployment processes are consistent enough | | Definitely disagree with this. I have never worked at two | places with a similar enough deploy process that would benefit | from a generic tool. | brycethornton wrote: | Sure, I see your point. I'd just like to see a pattern that | works for most that could gain some traction. At the end of | the day we're all trying to do the same thing (deploy high | quality software), just in different ways. Deployment | strategy shouldn't need to be a main competency of most | teams. | stopachka wrote: | This is very similar to the process fb had for years. With some | caveats (prod deploys once a week, handled by a central team) | | I think this kind of process can last a company well into the | thousands of engineers. | | Great work | 7ewis wrote: | This link has now been reposted 6 times in the past two weeks: | | https://news.ycombinator.com/item?id=22816645 | | https://news.ycombinator.com/item?id=22729766 | | https://news.ycombinator.com/item?id=22801191 | | https://news.ycombinator.com/item?id=22784712 | | https://news.ycombinator.com/item?id=22720028 | | https://news.ycombinator.com/item?id=22806810 | a13n wrote: | So? Guidelines don't explicitly say this behavior is unallowed. | https://news.ycombinator.com/newsguidelines.html | | Sometimes posts that deserve to be on the front page don't make | it. Seems fine to repost periodically as long as you aren't | spamming many times per day. | dang wrote: | That's an indicator of interest. I actually emailed one of the | submitters to repost the article for that reason. (Yes, we're | thinking about software to detect cases like this.) | | On HN, a submission doesn't count as a dupe unless it has had | significant attention. This is in the FAQ: | https://news.ycombinator.com/newsfaq.html. | paxys wrote: | And yet it has only reached the front page once. Every article | you see on top has been posted multiple times in order to get | there. That is how online voting/aggregation systems work. | kick wrote: | _Every article you see on top has been posted multiple times | in order to get there._ | | This isn't true at all, for the record. | rattray wrote: | None of them garnered any traction/comments, in case others | were looking for that. | MuffinFlavored wrote: | Do they use Kubernetes at Slack? | moondev wrote: | Doesn't seem like it based on | | > Instead of pushing the new build to our servers using a sync | script, each server pulls the build concurrently when signaled | by a Consul key change. | MuffinFlavored wrote: | does that mean they are not even using containers? | nathankunicki wrote: | Fun to read, but there's a lack of detail here that I'd like to | see. For example, this talks purely about code changes. However | times a code change requires a database schema change (as | mentioned above), different API's to be used, etc. In the | percentage based rollout where multiple versions are in use at | once, how are these differences handled? | yoloClin wrote: | I'm more curious about how DB rollbacks occur in situations | where a PR changes DB and is then reverted. | tantalor wrote: | Easy: don't do that. | | Always make your code compatible with the old and new schema. | Migrate the database separately. Then after the migration, | remove the code that supports the old schema. | navaati wrote: | For database schema changes, here is the standard practice: - | You have version 1 of the software, supporting schema A. - You | deploy a version 2 supporting both schema A and new schema B. | Both versions coexist until the deployment iis complete and all | version 1 instances are stopped. During all this time the | database is still on schema A, this is fine because your | instances, both version 1 and 2, support schema A. - Now you do | the schema upgrade. This is fine because your instances, now | all runnning version 2, support schema B - At last, if you wish | you can now deploy a version 3, dropping the support for schema | A. | walrus01 wrote: | <sarcasm> | | How nice of them to volunteer 2% of their paid customer base as | "canary" without them specifically opting in to it, or perhaps | even being aware. | | </sarcasm> | | Or perhaps they do it exclusively with the free service tier, | which is much more understandable. | deadbunny wrote: | Anecdotally I usually see slack changes in my free tier | channels a good week before paid tier ones so it wouldn't | surprise me. | friend-monoid wrote: | Seems reasonable to me? Better to deploy gradually in case the | deploy is bad, right? | dingaling wrote: | Tangential, but why do companies continually misuse verbs as | nouns? | | Nothing is gained by saying 'deploys' instead of | 'deployments' but instead confusion can be introduced. | | See also ' what is the ask' and 'minimum spend'. | vpzom wrote: | The gain is 1 syllable | toomuchtodo wrote: | If the users are aware and consent to being beta testers, | versus what's already likely stable (caveat being when you're | rapidly pushing out a hotfix because your last deploy broke | something). | paxys wrote: | They aren't beta testers. They are still getting the real | production build, just in the first step of a phased | rollout. Beta, pre-release etc. have a very different | meaning. | friend-monoid wrote: | Link doesn't work for me right now so I haven't read the | article, but usually beta testing precedes canary deploys. | Maybe this is different. | toomuchtodo wrote: | If it's canary, you don't trust it fully, no? Tests can | pass and you still end up munging data or the user | experience. | copecopecope wrote: | At some point a new build needs to roll out to production. | There's always going to be some risk that something goes | wrong, so better to test with 2% of the population | initially rather than 100%. By then, the build has already | gone through integration tests/dog-fooding, so if something | goes wrong in the canary phase, it's generally due to some | production environment configuration issue. | [deleted] | toomuchtodo wrote: | Not disagreeing, simply stating users should be aware and | get a say (an option would be fine to opt in to early | release access), especially if they're a paying customer. | copecopecope wrote: | I hear where you're coming from, but from my experience, | the canary phase usually lasts less than an hour. And the | traffic is usually split randomly, so the same 2% of | users aren't at elevated risk for every deployment. I | don't know how Slack does it, though. | nuclearnice1 wrote: | 2% chance of being canary 20% chance it breaks Expected 15 | minutes to roll back | | Expect 3.6 seconds of outage per user outage per release. | | It's fine. | | What I'd like you to get behind is disabling Windows Update. | THAT thing is a menace. | shusson wrote: | What happens after hot-fixing the release branch? Does the | release branch get merged back into master? | tantalor wrote: | Usually you fix master first then cherrypick the fix to the | release branch. | capableweb wrote: | No mention of feature toggles what so ever. I guess that's why it | took them a long time to fix the thing with the new WYSIWYG | editor, where after 2 weeks or something, they offered a toggle | for people to change back. | | Anyone knows their reasoning behind not employing feature | toggles? I would feel very slowed down if I didn't have the | guarantee and confidence I could quickly rollback in the event of | errors. | derision wrote: | They had an undocumented feature toggle for that since day 1. A | JavaScript snippet was issued was posted on a thread here that | reverted it to the old functionality. So they are using them | but not always surfacing them | draw_down wrote: | Who said they don't use feature toggles? That is a separate | concern from deployment. As far as I can tell you got mad about | a feature in their UI and decided that implies something about | their infrastructure with no actual evidence. | darkwater wrote: | I wonder why they didn't evaluate at some point using an | immutable infrastructure approach leveraging tools like Spinnaker | to manage the deploy? They sure have the muscle and numbers to | use it and even contribute to it actively, no? I mean, I know | that deploying your software is usually something pretty tied to | a specific engineering team but I really like the immutable | approach and I was wondering why a company the size of Slack, | born and grown in the "right" time, did not consider it. | wrkronmiller wrote: | Nice write-up! It would be interesting, however, to get more | details on what types of errors were caught in dogfooding, which | made it to production, what kind of hotfixes have had to be made | in the past, etc... | | It's nice to know what Slack does to mitigate bugs in releases, | but it would also be useful to know what kinds of bugs each step | catches and what bugs still slip through. ___________________________________________________________________ (page generated 2020-04-08 23:00 UTC)