[HN Gopher] Deploys at Slack
       ___________________________________________________________________
        
       Deploys at Slack
        
       Author : michaeldeng18
       Score  : 60 points
       Date   : 2020-04-08 20:20 UTC (2 hours ago)
        
 (HTM) web link (slack.engineering)
 (TXT) w3m dump (slack.engineering)
        
       | aledalgrande wrote:
       | A few questions I have left unanswered:
       | 
       | - does the deploy commander create the hotfixes or the engineers
       | who authored the commits?
       | 
       | - it seems that the deployment is fully automated, but engineers
       | still have to be available in case of problems, does that impact
       | productivity?
       | 
       | - "Once we are confident that core functionality is unchanged",
       | is there a particular metric to assert that?
       | 
       | - how long does deployment take currently?
       | 
       | - switching directories doesn't seem like a fully atomic
       | operation yet, isn't there a delay from loading the files and
       | wouldn't that generate 502s from the service? Maybe it's better
       | to create new instances with the new files and then change the
       | router to use those (blue-green)?
        
       | brycethornton wrote:
       | It's always nice to see how other teams do it. Nothing too
       | groundbreaking here but that's a good thing.
       | 
       | I did notice the screenshot of "Checkpoint", their deployment
       | tracking UI. Are there solid open source or SaaS tools doing
       | something similar? I've seen various companies build similar
       | tools but most deployment processes are consistent enough to have
       | a 3rd-party tool that was useful for most teams.
        
         | sahillavingia wrote:
         | We (Gumroad) open sourced ours:
         | http://github.com/gumroad/wilfred
         | 
         | Here's what it looks like:
         | https://twitter.com/shl/status/1128039742308737024/photo/2
        
         | jjeaff wrote:
         | Gitlabs pipelines and issues/merges UI is similar and open
         | source.
        
         | thinkingkong wrote:
         | I've built that tool 2-3 times now. The issue is really the
         | deploy function and what controls it. It's always a one-off, or
         | so tightly integrated into the hosting environment, that
         | reaching in with a SaaS product is somewhat difficult. That
         | being said, the new lowest-common-denominator standards like
         | K8s make it way easier. If anyone is interested in using a tool
         | just leave a comment and I'll reach out.
        
           | bsima wrote:
           | interested
        
         | mrdonbrown wrote:
         | Sleuth is a SaaS deployment tracker that pulls deployments from
         | source repositories, feature flags, and other sources, in
         | addition to pushes via curl. You can see Sleuth used to, well,
         | track Sleuth at https://app.sleuth.io/sleuth
         | 
         | [Disclaimer: am a Sleuth co-founder]
        
           | ivanfon wrote:
           | Is it possible to view the page you linked without creating
           | an account? It redirects me to your landing page.
        
         | paxys wrote:
         | > most deployment processes are consistent enough
         | 
         | Definitely disagree with this. I have never worked at two
         | places with a similar enough deploy process that would benefit
         | from a generic tool.
        
           | brycethornton wrote:
           | Sure, I see your point. I'd just like to see a pattern that
           | works for most that could gain some traction. At the end of
           | the day we're all trying to do the same thing (deploy high
           | quality software), just in different ways. Deployment
           | strategy shouldn't need to be a main competency of most
           | teams.
        
       | stopachka wrote:
       | This is very similar to the process fb had for years. With some
       | caveats (prod deploys once a week, handled by a central team)
       | 
       | I think this kind of process can last a company well into the
       | thousands of engineers.
       | 
       | Great work
        
       | 7ewis wrote:
       | This link has now been reposted 6 times in the past two weeks:
       | 
       | https://news.ycombinator.com/item?id=22816645
       | 
       | https://news.ycombinator.com/item?id=22729766
       | 
       | https://news.ycombinator.com/item?id=22801191
       | 
       | https://news.ycombinator.com/item?id=22784712
       | 
       | https://news.ycombinator.com/item?id=22720028
       | 
       | https://news.ycombinator.com/item?id=22806810
        
         | a13n wrote:
         | So? Guidelines don't explicitly say this behavior is unallowed.
         | https://news.ycombinator.com/newsguidelines.html
         | 
         | Sometimes posts that deserve to be on the front page don't make
         | it. Seems fine to repost periodically as long as you aren't
         | spamming many times per day.
        
         | dang wrote:
         | That's an indicator of interest. I actually emailed one of the
         | submitters to repost the article for that reason. (Yes, we're
         | thinking about software to detect cases like this.)
         | 
         | On HN, a submission doesn't count as a dupe unless it has had
         | significant attention. This is in the FAQ:
         | https://news.ycombinator.com/newsfaq.html.
        
         | paxys wrote:
         | And yet it has only reached the front page once. Every article
         | you see on top has been posted multiple times in order to get
         | there. That is how online voting/aggregation systems work.
        
           | kick wrote:
           | _Every article you see on top has been posted multiple times
           | in order to get there._
           | 
           | This isn't true at all, for the record.
        
         | rattray wrote:
         | None of them garnered any traction/comments, in case others
         | were looking for that.
        
       | MuffinFlavored wrote:
       | Do they use Kubernetes at Slack?
        
         | moondev wrote:
         | Doesn't seem like it based on
         | 
         | > Instead of pushing the new build to our servers using a sync
         | script, each server pulls the build concurrently when signaled
         | by a Consul key change.
        
           | MuffinFlavored wrote:
           | does that mean they are not even using containers?
        
       | nathankunicki wrote:
       | Fun to read, but there's a lack of detail here that I'd like to
       | see. For example, this talks purely about code changes. However
       | times a code change requires a database schema change (as
       | mentioned above), different API's to be used, etc. In the
       | percentage based rollout where multiple versions are in use at
       | once, how are these differences handled?
        
         | yoloClin wrote:
         | I'm more curious about how DB rollbacks occur in situations
         | where a PR changes DB and is then reverted.
        
         | tantalor wrote:
         | Easy: don't do that.
         | 
         | Always make your code compatible with the old and new schema.
         | Migrate the database separately. Then after the migration,
         | remove the code that supports the old schema.
        
         | navaati wrote:
         | For database schema changes, here is the standard practice: -
         | You have version 1 of the software, supporting schema A. - You
         | deploy a version 2 supporting both schema A and new schema B.
         | Both versions coexist until the deployment iis complete and all
         | version 1 instances are stopped. During all this time the
         | database is still on schema A, this is fine because your
         | instances, both version 1 and 2, support schema A. - Now you do
         | the schema upgrade. This is fine because your instances, now
         | all runnning version 2, support schema B - At last, if you wish
         | you can now deploy a version 3, dropping the support for schema
         | A.
        
       | walrus01 wrote:
       | <sarcasm>
       | 
       | How nice of them to volunteer 2% of their paid customer base as
       | "canary" without them specifically opting in to it, or perhaps
       | even being aware.
       | 
       | </sarcasm>
       | 
       | Or perhaps they do it exclusively with the free service tier,
       | which is much more understandable.
        
         | deadbunny wrote:
         | Anecdotally I usually see slack changes in my free tier
         | channels a good week before paid tier ones so it wouldn't
         | surprise me.
        
         | friend-monoid wrote:
         | Seems reasonable to me? Better to deploy gradually in case the
         | deploy is bad, right?
        
           | dingaling wrote:
           | Tangential, but why do companies continually misuse verbs as
           | nouns?
           | 
           | Nothing is gained by saying 'deploys' instead of
           | 'deployments' but instead confusion can be introduced.
           | 
           | See also ' what is the ask' and 'minimum spend'.
        
             | vpzom wrote:
             | The gain is 1 syllable
        
           | toomuchtodo wrote:
           | If the users are aware and consent to being beta testers,
           | versus what's already likely stable (caveat being when you're
           | rapidly pushing out a hotfix because your last deploy broke
           | something).
        
             | paxys wrote:
             | They aren't beta testers. They are still getting the real
             | production build, just in the first step of a phased
             | rollout. Beta, pre-release etc. have a very different
             | meaning.
        
             | friend-monoid wrote:
             | Link doesn't work for me right now so I haven't read the
             | article, but usually beta testing precedes canary deploys.
             | Maybe this is different.
        
               | toomuchtodo wrote:
               | If it's canary, you don't trust it fully, no? Tests can
               | pass and you still end up munging data or the user
               | experience.
        
             | copecopecope wrote:
             | At some point a new build needs to roll out to production.
             | There's always going to be some risk that something goes
             | wrong, so better to test with 2% of the population
             | initially rather than 100%. By then, the build has already
             | gone through integration tests/dog-fooding, so if something
             | goes wrong in the canary phase, it's generally due to some
             | production environment configuration issue.
        
               | [deleted]
        
               | toomuchtodo wrote:
               | Not disagreeing, simply stating users should be aware and
               | get a say (an option would be fine to opt in to early
               | release access), especially if they're a paying customer.
        
               | copecopecope wrote:
               | I hear where you're coming from, but from my experience,
               | the canary phase usually lasts less than an hour. And the
               | traffic is usually split randomly, so the same 2% of
               | users aren't at elevated risk for every deployment. I
               | don't know how Slack does it, though.
        
         | nuclearnice1 wrote:
         | 2% chance of being canary 20% chance it breaks Expected 15
         | minutes to roll back
         | 
         | Expect 3.6 seconds of outage per user outage per release.
         | 
         | It's fine.
         | 
         | What I'd like you to get behind is disabling Windows Update.
         | THAT thing is a menace.
        
       | shusson wrote:
       | What happens after hot-fixing the release branch? Does the
       | release branch get merged back into master?
        
         | tantalor wrote:
         | Usually you fix master first then cherrypick the fix to the
         | release branch.
        
       | capableweb wrote:
       | No mention of feature toggles what so ever. I guess that's why it
       | took them a long time to fix the thing with the new WYSIWYG
       | editor, where after 2 weeks or something, they offered a toggle
       | for people to change back.
       | 
       | Anyone knows their reasoning behind not employing feature
       | toggles? I would feel very slowed down if I didn't have the
       | guarantee and confidence I could quickly rollback in the event of
       | errors.
        
         | derision wrote:
         | They had an undocumented feature toggle for that since day 1. A
         | JavaScript snippet was issued was posted on a thread here that
         | reverted it to the old functionality. So they are using them
         | but not always surfacing them
        
         | draw_down wrote:
         | Who said they don't use feature toggles? That is a separate
         | concern from deployment. As far as I can tell you got mad about
         | a feature in their UI and decided that implies something about
         | their infrastructure with no actual evidence.
        
       | darkwater wrote:
       | I wonder why they didn't evaluate at some point using an
       | immutable infrastructure approach leveraging tools like Spinnaker
       | to manage the deploy? They sure have the muscle and numbers to
       | use it and even contribute to it actively, no? I mean, I know
       | that deploying your software is usually something pretty tied to
       | a specific engineering team but I really like the immutable
       | approach and I was wondering why a company the size of Slack,
       | born and grown in the "right" time, did not consider it.
        
       | wrkronmiller wrote:
       | Nice write-up! It would be interesting, however, to get more
       | details on what types of errors were caught in dogfooding, which
       | made it to production, what kind of hotfixes have had to be made
       | in the past, etc...
       | 
       | It's nice to know what Slack does to mitigate bugs in releases,
       | but it would also be useful to know what kinds of bugs each step
       | catches and what bugs still slip through.
        
       ___________________________________________________________________
       (page generated 2020-04-08 23:00 UTC)