[HN Gopher] Slack's Incident on 2-22-22
       ___________________________________________________________________
        
       Slack's Incident on 2-22-22
        
       Author : alphabettsy
       Score  : 122 points
       Date   : 2022-04-26 17:26 UTC (5 hours ago)
        
 (HTM) web link (slack.engineering)
 (TXT) w3m dump (slack.engineering)
        
       | scrollaway wrote:
       | A post-mortem with non-ISO dates? Even on THAT date?! :)
        
         | smegsicle wrote:
         | looks unambiguous to me lol
        
           | mulmen wrote:
           | Ok, what is the format string?                 a) m-dd-yy
           | b) m-yy-dd
           | 
           | I can't tell. How do you disambiguate?
        
             | rjh29 wrote:
             | It can't be either, 22 is not a valid month.
             | 
             | I agree with you though, the point of a date like yyyy-mm-
             | dd is to avoid working out stuff like this. You don't pick
             | a date format based on whether the current date is
             | ambiguous or not.
        
               | mulmen wrote:
               | Good catch. I updated my post. The question remains, how
               | can this format be disambiguated?
               | 
               | Agreed, this is why ISO8601 exists.
        
       | 58x14 wrote:
       | Just my personal take, I think this is a really well-written
       | incident postmortem. It's specific, extensive, candid, and dare I
       | say, entertaining?
       | 
       | Many incident reports are fully lacking in any meaningful detail,
       | or wholly unapologetic. I actually enjoyed learning tidbits about
       | the author, in particular their mention of
       | https://how.complexsystems.fail/.
       | 
       | Reading this boosted my confidence in Slack's teams, which should
       | ultimately be the objective of a release like this. It's not pure
       | PR nor a gruff legally-obligated disclosure.
       | 
       | It helps that I wasn't really affected by this incident.
        
         | vxNsr wrote:
         | The fact that the current top comment thread is quibbling about
         | the date format in the title seems to agree with this
         | assessment, if there was anything real to complain about that's
         | what we'd be seeing, instead we get bikeshedding on the date
         | format in the title of a post.
        
           | dijonman2 wrote:
           | Which is aligned with US date formatting.
        
         | ttul wrote:
         | On today's internet, a frank post mortem delivers value to
         | customers and PR gold too.
        
       | _justinfunk wrote:
       | I'm looking forward to http://thisincidentreportdoesnotexist.com
       | launching sometime later this year.
        
       | notacoward wrote:
       | I love the diagrams of the cache<->DB cycle in normal vs.
       | degenerate states. Those illustrate the problem very clearly and
       | succinctly, and I hope they make it into a textbook some day.
       | Kudos.
        
       | godmode2019 wrote:
       | Tidbit:
       | 
       | 2-22-22 was also when Russia invaded Ukraine
       | 
       | And Joe Bidens statement about the invasion was on 2-22-22 2:22pm
       | on the dot.
       | 
       | I could not figure out the significance of this more than 11:11
       | was when WW1 ended, but it's probably something else.
        
         | Kwpolska wrote:
         | Nope, you're off by two days:
         | https://en.wikipedia.org/wiki/2022_Russian_invasion_of_Ukrai...
        
       | oxfordmale wrote:
       | Which major outage ? According to the Slack uptime, there was
       | barely 1.5 hour of outage :-)
       | 
       | P.S.
       | 
       | Yes I know the uptime is decide by committee, and doesn't reflect
       | reality. I am just being cynical.
        
       | olliej wrote:
       | I find reading about these incidents super interesting, and I
       | generally find the work performed by the folk keeping these
       | service running (and dealing with the inevitable falling over of
       | any computer system).
       | 
       | At the same time it seems like a horrifying job I would never
       | ever want :D
        
       | diarrhea wrote:
       | That date format is actually the worst I have ever encountered.
       | m-d-y, with year in 2 digits, numbers not zero-padded, US "order"
       | yet using dashes. It's like a moderator of /r/ISO8601 came up
       | with the worst possible format _on purpose_. Am I missing
       | something?
        
         | HiJon89 wrote:
         | I expected the top comment on hackernews to be something this
         | pedantic and irrelevant to the content, and I was not
         | disappointed
        
           | elpakal wrote:
           | I had to read the parent comment twice to understand that it
           | was talking about the date in the title of the post and not
           | anything relevant whatsoever.
        
         | missedthecue wrote:
         | I mean, there is literally no way to confuse it with another
         | date, unless you go back 100 years, when Slack didn't exist.
         | 
         | There is no 22nd month, so we know the 22s are the day and the
         | year, leaving only the 2 to be the month. Is it really that
         | difficult to parse?
        
           | johannes1234321 wrote:
           | > leaving only the 2 to be the day
           | 
           | I think you meant "to be the month" there. qed#
        
             | diarrhea wrote:
             | Beautiful.
        
           | bamboozled wrote:
           | So we could either:
           | 
           | a) Use a far superior date format which nearly the entire
           | world uses by default and is better and simpler in many ways.
           | 
           | b) Do logic when we see dates to try workout what format the
           | date is in.
           | 
           | Going with _a_ seems like a no brainer...
        
             | verve_rat wrote:
             | Especially in this scenario where you are communicating to
             | an international audience.
        
           | burnte wrote:
           | Yeah, but what about that party that I threw on 10/11/12. Did
           | I set it for November 12, 2010, or October 11 in 2012? Or
           | somewhen else?
        
           | tjoff wrote:
           | I'm so tired of having to do that game every time I see a
           | date. It is not hard, but it is quite annoying. Especially
           | since it isn't solvable in a lot of cases, so you try to
           | reason your way to the most realistic interpretation.
           | 
           | It shouldn't be this hard.
        
             | true_religion wrote:
             | It's not really that hard. Like the imperial system,
             | Americans just memorize how it works as children and don't
             | think about it anymore.
             | 
             | Think about it like speaking a different language, except
             | with numbers and not words.
        
               | colejohnson66 wrote:
               | Really HN? We're seriously downvoting this comment to
               | oblivion? I get non-Americans get passionate in their
               | anger at imperial units, but this person is just
               | explaining why it's natural to us.
        
               | dllthomas wrote:
               | The complaint isn't about the particular other order, but
               | the fact that the order is ambiguous. In this case that
               | doesn't matter, but often it does.
               | 
               | Americans memorize inches and yards, and often also
               | memorize centimeters and meters, and working with
               | _either_ is fine, but we 're not so often faced with
               | numbers where it might be inches _or_ centimeters and we
               | have to figure out which (and when we are, it 's
               | sometimes a pain - certainly a bigger pain that working
               | with known units).
               | 
               | Or, working with your language analogy, please go fetch
               | me some "pasta" without knowing whether I'm speaking
               | Italian or Polish.
        
               | xeromal wrote:
               | Context matters in your pasta scenario.
        
               | dllthomas wrote:
               | Since the text itself doesn't clarify, context is the
               | only way of resolving any of the scenarios. In each case
               | it's usually sufficient and often not all that hard. But
               | it's always harder than if the system in use was made
               | explicit, and I understand the complaint (even if my
               | annoyance at the ambiguity is quite significantly below
               | the level where I would have complained myself,
               | particularly in this case).
        
               | erpellan wrote:
               | Context also matters in the date parsing scenario.
               | 11/12/22 could be several different dates depending on
               | the context.
        
               | xeromal wrote:
               | Yeah, and in this post, it's clear what it is.
        
               | ldh wrote:
               | It may not have happened to you yet, but someday you'll
               | see a date somewhere other than this post.
        
               | xeromal wrote:
               | This honestly has me laughing.
        
               | ascar wrote:
               | > Think about it like speaking a different language
               | 
               | The correct analogy is I don't know which language is
               | spoken and the same words get used in multiple languages
               | with different meaning. Now I can apply heuristics to
               | figure it out or in some cases I can only guess.
        
           | huhtenberg wrote:
           | I just thought it was a typo.
        
           | noselasd wrote:
           | That particular date is possible to understand, but the date
           | format is not. (Is is really that fun to try to figure out
           | what 12-11-21 means ?)
        
             | CPLX wrote:
             | November 21, 1812
             | 
             | What do I win?
        
         | jonpurdy wrote:
         | Came here to complain specifically about this. 2022-02-22 is
         | unambiguous, big endian, and sorts nicely. IDK why society
         | still uses any other date formats considering how international
         | everything is.
        
           | [deleted]
        
           | gleenn wrote:
           | It's because people for hundreds of years have been saying
           | "March second, nineteen sixty two" which they then write out
           | in that order. As a programmer, peoples' frustrations are
           | understandable, but you're a bit naive if you think even a
           | percentage point of the speaking population of the world
           | knows or is concerned with big endian-ness or sortability.
           | However they speak English, at least in America, in that
           | order, and that's the way they write it. Europeans only got
           | it a little better.
        
             | [deleted]
        
             | theamk wrote:
             | We do say "five dollars" while writing "$5", so saying and
             | writing different things is not unheard of.
             | 
             | And endiannes / sorting comes up in real life pretty often
             | - scanning for large numbers in the price list, or finding
             | stuff in the sorts list.
             | 
             | I think if history turned differently, we could have had
             | sane time format in the US.
        
             | ascar wrote:
             | > Europeans only got it a little better.
             | 
             | There is a reasonable argument for little endian dates (as
             | in the least significant information is usually the most
             | relevant as it changes most often), but apart from the "it
             | has been like this forever" I don't see any reasonable
             | argument for middle endian date formats. Then again, the US
             | is notoriously resistant to the metric system too.
        
               | mumblemumble wrote:
               | Your error is expecting reasonableness. All linguistic
               | conventions are either arbitrary or lost to time, and
               | mostly only exist for tradition's sake.
        
               | lilyball wrote:
               | It's because it matches the way we speak dates aloud.
               | When intended for human consumption, sortability and big-
               | endianness doesn't matter, but matching the way we speak
               | does. Maybe other cultures actually speak dates
               | differently, I don't know, but I have never seen a native
               | English speaker habitually speak dates any differently
               | than "January 1st, 2001".
               | 
               | All that said, I definitely agree with the original
               | complaint, m-dd-yy is an atrocious format. If you're
               | going to use dashes, stick with yyyy-mm-dd. Replacing the
               | dashes with slashes, as in 2/22/22, would have been fine.
        
               | rmccue wrote:
               | "the twenty-sixth of April" would be the way I say
               | today's date and anecdotally is in common usage in both
               | countries I've lived in (the UK and Australia, both using
               | d/m/y). I'd say it's about as frequent as "April the
               | twenty-sixth" by itself, and definitely more common if
               | you include the day ("Tuesday, the twenty-sixth of
               | April").
        
               | ChrisKnott wrote:
               | In the UK I think "1st of January" is probably slightly
               | more common than "January the 1st" although you hear
               | both. "January 1st" (no "the") sounds American.
        
               | verve_rat wrote:
               | I'm from NZ and it is 100% normal to switch back and
               | forth between "The second of March" and "March the 23rd".
               | 
               | People I have met from Australia, South Africa, the UK,
               | all have the same flexibility.
        
               | [deleted]
        
               | jc_811 wrote:
               | Oh this is a great point! I'd never realized that. I know
               | that in Spanish (and I assume many of the romance
               | languages) we always say the day first, eg dos de febrero
               | (2nd of February). In American English even though the
               | day first technically is grammatically correct, we pretty
               | much never say it in that order (February 2nd instead of
               | the 2nd of February)
        
             | ksdnjweusdnkl21 wrote:
             | Have they not been doing the same with "fourth of July"? Or
             | is this an exception?
        
               | gleenn wrote:
               | Our Independence Day is probably a special case. Clearly
               | language is flexible enough to say all the formats, but
               | the date format we write matches the most common
               | verbalization.
        
           | Hjfrf wrote:
           | The one exception I can think of is a bug in the mssql
           | datetime type (but not date or datetime2) where strings in
           | that format are assumed to be yyyy-dd-mm if the locale
           | dateformat is dmy (e.g. British English).
        
         | charcircuit wrote:
         | It's a shortened version of "February 22, 2022"
         | 
         | It doesn't seem that bad to turn it into 2-22-22.
        
           | lucideer wrote:
           | It's also a shortened version of "22nd of February, 2022".
        
           | smcl wrote:
           | The issue is a great deal of the rest of the world don't do
           | this, so you need to decide whether to apply best-guess
           | heuristics to parse it or decide that it's a typo ("ah
           | there's not 22 months, so maybe it's the 22nd of February or
           | someone fat-fingered the 2nd of February...?").
           | 
           | In this case you can lookup Slack outages to disambiguate it,
           | but the frustration here - and I share it - is directed at
           | the stubborn refusal to use a standard format that the reest
           | of the world has agreed upon.
        
         | Invictus0 wrote:
         | It's a quirky nod to the fact that all the digits were 2 on
         | that day in this format.
        
         | vxNsr wrote:
         | > _Am I missing something?_
         | 
         | Yes, the numbers are all the same, and the author is based in
         | the US, and thus is using the default format in the US. So odd
         | that this is the top comment.
        
         | samstave wrote:
         | OMG - I thought I clicked on the tablet thread regarding
         | Sumerian OOOs -- and I thought you were sarcastically making
         | fun of the way the Sumerians captured dates on limestone
         | tablets ~4,000 years ago...
         | 
         | (i had scrolled immediately down, so the thread titel wasnt
         | visible when I was reading your comment)
         | 
         | haha
        
         | adamomada wrote:
         | This is what you sometimes see for best-before dates in Canada.
         | Even better, because our dates are "supposed to" be like 22/2
         | but I don't think anyone here does that, except Quebec perhaps.
         | Sometimes you just have no clue
        
         | rcthompson wrote:
         | The point is to describe the date using only the number 2.
        
       | neerajk wrote:
       | "Mcrib is objectively a better system for generating memcached
       | configurations -- but its efficiency made the broader system
       | behave in a less safe way." Be good but not _that_ good :)
        
       | pierrebai wrote:
       | Sometimes new roll-out causes outage, sometimes, roll-out are
       | delayed due to the overall system architecture. Reading the post-
       | mortem, I could not help but be reminded of this issue as
       | described here: https://www.youtube.com/watch?v=y8OnoxKotPQ
        
       | epmatsw wrote:
       | McRib is a hilarious name for a service
        
         | whoopdedo wrote:
         | I certainly wouldn't trust it's availability.
         | 
         | (The original McDonald's McRib sandwich is well known for only
         | being sold a limited time.)
         | 
         | So "Mcrouter" comes from Memcache-Router, then the obvious
         | McDonalds jokes are made and someone cleverly suggests "Mcrib"
         | for the next service. But I can't think what the backronym
         | would be for it. Memcache Ring Buffer maybe. Or Broker.
        
           | phan wrote:
           | memcache router interface broker
        
           | bee_rider wrote:
           | Apparently it generates configurations for Mcrouter. Could be
           | MemCache-Router Instance Borker.
        
             | true_religion wrote:
             | I think you meant Broker, but the misspelling is an act of
             | genius since we are talking about downtime caused by an
             | infrastructure failure.
        
               | qubyte wrote:
               | And I misread it as McBorker and now I can't stop
               | chuckling.
        
               | erichurkman wrote:
               | McBorker, Chaos Monkey's cousin.
        
               | bee_rider wrote:
               | Does it make it less of an act of genius if it was
               | intentional?
        
           | jonah-archive wrote:
           | RIB is a common term in networking for "Routing Information
           | Base" (being the set of all routes which could be chosen to
           | be installed in the routing table (or FIB -- "Forwarding") by
           | the control plane. I don't know that this is the actual
           | etymology but it's not implausible.
        
       | mescaline wrote:
       | An over communication platform should have scheduled outages like
       | this regularly!
        
         | alex-zierhut wrote:
         | What motivation would someone have to run a scheduled outage? I
         | can't think of any.
        
         | AndrewUnmuted wrote:
         | Something about all of this feels like a scheduled outage, to
         | me.
         | 
         | I am suspicious, though cannot back this up at all, that they
         | were ready for this incident and may have even planed for it.
        
           | jdlshore wrote:
           | This sort of handwavy conspiracy thinking is distressingly
           | common. What basis do you have for your suspicion? Is it just
           | "big company bad"?
        
             | AndrewUnmuted wrote:
             | > distressingly
             | 
             | You're distressed by my thinking? That's odd.
             | 
             | Slack is a terrible product that engulfs the worker inside
             | a dead-eyed grunt culture, featuring an endless spree of
             | work-life balance destroyers. It might be great for people
             | who ask for things from others, but for the people who have
             | to actually do the thing being asked of them, Slack is a
             | nightmare world.
             | 
             | Anyone with the psychology to make a product like Slack, is
             | likely to engage in handwavy conspiracy thinking
             | themselves.
        
               | alphabettsy wrote:
               | > engulfs the worker inside a dead-eyed grunt culture,
               | featuring an endless spree of work-life balance
               | destroyers. It might be great for people who ask for
               | things from others, but for the people who have to
               | actually do the thing being asked of them, Slack is a
               | nightmare world.
               | 
               | I think it depends on the organization and how you use
               | it. In a previous role I would've agreed with you. People
               | expected you to reply at all hours, where I am now that
               | isn't the case.
               | 
               | Tools do not create toxic culture or destroy work-life
               | balance. Organizations do that.
        
             | mulmen wrote:
             | I choose to believe it is not common thinking but instead
             | commonly verbalized among the minority with such thoughts.
        
       | encryptluks2 wrote:
       | > including the author -- which certainly made my role as
       | Incident Commander more challenging!
       | 
       | As if no other way to communicate exists?
       | 
       | I remember using Slack, feeling fed up with emails, until I
       | realized that if I wanted to sync Slack messages offline and have
       | a standard way to view these messages that I was SOL. I am so
       | glad that I've returned to email and optimized my workflow to use
       | email effectively and efficiently. The best part is no more
       | vendor lock-in.
        
       | orf wrote:
       | *22/2/22
        
         | adamomada wrote:
         | You know how the date looked strange to you? It's the same for
         | your correction, but for other people
        
           | orf wrote:
           | For a statistically insignificant portion of people, sure. It
           | doesn't make it any less correct.
        
             | 4ggr0 wrote:
             | For the whole of Europe it would be 22.02.2022, how is all
             | of Europe statistically insignificant?
        
               | orf wrote:
               | The official EU rules say 22.02.2022, but nobody in
               | Europe would have trouble parsing 22/2/22 or any
               | variation thereof. And the / (or -) separator is indeed
               | used in parts of the EU.
               | 
               | It's the ordering that's significant, not the separator.
        
       | dormando wrote:
       | Hi! I'd like to offer some hopefully useful information if any
       | Slack folks end up reading this, or anyone else with a similar
       | infrastructure. I'll start with some tech and make a separate
       | philosophical comment.
       | 
       | Also caveat: I have no deep view into Slack's infrastructure so
       | anything I say here may not even be relevant. YMMV.
       | 
       | First some self promotion:
       | https://github.com/memcached/memcached/wiki/Proxy memcached
       | itself is shipping router/proxy software. Mcrouter is difficult
       | to manage and unsupported. This proxy is community developed,
       | more flexible, likely faster, and will support more native
       | features of memcached. We're currently in a stabilization round
       | ensuring it won't eat pets but all of the basic features have
       | been in for a while. Documentation and example libraries are
       | still needed but community feedback help speed those up
       | tremendously (or any kind of question/help request).
       | 
       | It's not clear to me why memcached is being managed like this;
       | mcrouter seems to only be used to abstract the configuration from
       | the clients. It has a lot of features for redundant pools and so
       | on. Especially with what sounds like globally immutable data and
       | the threat of cascading failures during rolling upgrades it
       | sounds like it would be very helpful here.
       | 
       | If cost or pool sizes are the main reasons why the structure is
       | flat, using Extstore
       | (https://github.com/memcached/memcached/wiki/Extstore) can likely
       | help. Even if object value sizes are in the realm of 500 bytes,
       | using flash storage can still greatly reduce the amount of RAM
       | necessary or reduce the pool size (granted the network can still
       | keep up) with nearly identical performance. Extstore takes a lot
       | of tradeoffs (ie; keeping keys in RAM) to ensure most operations
       | don't actually write to flash or double-read. Extstore's in use
       | in tons of places and everyone's immediately addicted.
       | 
       | Finally, the Meta Protocol
       | (https://github.com/memcached/memcached/wiki/MetaCommands) can
       | help with stampeding herds to help keep DB load from exploding
       | without adding excess network roundtrips under normal conditions.
       | I've seen lots of workarounds people build but this protocol
       | extension gives a lot of flexibility you can use to help survive
       | degraded states: anti-stampeding herd, serve-stale, better
       | counter semantics, and so on.
        
       | dormando wrote:
       | Now a more philosoraptor style comment: I see Mcrib is a service
       | built to quickly detect and replace memcached's. I treat
       | memcached in infrastructure as a very stable service. Meaning it
       | is infrequently necessary to upgrade it, and it will generally
       | not fail on its own. If it does it will be highly infrequent
       | compared to services with higher churn or more
       | complexity/dependencies. This means if they're failing often
       | enough that you need to rapidly detect and replace them you have
       | a more fundamental problem.
       | 
       | From a structural standpoint I think my technical comment can be
       | useful. If things really are failing this much A) you should
       | figure out why and slow that down. B) if you have a generally
       | stable system and understand the typical rate of failure, you can
       | add tripwires into Mcrib to avoid over-culling services and
       | loudly raise alarms. Then C) you can improve technical
       | reliability with redundancy/extstore/etc.
       | 
       | I've also seen plenty of times where folks have a dependency of a
       | service determine if that service is usable, which I disagree
       | with quite strongly. Consul being down on a node should trigger
       | something to consider if the service is dead. It's important both
       | for reliability (don't kill perfectly working things because you
       | end up having to design around it), and for maintainability as
       | you've now made people afraid of upgrading Consul or other co-
       | dependent services. Other similar failures are single-point-of-
       | testing availability checking where instead you probably want two
       | points of truth before shooting a service.
       | 
       | Now you risk people being afraid of upgrading probably anything,
       | which means they will work around it, abstract it, or needlessly
       | replace it with something they feel safer managing. The latter is
       | at best a waste of time, at worst a time bomb until you find out
       | what conditions this new thing breaks under.
       | 
       | This isn't advocating that you design without assuming anything
       | can fail anywhere at any time; just pointing out that how often a
       | service _should_ fail is extremely useful information when
       | designing systems and designing fail safes, alerts, monitoring,
       | etc.
        
         | bognition wrote:
         | It's likely that the memcached install is so large that the
         | underlying instances themselves are failing. When you have
         | hundreds or thousands of instances, failures in the instances
         | themselves become pretty regular.
        
       | belter wrote:
       | A Date that is both a Palindrome and an Ambigram:
       | 
       | https://www.jagranjosh.com/general-knowledge/22-02-2022-is-b...
        
         | sva_ wrote:
         | Well, not the way they format it.
        
       ___________________________________________________________________
       (page generated 2022-04-26 23:00 UTC)