[HN Gopher] Slack's Incident on 2-22-22 ___________________________________________________________________ Slack's Incident on 2-22-22 Author : alphabettsy Score : 122 points Date : 2022-04-26 17:26 UTC (5 hours ago) (HTM) web link (slack.engineering) (TXT) w3m dump (slack.engineering) | scrollaway wrote: | A post-mortem with non-ISO dates? Even on THAT date?! :) | smegsicle wrote: | looks unambiguous to me lol | mulmen wrote: | Ok, what is the format string? a) m-dd-yy | b) m-yy-dd | | I can't tell. How do you disambiguate? | rjh29 wrote: | It can't be either, 22 is not a valid month. | | I agree with you though, the point of a date like yyyy-mm- | dd is to avoid working out stuff like this. You don't pick | a date format based on whether the current date is | ambiguous or not. | mulmen wrote: | Good catch. I updated my post. The question remains, how | can this format be disambiguated? | | Agreed, this is why ISO8601 exists. | 58x14 wrote: | Just my personal take, I think this is a really well-written | incident postmortem. It's specific, extensive, candid, and dare I | say, entertaining? | | Many incident reports are fully lacking in any meaningful detail, | or wholly unapologetic. I actually enjoyed learning tidbits about | the author, in particular their mention of | https://how.complexsystems.fail/. | | Reading this boosted my confidence in Slack's teams, which should | ultimately be the objective of a release like this. It's not pure | PR nor a gruff legally-obligated disclosure. | | It helps that I wasn't really affected by this incident. | vxNsr wrote: | The fact that the current top comment thread is quibbling about | the date format in the title seems to agree with this | assessment, if there was anything real to complain about that's | what we'd be seeing, instead we get bikeshedding on the date | format in the title of a post. | dijonman2 wrote: | Which is aligned with US date formatting. | ttul wrote: | On today's internet, a frank post mortem delivers value to | customers and PR gold too. | _justinfunk wrote: | I'm looking forward to http://thisincidentreportdoesnotexist.com | launching sometime later this year. | notacoward wrote: | I love the diagrams of the cache<->DB cycle in normal vs. | degenerate states. Those illustrate the problem very clearly and | succinctly, and I hope they make it into a textbook some day. | Kudos. | godmode2019 wrote: | Tidbit: | | 2-22-22 was also when Russia invaded Ukraine | | And Joe Bidens statement about the invasion was on 2-22-22 2:22pm | on the dot. | | I could not figure out the significance of this more than 11:11 | was when WW1 ended, but it's probably something else. | Kwpolska wrote: | Nope, you're off by two days: | https://en.wikipedia.org/wiki/2022_Russian_invasion_of_Ukrai... | oxfordmale wrote: | Which major outage ? According to the Slack uptime, there was | barely 1.5 hour of outage :-) | | P.S. | | Yes I know the uptime is decide by committee, and doesn't reflect | reality. I am just being cynical. | olliej wrote: | I find reading about these incidents super interesting, and I | generally find the work performed by the folk keeping these | service running (and dealing with the inevitable falling over of | any computer system). | | At the same time it seems like a horrifying job I would never | ever want :D | diarrhea wrote: | That date format is actually the worst I have ever encountered. | m-d-y, with year in 2 digits, numbers not zero-padded, US "order" | yet using dashes. It's like a moderator of /r/ISO8601 came up | with the worst possible format _on purpose_. Am I missing | something? | HiJon89 wrote: | I expected the top comment on hackernews to be something this | pedantic and irrelevant to the content, and I was not | disappointed | elpakal wrote: | I had to read the parent comment twice to understand that it | was talking about the date in the title of the post and not | anything relevant whatsoever. | missedthecue wrote: | I mean, there is literally no way to confuse it with another | date, unless you go back 100 years, when Slack didn't exist. | | There is no 22nd month, so we know the 22s are the day and the | year, leaving only the 2 to be the month. Is it really that | difficult to parse? | johannes1234321 wrote: | > leaving only the 2 to be the day | | I think you meant "to be the month" there. qed# | diarrhea wrote: | Beautiful. | bamboozled wrote: | So we could either: | | a) Use a far superior date format which nearly the entire | world uses by default and is better and simpler in many ways. | | b) Do logic when we see dates to try workout what format the | date is in. | | Going with _a_ seems like a no brainer... | verve_rat wrote: | Especially in this scenario where you are communicating to | an international audience. | burnte wrote: | Yeah, but what about that party that I threw on 10/11/12. Did | I set it for November 12, 2010, or October 11 in 2012? Or | somewhen else? | tjoff wrote: | I'm so tired of having to do that game every time I see a | date. It is not hard, but it is quite annoying. Especially | since it isn't solvable in a lot of cases, so you try to | reason your way to the most realistic interpretation. | | It shouldn't be this hard. | true_religion wrote: | It's not really that hard. Like the imperial system, | Americans just memorize how it works as children and don't | think about it anymore. | | Think about it like speaking a different language, except | with numbers and not words. | colejohnson66 wrote: | Really HN? We're seriously downvoting this comment to | oblivion? I get non-Americans get passionate in their | anger at imperial units, but this person is just | explaining why it's natural to us. | dllthomas wrote: | The complaint isn't about the particular other order, but | the fact that the order is ambiguous. In this case that | doesn't matter, but often it does. | | Americans memorize inches and yards, and often also | memorize centimeters and meters, and working with | _either_ is fine, but we 're not so often faced with | numbers where it might be inches _or_ centimeters and we | have to figure out which (and when we are, it 's | sometimes a pain - certainly a bigger pain that working | with known units). | | Or, working with your language analogy, please go fetch | me some "pasta" without knowing whether I'm speaking | Italian or Polish. | xeromal wrote: | Context matters in your pasta scenario. | dllthomas wrote: | Since the text itself doesn't clarify, context is the | only way of resolving any of the scenarios. In each case | it's usually sufficient and often not all that hard. But | it's always harder than if the system in use was made | explicit, and I understand the complaint (even if my | annoyance at the ambiguity is quite significantly below | the level where I would have complained myself, | particularly in this case). | erpellan wrote: | Context also matters in the date parsing scenario. | 11/12/22 could be several different dates depending on | the context. | xeromal wrote: | Yeah, and in this post, it's clear what it is. | ldh wrote: | It may not have happened to you yet, but someday you'll | see a date somewhere other than this post. | xeromal wrote: | This honestly has me laughing. | ascar wrote: | > Think about it like speaking a different language | | The correct analogy is I don't know which language is | spoken and the same words get used in multiple languages | with different meaning. Now I can apply heuristics to | figure it out or in some cases I can only guess. | huhtenberg wrote: | I just thought it was a typo. | noselasd wrote: | That particular date is possible to understand, but the date | format is not. (Is is really that fun to try to figure out | what 12-11-21 means ?) | CPLX wrote: | November 21, 1812 | | What do I win? | jonpurdy wrote: | Came here to complain specifically about this. 2022-02-22 is | unambiguous, big endian, and sorts nicely. IDK why society | still uses any other date formats considering how international | everything is. | [deleted] | gleenn wrote: | It's because people for hundreds of years have been saying | "March second, nineteen sixty two" which they then write out | in that order. As a programmer, peoples' frustrations are | understandable, but you're a bit naive if you think even a | percentage point of the speaking population of the world | knows or is concerned with big endian-ness or sortability. | However they speak English, at least in America, in that | order, and that's the way they write it. Europeans only got | it a little better. | [deleted] | theamk wrote: | We do say "five dollars" while writing "$5", so saying and | writing different things is not unheard of. | | And endiannes / sorting comes up in real life pretty often | - scanning for large numbers in the price list, or finding | stuff in the sorts list. | | I think if history turned differently, we could have had | sane time format in the US. | ascar wrote: | > Europeans only got it a little better. | | There is a reasonable argument for little endian dates (as | in the least significant information is usually the most | relevant as it changes most often), but apart from the "it | has been like this forever" I don't see any reasonable | argument for middle endian date formats. Then again, the US | is notoriously resistant to the metric system too. | mumblemumble wrote: | Your error is expecting reasonableness. All linguistic | conventions are either arbitrary or lost to time, and | mostly only exist for tradition's sake. | lilyball wrote: | It's because it matches the way we speak dates aloud. | When intended for human consumption, sortability and big- | endianness doesn't matter, but matching the way we speak | does. Maybe other cultures actually speak dates | differently, I don't know, but I have never seen a native | English speaker habitually speak dates any differently | than "January 1st, 2001". | | All that said, I definitely agree with the original | complaint, m-dd-yy is an atrocious format. If you're | going to use dashes, stick with yyyy-mm-dd. Replacing the | dashes with slashes, as in 2/22/22, would have been fine. | rmccue wrote: | "the twenty-sixth of April" would be the way I say | today's date and anecdotally is in common usage in both | countries I've lived in (the UK and Australia, both using | d/m/y). I'd say it's about as frequent as "April the | twenty-sixth" by itself, and definitely more common if | you include the day ("Tuesday, the twenty-sixth of | April"). | ChrisKnott wrote: | In the UK I think "1st of January" is probably slightly | more common than "January the 1st" although you hear | both. "January 1st" (no "the") sounds American. | verve_rat wrote: | I'm from NZ and it is 100% normal to switch back and | forth between "The second of March" and "March the 23rd". | | People I have met from Australia, South Africa, the UK, | all have the same flexibility. | [deleted] | jc_811 wrote: | Oh this is a great point! I'd never realized that. I know | that in Spanish (and I assume many of the romance | languages) we always say the day first, eg dos de febrero | (2nd of February). In American English even though the | day first technically is grammatically correct, we pretty | much never say it in that order (February 2nd instead of | the 2nd of February) | ksdnjweusdnkl21 wrote: | Have they not been doing the same with "fourth of July"? Or | is this an exception? | gleenn wrote: | Our Independence Day is probably a special case. Clearly | language is flexible enough to say all the formats, but | the date format we write matches the most common | verbalization. | Hjfrf wrote: | The one exception I can think of is a bug in the mssql | datetime type (but not date or datetime2) where strings in | that format are assumed to be yyyy-dd-mm if the locale | dateformat is dmy (e.g. British English). | charcircuit wrote: | It's a shortened version of "February 22, 2022" | | It doesn't seem that bad to turn it into 2-22-22. | lucideer wrote: | It's also a shortened version of "22nd of February, 2022". | smcl wrote: | The issue is a great deal of the rest of the world don't do | this, so you need to decide whether to apply best-guess | heuristics to parse it or decide that it's a typo ("ah | there's not 22 months, so maybe it's the 22nd of February or | someone fat-fingered the 2nd of February...?"). | | In this case you can lookup Slack outages to disambiguate it, | but the frustration here - and I share it - is directed at | the stubborn refusal to use a standard format that the reest | of the world has agreed upon. | Invictus0 wrote: | It's a quirky nod to the fact that all the digits were 2 on | that day in this format. | vxNsr wrote: | > _Am I missing something?_ | | Yes, the numbers are all the same, and the author is based in | the US, and thus is using the default format in the US. So odd | that this is the top comment. | samstave wrote: | OMG - I thought I clicked on the tablet thread regarding | Sumerian OOOs -- and I thought you were sarcastically making | fun of the way the Sumerians captured dates on limestone | tablets ~4,000 years ago... | | (i had scrolled immediately down, so the thread titel wasnt | visible when I was reading your comment) | | haha | adamomada wrote: | This is what you sometimes see for best-before dates in Canada. | Even better, because our dates are "supposed to" be like 22/2 | but I don't think anyone here does that, except Quebec perhaps. | Sometimes you just have no clue | rcthompson wrote: | The point is to describe the date using only the number 2. | neerajk wrote: | "Mcrib is objectively a better system for generating memcached | configurations -- but its efficiency made the broader system | behave in a less safe way." Be good but not _that_ good :) | pierrebai wrote: | Sometimes new roll-out causes outage, sometimes, roll-out are | delayed due to the overall system architecture. Reading the post- | mortem, I could not help but be reminded of this issue as | described here: https://www.youtube.com/watch?v=y8OnoxKotPQ | epmatsw wrote: | McRib is a hilarious name for a service | whoopdedo wrote: | I certainly wouldn't trust it's availability. | | (The original McDonald's McRib sandwich is well known for only | being sold a limited time.) | | So "Mcrouter" comes from Memcache-Router, then the obvious | McDonalds jokes are made and someone cleverly suggests "Mcrib" | for the next service. But I can't think what the backronym | would be for it. Memcache Ring Buffer maybe. Or Broker. | phan wrote: | memcache router interface broker | bee_rider wrote: | Apparently it generates configurations for Mcrouter. Could be | MemCache-Router Instance Borker. | true_religion wrote: | I think you meant Broker, but the misspelling is an act of | genius since we are talking about downtime caused by an | infrastructure failure. | qubyte wrote: | And I misread it as McBorker and now I can't stop | chuckling. | erichurkman wrote: | McBorker, Chaos Monkey's cousin. | bee_rider wrote: | Does it make it less of an act of genius if it was | intentional? | jonah-archive wrote: | RIB is a common term in networking for "Routing Information | Base" (being the set of all routes which could be chosen to | be installed in the routing table (or FIB -- "Forwarding") by | the control plane. I don't know that this is the actual | etymology but it's not implausible. | mescaline wrote: | An over communication platform should have scheduled outages like | this regularly! | alex-zierhut wrote: | What motivation would someone have to run a scheduled outage? I | can't think of any. | AndrewUnmuted wrote: | Something about all of this feels like a scheduled outage, to | me. | | I am suspicious, though cannot back this up at all, that they | were ready for this incident and may have even planed for it. | jdlshore wrote: | This sort of handwavy conspiracy thinking is distressingly | common. What basis do you have for your suspicion? Is it just | "big company bad"? | AndrewUnmuted wrote: | > distressingly | | You're distressed by my thinking? That's odd. | | Slack is a terrible product that engulfs the worker inside | a dead-eyed grunt culture, featuring an endless spree of | work-life balance destroyers. It might be great for people | who ask for things from others, but for the people who have | to actually do the thing being asked of them, Slack is a | nightmare world. | | Anyone with the psychology to make a product like Slack, is | likely to engage in handwavy conspiracy thinking | themselves. | alphabettsy wrote: | > engulfs the worker inside a dead-eyed grunt culture, | featuring an endless spree of work-life balance | destroyers. It might be great for people who ask for | things from others, but for the people who have to | actually do the thing being asked of them, Slack is a | nightmare world. | | I think it depends on the organization and how you use | it. In a previous role I would've agreed with you. People | expected you to reply at all hours, where I am now that | isn't the case. | | Tools do not create toxic culture or destroy work-life | balance. Organizations do that. | mulmen wrote: | I choose to believe it is not common thinking but instead | commonly verbalized among the minority with such thoughts. | encryptluks2 wrote: | > including the author -- which certainly made my role as | Incident Commander more challenging! | | As if no other way to communicate exists? | | I remember using Slack, feeling fed up with emails, until I | realized that if I wanted to sync Slack messages offline and have | a standard way to view these messages that I was SOL. I am so | glad that I've returned to email and optimized my workflow to use | email effectively and efficiently. The best part is no more | vendor lock-in. | orf wrote: | *22/2/22 | adamomada wrote: | You know how the date looked strange to you? It's the same for | your correction, but for other people | orf wrote: | For a statistically insignificant portion of people, sure. It | doesn't make it any less correct. | 4ggr0 wrote: | For the whole of Europe it would be 22.02.2022, how is all | of Europe statistically insignificant? | orf wrote: | The official EU rules say 22.02.2022, but nobody in | Europe would have trouble parsing 22/2/22 or any | variation thereof. And the / (or -) separator is indeed | used in parts of the EU. | | It's the ordering that's significant, not the separator. | dormando wrote: | Hi! I'd like to offer some hopefully useful information if any | Slack folks end up reading this, or anyone else with a similar | infrastructure. I'll start with some tech and make a separate | philosophical comment. | | Also caveat: I have no deep view into Slack's infrastructure so | anything I say here may not even be relevant. YMMV. | | First some self promotion: | https://github.com/memcached/memcached/wiki/Proxy memcached | itself is shipping router/proxy software. Mcrouter is difficult | to manage and unsupported. This proxy is community developed, | more flexible, likely faster, and will support more native | features of memcached. We're currently in a stabilization round | ensuring it won't eat pets but all of the basic features have | been in for a while. Documentation and example libraries are | still needed but community feedback help speed those up | tremendously (or any kind of question/help request). | | It's not clear to me why memcached is being managed like this; | mcrouter seems to only be used to abstract the configuration from | the clients. It has a lot of features for redundant pools and so | on. Especially with what sounds like globally immutable data and | the threat of cascading failures during rolling upgrades it | sounds like it would be very helpful here. | | If cost or pool sizes are the main reasons why the structure is | flat, using Extstore | (https://github.com/memcached/memcached/wiki/Extstore) can likely | help. Even if object value sizes are in the realm of 500 bytes, | using flash storage can still greatly reduce the amount of RAM | necessary or reduce the pool size (granted the network can still | keep up) with nearly identical performance. Extstore takes a lot | of tradeoffs (ie; keeping keys in RAM) to ensure most operations | don't actually write to flash or double-read. Extstore's in use | in tons of places and everyone's immediately addicted. | | Finally, the Meta Protocol | (https://github.com/memcached/memcached/wiki/MetaCommands) can | help with stampeding herds to help keep DB load from exploding | without adding excess network roundtrips under normal conditions. | I've seen lots of workarounds people build but this protocol | extension gives a lot of flexibility you can use to help survive | degraded states: anti-stampeding herd, serve-stale, better | counter semantics, and so on. | dormando wrote: | Now a more philosoraptor style comment: I see Mcrib is a service | built to quickly detect and replace memcached's. I treat | memcached in infrastructure as a very stable service. Meaning it | is infrequently necessary to upgrade it, and it will generally | not fail on its own. If it does it will be highly infrequent | compared to services with higher churn or more | complexity/dependencies. This means if they're failing often | enough that you need to rapidly detect and replace them you have | a more fundamental problem. | | From a structural standpoint I think my technical comment can be | useful. If things really are failing this much A) you should | figure out why and slow that down. B) if you have a generally | stable system and understand the typical rate of failure, you can | add tripwires into Mcrib to avoid over-culling services and | loudly raise alarms. Then C) you can improve technical | reliability with redundancy/extstore/etc. | | I've also seen plenty of times where folks have a dependency of a | service determine if that service is usable, which I disagree | with quite strongly. Consul being down on a node should trigger | something to consider if the service is dead. It's important both | for reliability (don't kill perfectly working things because you | end up having to design around it), and for maintainability as | you've now made people afraid of upgrading Consul or other co- | dependent services. Other similar failures are single-point-of- | testing availability checking where instead you probably want two | points of truth before shooting a service. | | Now you risk people being afraid of upgrading probably anything, | which means they will work around it, abstract it, or needlessly | replace it with something they feel safer managing. The latter is | at best a waste of time, at worst a time bomb until you find out | what conditions this new thing breaks under. | | This isn't advocating that you design without assuming anything | can fail anywhere at any time; just pointing out that how often a | service _should_ fail is extremely useful information when | designing systems and designing fail safes, alerts, monitoring, | etc. | bognition wrote: | It's likely that the memcached install is so large that the | underlying instances themselves are failing. When you have | hundreds or thousands of instances, failures in the instances | themselves become pretty regular. | belter wrote: | A Date that is both a Palindrome and an Ambigram: | | https://www.jagranjosh.com/general-knowledge/22-02-2022-is-b... | sva_ wrote: | Well, not the way they format it. ___________________________________________________________________ (page generated 2022-04-26 23:00 UTC)