[HN Gopher] An update on recent service disruptions
       ___________________________________________________________________
        
       An update on recent service disruptions
        
       Author : todsacerdoti
       Score  : 91 points
       Date   : 2022-03-23 20:39 UTC (2 hours ago)
        
 (HTM) web link (github.blog)
 (TXT) w3m dump (github.blog)
        
       | renewiltord wrote:
       | Huh, one guy on HN _did_ say it was the DB that was the problem
       | earlier on. Neat!
        
       | speedgoose wrote:
       | It's interesting to read that so many systems and activities are
       | dependent on a single point of failure : the main primary MySQL
       | node at GitHub.
        
         | eatonphil wrote:
         | I can imagine what they're in a rush to refactor.
        
         | cyberpunk wrote:
         | I mean, they have $MEGABUCKS, they could probably get 1/2 the
         | team who maintains mariadb to come in and work for them if they
         | wanted, and they still have a giant single db node doing writes
         | and struggle to fail it over.
         | 
         | We're doomed >_<
         | 
         | You would think it wouldn't be _THAT_ hard to shard something
         | like GitHub effectively.
         | 
         | I mean, all user accounts/repos starting with the letter 'a' go
         | to the 'a' cluster and so on seems not exactly science-fiction
         | levels of technology.
        
           | [deleted]
        
           | throwusawayus wrote:
           | it's profoundly strange that github has not properly sharded
           | yet. essentially all large social networks are or have used
           | sharded mysql successfully, this is not rocket science
           | 
           | livejournal, facebook, twitter, linkedin, tumblr, pinterest
           | all use (or formerly used) sharded mysql and most of these
           | are at larger db size than github
           | 
           | i will also repeat my comment from another recent thread: i
           | just cannot understand how 20+ former github db and infra
           | people recently left to join a db sharding company. this
           | makes no sense whatsoever in light of github's lack of
           | successful sharding. wtf is going on in the tech world these
           | days
        
             | samlambert wrote:
             | This is just a gross simplification of the situation. You
             | commenting from the outside without much context. GitHub is
             | 14 year old rails app that is very complex doing large
             | migrations of database platforms can very difficult and
             | take a long time.
        
               | throwusawayus wrote:
               | You seem totally fine with making gross simplifications
               | about your competitors at aws:
               | https://news.ycombinator.com/item?id=30459692
        
               | ketzo wrote:
               | This is a super, super annoying type of comment, and
               | against HN guidelines. It adds absolutely nothing to the
               | conversation. Don't go digging through people's comment
               | history to throw mud.
        
             | drewbug01 wrote:
             | > i just cannot understand how 20+ former github db and
             | infra people recently left to join a db sharding company.
             | this makes no sense whatsoever in light of github's lack of
             | successful sharding.
             | 
             | I believe you have the chain of causality backwards here.
             | In fact, I think it suggests that talent that went to
             | planet scale is perhaps not the issue.
        
               | throwusawayus wrote:
               | i'm not suggesting that these departures are the cause of
               | github's issue. rather, i'm saying i don't understand why
               | such a large group from github was hired by planetscale
               | if they did not have experience successfully sharding or
               | successfully leveraging vitess
               | 
               | this is like if you were building a high-rise condo,
               | would you hire the architects or management company from
               | the building that collapsed in surfside florida? sure,
               | they know what NOT to do next time, but that doesn't mean
               | they do know what TO do
        
               | samlambert wrote:
               | Hi, former GitHubber here and CEO of PlanetScale. I came
               | to PlanetScale after seeing the incredible impact that
               | Vitess had at GitHub. The team we have hired here
               | successfully sharded part's of GitHub's very large
               | platform. We continue to shard very large customers at
               | PlanetScale.
               | 
               | Platform migrations take a very long time and it's very
               | complicated especially with decade old codesbases. I will
               | say the current team at GitHub are nothing but
               | outstanding people and engineers with a difficult task of
               | managing a very large deployment.
        
               | drewbug01 wrote:
               | I understand what you're getting at... but why is it you
               | presume that the employees now at planetscale are the
               | reason GH couldn't shard out their databases?
               | 
               | Like, there's another angle here: management, yeah?
               | 
               | Another way of reframing it is "maybe the folks hiring at
               | planetscale know the inside baseball about GH
               | infrastructure". For example:
               | https://www.linkedin.com/in/isamlambert
        
               | throwusawayus wrote:
               | this is my exact point: they hired the former head of GH
               | infrastructure - _literally the person directly
               | responsible for all this at github for years_ - and made
               | him their ceo
               | 
               | github should have sharded _years ago_ , every other
               | large mysql user did so much earlier in their growth
               | trajectory
        
               | drewbug01 wrote:
               | Ah, that's a more specific point than what you seemed to
               | have been making before.
        
           | sillysaurusx wrote:
           | > they could probably get 1/2 the team who maintains mariadb
           | to come in and work for them if they wanted
           | 
           |  _The Mythical Man Month_ has a few things to say about that.
           | 
           | (It's tempting to feel that the information is outdated, but
           | in my experience it still seems true.)
        
             | cyberpunk wrote:
             | My point was, that they've not done that, is because it
             | wouldn't help.
             | 
             | This is an architectural problem, which even if they had
             | the massive expensive brains behind something like mysql on
             | their team they couldn't fix it.
             | 
             | (at least, I'm guessing, I think this kinda architecture
             | doesn't scale even if they could kick the can down the road
             | a few times..)
        
               | sillysaurusx wrote:
               | Oh, my apologies! Yes, that makes sense.
        
         | prepend wrote:
         | My fear is that this seems like a cover excuse for moving off
         | MySQL. The bug will be too hard and they'll move off. They will
         | choose SQLServer and take a lot of time to convert and then
         | have even more outages.
        
           | drewbug01 wrote:
           | At GitHub's scale, you don't just "move off" a database. At
           | best it would be a gradual project that would take _years_
           | for the company to complete, and likely trigger additional
           | incidents along the way.
        
             | samlambert wrote:
             | Correct.
        
           | nimbius wrote:
           | 100% agreed. a lift-and-shift migration to Galera and modern
           | MariaDB wouldnt be hard, but knowing MS there are mid
           | managers waiting in the wings to swoop in and drive this into
           | the ground with azure/sqlserver, the former of which posted 8
           | outages in the past 90 days alone.
           | 
           | this is classic Microsoft. spend a ton of money for something
           | very valuable -- in this case virtually all developer
           | marketshare -- and then casually pedal it into the ground
           | while you lie about the KPI's to C levels (IIS marketshare on
           | netcraft as a function of parked websites at GoDaddy to
           | dominate over Apache) and keep it on life support with other
           | revenue streams (XBox) for the next 16 quarters until it
           | becomes a repulsive enough carbuncle to shareholders that it
           | gets the axe (Microsoft phone.) then in a year, limp into the
           | barn with another product nobody else but you could afford to
           | buy (minecraft) and slowly turn it into a KPI farm for
           | Microsoft account metrics to drive some other failing product
           | (Azure) and keep the C level happy while you alienate
           | virtually every player with mechanics or requirements they
           | hate.
        
             | throwusawayus wrote:
             | galera is not a solution for scaling out writes, full stop
             | 
             | galera has lower max writes/sec than a traditional async
             | single master because it's a cluster. the other members of
             | the cluster need to ack the writes, and all members are
             | doing all the writes, so adding machines does not increase
             | your max writes
        
             | prepend wrote:
             | It's funny you mention Minecraft. My kids recently said "I
             | hate Microsoft. All they did is ruin Mojang."
             | 
             | They don't know Microsoft for anything other than ruining
             | Minecraft. They didn't know Microsoft made the Xbox or even
             | Windows.
             | 
             | They made this statement after Microsoft forced them to
             | migrate their account they've had for 5 years to a
             | Microsoft account. That broke their computer for a few days
             | and reset their games. For no useful reason.
        
           | speedgoose wrote:
           | I bet on Azure Cosmos DB.
        
           | protomyth wrote:
           | Why would Microsoft not migrate to SQL Server? MySQL is owned
           | by Oracle. Microsoft cannot be happy about using a product
           | from Oracle. SQL Server is a pretty good product and the
           | conversion will give them even more tools and expertise for
           | their consulting wing to do it for other companies.
        
             | Ygg2 wrote:
             | They do seem to be acting like they are cool and OSS
             | friendly.
        
         | wincent wrote:
         | Linked in the article is this other one, "Partitioning GitHub's
         | relational databases to handle scale"
         | (https://github.blog/2021-09-27-partitioning-githubs-
         | relation...). That describes how there isn't just one "main
         | primary" node; there are multiple clusters, of which `mysql1`
         | is just one (the original one -- since then, many others have
         | been partitioned off).
        
           | throwusawayus wrote:
           | from that article it sounds like they are mostly doing
           | "functional partitioning" (moving tables off to other db
           | primary/replica clusters) rather than true sharding
           | (splitting up tables by ranges of data)
           | 
           | functional partitioning is a band-aid. you do it when your
           | main cluster is exploding but you need to buy time. it
           | ultimately is a very bad thing, because generally your whole
           | site is depenedent on every single functional partition being
           | up. it moves you from 1 single point of failure to N single
           | points of failure!
        
             | wincent wrote:
             | Towards the end:
             | 
             | > In addition to vertical partitioning to move database
             | tables, we also use horizontal partitioning (aka sharding).
             | This allows us to split database tables across multiple
             | clusters, enabling more sustainable growth.
        
               | throwusawayus wrote:
               | yes i know, the fact that this is a small blurb at the
               | bottom of the article (which is largely about functional
               | partitioning) exactly proves my point
        
       | egberts1 wrote:
       | The hardest part is drafting a series of questions for the end-
       | user to understand and answer before we get that "MAGIC-PRESTO-
       | BLAM-WHIZ" configuration files that just works.
       | 
       | I blame the program providers.
       | 
       | Some Debian maintainers are trying to do this simple querying of
       | complex configurations (dpkg-reconfigure <`package-name>`). And I
       | applaud their limited inroad efforts there because no else one
       | has seem to bother.
        
       | paxys wrote:
       | Outage checklist
       | 
       | - Was it DNS?
       | 
       | - Was it a bad config update?
       | 
       | - Was it an overloaded single point of failure?
       | 
       | There's rarely a #4
        
         | doublerabbit wrote:
         | #5 Is it plugged in?
         | 
         | Build&Configure server, A few hour drive it down to the DC,
         | rack the server. Get back home, try to access. No luck. Turned
         | out I had forgotten I had forgotten to connect power and turn
         | it on.
        
         | HL33tibCe7 wrote:
         | A proposed fourth: was it BGP?
         | 
         | Although most of those fall under "bad config update" (although
         | likewise that applies to DNS).
        
           | jenny91 wrote:
           | When has BGP caused a serious outage at a website?
        
             | karlding wrote:
             | The big Facebook outage from last fall [0]?
             | 
             | [0] https://blog.cloudflare.com/october-2021-facebook-
             | outage/
        
               | jenny91 wrote:
               | The cause wasn't BGP though, it was just caught in the
               | middle of a big mess and made it very hard to undo a
               | botched config change.
        
             | [deleted]
        
         | karmakaze wrote:
         | Or an actual DDoS, not self-inflicted.
        
       | longcommonname wrote:
       | Our internal monitoring has seen more outages than they listed
       | here. Theres been 4 full days where github actions have been
       | mixed between completely broken and degraded status.
       | 
       | It's nice to finally get some comms, but this is incredibly late
       | and incomplete.
        
       | calcifer wrote:
       | TLDR: They still don't know why this is happenning.
        
         | rvz wrote:
         | Oh dear. So other than those double outages, I should expect at
         | the very least that GitHub should have at least one outage
         | every month, not 3 or 5 a month?
         | 
         | I don't think we are going to see GitHub be up for a full month
         | without an incident anytime soon.
         | 
         | I guess with my entire comment chain [0] on GitHub's situation
         | has aged for two straight years in a row then, especially
         | yesterday's one: [1].
         | 
         | [0] https://news.ycombinator.com/item?id=30779275
         | 
         | [1] https://news.ycombinator.com/item?id=30767821
        
       ___________________________________________________________________
       (page generated 2022-03-23 23:00 UTC)