[HN Gopher] An update on recent service disruptions ___________________________________________________________________ An update on recent service disruptions Author : todsacerdoti Score : 91 points Date : 2022-03-23 20:39 UTC (2 hours ago) (HTM) web link (github.blog) (TXT) w3m dump (github.blog) | renewiltord wrote: | Huh, one guy on HN _did_ say it was the DB that was the problem | earlier on. Neat! | speedgoose wrote: | It's interesting to read that so many systems and activities are | dependent on a single point of failure : the main primary MySQL | node at GitHub. | eatonphil wrote: | I can imagine what they're in a rush to refactor. | cyberpunk wrote: | I mean, they have $MEGABUCKS, they could probably get 1/2 the | team who maintains mariadb to come in and work for them if they | wanted, and they still have a giant single db node doing writes | and struggle to fail it over. | | We're doomed >_< | | You would think it wouldn't be _THAT_ hard to shard something | like GitHub effectively. | | I mean, all user accounts/repos starting with the letter 'a' go | to the 'a' cluster and so on seems not exactly science-fiction | levels of technology. | [deleted] | throwusawayus wrote: | it's profoundly strange that github has not properly sharded | yet. essentially all large social networks are or have used | sharded mysql successfully, this is not rocket science | | livejournal, facebook, twitter, linkedin, tumblr, pinterest | all use (or formerly used) sharded mysql and most of these | are at larger db size than github | | i will also repeat my comment from another recent thread: i | just cannot understand how 20+ former github db and infra | people recently left to join a db sharding company. this | makes no sense whatsoever in light of github's lack of | successful sharding. wtf is going on in the tech world these | days | samlambert wrote: | This is just a gross simplification of the situation. You | commenting from the outside without much context. GitHub is | 14 year old rails app that is very complex doing large | migrations of database platforms can very difficult and | take a long time. | throwusawayus wrote: | You seem totally fine with making gross simplifications | about your competitors at aws: | https://news.ycombinator.com/item?id=30459692 | ketzo wrote: | This is a super, super annoying type of comment, and | against HN guidelines. It adds absolutely nothing to the | conversation. Don't go digging through people's comment | history to throw mud. | drewbug01 wrote: | > i just cannot understand how 20+ former github db and | infra people recently left to join a db sharding company. | this makes no sense whatsoever in light of github's lack of | successful sharding. | | I believe you have the chain of causality backwards here. | In fact, I think it suggests that talent that went to | planet scale is perhaps not the issue. | throwusawayus wrote: | i'm not suggesting that these departures are the cause of | github's issue. rather, i'm saying i don't understand why | such a large group from github was hired by planetscale | if they did not have experience successfully sharding or | successfully leveraging vitess | | this is like if you were building a high-rise condo, | would you hire the architects or management company from | the building that collapsed in surfside florida? sure, | they know what NOT to do next time, but that doesn't mean | they do know what TO do | samlambert wrote: | Hi, former GitHubber here and CEO of PlanetScale. I came | to PlanetScale after seeing the incredible impact that | Vitess had at GitHub. The team we have hired here | successfully sharded part's of GitHub's very large | platform. We continue to shard very large customers at | PlanetScale. | | Platform migrations take a very long time and it's very | complicated especially with decade old codesbases. I will | say the current team at GitHub are nothing but | outstanding people and engineers with a difficult task of | managing a very large deployment. | drewbug01 wrote: | I understand what you're getting at... but why is it you | presume that the employees now at planetscale are the | reason GH couldn't shard out their databases? | | Like, there's another angle here: management, yeah? | | Another way of reframing it is "maybe the folks hiring at | planetscale know the inside baseball about GH | infrastructure". For example: | https://www.linkedin.com/in/isamlambert | throwusawayus wrote: | this is my exact point: they hired the former head of GH | infrastructure - _literally the person directly | responsible for all this at github for years_ - and made | him their ceo | | github should have sharded _years ago_ , every other | large mysql user did so much earlier in their growth | trajectory | drewbug01 wrote: | Ah, that's a more specific point than what you seemed to | have been making before. | sillysaurusx wrote: | > they could probably get 1/2 the team who maintains mariadb | to come in and work for them if they wanted | | _The Mythical Man Month_ has a few things to say about that. | | (It's tempting to feel that the information is outdated, but | in my experience it still seems true.) | cyberpunk wrote: | My point was, that they've not done that, is because it | wouldn't help. | | This is an architectural problem, which even if they had | the massive expensive brains behind something like mysql on | their team they couldn't fix it. | | (at least, I'm guessing, I think this kinda architecture | doesn't scale even if they could kick the can down the road | a few times..) | sillysaurusx wrote: | Oh, my apologies! Yes, that makes sense. | prepend wrote: | My fear is that this seems like a cover excuse for moving off | MySQL. The bug will be too hard and they'll move off. They will | choose SQLServer and take a lot of time to convert and then | have even more outages. | drewbug01 wrote: | At GitHub's scale, you don't just "move off" a database. At | best it would be a gradual project that would take _years_ | for the company to complete, and likely trigger additional | incidents along the way. | samlambert wrote: | Correct. | nimbius wrote: | 100% agreed. a lift-and-shift migration to Galera and modern | MariaDB wouldnt be hard, but knowing MS there are mid | managers waiting in the wings to swoop in and drive this into | the ground with azure/sqlserver, the former of which posted 8 | outages in the past 90 days alone. | | this is classic Microsoft. spend a ton of money for something | very valuable -- in this case virtually all developer | marketshare -- and then casually pedal it into the ground | while you lie about the KPI's to C levels (IIS marketshare on | netcraft as a function of parked websites at GoDaddy to | dominate over Apache) and keep it on life support with other | revenue streams (XBox) for the next 16 quarters until it | becomes a repulsive enough carbuncle to shareholders that it | gets the axe (Microsoft phone.) then in a year, limp into the | barn with another product nobody else but you could afford to | buy (minecraft) and slowly turn it into a KPI farm for | Microsoft account metrics to drive some other failing product | (Azure) and keep the C level happy while you alienate | virtually every player with mechanics or requirements they | hate. | throwusawayus wrote: | galera is not a solution for scaling out writes, full stop | | galera has lower max writes/sec than a traditional async | single master because it's a cluster. the other members of | the cluster need to ack the writes, and all members are | doing all the writes, so adding machines does not increase | your max writes | prepend wrote: | It's funny you mention Minecraft. My kids recently said "I | hate Microsoft. All they did is ruin Mojang." | | They don't know Microsoft for anything other than ruining | Minecraft. They didn't know Microsoft made the Xbox or even | Windows. | | They made this statement after Microsoft forced them to | migrate their account they've had for 5 years to a | Microsoft account. That broke their computer for a few days | and reset their games. For no useful reason. | speedgoose wrote: | I bet on Azure Cosmos DB. | protomyth wrote: | Why would Microsoft not migrate to SQL Server? MySQL is owned | by Oracle. Microsoft cannot be happy about using a product | from Oracle. SQL Server is a pretty good product and the | conversion will give them even more tools and expertise for | their consulting wing to do it for other companies. | Ygg2 wrote: | They do seem to be acting like they are cool and OSS | friendly. | wincent wrote: | Linked in the article is this other one, "Partitioning GitHub's | relational databases to handle scale" | (https://github.blog/2021-09-27-partitioning-githubs- | relation...). That describes how there isn't just one "main | primary" node; there are multiple clusters, of which `mysql1` | is just one (the original one -- since then, many others have | been partitioned off). | throwusawayus wrote: | from that article it sounds like they are mostly doing | "functional partitioning" (moving tables off to other db | primary/replica clusters) rather than true sharding | (splitting up tables by ranges of data) | | functional partitioning is a band-aid. you do it when your | main cluster is exploding but you need to buy time. it | ultimately is a very bad thing, because generally your whole | site is depenedent on every single functional partition being | up. it moves you from 1 single point of failure to N single | points of failure! | wincent wrote: | Towards the end: | | > In addition to vertical partitioning to move database | tables, we also use horizontal partitioning (aka sharding). | This allows us to split database tables across multiple | clusters, enabling more sustainable growth. | throwusawayus wrote: | yes i know, the fact that this is a small blurb at the | bottom of the article (which is largely about functional | partitioning) exactly proves my point | egberts1 wrote: | The hardest part is drafting a series of questions for the end- | user to understand and answer before we get that "MAGIC-PRESTO- | BLAM-WHIZ" configuration files that just works. | | I blame the program providers. | | Some Debian maintainers are trying to do this simple querying of | complex configurations (dpkg-reconfigure <`package-name>`). And I | applaud their limited inroad efforts there because no else one | has seem to bother. | paxys wrote: | Outage checklist | | - Was it DNS? | | - Was it a bad config update? | | - Was it an overloaded single point of failure? | | There's rarely a #4 | doublerabbit wrote: | #5 Is it plugged in? | | Build&Configure server, A few hour drive it down to the DC, | rack the server. Get back home, try to access. No luck. Turned | out I had forgotten I had forgotten to connect power and turn | it on. | HL33tibCe7 wrote: | A proposed fourth: was it BGP? | | Although most of those fall under "bad config update" (although | likewise that applies to DNS). | jenny91 wrote: | When has BGP caused a serious outage at a website? | karlding wrote: | The big Facebook outage from last fall [0]? | | [0] https://blog.cloudflare.com/october-2021-facebook- | outage/ | jenny91 wrote: | The cause wasn't BGP though, it was just caught in the | middle of a big mess and made it very hard to undo a | botched config change. | [deleted] | karmakaze wrote: | Or an actual DDoS, not self-inflicted. | longcommonname wrote: | Our internal monitoring has seen more outages than they listed | here. Theres been 4 full days where github actions have been | mixed between completely broken and degraded status. | | It's nice to finally get some comms, but this is incredibly late | and incomplete. | calcifer wrote: | TLDR: They still don't know why this is happenning. | rvz wrote: | Oh dear. So other than those double outages, I should expect at | the very least that GitHub should have at least one outage | every month, not 3 or 5 a month? | | I don't think we are going to see GitHub be up for a full month | without an incident anytime soon. | | I guess with my entire comment chain [0] on GitHub's situation | has aged for two straight years in a row then, especially | yesterday's one: [1]. | | [0] https://news.ycombinator.com/item?id=30779275 | | [1] https://news.ycombinator.com/item?id=30767821 ___________________________________________________________________ (page generated 2022-03-23 23:00 UTC)