[HN Gopher] MySQL Sharding at Quora ___________________________________________________________________ MySQL Sharding at Quora Author : jinqueeny Score : 74 points Date : 2020-04-26 14:08 UTC (8 hours ago) (HTM) web link (www.quora.com) (TXT) w3m dump (www.quora.com) | troughway wrote: | While this is a great read, I'm excited to find out more about | the tech challenges behind their login wall. | fovc wrote: | > The result is that we do very few joins in MySQL | | I realize depending on your data model some of this is | unavoidable once you shard, but it sounds like they did this even | before sharding. | | What's the reason to do that? Seems like you either end up with | long transactions or doing MVCC in app code | tobyjsullivan wrote: | The preceding sentence is | | > Joins inside MySQL were strongly discouraged in the codebase | so that we would have more freedom in choosing which tables to | move for scale out. The result is that we do very few joins in | MySQL. | | And that comes at the end of the section describing their | "vertical scaling" strategy and its motivations. | | As for your hypothesis about long transactions, this is not | actually a problem in my experience (on larger scale CRUDy | apps, specifically). In this scenario, you're choosing to trade | consistency for the ability to break out arbitrary tables to | discrete databases. You're going to lose that consistency after | splitting data between multiple databases anyway (we'll gloss | over distributed transactions here) so why not be proactive | about it? And on a "simple" platform like Quora, data is going | to have a predominantly hierarchical relationship (users have | posts which have comments which have likes, etc.) so internal | consistency between content is rarely a significant concern. | toast0 wrote: | As a sibling has noted, joins across database hosts with SQL is | somewhere between hard and impossible. So it made sense to | discourage and remove them on that basis, given that the | datasize was growing/has grown beyond what can be managed | reasonably on a single host. | | Another reason to avoid joins is that they it can be difficult | to consistently write joins that execute quickly. Often, a | poorly performing join can be written as a sequence of queries | which each perform well, and joined with reasonable performance | on the client side. Usually, it's easier to scale the client | (webserver) tier than the database tier, so moving work to the | client makes sense for that reason too. | | Of course, this has to be considered in the context of the | service. These days it's pretty easy to get a database server | with 64 cores and a couple terrabytes of ram, and several | terrabytes of fast SSD storage. You can do a lot with that, and | you may never need to shard. | gokaygurcan wrote: | https://www.raphkoster.com/2009/01/08/database-sharding-came... | | Nice article from Raph Koster. Especially if you played Ultima | Online by any chance. | awinter-py wrote: | 'shard user scale data, centralize or mirror small scale data' | (the A1 A2 B C D E diagram) is a useful point | | 'obvious in retrospect' I suspect to a lot of teams that add | sharding to their system | JohnBooty wrote: | A couple of years ago, our infrastructure team | implemented the ability to move a table from one MySQL | host to another. This allowed us to have multiple | masters with different tables on different masters, | which we refer to as "vertical sharding". Each one | of these masters, along with its slaves, is called as a | "partition" at Quora. | | I nearly stopped reading here. | | Naming things is hard, but why would they choose a term | ("partition") that has already existed for many decades in the | same context? | | That's nonsensical. | Ozzie_osman wrote: | What's nonsensical about it? A partition is just a logical | split of your data. You can do horizontal partitioning (aka | "horizontal sharding"), like splitting the same table to | different databases by key. Or you can do vertical | partitioning, where you put different tables in different | databases. | jeffdavis wrote: | I always thought vertical partitioning meant to split by | columns (that is, have fewer columns per table, and you need | to join to get the original back). | zerr wrote: | Comment from the representative (probably) of the "Chinese Quora" | is interesting as well :) | jinqueeny wrote: | Thanks! Posting the case study here for more information: | https://pingcap.com/success-stories/lesson-learned-from-quer... | roskilli wrote: | Since it's range based sharding, was Vitess ever considered (a | popular CNCF MySQL range sharding cluster approach that came out | of YouTube)? | sethammons wrote: | Different company. We evaluated Vitess in comparison to | ProxySQL. While Vitess was very promising, it was a large | refactor and an all or nothing proposition to switch to it. | ProxySQL allowed us to move to it piecemeal and adopt a crawl, | walk, run solution in adopting it. | RoboTeddy wrote: | Would Google Cloud Spanner work well for a workload like this | one? Why or why not? | winrid wrote: | It'd probably work. They cite cost as a reason for not going | third party in the article. | axegon_ wrote: | By all means yes, I'd argue spanner would be the ideal | candidate for the task. It scales beyond anything else I've | come across. Mind you, so do the costs of running it. I've only | used it for training exercises and numbers add up like nothing | else. IIRC running a production instance at the bare minimum | would set you back around 9k/month. | makz wrote: | This is going to be very useful for one of the projects I'm | working on, thanks for sharing | TekMol wrote: | Interesting that they still use MySql and did not switch to | MariaDB. | | What is HNs opinion on MySql vs MariaDB? | brobinson wrote: | Why do you find it interesting that they didn't switch? What is | the compelling reason to switch if you don't care about Oracle | owning MySQL? | | The only difference I remember from having been forced to use | MariaDB in the past was that one of the 5.x releases didn't | support NOT NULL virtual columns when the equivalent 5.x MySQL | release did support them... to me, MariaDB is just a strange, | slightly incompatible fork, but maybe someone else here has | more experience with both. (I prefer postgres) | Mailtemi wrote: | Their site(use case) looks almost similar to stackoverflow.com. | The stackoverflow.com has few MS SQL servers with a Memcached | as cache. I guess if reads are the heavy load it makes perfect | sense just to continue with your current SQL DB + sharding. ___________________________________________________________________ (page generated 2020-04-26 23:00 UTC)