[HN Gopher] Scaling Databases at Activision [pdf] ___________________________________________________________________ Scaling Databases at Activision [pdf] Author : ksec Score : 158 points Date : 2023-04-21 16:09 UTC (6 hours ago) (HTM) web link (static.sched.com) (TXT) w3m dump (static.sched.com) | erksch wrote: | It would have been interesting to know what alternatives they | considered. Perhaps this would have been easy to solve using | Nomad and TiDB. | | So it's a story of people who used Kubernetes and MySQL and | continued to use Kubernetes and MySQL. The end. | kokizzu2 wrote: | yup lots of companies do agree https://github.com/kokizzu/list- | of-tech-migrations | bickett wrote: | This is an interesting discussion on database performance and | scaling. Thanks for sharing! | achanda358 wrote: | What kind of queries per second did you see while running on | kubernetes, compared to VMs? | rektide wrote: | 3 months from start to roll out for tbeir large scale deploy is | hella impressive. | | Just trying to do basic scaling or other simple tasks sounded | like an utter nightmare on their old system. Having autonomic | computing at their back seems like an obvious win. Tell it your | desired state & let the controller do the job. | | The new paradigm for computer operations is so wonderful. | endur wrote: | The timing slide caught my attention as well. 3 months for | anything is fast... | cpressland wrote: | Ahh, I wanted to go see this talk at KubeCon yesterday but the | room was completely full. Glad to see the finer points got posted | here. | brycethornton wrote: | They have a PlanetScale logo on the last slide. Is this just to | say "Thanks for your work on Vitess!" or are they using the | PlanetScale service for some of this? | jtchang wrote: | I wonder what game this was for. | | To me this is validation that going the SQL route in almost 99% | of new apps is the right way to go. It will be a rare case that | you won't be able to scale out given how mature some of these | technologies are. | chillydawg wrote: | Demonware handle the online services for the CoD franchise. | VirusNewbie wrote: | Github had DAYS of an outage because they unsuccessfully tried | to scale out SQL rather than dealing with NoSql. | | There are very few things that structured NoSql can't do if you | ignore reporting. Once you scale out traditional SQL, you | aren't going to be using it for reporting either way though. | mh- wrote: | This is a gross mischaracterization of what caused that | outage, even just based on the public information from the | (excellent) blog post [0] about it. | | Automating _master failover_ on MySQL without a human in the | loop - _in the topology GitHub used back then_ - is risky. | | 0: https://github.blog/2018-10-30-oct21-post-incident- | analysis/ | LunaSea wrote: | Anyone have more blog posts about video game backend tech like | this one? | geenat wrote: | These numbers make more sense on hardware from 5 years ago (to be | fair, what they were dealing with), but postgres, on modern | consumer hardware (Yes, consumer! Ex: KC3000 + Ryzen 7950x), | already does over 50k QPS without being overburdened. You could | probably squeeze out 100k QPS with faster NVME's. | | That's like 10 tall servers for their peak QPS? (500k QPS on | slide 19) | | Lots of natural sharding points here too: company, game, usage of | data, etc. | | This is assuming you avoid or go light on expensive features like | foreign keys (as Vitess already does). | | The "scale magic" in Spanner (Big Table) inspired DB's are just | hidden automation of traditional sharding- | | CockroachDB: Sharded indexes, then runs map reduce for you. | | Scylla/Cassanda: Sharded indexes again, but more limitations for | speedups: Eventually consistent. You don't have fast delete- only | update (discord uses tombstones). JOIN's are "in app" only. | | Vitess: Proxy that dismantles/routes your query to the correct | server. This scales, but is eventually consistent. JOIN's on co- | located data, or "in app".... in simple terms, Vitess is like an | externally managed "in app" query parser/router. | winrid wrote: | Mongo's sharding also uses routers like Vitess but can be | strongly consistent depending on connection and write settings. | | Data is divided into chunks and the routers route queries based | on where your query lands on chunk ranges. | aeyes wrote: | I see 500k qps with 30TB data doubling every 2 years on slide | 19. | | In my experience me this isn't close to being possible with a | single Postgres box. There is no way around sharding this | workload. | | Vitess has different consistency models, it's not as eventually | consistent as you describe. | geenat wrote: | You'd certainly want to shard from the start in any case, but | it's an order of magnitude less postgres boxes for the QPS. | leetrout wrote: | Wonder what the tradeoff on connection overhead would be? | I've never been around a connection pooler that would be | anywhere near this load. | leetrout wrote: | When you give up one of the major components of relational | databases like foreign keys you pick up lots of performance | opportunity. | | It is amazing what Vitess / PS will help you accomplish but they | don't talk a lot about tradeoffs you make to get there (which are | similar to what you face with sharding MySQL without additional | tooling). | berkle4455 wrote: | Foreign keys are so overrated. Write good code and you don't | have to rely on the DB to enforce referential integrity. It's | the equivalent of running test cases with every db write. | AdrianB1 wrote: | From experience you are very right. Also from experience, | most teams are not good enough to write the code with that | quality that you can give up to referential integrity. | vecter wrote: | I'm curious, how would you obviate the need for foreign keys | with "good" code? Can you provide a toy example or a | reference to an article so I can understand better? I've used | NoSQL databases a long time ago and currently rely on on good | ole PostgreSQL, but I'm having a hard time understanding how | "good" code can be a better solution for managing | relationships between data than a foreign key. | Yoric wrote: | I guess you can use indices instead of foreign keys? And | somehow implement all the `ON DELETE CASCADE` manually | within any transaction that removes the original row? Not | sure how it's "good" code but it could be faster. | DasIch wrote: | A foreign key doesn't necessarily imply an index. If you | are using postgres, you would have to add an index in | addition to the foreign key, if you want one. | | On delete cascade, depending on how many rows it cascades | to, can be problematic because it's a very long running | blocking operation. That's something one might want to do | as a background operation and in batches. Although that | won't make it faster. | sroussey wrote: | Faster in aggregate throughput can often be different | from faster for a specific operation. | | Personally, I find delete on cascade dangerous. I mean, | lots of fun for a pen tester, sure... | berkle4455 wrote: | > for managing relationships between data than a foreign | key. | | You're conflating the concept of a normalized database with | insanely slow DB-enforced referential integrity/foreign | keys. | | Toy example? Sure. Take a well-formed 3NF schema and | disable foreign key constraints. | AlisdairO wrote: | How does this stay correct in the presence of concurrent | activity? | berkle4455 wrote: | Using transactions or UUID/ULIDs though maybe I'm | misunderstandingyour question. How do foreign key | constraints help with concurrency? | AlisdairO wrote: | Table User: userid, etc | | Table Resources: resourceid, userid, etc | | If I want to restrict deletion of a user to only be | possible after all the resources are deleted, I'm forced | into using higher-than-default isolation levels in most | DBs. This has significant performance implications. It's | also much easier to make a mistake - for example, if when | creating a resource I check that the user exists prior to | starting the transaction, then start the tran, then do | the work, it will allow insertion of data into a | nonexistent user. | vecter wrote: | Sorry, assume I'm dense. | | > Take a well-formed 3NF schema and disable foreign key | constraints. | | I'm familiar with 3NF, but can you expand on how 3NF | enables you to remove foreign keys? Or feel free to point | me to an article/blog, I don't want to waste your time if | it's too much to explain. I did some googling but wasn't | sure where to proceed from your post. | Philip-J-Fry wrote: | They're not saying 3NF enables you to remove foreign | keys. They are talking about removing foreign key | constraints from your RDBMS of choice. | | Something like SQL Server can enforce foreign key | constraints if you explicitly tell it your relationships | between tables. The downside is that having this | referential integrity costs you performance as the | database has to check your relations when | inserting/updating/deleting rows. E.g. checking that a | foreign key is pointing at a valid primary key, checking | that you aren't leaving invalid foreign keys when | deleting a primary key, etc. This is to prevent you | inserting bad data into the database. | | You can delete these constraints and still have the exact | same behavior _so long as your code is correct_. It just | means that the database isn 't going to stop you writing | bad data. | vecter wrote: | That makes sense. For workloads where write performance | isn't very important but read performance is, this sounds | like it may still be a worthwhile tradeoff to have that | extra level of data integrity. | | My sense is that many typical CRUD apps aren't writing | gargantuan volumes of data or making very complex edits, | and if they do, it's ok if it takes a second longer. | Usually read speed is more of a bottleneck for user- | facing applications, but I'm sure there are probably some | examples where this tradeoff is worth it. | berkle4455 wrote: | Say you have a super basic setup: user = | {user_id, email} order = {order_id, user_id} | | order.user_id is a foreign key to user.user_id. That's a | perfectly valid and reasonable way to organize things. | | Enabling RDBMS-enforced foreign key constraints is the | issue. It slows everything down dramatically. | vecter wrote: | I see, thanks. Basically just store the foreign keys | yourself as columns in relevant tables and perform the | joins in SQL without having the DB enforce FK integrity | with every insertion/update/delete. Would it be fair to | say that read speeds are unaffected by this? | berkle4455 wrote: | Yes. Quite literally the only difference is not enabling | foreign key constraints. Read speeds unaffected correct. | kgeist wrote: | "Write good code and you won't have bugs" :) | | I think it's good practice to enforce consistency rules both | in the DB and in code. If you make a mistake in your code, | the DB won't allow it, and vice versa. | vp8989 wrote: | In theory it is good practice, but in practice enforcing | referential integrity is not free and it makes satisfying | other non-functional requirements harder. Engineering is | about tradeoffs. Foreign keys are overkill in modern | architectures where DBs/tables are typically only written | to by a single application. They solve a problem that lots | of people don't really have anymore. | kgeist wrote: | In the legacy project our team inherited, there's a lot | of data consistency issues stemming from the lack of | foreign keys (we're adding them back now). Data | consistency is important because we deal with financial | data. Typically, code and data change more often than DB | schemas, so every release has a chance that someone will | forget to update all relations manually in code, | introducing dangling references, for example. Foreign | keys help as an additional safety measure, because they | don't change as often. I think you should care about the | performance of foreign keys only when speed is more | important than data consistency. It's a good default. | sroussey wrote: | Personally, I thinks foreign keys are great in dev and | bad at any scale that you have many db servers. | | And as you point out, there are exceptions, like | financial data. But not marketing funnels where you might | throw everything away. | | You should probably have different types of engineers | working on such different projects as well. | baronvonsp wrote: | The exact analogy that came to mind for me. | | Plus, relational databases don't just sit under a single | application. There's usually multiple applications/services | talking to them. Worse, humans connect to them an do all | sorts of things they shouldn't do. That's the whole point | of managing referential integrity in the DBMS, since you | can only control "just write good code" across so many | application domains. | | Of course whether the performance tradeoff is worth it is a | complicated decision for many of the reasons people have | mentioned. But in 20 years of working with relational | databases at big companies, I've seen few examples where | the performance win exceeded the business risk. | candiddevmike wrote: | Things can be really fast when you throw out referential | integrity | _a_a_a_ wrote: | Databases are my interest, so if you don't mind... How would | removal of foreign keys speed up read access? | | I'm also going to disagree with you a little, again on the | issue of reading, if you have foreign keys then you can do | optimisations like cutting out chunks of joins. Adding | constraints means you know more which you can feed to the | optimiser which means often you can get better performance | (read performance, that is). | robmccoll wrote: | Not OP, but they said that you pick up performance, not | specifically in reading. The performance you gain is larger | aggregate write throughput (and read generally - although | reads that require spreading queries across shards and | aggregating results will likely perform a little worse). You | also gain scale in that you have turned something that was | limited by scale up into something limited by scale out | (while giving up some data guarantees and performance on some | queries). | | As to why you can't really have foreign keys, basically, | sharding schemes like this sit at an intermediate layer | between the application and the various DB clusters that are | the shards. You CAN have strong data consistency (and useful | foreign keys) within a shard because it's all inside a single | database; however across shards, you CAN NOT. The sharding | layer doesn't perform checks for you, so if you have two | logical tables that are partitioned differently across the | shards, you can't have a foreign key that will be enforced | correctly as the foreign key of a given row in one table may | live on a different shard in the other table. The local | database within the shard would reject the insert. | Transactions across shards can also be tricky to impossible. | _a_a_a_ wrote: | In my experience most databases are read-heavy so that's | what I was asking about. That said, thanks for a clear | explanation. | samlambert wrote: | We have foreign keys coming soon! | angarg12 wrote: | Interesting slide deck, but I feel we are missing some of the | most interesting details out. Particularly, slides 24 and 25 | could be an entire article on their own. | | * How did you tune Vitess for resiliency? What were the tradeoffs | and how was the performance? | | * How did you migrate from shared in app config to single DB | endpoint? (this is an issue we are facing right now) | | * What do you mean by some queries being too shard aware? How did | you optimize queries for efficient routing with Vitess? | super256 wrote: | I assume Activision _actually_ knows how to scale databases, but | has "join queues" for hype? I think everyone knows the infamous | Blizzard queues in release week, where you have to wait an hour | with only 5000 people in front of you. Happened recently with the | OW2 release. | mh- wrote: | Databases are not the only contended-for resource in online | games. | | Those queues are usually the result of intentional capacity | planning. Overprovisioning is expensive. | super256 wrote: | > Those queues are usually the result of intentional capacity | planning. Overprovisioning is expensive. | | I'd imagine they could spin up new servers on demand. | Arrath wrote: | Could, but after near 20 years of WoW and however many | expansion launches with big ole wait queues each time, they | really don't seem to bother. | rejectfinite wrote: | Gaming is insane that it even works online in near realtime. Even | moreso before. You can buy old wow servers. That was a world. | alecco wrote: | > * SQL query compatibility > * Minimal changes to the | application > * Runs MySQL in backend > * | Kubernetes native > * Provides Kubernetes operator | > We evaluated multiple candidates and chose Vitess | | https://github.com/vitessio/vitess | | But it's very light on numbers and doesn't show trade-offs or | anything. | edf13 wrote: | > Vitess has been a core component of YouTube's database | infrastructure since 2011, and has grown to encompass tens of | thousands of MySQL nodes. | | Would love to hear more about this implementation | tbarn wrote: | KubeCon talks should be out in the next month or so and it has | more details. | Thaxll wrote: | Vitess was used for all of Youtube, they moved to Spanner | couple of years ago, Vitess is proven and scalable tech. ___________________________________________________________________ (page generated 2023-04-21 23:00 UTC)