[HN Gopher] Cassandra at Apple: 1000s of Clusters, 300k Nodes, 1... ___________________________________________________________________ Cassandra at Apple: 1000s of Clusters, 300k Nodes, 100 PB Author : mfiguiere Score : 72 points Date : 2022-10-07 17:46 UTC (1 days ago) (HTM) web link (twitter.com) (TXT) w3m dump (twitter.com) | notacoward wrote: | The operational complexity of managing thousands of clusters must | be mind boggling. I've been on two projects managing dozens of | storage clusters, the second with more data than this in some | individual clusters and adding up almost this many total nodes. | There were technical problems with scaling up to 10K nodes per | cluster, but the _operational_ issues mostly scaled according to | number of clusters. For example, how many alerts per hour /day | can you stand? Too many and you're overwhelmed; too few and you | miss stuff. Walking that fine line became successively more | difficult as clusters were added. Same thing with graphs and | dashboards. Also, when your storage and IOPS are siloed this much | you have no elasticity, so you're going to be chasing capacity or | load problems much more often. On the plus side, this probably | means each tenant has their own cluster, so you don't have so | many worries about them affecting each other. | | The big question I'd have is: how many _people_ (including on | client teams) does it take to manage this much sprawl? | jhgg wrote: | At this scale you automate away 99.9% of the things you respond | to. | | We are not even near this scale yet at my place of work, and we | are moving towards this methodology of strong automation to | orchestrate a cluster. We hire software engineers to operate | our database clusters, and the expectation is to somewhat be | selfishly motivated to write programs to remediate issues so | you don't get paged constantly. We do not expect to grow our | headcount proportionally to the number of clusters or nodes we | operate. | | You must treat your nodes like cattle not pets. If a node | fails, automation kicks in and re-bootstraps it. It is not | worth figuring out how to nurse it back to health. When you are | performing rolling or scale up operations on the cluster you | are just invoking automation to do everything for you. | redanddead wrote: | They for sure have in-house software to handle these nodes. The | software may be pretty good at what it does considering they | know what their own DBs are susceptible to, that might reduce a | whole bunch of the human management and the team they need may | be not that big but likely highly skilled. | magnawave wrote: | One of the hard parts of quantifying that, is you have people | who wear many hats. So sure you have Cassandra gurus, and | probably a decent number of them. But this is in the league | where really hardcore automation kicks in to keep staffing | sane, and operations possible. But outside that, how do you | count datacenter folks, client folks, networking folks, etc who | only spend a little fraction of time on the database parts. | | But I think I can say with sufficient knowledge, all things | considered still, "way fewer than you might think". | iampims wrote: | I'd be curious to know what the orchestration platform looks like | for running 300,000 nodes and 1,000s of clusters. | jackblemming wrote: | Unrelated but does anyone else get tired of the fetishization of | BIG NUMBER? I don't care if Facebook has billions of users if | it's a hot pile of garbage. I don't care if some game has | millions of players if it's bad. When did BIG NUMBER overtake | quality and can we go back? | eurleif wrote: | Quantity of users is social proof (more accurately, social | evidence) of quality. The claim "we have high quality" is | cheap: anyone can make it, and it's subjective enough that it | can't be disproven in any absolute sense. But if you lie about | having billions of users, you can be called out on that pretty | easily; and if you say it honestly, it implies that billions of | people like the quality of your product or service enough to | use it. Billions of people can be wrong, but saying you have | billions of users is still much better evidence of quality than | saying you have quality. | api wrote: | > Quantity of users is social proof (more accurately, social | evidence) of quality. | | It can also be evidence of founder effects (JavaScript), lock | in (Windows), network effects (most social media), etc. | osigurdson wrote: | It is a bit like music. One could state that all of the popular | music is garbage (I might even state that myself some days), | but in the end, the music purchasing public have spoken. | Maursault wrote: | > I don't care if Facebook has billions of users if it's a hot | pile of garbage. | | Shockingly, Apache Cassandra was initially developed at | Facebook, so at least a hot pile of garbage is good for | something. Plus, it'll keep you warm, so that's two things. | chomp wrote: | BIG NUMBER implies a dedication to the care and feeding. It's a | nudge and a wink for "come work for us" | | Number of users advertisement is more for investors, and maybe | a broadcast for FOMO. | | There is, however, a tendency in our field to look for systems | design patterns that handle big N numbers, and apply those to | little N platforms, but I believe deep down that this is | business and management dysfunction, as systems refactoring is | tolerated even less than code refactoring. | latchkey wrote: | -\\_(tsu)_/-, I ran over 100k GPUs... it was fun. You sound | jelly. | redanddead wrote: | Sounds fun, do tell | faizshah wrote: | Interesting, I thought they were trying to switch over to | FoundationDB but looks like their Cassandra usage keeps growing. | onesociety2022 wrote: | Apple acquired FoundationDB but from what I have heard it's not | really used much. The FoundationDB founders have left Apple and | are working on other things. Cassandra is the main datastore | for iCloud data. | api wrote: | I keep hearing people talk about how hard Cassandra is to run | properly and how many people get it wrong etc. Is there anything | to this or is it just FUD and people who genuinely don't know | what they are doing? | [deleted] | jhgg wrote: | Our biggest issue with running Cassandra was related to | pathological read / write patterns by some tenants on our | system causing outsized availability impact due to triggering | garbage collection pressure that would cause whole node GC STW | pauses and severe tail latency / query degradation. | | We have solved these issues in a few ways, mainly: - working | with the relevant product teams to implement appropriate rate | limiting or improving data modeling. | | - introducing our own query layer, written in Rust that sits in | front of Cassandra that uses a form of micro-caching called | read coalescing, and also other forms of query throttling/load | shedding to reduce work the database must do for hot | keys/pathological patterns of access. We expose a GRPC | interface from this - and this lets us centralize control of | the client driver and tune it appropriately, while also getting | to leverage the ever growing open source grpc traffic routing | solutions (envoy, etc...) | | and ultimately, | | - switching to ScyllaDB, a C++ rewrite of Cassandra which is of | course void of any garbage collection issues, and features | faster overall performance and lower latencies. | | Scylla, however, is not without its own set of issues - and | somewhat strict hardware requirements[0] thanks to the seastar | engine it is built on top of. Their team however has been | delightful to work with, and our platform is markedly more | stable in current year than it was in years past thanks to the | above factors. | | Operationally, however, Scylla and Cassandra are quite easy to | run, the trickiest part is repairs. Common operations such as | cluster expansion, or replacement of node are so common an | operation that they are at this point mundane. Be wary however | about read/write amplification issues inherent to LSMT | databases, choosing the correct compaction strategy and tuning | it appropriately can be quite key. Additionally tombstones can | be quite bad for performance. | | In current day we offer a new more generic solution that sits | on top of scylla (it would work with Cassandra too) that | provides a simple interface to query KKV based data, without | having to worry too much about problems like large partitions, | hot keys, or tombstones! With a design like this, the | underlying cluster thus far has been issue free and very easy | to operate. | | [0]: https://discord.com/blog/how-discord-supercharges-network- | di... | achillean wrote: | We store a few PB of data in Cassandra, have used it for nearly | 10 years and in my opinion it's not that hard. Operationally | it's way easier to manage than Elastic and most other databases | (ex. PostgreSQL, MongoDB) plus there's a ton of documentation | available to help you debug/ benchmark your cluster. Note that | even though CQL looks similar to SQL it's important to | understand the differences but as with any new technology | there's a learning curve. I would strongly recommend checking | out C* if you need a database with high write throughput and | that needs to scale out. | pmcf wrote: | It's a distributed system and if you have been a DBA for a | single system like Oracle or MySQL there is a lot of new | competencies to learn. That being said, completely doable and | it's typical to see small teams running massive amounts of | Cassandra. At the same conference, Bloomberg talked about their | large Cassandra footprint with only 4 people. If you want to | run Cassandra in K8s there is the K8ssandra project that | automates a lot. It's a fast growing project as a result. | (http://k8ssandra.io) If you want to use Cassandra and not run | it, http://astra.datastax.com. One click and a few seconds, you | get a completely serverless version of Cassandra that you only | pay for what you use. I'm sure we will hear a lot more of these | stories at Cassandra Summit in March | (http://cassandrasummit.org) | [deleted] | [deleted] | rektide wrote: | I'd love info on how much Apple contributed to Cassandra! | _benedict wrote: | I'm sure Scott's talk went into detail about this, but I can | safely say that his team contributes a great deal to Cassandra | tpmx wrote: | 300k Cassandra nodes seems a bit over the top even for a company | with as many active devices as Apple. | | https://www.theverge.com/2022/1/28/22906071/apple-1-8-billio... | | 1.8B active devices / 300k nodes = (just) 6k devices per | Cassandra node | daniel-grigg wrote: | Or it tells us something of how much data is being scooped up | per device. Certainly when I look through the raw health data | collected it's quite alarming and I'm sure that's just a drop | in the ocean. | ezfe wrote: | Well, Health data can be uploaded to iCloud (CloudKit), but | it's End-to-End encrypted so not really a concern. | | Unlike other data in iCloud, if you lose your devices you | lose your HealthKit data. This is not true for photos or | emails, for example - which you keep if you lose your | devices. | mwint wrote: | Why do you think the raw health data is getting sucked off | your device? That would be totally off brand for them. | | Apple does have a separate opt-in "Research" program to | facilitate this kind of thing. | faeriechangling wrote: | Regardless of their current brand, Apple is the next big | advertising giant and no amount of brand purity is going to | change this. The data of Apple's users is simply of too | high value for Apple to ignore forever. | tpmx wrote: | Makes me think of that first decade (98-08) when Google | actually wasn't being evil. Yeah, it's inevitable that | Apple will turn to this when they can't grow any more | simply by raising the prices of their devices. Perhaps | they have reached that point about now... | smoldesu wrote: | It's also off-brand for Apple to join PRISM and comply with | thousands of annual requests for supposedly-inaccessible | iCloud data. Neither of you will ever be proven right until | we look inside those servers though, so making _any_ | conclusive statements is a mistake. Apple designed | Schrodinger 's datacenter. | prange wrote: | threeseed wrote: | Apple has a lot more data than just a list of devices. | | There is everything from Weather to Siri to Store Purchases | etc. | | And companies will syndicate data sets to different teams for | performance and security reasons ie. lots of duplication. | tpmx wrote: | > Apple has a lot more data than just a list of devices. | [...] | | Of course. That is not the point here. | echelon wrote: | Perhaps you'd be better convinced with a service breakdown. | | Breaking monoliths into service boundaries yields easier | ownership, maintenance, migration, and resilience. | | One "tiny" company with a few verticals can be comprised of | thousands of microservices, each handling their own | dedicated objective. Authentication, reverse proxy, API | gateway, SMS, email, customer list, marketing email | gateway, CMS for marketers on product X, feature flags, | transaction histories, GDPR compliance handling, billing | intelligence, various risk models, offline ML risk | enrichment, etc. etc. Each will have its own data needs and | replication / availability needs. | | This Apple number might seem crazy, but I'm not phased by | it. I can picture it. | tpmx wrote: | I can also picture it, but not really in the way you're | outlining it. | | It's a sad and very inefficient picture though. Apple | does not _need_ this this much data processing. It 's a | grotesque amount per device. Or maybe they're just | wasting insane amounts of energy doing lots and lots | doing of stupid analytics... | echelon wrote: | Sometimes things have to be built as layered abstractions | in order for humans to reason about them at scale. | | See also the natural stochastic gradient ascent that | produced our crazy complicated metabolic pathways (and | all of biology). | [deleted] | Luker88 wrote: | Couple of things as always: | | Cassandra works really bad with fat nodes (lots of data on one | node), and much much better with a lot of small nodes, and 100PB | with 300K nodes confirms this. Scylla scales better vertically, | but don't know how much. | | Some comments are already comparing this to pgsql/mysql/whatever. | Please don't. You can't make the same queries even though the | language seems to support it. | | Cassandra is good at ingesting data, bad at deleting, really | really bad at anything remotely relational. Errors are almost | pointless. | | I'm going to point at an older comment of mine on cassandra: | https://news.ycombinator.com/item?id=20430925#20432564 | | The takeaway should be: Yes, cassandra/scylla can be really fast | and scale a lot. But it is also very probably unusable for your | use case. Don't trust what the CQL language says you can do. | Don't get me started on how bad the CQL language is, either. | bluedino wrote: | Over 2PB per cluster, thousands of clusters, but only 100's of PB | of data. | | What do they use this for? iCloud storage related stuff? | candiddevmike wrote: | I thought Cassandra was bad at storing big files? | riku_iki wrote: | you can easily chunk big files. | magnawave wrote: | Correct, that's what object stores are for. However, metadata | on said files, is probably very handy to have in a database. | | I'm quite sure not all this Cassandra capacity is just | file/photo metadata storage either. | xvector wrote: | They use it for storing iCloud Photos without E2EE while | heavily marketing privacy | cassonmars wrote: | They were moving towards E2EE when everyone freaked out about | the on-device perceptual hashing trade off. | smoldesu wrote: | They should have wheeled out a better marketing spiel than | "trust us ;)" then. | AmericanChopper wrote: | Well... you don't actually need to make a computing device | automatically report its owner to the authorities for a | serious crime based on a provably flawed automated process, | prior to implementing encryption E2EE for a cloud storage | service. That was simply the strategy that Apple chose to | pursue. Blaming the users for reacting poorly to this | strictly anti-user approach is very backwards. | sneak wrote: | Or they could just deploy e2e without turning our devices | into things that spy on us. It's a false dichotomy. | MBCook wrote: | Yup. The knee-jerk privacy reaction _cost us_ privacy. | Gigachad wrote: | I don't think it's fair to say we need to accept either | options. Yes the crime they are trying to stop is | horrific and something must be done, but that doesn't | justify unlimited technological spyware. | | And the scope for abuse is so large. People in the UK are | getting arrested for retweeting mean memes, it's pretty | easy to imagine Google and Apple added offensive images | to their scanning and you get arrested for saving | something that goes against the current political agenda. | | As well as the case where google locked the account of a | parent who had taken photos to send to a medical expert. | redanddead wrote: | I'm under the belief that Redis is much faster than Cassandra, am | I crazy to think that Apple or any company really should have a | transition plan? Why isn't redis used more? | mplewis wrote: | Cassandra solves different problems than Redis is typically | used for. | zeristor wrote: | How comes iTunes movies takes minutes to display a list on Apple | TV, feels like minutes all the time? | jleahy wrote: | Normally you post a question and someone posts an answer. Here | someone has posted the answer and you have posted the question. | Are we playing Jeopardy? | tmpz22 wrote: | Or to just say it - because the system has absurd complexity | as proven by the hardware needed to run it. | ezfe wrote: | While it doesn't take that long on my device, it probably is | related to the fact that the iTunes store is over 20 years old | and has the tech debt to prove it. | reaperducer wrote: | _How comes iTunes movies takes minutes to display a list on | Apple TV, feels like minutes all the time?_ | | I just fired up the iTunes Movies app on my AppleTV for the | first time so no cache (I only watch my DVD/Blu-Ray rips), and | the app started and loaded a full list of movies in a little | under 3.5 seconds. | | If it takes you minutes, it sounds like PEBKAC. | rkwasny wrote: | Can be replaced with 300 servers with ScyllaDB :-) | leetrout wrote: | Because of no JVM? Or because of its different architecture | (different caching and such)? | | I would expect it to still require more for high availability | but from what I have heard around ScyllaDB it does seem there | is a benefit to it over cassandra. | neeh0 wrote: | Relevant post: https://instagram-engineering.com/open- | sourcing-a-10x-reduct... | riku_iki wrote: | will ScyllaDB shrink data 1k times using some magic? | [deleted] | pclmulqdq wrote: | You might need about 1500. I think 64 TB of flash is the | standard for a 2U database server these days. ___________________________________________________________________ (page generated 2022-10-08 23:00 UTC)