[HN Gopher] Cassandra at Apple: 1000s of Clusters, 300k Nodes, 1...
       ___________________________________________________________________
        
       Cassandra at Apple: 1000s of Clusters, 300k Nodes, 100 PB
        
       Author : mfiguiere
       Score  : 72 points
       Date   : 2022-10-07 17:46 UTC (1 days ago)
        
 (HTM) web link (twitter.com)
 (TXT) w3m dump (twitter.com)
        
       | notacoward wrote:
       | The operational complexity of managing thousands of clusters must
       | be mind boggling. I've been on two projects managing dozens of
       | storage clusters, the second with more data than this in some
       | individual clusters and adding up almost this many total nodes.
       | There were technical problems with scaling up to 10K nodes per
       | cluster, but the _operational_ issues mostly scaled according to
       | number of clusters. For example, how many alerts per hour /day
       | can you stand? Too many and you're overwhelmed; too few and you
       | miss stuff. Walking that fine line became successively more
       | difficult as clusters were added. Same thing with graphs and
       | dashboards. Also, when your storage and IOPS are siloed this much
       | you have no elasticity, so you're going to be chasing capacity or
       | load problems much more often. On the plus side, this probably
       | means each tenant has their own cluster, so you don't have so
       | many worries about them affecting each other.
       | 
       | The big question I'd have is: how many _people_ (including on
       | client teams) does it take to manage this much sprawl?
        
         | jhgg wrote:
         | At this scale you automate away 99.9% of the things you respond
         | to.
         | 
         | We are not even near this scale yet at my place of work, and we
         | are moving towards this methodology of strong automation to
         | orchestrate a cluster. We hire software engineers to operate
         | our database clusters, and the expectation is to somewhat be
         | selfishly motivated to write programs to remediate issues so
         | you don't get paged constantly. We do not expect to grow our
         | headcount proportionally to the number of clusters or nodes we
         | operate.
         | 
         | You must treat your nodes like cattle not pets. If a node
         | fails, automation kicks in and re-bootstraps it. It is not
         | worth figuring out how to nurse it back to health. When you are
         | performing rolling or scale up operations on the cluster you
         | are just invoking automation to do everything for you.
        
         | redanddead wrote:
         | They for sure have in-house software to handle these nodes. The
         | software may be pretty good at what it does considering they
         | know what their own DBs are susceptible to, that might reduce a
         | whole bunch of the human management and the team they need may
         | be not that big but likely highly skilled.
        
         | magnawave wrote:
         | One of the hard parts of quantifying that, is you have people
         | who wear many hats. So sure you have Cassandra gurus, and
         | probably a decent number of them. But this is in the league
         | where really hardcore automation kicks in to keep staffing
         | sane, and operations possible. But outside that, how do you
         | count datacenter folks, client folks, networking folks, etc who
         | only spend a little fraction of time on the database parts.
         | 
         | But I think I can say with sufficient knowledge, all things
         | considered still, "way fewer than you might think".
        
       | iampims wrote:
       | I'd be curious to know what the orchestration platform looks like
       | for running 300,000 nodes and 1,000s of clusters.
        
       | jackblemming wrote:
       | Unrelated but does anyone else get tired of the fetishization of
       | BIG NUMBER? I don't care if Facebook has billions of users if
       | it's a hot pile of garbage. I don't care if some game has
       | millions of players if it's bad. When did BIG NUMBER overtake
       | quality and can we go back?
        
         | eurleif wrote:
         | Quantity of users is social proof (more accurately, social
         | evidence) of quality. The claim "we have high quality" is
         | cheap: anyone can make it, and it's subjective enough that it
         | can't be disproven in any absolute sense. But if you lie about
         | having billions of users, you can be called out on that pretty
         | easily; and if you say it honestly, it implies that billions of
         | people like the quality of your product or service enough to
         | use it. Billions of people can be wrong, but saying you have
         | billions of users is still much better evidence of quality than
         | saying you have quality.
        
           | api wrote:
           | > Quantity of users is social proof (more accurately, social
           | evidence) of quality.
           | 
           | It can also be evidence of founder effects (JavaScript), lock
           | in (Windows), network effects (most social media), etc.
        
         | osigurdson wrote:
         | It is a bit like music. One could state that all of the popular
         | music is garbage (I might even state that myself some days),
         | but in the end, the music purchasing public have spoken.
        
         | Maursault wrote:
         | > I don't care if Facebook has billions of users if it's a hot
         | pile of garbage.
         | 
         | Shockingly, Apache Cassandra was initially developed at
         | Facebook, so at least a hot pile of garbage is good for
         | something. Plus, it'll keep you warm, so that's two things.
        
         | chomp wrote:
         | BIG NUMBER implies a dedication to the care and feeding. It's a
         | nudge and a wink for "come work for us"
         | 
         | Number of users advertisement is more for investors, and maybe
         | a broadcast for FOMO.
         | 
         | There is, however, a tendency in our field to look for systems
         | design patterns that handle big N numbers, and apply those to
         | little N platforms, but I believe deep down that this is
         | business and management dysfunction, as systems refactoring is
         | tolerated even less than code refactoring.
        
         | latchkey wrote:
         | -\\_(tsu)_/-, I ran over 100k GPUs... it was fun. You sound
         | jelly.
        
           | redanddead wrote:
           | Sounds fun, do tell
        
       | faizshah wrote:
       | Interesting, I thought they were trying to switch over to
       | FoundationDB but looks like their Cassandra usage keeps growing.
        
         | onesociety2022 wrote:
         | Apple acquired FoundationDB but from what I have heard it's not
         | really used much. The FoundationDB founders have left Apple and
         | are working on other things. Cassandra is the main datastore
         | for iCloud data.
        
       | api wrote:
       | I keep hearing people talk about how hard Cassandra is to run
       | properly and how many people get it wrong etc. Is there anything
       | to this or is it just FUD and people who genuinely don't know
       | what they are doing?
        
         | [deleted]
        
         | jhgg wrote:
         | Our biggest issue with running Cassandra was related to
         | pathological read / write patterns by some tenants on our
         | system causing outsized availability impact due to triggering
         | garbage collection pressure that would cause whole node GC STW
         | pauses and severe tail latency / query degradation.
         | 
         | We have solved these issues in a few ways, mainly: - working
         | with the relevant product teams to implement appropriate rate
         | limiting or improving data modeling.
         | 
         | - introducing our own query layer, written in Rust that sits in
         | front of Cassandra that uses a form of micro-caching called
         | read coalescing, and also other forms of query throttling/load
         | shedding to reduce work the database must do for hot
         | keys/pathological patterns of access. We expose a GRPC
         | interface from this - and this lets us centralize control of
         | the client driver and tune it appropriately, while also getting
         | to leverage the ever growing open source grpc traffic routing
         | solutions (envoy, etc...)
         | 
         | and ultimately,
         | 
         | - switching to ScyllaDB, a C++ rewrite of Cassandra which is of
         | course void of any garbage collection issues, and features
         | faster overall performance and lower latencies.
         | 
         | Scylla, however, is not without its own set of issues - and
         | somewhat strict hardware requirements[0] thanks to the seastar
         | engine it is built on top of. Their team however has been
         | delightful to work with, and our platform is markedly more
         | stable in current year than it was in years past thanks to the
         | above factors.
         | 
         | Operationally, however, Scylla and Cassandra are quite easy to
         | run, the trickiest part is repairs. Common operations such as
         | cluster expansion, or replacement of node are so common an
         | operation that they are at this point mundane. Be wary however
         | about read/write amplification issues inherent to LSMT
         | databases, choosing the correct compaction strategy and tuning
         | it appropriately can be quite key. Additionally tombstones can
         | be quite bad for performance.
         | 
         | In current day we offer a new more generic solution that sits
         | on top of scylla (it would work with Cassandra too) that
         | provides a simple interface to query KKV based data, without
         | having to worry too much about problems like large partitions,
         | hot keys, or tombstones! With a design like this, the
         | underlying cluster thus far has been issue free and very easy
         | to operate.
         | 
         | [0]: https://discord.com/blog/how-discord-supercharges-network-
         | di...
        
         | achillean wrote:
         | We store a few PB of data in Cassandra, have used it for nearly
         | 10 years and in my opinion it's not that hard. Operationally
         | it's way easier to manage than Elastic and most other databases
         | (ex. PostgreSQL, MongoDB) plus there's a ton of documentation
         | available to help you debug/ benchmark your cluster. Note that
         | even though CQL looks similar to SQL it's important to
         | understand the differences but as with any new technology
         | there's a learning curve. I would strongly recommend checking
         | out C* if you need a database with high write throughput and
         | that needs to scale out.
        
         | pmcf wrote:
         | It's a distributed system and if you have been a DBA for a
         | single system like Oracle or MySQL there is a lot of new
         | competencies to learn. That being said, completely doable and
         | it's typical to see small teams running massive amounts of
         | Cassandra. At the same conference, Bloomberg talked about their
         | large Cassandra footprint with only 4 people. If you want to
         | run Cassandra in K8s there is the K8ssandra project that
         | automates a lot. It's a fast growing project as a result.
         | (http://k8ssandra.io) If you want to use Cassandra and not run
         | it, http://astra.datastax.com. One click and a few seconds, you
         | get a completely serverless version of Cassandra that you only
         | pay for what you use. I'm sure we will hear a lot more of these
         | stories at Cassandra Summit in March
         | (http://cassandrasummit.org)
        
         | [deleted]
        
           | [deleted]
        
       | rektide wrote:
       | I'd love info on how much Apple contributed to Cassandra!
        
         | _benedict wrote:
         | I'm sure Scott's talk went into detail about this, but I can
         | safely say that his team contributes a great deal to Cassandra
        
       | tpmx wrote:
       | 300k Cassandra nodes seems a bit over the top even for a company
       | with as many active devices as Apple.
       | 
       | https://www.theverge.com/2022/1/28/22906071/apple-1-8-billio...
       | 
       | 1.8B active devices / 300k nodes = (just) 6k devices per
       | Cassandra node
        
         | daniel-grigg wrote:
         | Or it tells us something of how much data is being scooped up
         | per device. Certainly when I look through the raw health data
         | collected it's quite alarming and I'm sure that's just a drop
         | in the ocean.
        
           | ezfe wrote:
           | Well, Health data can be uploaded to iCloud (CloudKit), but
           | it's End-to-End encrypted so not really a concern.
           | 
           | Unlike other data in iCloud, if you lose your devices you
           | lose your HealthKit data. This is not true for photos or
           | emails, for example - which you keep if you lose your
           | devices.
        
           | mwint wrote:
           | Why do you think the raw health data is getting sucked off
           | your device? That would be totally off brand for them.
           | 
           | Apple does have a separate opt-in "Research" program to
           | facilitate this kind of thing.
        
             | faeriechangling wrote:
             | Regardless of their current brand, Apple is the next big
             | advertising giant and no amount of brand purity is going to
             | change this. The data of Apple's users is simply of too
             | high value for Apple to ignore forever.
        
               | tpmx wrote:
               | Makes me think of that first decade (98-08) when Google
               | actually wasn't being evil. Yeah, it's inevitable that
               | Apple will turn to this when they can't grow any more
               | simply by raising the prices of their devices. Perhaps
               | they have reached that point about now...
        
             | smoldesu wrote:
             | It's also off-brand for Apple to join PRISM and comply with
             | thousands of annual requests for supposedly-inaccessible
             | iCloud data. Neither of you will ever be proven right until
             | we look inside those servers though, so making _any_
             | conclusive statements is a mistake. Apple designed
             | Schrodinger 's datacenter.
        
               | prange wrote:
        
         | threeseed wrote:
         | Apple has a lot more data than just a list of devices.
         | 
         | There is everything from Weather to Siri to Store Purchases
         | etc.
         | 
         | And companies will syndicate data sets to different teams for
         | performance and security reasons ie. lots of duplication.
        
           | tpmx wrote:
           | > Apple has a lot more data than just a list of devices.
           | [...]
           | 
           | Of course. That is not the point here.
        
             | echelon wrote:
             | Perhaps you'd be better convinced with a service breakdown.
             | 
             | Breaking monoliths into service boundaries yields easier
             | ownership, maintenance, migration, and resilience.
             | 
             | One "tiny" company with a few verticals can be comprised of
             | thousands of microservices, each handling their own
             | dedicated objective. Authentication, reverse proxy, API
             | gateway, SMS, email, customer list, marketing email
             | gateway, CMS for marketers on product X, feature flags,
             | transaction histories, GDPR compliance handling, billing
             | intelligence, various risk models, offline ML risk
             | enrichment, etc. etc. Each will have its own data needs and
             | replication / availability needs.
             | 
             | This Apple number might seem crazy, but I'm not phased by
             | it. I can picture it.
        
               | tpmx wrote:
               | I can also picture it, but not really in the way you're
               | outlining it.
               | 
               | It's a sad and very inefficient picture though. Apple
               | does not _need_ this this much data processing. It 's a
               | grotesque amount per device. Or maybe they're just
               | wasting insane amounts of energy doing lots and lots
               | doing of stupid analytics...
        
               | echelon wrote:
               | Sometimes things have to be built as layered abstractions
               | in order for humans to reason about them at scale.
               | 
               | See also the natural stochastic gradient ascent that
               | produced our crazy complicated metabolic pathways (and
               | all of biology).
        
             | [deleted]
        
       | Luker88 wrote:
       | Couple of things as always:
       | 
       | Cassandra works really bad with fat nodes (lots of data on one
       | node), and much much better with a lot of small nodes, and 100PB
       | with 300K nodes confirms this. Scylla scales better vertically,
       | but don't know how much.
       | 
       | Some comments are already comparing this to pgsql/mysql/whatever.
       | Please don't. You can't make the same queries even though the
       | language seems to support it.
       | 
       | Cassandra is good at ingesting data, bad at deleting, really
       | really bad at anything remotely relational. Errors are almost
       | pointless.
       | 
       | I'm going to point at an older comment of mine on cassandra:
       | https://news.ycombinator.com/item?id=20430925#20432564
       | 
       | The takeaway should be: Yes, cassandra/scylla can be really fast
       | and scale a lot. But it is also very probably unusable for your
       | use case. Don't trust what the CQL language says you can do.
       | Don't get me started on how bad the CQL language is, either.
        
       | bluedino wrote:
       | Over 2PB per cluster, thousands of clusters, but only 100's of PB
       | of data.
       | 
       | What do they use this for? iCloud storage related stuff?
        
         | candiddevmike wrote:
         | I thought Cassandra was bad at storing big files?
        
           | riku_iki wrote:
           | you can easily chunk big files.
        
           | magnawave wrote:
           | Correct, that's what object stores are for. However, metadata
           | on said files, is probably very handy to have in a database.
           | 
           | I'm quite sure not all this Cassandra capacity is just
           | file/photo metadata storage either.
        
         | xvector wrote:
         | They use it for storing iCloud Photos without E2EE while
         | heavily marketing privacy
        
           | cassonmars wrote:
           | They were moving towards E2EE when everyone freaked out about
           | the on-device perceptual hashing trade off.
        
             | smoldesu wrote:
             | They should have wheeled out a better marketing spiel than
             | "trust us ;)" then.
        
             | AmericanChopper wrote:
             | Well... you don't actually need to make a computing device
             | automatically report its owner to the authorities for a
             | serious crime based on a provably flawed automated process,
             | prior to implementing encryption E2EE for a cloud storage
             | service. That was simply the strategy that Apple chose to
             | pursue. Blaming the users for reacting poorly to this
             | strictly anti-user approach is very backwards.
        
             | sneak wrote:
             | Or they could just deploy e2e without turning our devices
             | into things that spy on us. It's a false dichotomy.
        
             | MBCook wrote:
             | Yup. The knee-jerk privacy reaction _cost us_ privacy.
        
               | Gigachad wrote:
               | I don't think it's fair to say we need to accept either
               | options. Yes the crime they are trying to stop is
               | horrific and something must be done, but that doesn't
               | justify unlimited technological spyware.
               | 
               | And the scope for abuse is so large. People in the UK are
               | getting arrested for retweeting mean memes, it's pretty
               | easy to imagine Google and Apple added offensive images
               | to their scanning and you get arrested for saving
               | something that goes against the current political agenda.
               | 
               | As well as the case where google locked the account of a
               | parent who had taken photos to send to a medical expert.
        
       | redanddead wrote:
       | I'm under the belief that Redis is much faster than Cassandra, am
       | I crazy to think that Apple or any company really should have a
       | transition plan? Why isn't redis used more?
        
         | mplewis wrote:
         | Cassandra solves different problems than Redis is typically
         | used for.
        
       | zeristor wrote:
       | How comes iTunes movies takes minutes to display a list on Apple
       | TV, feels like minutes all the time?
        
         | jleahy wrote:
         | Normally you post a question and someone posts an answer. Here
         | someone has posted the answer and you have posted the question.
         | Are we playing Jeopardy?
        
           | tmpz22 wrote:
           | Or to just say it - because the system has absurd complexity
           | as proven by the hardware needed to run it.
        
         | ezfe wrote:
         | While it doesn't take that long on my device, it probably is
         | related to the fact that the iTunes store is over 20 years old
         | and has the tech debt to prove it.
        
         | reaperducer wrote:
         | _How comes iTunes movies takes minutes to display a list on
         | Apple TV, feels like minutes all the time?_
         | 
         | I just fired up the iTunes Movies app on my AppleTV for the
         | first time so no cache (I only watch my DVD/Blu-Ray rips), and
         | the app started and loaded a full list of movies in a little
         | under 3.5 seconds.
         | 
         | If it takes you minutes, it sounds like PEBKAC.
        
       | rkwasny wrote:
       | Can be replaced with 300 servers with ScyllaDB :-)
        
         | leetrout wrote:
         | Because of no JVM? Or because of its different architecture
         | (different caching and such)?
         | 
         | I would expect it to still require more for high availability
         | but from what I have heard around ScyllaDB it does seem there
         | is a benefit to it over cassandra.
        
           | neeh0 wrote:
           | Relevant post: https://instagram-engineering.com/open-
           | sourcing-a-10x-reduct...
        
         | riku_iki wrote:
         | will ScyllaDB shrink data 1k times using some magic?
        
         | [deleted]
        
         | pclmulqdq wrote:
         | You might need about 1500. I think 64 TB of flash is the
         | standard for a 2U database server these days.
        
       ___________________________________________________________________
       (page generated 2022-10-08 23:00 UTC)