[HN Gopher] How Discord supercharges network disks for extreme l...
       ___________________________________________________________________
        
       How Discord supercharges network disks for extreme low latency
        
       Author : techinvalley
       Score  : 177 points
       Date   : 2022-08-15 19:24 UTC (3 hours ago)
        
 (HTM) web link (discord.com)
 (TXT) w3m dump (discord.com)
        
       | throwdbaaway wrote:
       | That's a very smart solution indeed. I wonder if it is also
       | possible to throw more memory at the problem? Managing instances
       | with locally attached disks on the cloud is a bit of a pain, as
       | you don't have the capability to stop/start the instances
       | anymore.
        
       | skyde wrote:
       | The database should be able to be configured with a cache device.
       | 
       | Using a RAID manager or a filesystem for this does not seem
       | optimal.
        
         | skyde wrote:
         | Actually, using Something like ReadySet[1] but using a swap
         | drive instead of memory to store the cached row would work very
         | well.
         | 
         | [1] https://readyset.io/blog/introducing-readyset
        
       | JanMa wrote:
       | How does this setup handle a maintenance of the underlying
       | hypervisor host? As far as I know the VM will be migrated to a
       | new hypervisor and all data on the local SSDs is lost. Can the
       | custom RAID0 array of local SSDs handle this or does it have to
       | be manually rebuilt on every maintenance?
        
         | ahepp wrote:
         | From the article:
         | 
         | > GCP provides an interesting "guarantee" around the failure of
         | Local SSDs: If any Local SSD fails, the entire server is
         | migrated to a different set of hardware, essentially erasing
         | all Local SSD data for that server.
         | 
         | I wonder how md handles reads during the rebuild, and how long
         | it takes to replicate the persistent store back onto the raid0
         | mirror.
        
         | jhgg wrote:
         | On GCP, live migration moves the data on the local-ssd to the
         | new host as well.
        
       | merb wrote:
       | would've really known how much cpu this solution would cost. most
       | of the time mdraid adds additonal cpu time (not that it matters
       | to them that much)
        
         | idorosen wrote:
         | If it's just raid0 and raid1, then there's likely not any
         | significant CPU time or CPU overhead involved. Most
         | southbridges or I/O controllers support mirroring directly, and
         | md knows how to use them. Most virtualized disk controllers do
         | this in hardware as well.
         | 
         | CPU overhead comes into play when you're doing parity on a
         | software raid setup (like md or zfs) such as in md raid5 or
         | raid6.
         | 
         | If they needed data scrubbing at a single host level like zfs
         | offers, then probably CPU would be a factor, but I'm assuming
         | they achieve data integrity at a higher level/across hosts,
         | such as in their distributed DB.
        
       | nvarsj wrote:
       | As an aside, I'm always impressed by Discord's engineering
       | articles. They are incredibly pragmatic - typically using
       | commonly available OSS to solve big problems. If this was another
       | unicorn company they would have instead written a custom disk
       | controller in Rust, called it a Greek name, and have done several
       | major conference talks on their unique innovation.
        
         | jfim wrote:
         | There are some times when writing a custom solution does make
         | sense though.
         | 
         | In their case, I'm wondering why the host failure isn't handled
         | at a higher level already. A node failure causing all data to
         | be lost on that host should be handled gracefully through
         | replication and another replica brought up transparently.
         | 
         | In any case, their usage of local storage as a write through
         | cache though md is pretty interesting, I wonder if it would
         | work the other way around for reading.
        
           | mikesun wrote:
           | Scylla (and Cassandra) provides cluster-level replication.
           | Even with only local NVMes, a single node failure with loss
           | of data would be tolerated. But relying on "ephemeral local
           | SSDs" that nodes can lose if any VM is power-cycled adds
           | additional risk that some incident could cause multiple
           | replicas to lose their data.
        
         | stevenpetryk wrote:
         | That's a common theme here. We try to avoid making tools into
         | projects.
        
         | Pepe1vo wrote:
         | Discord has done its fair share of RIIR though[0][1] ;)
         | 
         | [0] https://discord.com/blog/why-discord-is-switching-from-go-
         | to...
         | 
         | [1] https://discord.com/blog/using-rust-to-scale-elixir-
         | for-11-m...
        
         | whizzter wrote:
         | They've probably subscribed to Taco Bell Programming.
         | 
         | http://widgetsandshit.com/teddziuba/2010/10/taco-bell-progra...
        
           | MonkeyMalarky wrote:
           | The article is from 2010 and uses the term DevOps! Just how
           | long has that meme been around?
        
           | Sohcahtoa82 wrote:
           | Oooh, I like this. I gotta remember this term.
        
           | tablespoon wrote:
           | >> http://widgetsandshit.com/teddziuba/2010/10/taco-bell-
           | progra...
           | 
           | > Here's a concrete example: suppose you have millions of web
           | pages that you want to download and save to disk for later
           | processing. How do you do it? The cool-kids answer is to
           | write a distributed crawler in Clojure and run it on EC2,
           | handing out jobs with a message queue like SQS or ZeroMQ.
           | 
           | > The Taco Bell answer? xargs and wget. In the rare case that
           | you saturate the network connection, add some split and
           | rsync. A "distributed crawler" is really only like 10 lines
           | of shell script.
           | 
           | I generally agree, but it's probably only 10 lines if you
           | assume you never have to deal with any errors.
        
             | derefr wrote:
             | That can be solved at the design level: write your get step
             | as an idempotent "only do it if it isn't already done"
             | creation operation for a given output file -- like a make
             | target, but no need to actually use Make (just a `test -f
             | || ...`.)
             | 
             | Then run your little pipeline _in a loop_ until it stops
             | making progress (`find | wc` doesn't increase.) Either it
             | finished, or everything that's left as input represents one
             | or more classes of errors. Debug them, and then start it
             | looping again :)
        
               | dekhn wrote:
               | Not redoing steps that appear to be already done has its
               | own challenges- for example, a transfer that broke
               | halfway through might leave a destination file, but not
               | represent a completion (typically dealt with by writing
               | to a temp file and renaming).
               | 
               | The issue here is that your code has no real-time
               | adaptability. Many backends will scale with load up to a
               | point then start returning "make fewer requests".
               | Normally, you implement some internal logic such as
               | randomized exponential backoff retries (amazingly, this
               | is a remarkably effective way to automatically find the
               | saturation point of the cluster), although I have also
               | seen some large clients that coordinate their fetches
               | centrally using tokens.
        
               | derefr wrote:
               | Having that logic in the same place as the work of
               | actually driving the fetch/crawl, though, is a violation
               | of Unix "small components, each doing one thing"
               | thinking.
               | 
               | You know how you can rate-limit your requests? A forward
               | proxy daemon that rate-limits upstream connections by
               | holding them open but not serving them until the timeout
               | has elapsed. (I.e. Nginx with five lines of config.) As
               | long as your fetcher has a concurrency limit, stalling
               | some of those connections will lead to decreased
               | attempted throughput.
               | 
               | (This isn't just for scripting, either; it's also a near-
               | optimal way to implement global per-domain upstream-API
               | rate-limiting in a production system that has multiple
               | shared-nothing backends. It's Istio/Envoy "in the
               | small.")
        
               | dekhn wrote:
               | Setting up the nginx server is one more server (and isn't
               | particularly a small component doing one thing) to
               | manage.
               | 
               | Having built several large distributed computing systems,
               | I've found that the inner client always needs to have a
               | fair amount of intelligence when talking to the server.
               | That means responding to errors in a way that doesn't
               | lead to thundering herds. The nice thing about this is
               | that, like modern TCP, it auto-tunes to the capacity of
               | the system, while also handling outages well.
        
             | hn_version_0023 wrote:
             | I'd add GNU _parallel_ to the tools used; I've written this
             | exact crawler that way, saving screenshots using
             | _ghostscript_ , IIRC
        
             | Sohcahtoa82 wrote:
             | Fair, but a simply Python script could probably handle it.
             | Still don't need a message queue.
        
             | skrtskrt wrote:
             | Errors, retries, consistency, observability. For actual
             | software running in production, shelling out is way too
             | flaky and unobservable
        
               | bbarnett wrote:
               | It's not flaky at all, it is merely that most people
               | don't code bash/etc to catch errors, retry on failure,
               | etc, etc.
               | 
               | I will 100% agree that it has disadvantages, but it's
               | unfair to level the above at shell scripts, for most of
               | your complaint, is about poorly coded shell scripts.
               | 
               | An example? sysvinit is a few C programs, and all of it
               | wrapped in bash or sh. It's far more reliable than
               | systemd ever has been, with far better error checking.
               | 
               | Part of this is simplicity. 100 lines of code is better
               | than 10k lines. "Whole scope" on one page can't be
               | underestimated for debugging and comprehension, which
               | also makes error checking easier too.
        
               | skrtskrt wrote:
               | Can I, with off the shelf OSS tooling, _easily_ trace
               | that code that's "just wget and xargs", emit metrics and
               | traces to collectors, differentiate between all the
               | possible network and http failure errors, retry
               | individual requests with backoff and jitter, allow
               | individual threads to fail and retry those without
               | borking the root program, and write the results to a
               | datastore in idempotent way, and allow a junior developer
               | to contribute to it with little ramp-up?
               | 
               | It's not about "can bash do it" it's about "is there a
               | huge ecosystem of tools, which we are probably already
               | using in our organization, that thoroughly cover all
               | these issues".
        
           | lordpankake wrote:
           | Awesome article!
        
       | [deleted]
        
       | shrubble wrote:
       | Key sentence and a half:
       | 
       | 'Discord runs most of its hardware in Google Cloud and they
       | provide ready access to "Local SSDs" -- NVMe based instance
       | storage, which do have incredibly fast latency profiles.
       | Unfortunately, in our testing, we ran into enough reliability
       | issues'
        
         | why_only_15 wrote:
         | Don't all physical SSDs have reliability issues? There's a good
         | reason we replicate data across devices.
        
       | legulere wrote:
       | Basically they need to solve local SSDs not having all needed
       | features and persistent disks having too high latency by:
       | 
       | > essentially a write-through cache, with GCP's Local SSDs as the
       | cache and Persistent Disks as the storage layer.
        
         | [deleted]
        
         | ahepp wrote:
         | I found it worth noting that the cache is primarily interested
         | through linux's built in software raid system, md. SSDs in
         | raid0 (strip), persistent disk in raid1 (mirror).
        
       | hardware2win wrote:
       | It sounds like intel optane persistent memory could work here
        
         | wmf wrote:
         | Sadly Optane cost a fortune, was never available in most
         | clouds, and has now been canceled.
        
       | davidw wrote:
       | Regarding ScyllaDB:
       | 
       | > while achieving significantly higher throughputs and lower
       | latencies [compared to Cassandra]
       | 
       | Do they really get all that just because it's in C++? Anyone
       | familiar with both of them?
        
       | doubledad222 wrote:
       | This was a very interesting read. The detail on the exploratory
       | process was perfect. Can't wait for part two.
        
       | javier_e06 wrote:
       | Yes, RAID is all that. It's interesting to see such established
       | technology shine and shine again.
        
       | madars wrote:
       | Q about a related use case: Can I use tiered storage (e.g., SSD
       | cache in front of a HDD with, say, dm-cache), or an md-based
       | approach like Discord's, to successfully sync an Ethereum node?
       | Everyone says "you should get a 2TB SSD" but I'm wondering if I
       | can be more future-proof with say 512 GB SSD cache + much larger
       | HDD.
        
         | pas wrote:
         | Sure, you might want to look into bcache or bcachefs.
        
         | Nextgrid wrote:
         | I'm not familiar with what Ethereum requires for its syncing
         | operation.
         | 
         | If it's a sequential write (by downloading the entire
         | blockchain), you will still be bottlenecked by the throughput
         | of the underlying disk.
         | 
         | If it's sequential reads (in between writes), the reads can be
         | handled by the cache if the location is local enough to the
         | previous write operation that it hasn't been evicted yet.
         | 
         | If it's random unpredictable reads, it's unlikely a cache will
         | help unless the cache is big enough to fit the entire working
         | dataset (otherwise you'll get a terrible cache hit rate as most
         | of what you need would've been evicted by then) but then you're
         | back at your original problem of needing a huge SSD.
        
       | alberth wrote:
       | TL;DR - NAS is slow. RAID 0 is fast.
        
         | wmf wrote:
         | This is not an accurate summary of the article.
        
           | [deleted]
        
       | PeterWhittaker wrote:
       | I feel like I am missing a step. Do they write to md1 and read
       | from md0? Or do they read from md1, and under the covers the read
       | is likely fulfilled by md0?
        
       | ahepp wrote:
       | Is it necessary to fully replicate the persistent store onto the
       | striped SSD array? I admire such a simple solution, but I wonder
       | if something like an LRU cache would achieve similar speedups
       | while using fewer resources. On the other hand, it could be a
       | small cost to pay for a more consistent and predictable workload.
       | 
       | How does md handle a synchronous write in a heterogenous mirror?
       | Does it wait for both devices to be written?
       | 
       | I'm also curious how this solution compares to allocating more
       | ram to the servers, and either letting the database software use
       | this for caching, or even creating a ramdisk and putting that in
       | raid1 with the persistent storage. Since the SSDs are being
       | treated as volatile anyways. I assume it would be prohibitively
       | expensive to replicate the entire persistent store into main
       | memory.
       | 
       | I'd also be interested to know how this compares with replacing
       | the entire persistent disk / SSD system with zfs over a few SSDs
       | (which would also allow snapshoting). Of course it is probably a
       | huge feature to be able to have snapshots be integrated into your
       | cloud...
        
         | mikesun wrote:
         | > Is it necessary to fully replicate the persistent store onto
         | the striped SSD array? I admire such a simple solution, but I
         | wonder if something like an LRU cache would achieve similar
         | speedups while using fewer resources. On the other hand, it
         | could be a small cost to pay for a more consistent and
         | predictable workload.
         | 
         | One of the reasons an LRU cache like dm-cache wasn't feasible
         | was because we had a higher than acceptable bad sector read
         | rate which would cause a cache like dm-cache to bubble up a
         | block device error up to the database. The database would then
         | shut itself down when it encountered an disk-level error.
         | 
         | > How does md handle a synchronous write in a heterogenous
         | mirror? Does it wait for both devices to be written? Yes, md
         | waits for both mirrors to be written.
         | 
         | > I'm also curious how this solution compares to allocating
         | more ram to the servers, and either letting the database
         | software use this for caching, or even creating a ramdisk and
         | putting that in raid1 with the persistent storage. Since the
         | SSDs are being treated as volatile anyways. I assume it would
         | be prohibitively expensive to replicate the entire persistent
         | store into main memory. Yeah, we're talking many terabytes.
         | 
         | > I'd also be interested to know how this compares with
         | replacing the entire persistent disk / SSD system with zfs over
         | a few SSDs (which would also allow snapshoting). Of course it
         | is probably a huge feature to be able to have snapshots be
         | integrated into your cloud... Would love if we could've used
         | ZFS, but Scylla requires XFS.
        
       | cperciva wrote:
       | _4 billion messages sent through the platform by millions of
       | people per day_
       | 
       | I wish companies would stop inflating their numbers by citing
       | "per day" statistics. 4 billion messages per day is less than
       | 50k/second; that sort of transaction volume is well within the
       | capabilities of pgsql running on midrange hardware.
        
         | [deleted]
        
         | combyn8tor wrote:
         | Is there a template for replying to hackernews posts linking to
         | engineering articles?
         | 
         | 1. Cherry pick a piece of info 2. Claim it's not that
         | hard/impressive/large 3. Claim it can be done much simpler with
         | <insert database/language/hardware>
        
         | zorkian wrote:
         | (I work at Discord.)
         | 
         | It's not even the most interesting metric about our systems
         | anyway. If we're really going to look at the tech, the
         | inflation of those metrics to deliver the service is where the
         | work generally is in the system --
         | 
         | * 50k+ QPS (average) for new message inserts * 500k+ QPS when
         | you factor in deletes, updates, etc * 3M+ QPS looking at db
         | reads * 30M+ QPS looking at the gateway websockets (fanout of
         | things happening to online users)
         | 
         | But I hear you, we're conflating some marketing metrics with
         | technical metrics, we'll take that feedback for next time.
        
           | cperciva wrote:
           | Ideally I'd like to hear about messages per second at the
           | 99.99th percentile or something similar. That number says far
           | more about how hard it is to service the load than a per-day
           | value ever will.
        
         | pixl97 wrote:
         | 50k/s doesn't tell us about the burstiness of the messages.
         | Most places in the US don't have much traffic at 2AM
        
         | bearjaws wrote:
         | Absolutely not. This is an average, I bet their peak could be
         | in the mid hundreds of thousands, imagine a hype moment in a
         | LCS game.
         | 
         | It can in _theory_ work, but the real world would make this the
         | most unstable platform of all the messaging platforms.
         | 
         | Just one vacuum would bring this system down, even if it wasn't
         | an exclusive lock... Also I would be curious how you would
         | implement similar functionality to the URL deep linking and
         | image posting / hosting.
         | 
         | Mind you the answers to these will probably increase average
         | message size. Which means more write bandwidth.
         | 
         | Some bar napkin math shows this would be around 180GiB per
         | hour, 24/7, 4.3TiB per day.
         | 
         | Unless all messages disappeared within 20 days you would exceed
         | pretty much any reasonable single-server NVME setup. Also have
         | fun with trim and optimizing NVME write performance. Which is
         | also going to diminish as all the drives fail due to write
         | wear...
        
           | cperciva wrote:
           | _Absolutely not. This is an average, I bet their peak could
           | be in the mid hundreds of thousands_
           | 
           | That's my point. Per-second numbers are far more useful than
           | per-day numbers.
        
         | marcinzm wrote:
         | I mean they literally list the per second numbers a couple
         | paragraph down:
         | 
         | >Our databases were serving around 2 million requests per
         | second (in this screenshot.)
         | 
         | I doubt pgsql will have fun on mid-range hardware with 50k
         | writes/sec, ~2 million reads/sec and 4 billion additional rows
         | per day with few deletions.
        
           | merb wrote:
           | actually 'few deletions' and 'few updates' is basically the
           | happy case for postgres. MVCC in a append only system is
           | where it shines. (because you generate way less dead tuples,
           | thus vacuum is not that big of a problem)
        
         | dist1ll wrote:
         | That's only messages _sent_. The article cites 2 million
         | queries per second hitting their cluster, which are not served
         | by a CDN. Considering latency requirements and burst traffic,
         | you 're looking at a pretty massive load.
        
         | mbesto wrote:
         | What a myopic comment.
         | 
         | > 50k/second
         | 
         | Yes, 50k/second for every minute of the day 365/24/7. Very few
         | companies can quote that.
         | 
         | Not to mention:
         | 
         | - Has complex threaded messages
         | 
         | - Geo-redundancies
         | 
         | - Those message are real-time
         | 
         | - Global user base
         | 
         | - Unknown told of features related to messaging (bot recations,
         | ACL, permissions, privacy, formatting, reactions, etc.)
         | 
         | - No/limited downtime, live updates
         | 
         | Discord is technically impressive, not sure why you felt you
         | had to diminish that.
        
           | sammy2255 wrote:
           | Geo redundancies? Discord is all in us-east-1 gcp. Unless you
           | Meant AZ redundancy?
        
             | mbesto wrote:
             | > We are running more than 850 voice servers in 13 regions
             | (hosted in more than 30 data centers) all over the world.
             | This provisioning includes lots of redundancy to handle
             | data center failures and DDoS attacks. We use a handful of
             | providers and use physical servers in their data centers.
             | We just recently added a South Africa region. Thanks to all
             | our engineering efforts on both the client and server
             | architecture, we are able to serve more than 2.6 million
             | concurrent voice users with egress traffic of more than 220
             | Gbps (bits-per-second) and 120 Mpps (packets-per-second).
             | 
             | I don't see anything on their messaging specifically, just
             | assuming they would have something similar.
             | 
             | https://discord.com/blog/how-discord-handles-two-and-half-
             | mi...
        
               | zorkian wrote:
               | Our messaging stack is not currently multi-regional,
               | unfortunately. This is in the works, though, but it's a
               | fairly significant architectural evolution from where we
               | are today.
               | 
               | Data storage is going to be multi-regional soon, but
               | that's just from a redundancy/"data is safe in case of
               | us-east1 failure" scenario -- we're not yet going to be
               | actively serving live user traffic from outside of us-
               | east1.
        
       | t0mas88 wrote:
       | Clever trick. Having dealt with very similar things using
       | Cassandra, I'm curious how this setup will react to a failure of
       | a local Nvme disk.
       | 
       | They say that GCP will kill the whole node, which is probably a
       | good thing if you can be sure it does that quickly and
       | consistently.
       | 
       | If it doesn't (or not fast enough) you'll have a slow node
       | amongst faster ones, creating a big hotspot in your database.
       | Cassandra doesn't work very well if that happens and in early
       | versions I remember some cascading effects when a few nodes had
       | slowdowns.
        
         | TheGuyWhoCodes wrote:
         | that's always a risk when using local drives and needing to
         | rebuild when a node dies but I guess they can over provision in
         | case of one node failure in cluster until the cache is warmed
         | up
         | 
         | Edit: Just wanted to add that because they are using Persistent
         | Disks as the source of truth and depending on the network
         | bandwidth it might not be that big of a problem to restore a
         | node to a working state if it's using a quorum for reads and RP
         | >= 3.
         | 
         | Resorting a Node from zero in case of disk failure will always
         | be bad.
         | 
         | They could also have another caching layer on top of the
         | cluster to further mitigated the latency issue until the nodes
         | gets back to health and finishes all the hinted handoffs.
        
           | jhgg wrote:
           | We have two ways of re-building a node under this setup.
           | 
           | We can either re-build the node by simply wiping its disks,
           | and letting it stream in data from other replicas, or we can
           | re-build by simply re-syncing the pd-ssd to the nvme.
           | 
           | Node failure is a regular occurrence, it isn't a "bad" thing,
           | and something we intend to fully automate. Node should be
           | able to fail and recover without anyone noticing.
        
         | mikesun wrote:
         | That's good observation. We've spent a lot of time on our
         | control plane which handles the various RAID1 failure modes,
         | e.g. when a RAID1 degrades due to failed local SSD, we force
         | stop the node so that it doesn't continue to operate as a slow
         | node. Wait for part 2! :)
        
       ___________________________________________________________________
       (page generated 2022-08-15 23:00 UTC)