[HN Gopher] How Discord supercharges network disks for extreme l... ___________________________________________________________________ How Discord supercharges network disks for extreme low latency Author : techinvalley Score : 177 points Date : 2022-08-15 19:24 UTC (3 hours ago) (HTM) web link (discord.com) (TXT) w3m dump (discord.com) | throwdbaaway wrote: | That's a very smart solution indeed. I wonder if it is also | possible to throw more memory at the problem? Managing instances | with locally attached disks on the cloud is a bit of a pain, as | you don't have the capability to stop/start the instances | anymore. | skyde wrote: | The database should be able to be configured with a cache device. | | Using a RAID manager or a filesystem for this does not seem | optimal. | skyde wrote: | Actually, using Something like ReadySet[1] but using a swap | drive instead of memory to store the cached row would work very | well. | | [1] https://readyset.io/blog/introducing-readyset | JanMa wrote: | How does this setup handle a maintenance of the underlying | hypervisor host? As far as I know the VM will be migrated to a | new hypervisor and all data on the local SSDs is lost. Can the | custom RAID0 array of local SSDs handle this or does it have to | be manually rebuilt on every maintenance? | ahepp wrote: | From the article: | | > GCP provides an interesting "guarantee" around the failure of | Local SSDs: If any Local SSD fails, the entire server is | migrated to a different set of hardware, essentially erasing | all Local SSD data for that server. | | I wonder how md handles reads during the rebuild, and how long | it takes to replicate the persistent store back onto the raid0 | mirror. | jhgg wrote: | On GCP, live migration moves the data on the local-ssd to the | new host as well. | merb wrote: | would've really known how much cpu this solution would cost. most | of the time mdraid adds additonal cpu time (not that it matters | to them that much) | idorosen wrote: | If it's just raid0 and raid1, then there's likely not any | significant CPU time or CPU overhead involved. Most | southbridges or I/O controllers support mirroring directly, and | md knows how to use them. Most virtualized disk controllers do | this in hardware as well. | | CPU overhead comes into play when you're doing parity on a | software raid setup (like md or zfs) such as in md raid5 or | raid6. | | If they needed data scrubbing at a single host level like zfs | offers, then probably CPU would be a factor, but I'm assuming | they achieve data integrity at a higher level/across hosts, | such as in their distributed DB. | nvarsj wrote: | As an aside, I'm always impressed by Discord's engineering | articles. They are incredibly pragmatic - typically using | commonly available OSS to solve big problems. If this was another | unicorn company they would have instead written a custom disk | controller in Rust, called it a Greek name, and have done several | major conference talks on their unique innovation. | jfim wrote: | There are some times when writing a custom solution does make | sense though. | | In their case, I'm wondering why the host failure isn't handled | at a higher level already. A node failure causing all data to | be lost on that host should be handled gracefully through | replication and another replica brought up transparently. | | In any case, their usage of local storage as a write through | cache though md is pretty interesting, I wonder if it would | work the other way around for reading. | mikesun wrote: | Scylla (and Cassandra) provides cluster-level replication. | Even with only local NVMes, a single node failure with loss | of data would be tolerated. But relying on "ephemeral local | SSDs" that nodes can lose if any VM is power-cycled adds | additional risk that some incident could cause multiple | replicas to lose their data. | stevenpetryk wrote: | That's a common theme here. We try to avoid making tools into | projects. | Pepe1vo wrote: | Discord has done its fair share of RIIR though[0][1] ;) | | [0] https://discord.com/blog/why-discord-is-switching-from-go- | to... | | [1] https://discord.com/blog/using-rust-to-scale-elixir- | for-11-m... | whizzter wrote: | They've probably subscribed to Taco Bell Programming. | | http://widgetsandshit.com/teddziuba/2010/10/taco-bell-progra... | MonkeyMalarky wrote: | The article is from 2010 and uses the term DevOps! Just how | long has that meme been around? | Sohcahtoa82 wrote: | Oooh, I like this. I gotta remember this term. | tablespoon wrote: | >> http://widgetsandshit.com/teddziuba/2010/10/taco-bell- | progra... | | > Here's a concrete example: suppose you have millions of web | pages that you want to download and save to disk for later | processing. How do you do it? The cool-kids answer is to | write a distributed crawler in Clojure and run it on EC2, | handing out jobs with a message queue like SQS or ZeroMQ. | | > The Taco Bell answer? xargs and wget. In the rare case that | you saturate the network connection, add some split and | rsync. A "distributed crawler" is really only like 10 lines | of shell script. | | I generally agree, but it's probably only 10 lines if you | assume you never have to deal with any errors. | derefr wrote: | That can be solved at the design level: write your get step | as an idempotent "only do it if it isn't already done" | creation operation for a given output file -- like a make | target, but no need to actually use Make (just a `test -f | || ...`.) | | Then run your little pipeline _in a loop_ until it stops | making progress (`find | wc` doesn't increase.) Either it | finished, or everything that's left as input represents one | or more classes of errors. Debug them, and then start it | looping again :) | dekhn wrote: | Not redoing steps that appear to be already done has its | own challenges- for example, a transfer that broke | halfway through might leave a destination file, but not | represent a completion (typically dealt with by writing | to a temp file and renaming). | | The issue here is that your code has no real-time | adaptability. Many backends will scale with load up to a | point then start returning "make fewer requests". | Normally, you implement some internal logic such as | randomized exponential backoff retries (amazingly, this | is a remarkably effective way to automatically find the | saturation point of the cluster), although I have also | seen some large clients that coordinate their fetches | centrally using tokens. | derefr wrote: | Having that logic in the same place as the work of | actually driving the fetch/crawl, though, is a violation | of Unix "small components, each doing one thing" | thinking. | | You know how you can rate-limit your requests? A forward | proxy daemon that rate-limits upstream connections by | holding them open but not serving them until the timeout | has elapsed. (I.e. Nginx with five lines of config.) As | long as your fetcher has a concurrency limit, stalling | some of those connections will lead to decreased | attempted throughput. | | (This isn't just for scripting, either; it's also a near- | optimal way to implement global per-domain upstream-API | rate-limiting in a production system that has multiple | shared-nothing backends. It's Istio/Envoy "in the | small.") | dekhn wrote: | Setting up the nginx server is one more server (and isn't | particularly a small component doing one thing) to | manage. | | Having built several large distributed computing systems, | I've found that the inner client always needs to have a | fair amount of intelligence when talking to the server. | That means responding to errors in a way that doesn't | lead to thundering herds. The nice thing about this is | that, like modern TCP, it auto-tunes to the capacity of | the system, while also handling outages well. | hn_version_0023 wrote: | I'd add GNU _parallel_ to the tools used; I've written this | exact crawler that way, saving screenshots using | _ghostscript_ , IIRC | Sohcahtoa82 wrote: | Fair, but a simply Python script could probably handle it. | Still don't need a message queue. | skrtskrt wrote: | Errors, retries, consistency, observability. For actual | software running in production, shelling out is way too | flaky and unobservable | bbarnett wrote: | It's not flaky at all, it is merely that most people | don't code bash/etc to catch errors, retry on failure, | etc, etc. | | I will 100% agree that it has disadvantages, but it's | unfair to level the above at shell scripts, for most of | your complaint, is about poorly coded shell scripts. | | An example? sysvinit is a few C programs, and all of it | wrapped in bash or sh. It's far more reliable than | systemd ever has been, with far better error checking. | | Part of this is simplicity. 100 lines of code is better | than 10k lines. "Whole scope" on one page can't be | underestimated for debugging and comprehension, which | also makes error checking easier too. | skrtskrt wrote: | Can I, with off the shelf OSS tooling, _easily_ trace | that code that's "just wget and xargs", emit metrics and | traces to collectors, differentiate between all the | possible network and http failure errors, retry | individual requests with backoff and jitter, allow | individual threads to fail and retry those without | borking the root program, and write the results to a | datastore in idempotent way, and allow a junior developer | to contribute to it with little ramp-up? | | It's not about "can bash do it" it's about "is there a | huge ecosystem of tools, which we are probably already | using in our organization, that thoroughly cover all | these issues". | lordpankake wrote: | Awesome article! | [deleted] | shrubble wrote: | Key sentence and a half: | | 'Discord runs most of its hardware in Google Cloud and they | provide ready access to "Local SSDs" -- NVMe based instance | storage, which do have incredibly fast latency profiles. | Unfortunately, in our testing, we ran into enough reliability | issues' | why_only_15 wrote: | Don't all physical SSDs have reliability issues? There's a good | reason we replicate data across devices. | legulere wrote: | Basically they need to solve local SSDs not having all needed | features and persistent disks having too high latency by: | | > essentially a write-through cache, with GCP's Local SSDs as the | cache and Persistent Disks as the storage layer. | [deleted] | ahepp wrote: | I found it worth noting that the cache is primarily interested | through linux's built in software raid system, md. SSDs in | raid0 (strip), persistent disk in raid1 (mirror). | hardware2win wrote: | It sounds like intel optane persistent memory could work here | wmf wrote: | Sadly Optane cost a fortune, was never available in most | clouds, and has now been canceled. | davidw wrote: | Regarding ScyllaDB: | | > while achieving significantly higher throughputs and lower | latencies [compared to Cassandra] | | Do they really get all that just because it's in C++? Anyone | familiar with both of them? | doubledad222 wrote: | This was a very interesting read. The detail on the exploratory | process was perfect. Can't wait for part two. | javier_e06 wrote: | Yes, RAID is all that. It's interesting to see such established | technology shine and shine again. | madars wrote: | Q about a related use case: Can I use tiered storage (e.g., SSD | cache in front of a HDD with, say, dm-cache), or an md-based | approach like Discord's, to successfully sync an Ethereum node? | Everyone says "you should get a 2TB SSD" but I'm wondering if I | can be more future-proof with say 512 GB SSD cache + much larger | HDD. | pas wrote: | Sure, you might want to look into bcache or bcachefs. | Nextgrid wrote: | I'm not familiar with what Ethereum requires for its syncing | operation. | | If it's a sequential write (by downloading the entire | blockchain), you will still be bottlenecked by the throughput | of the underlying disk. | | If it's sequential reads (in between writes), the reads can be | handled by the cache if the location is local enough to the | previous write operation that it hasn't been evicted yet. | | If it's random unpredictable reads, it's unlikely a cache will | help unless the cache is big enough to fit the entire working | dataset (otherwise you'll get a terrible cache hit rate as most | of what you need would've been evicted by then) but then you're | back at your original problem of needing a huge SSD. | alberth wrote: | TL;DR - NAS is slow. RAID 0 is fast. | wmf wrote: | This is not an accurate summary of the article. | [deleted] | PeterWhittaker wrote: | I feel like I am missing a step. Do they write to md1 and read | from md0? Or do they read from md1, and under the covers the read | is likely fulfilled by md0? | ahepp wrote: | Is it necessary to fully replicate the persistent store onto the | striped SSD array? I admire such a simple solution, but I wonder | if something like an LRU cache would achieve similar speedups | while using fewer resources. On the other hand, it could be a | small cost to pay for a more consistent and predictable workload. | | How does md handle a synchronous write in a heterogenous mirror? | Does it wait for both devices to be written? | | I'm also curious how this solution compares to allocating more | ram to the servers, and either letting the database software use | this for caching, or even creating a ramdisk and putting that in | raid1 with the persistent storage. Since the SSDs are being | treated as volatile anyways. I assume it would be prohibitively | expensive to replicate the entire persistent store into main | memory. | | I'd also be interested to know how this compares with replacing | the entire persistent disk / SSD system with zfs over a few SSDs | (which would also allow snapshoting). Of course it is probably a | huge feature to be able to have snapshots be integrated into your | cloud... | mikesun wrote: | > Is it necessary to fully replicate the persistent store onto | the striped SSD array? I admire such a simple solution, but I | wonder if something like an LRU cache would achieve similar | speedups while using fewer resources. On the other hand, it | could be a small cost to pay for a more consistent and | predictable workload. | | One of the reasons an LRU cache like dm-cache wasn't feasible | was because we had a higher than acceptable bad sector read | rate which would cause a cache like dm-cache to bubble up a | block device error up to the database. The database would then | shut itself down when it encountered an disk-level error. | | > How does md handle a synchronous write in a heterogenous | mirror? Does it wait for both devices to be written? Yes, md | waits for both mirrors to be written. | | > I'm also curious how this solution compares to allocating | more ram to the servers, and either letting the database | software use this for caching, or even creating a ramdisk and | putting that in raid1 with the persistent storage. Since the | SSDs are being treated as volatile anyways. I assume it would | be prohibitively expensive to replicate the entire persistent | store into main memory. Yeah, we're talking many terabytes. | | > I'd also be interested to know how this compares with | replacing the entire persistent disk / SSD system with zfs over | a few SSDs (which would also allow snapshoting). Of course it | is probably a huge feature to be able to have snapshots be | integrated into your cloud... Would love if we could've used | ZFS, but Scylla requires XFS. | cperciva wrote: | _4 billion messages sent through the platform by millions of | people per day_ | | I wish companies would stop inflating their numbers by citing | "per day" statistics. 4 billion messages per day is less than | 50k/second; that sort of transaction volume is well within the | capabilities of pgsql running on midrange hardware. | [deleted] | combyn8tor wrote: | Is there a template for replying to hackernews posts linking to | engineering articles? | | 1. Cherry pick a piece of info 2. Claim it's not that | hard/impressive/large 3. Claim it can be done much simpler with | <insert database/language/hardware> | zorkian wrote: | (I work at Discord.) | | It's not even the most interesting metric about our systems | anyway. If we're really going to look at the tech, the | inflation of those metrics to deliver the service is where the | work generally is in the system -- | | * 50k+ QPS (average) for new message inserts * 500k+ QPS when | you factor in deletes, updates, etc * 3M+ QPS looking at db | reads * 30M+ QPS looking at the gateway websockets (fanout of | things happening to online users) | | But I hear you, we're conflating some marketing metrics with | technical metrics, we'll take that feedback for next time. | cperciva wrote: | Ideally I'd like to hear about messages per second at the | 99.99th percentile or something similar. That number says far | more about how hard it is to service the load than a per-day | value ever will. | pixl97 wrote: | 50k/s doesn't tell us about the burstiness of the messages. | Most places in the US don't have much traffic at 2AM | bearjaws wrote: | Absolutely not. This is an average, I bet their peak could be | in the mid hundreds of thousands, imagine a hype moment in a | LCS game. | | It can in _theory_ work, but the real world would make this the | most unstable platform of all the messaging platforms. | | Just one vacuum would bring this system down, even if it wasn't | an exclusive lock... Also I would be curious how you would | implement similar functionality to the URL deep linking and | image posting / hosting. | | Mind you the answers to these will probably increase average | message size. Which means more write bandwidth. | | Some bar napkin math shows this would be around 180GiB per | hour, 24/7, 4.3TiB per day. | | Unless all messages disappeared within 20 days you would exceed | pretty much any reasonable single-server NVME setup. Also have | fun with trim and optimizing NVME write performance. Which is | also going to diminish as all the drives fail due to write | wear... | cperciva wrote: | _Absolutely not. This is an average, I bet their peak could | be in the mid hundreds of thousands_ | | That's my point. Per-second numbers are far more useful than | per-day numbers. | marcinzm wrote: | I mean they literally list the per second numbers a couple | paragraph down: | | >Our databases were serving around 2 million requests per | second (in this screenshot.) | | I doubt pgsql will have fun on mid-range hardware with 50k | writes/sec, ~2 million reads/sec and 4 billion additional rows | per day with few deletions. | merb wrote: | actually 'few deletions' and 'few updates' is basically the | happy case for postgres. MVCC in a append only system is | where it shines. (because you generate way less dead tuples, | thus vacuum is not that big of a problem) | dist1ll wrote: | That's only messages _sent_. The article cites 2 million | queries per second hitting their cluster, which are not served | by a CDN. Considering latency requirements and burst traffic, | you 're looking at a pretty massive load. | mbesto wrote: | What a myopic comment. | | > 50k/second | | Yes, 50k/second for every minute of the day 365/24/7. Very few | companies can quote that. | | Not to mention: | | - Has complex threaded messages | | - Geo-redundancies | | - Those message are real-time | | - Global user base | | - Unknown told of features related to messaging (bot recations, | ACL, permissions, privacy, formatting, reactions, etc.) | | - No/limited downtime, live updates | | Discord is technically impressive, not sure why you felt you | had to diminish that. | sammy2255 wrote: | Geo redundancies? Discord is all in us-east-1 gcp. Unless you | Meant AZ redundancy? | mbesto wrote: | > We are running more than 850 voice servers in 13 regions | (hosted in more than 30 data centers) all over the world. | This provisioning includes lots of redundancy to handle | data center failures and DDoS attacks. We use a handful of | providers and use physical servers in their data centers. | We just recently added a South Africa region. Thanks to all | our engineering efforts on both the client and server | architecture, we are able to serve more than 2.6 million | concurrent voice users with egress traffic of more than 220 | Gbps (bits-per-second) and 120 Mpps (packets-per-second). | | I don't see anything on their messaging specifically, just | assuming they would have something similar. | | https://discord.com/blog/how-discord-handles-two-and-half- | mi... | zorkian wrote: | Our messaging stack is not currently multi-regional, | unfortunately. This is in the works, though, but it's a | fairly significant architectural evolution from where we | are today. | | Data storage is going to be multi-regional soon, but | that's just from a redundancy/"data is safe in case of | us-east1 failure" scenario -- we're not yet going to be | actively serving live user traffic from outside of us- | east1. | t0mas88 wrote: | Clever trick. Having dealt with very similar things using | Cassandra, I'm curious how this setup will react to a failure of | a local Nvme disk. | | They say that GCP will kill the whole node, which is probably a | good thing if you can be sure it does that quickly and | consistently. | | If it doesn't (or not fast enough) you'll have a slow node | amongst faster ones, creating a big hotspot in your database. | Cassandra doesn't work very well if that happens and in early | versions I remember some cascading effects when a few nodes had | slowdowns. | TheGuyWhoCodes wrote: | that's always a risk when using local drives and needing to | rebuild when a node dies but I guess they can over provision in | case of one node failure in cluster until the cache is warmed | up | | Edit: Just wanted to add that because they are using Persistent | Disks as the source of truth and depending on the network | bandwidth it might not be that big of a problem to restore a | node to a working state if it's using a quorum for reads and RP | >= 3. | | Resorting a Node from zero in case of disk failure will always | be bad. | | They could also have another caching layer on top of the | cluster to further mitigated the latency issue until the nodes | gets back to health and finishes all the hinted handoffs. | jhgg wrote: | We have two ways of re-building a node under this setup. | | We can either re-build the node by simply wiping its disks, | and letting it stream in data from other replicas, or we can | re-build by simply re-syncing the pd-ssd to the nvme. | | Node failure is a regular occurrence, it isn't a "bad" thing, | and something we intend to fully automate. Node should be | able to fail and recover without anyone noticing. | mikesun wrote: | That's good observation. We've spent a lot of time on our | control plane which handles the various RAID1 failure modes, | e.g. when a RAID1 degrades due to failed local SSD, we force | stop the node so that it doesn't continue to operate as a slow | node. Wait for part 2! :) ___________________________________________________________________ (page generated 2022-08-15 23:00 UTC)