[HN Gopher] Thread-Per-Core Buffer Management for a modern stora... ___________________________________________________________________ Thread-Per-Core Buffer Management for a modern storage system Author : arjunnarayan Score : 61 points Date : 2020-12-12 18:34 UTC (4 hours ago) (HTM) web link (vectorized.io) (TXT) w3m dump (vectorized.io) | dotnwat wrote: | Noah here, developer at Vectorized. Happy to answer any | questions. | monstrado wrote: | What kind of gaps are there currently between Kafka and | RedPanda? Particularly in the "Enterprise" world (e.g. | security, etc). | dotnwat wrote: | In progress right now are transactions and ACLs. A | preliminary form of ACLs with SCRAM will be available later | this month. Transactions will come early next year. Those are | probably the most visible differences. | monstrado wrote: | That's great! Would it be correct to assume that KSQL | works? | dotnwat wrote: | Having not tried, I would expect ksql to not work until | transaction support lands. That said, perhaps there are | some ways to configure it to avoid dependencies on those | underlying APIs. | carllerche wrote: | The article called out 500usec as an upper bound for compute. | How do you handle heavier compute operations (TlS, encoding / | decoding, ...) | agallego wrote: | on seastar, you yield a lot. so loops go from | for( ... : collection) {} | | to return seastar::do_for_each(collection, | callback); | mirekrusin wrote: | What is the point of talking performance-by-thread-per-core if | raft sits in front of it, ie. only one will do the work at any | time anyway? | agallego wrote: | so redpanda partitions 'raft' groups per kafka partition. so in | the `topic/partition` model every partition is it's own raft | group (similar to multi raft in cockroachdb). So it is in fact | even more important due to the replication cost and therefore | the additional work of checksumming, compression, etc. | | Last, a coordinator core for the one of the TCP connections | from a client will likely make requests to remote cores (say | you receive a request on core 44, but the destination is core | 66), so having a thread per core with explicit message passing | is pretty fundamental. | ss::future<std::vector<append_entries_reply>> | dispatch_hbeats_to_core(ss::shard_id shard, hbeats_ptr | requests) { return with_scheduling_group( | get_scheduling_group(), [this, shard, r = | std::move(requests)]() mutable { return | _group_manager.invoke_on( shard, | get_smp_service_group(), [this, r = | std::move(r)](ConsensusManager& m) mutable { | return dispatch_hbeats_to_groups(m, std::move(r)); | }); }); } | | Here is some code that shows importance of _accounting the | x-core comms explicitly_ | mirekrusin wrote: | Ok, thanks. Does redpanda do some kind of auto anti-affinity | on hosts for partition group to spread across remote cores? | | ps. redpanda link from article is broken, goes to | https://vectorized.io/blog/tpc- | buffers/vectorized.io/redpand... 404 | agallego wrote: | Oh shoot! thank you... fixing the link give me 5 mins. | | So currently the partition allocator - https://github.com/v | ectorizedio/redpanda/blob/dev/src/v/clus... - is primitive. | | But we have a working-not-yet-exposed HTTP admin api on the | controller that allows for Out Of Band placement. | | so the mechanics are there, but not yet integrated w/ the | partition allocator. | | Thinking that we integrate w/ k8s more deeply next year. | | The thinking at least is that at install we generate some | machine labels say in /etc/redpanda/labels.json or smth | like that and then the partition allocator can take simple | constraints. | | I worked on a few schedulers for www.concord.io with Fenzo | on top of mesos 6 years ago and this worked nicely for both | 'affinity', 'soft affinity' and anti-affinity constraints. | | Do you have any thoughts on how you'd like this exposed? | zinclozenge wrote: | If anybody's interested, there's a Seastar inspired library for | Rust that is being developed https://github.com/DataDog/glommio | bob1029 wrote: | More threads (i.e. shared state) is a huge mistake if you are | trying to maintain a storage subsystem with synchronous access | semantics. | | I am starting to think you can handle all storage requests for a | single logical node on just one core/thread. I have been pushing | 5~10 million JSON-serialized entities to disk per second with a | single managed thread in .NET Core (using a Samsung 970 Pro for | testing). This _includes_ indexing and sequential integer key | assignment. This testing will completely saturate the drive (over | 1 gigabyte per second steady-state). Just getting an increment of | a 64 bit integer over a million times per second across an | arbitrary number of threads is a big ask. This is the difference | you can see when you double down on single threaded ideology for | this type of problem domain. | | The technical trick to my success is to run all of the database | operations in micro batches (10~1000 microseconds per). I use | LMAX Disruptor, so the batches are formed naturally based on | throughput conditions. Selecting data structures and algorithms | that work well in this type of setup is critical. Append-only is | a must with flash and makes orders of magnitude difference in | performance. Everything else (b-tree algorithms, etc) follows | from this realization. | | Put another way, If you find yourself using Task or async/await | primitives when trying to talk to something as fast as NVMe | flash, you need to rethink your approach. The overhead with | multiple threads, task parallel abstractions, et. al. is going to | cripple any notion of high throughput in a synchronous storage | domain. | wmf wrote: | They're using shared-nothing between threads so shared state | isn't a problem. It sounds like the Redpanda architecture is | almost the same as what you're talking about. | bob1029 wrote: | Yes I am seeing some similarities in threading model for | sure. | | That said, there are a lot of other simultaneous | considerations at play when we are talking about punching | through business entity storage rates that exceed the rated | IOPS capacity of the underlying storage medium. My NVMe test | drive can only push ~500k write IOPS in the most ideal case, | but I am able to write several million logical entities | across those operations due to batching/single-writer | effects. | agallego wrote: | So depending on the disks, we found that different IO sizes | will yield optimal settings for saturating disks. As an | anecdote in clouds, the IOPS is the key principle and often | you can drive higher throughput depending on your DMA block | size (i.e.: 128KB vs 16KB etc)... obviously a tradeoff on | the memory pressure. but you can test them all easily w/ | `fio` | wskinner wrote: | I learned the same thing while writing a log structured merge | tree. Single threaded writes are a must - not only for | performance but also simplicity of implementation. | | I'm curious what about your use required implementing your own | storage subsystem rather than using an embedded key value store | like RocksDB. | jandrewrogers wrote: | RocksDB has two big limitations that preclude its use for | many types of high-performance data infrastructure (which it | sounds like the OP's use case was). First, its throughput | performance is _much_ worse (integer factor) than what can be | achieved with a different design for some applications. | Second, it isn 't designed to work well for very large | storage volumes. Again, easy to remedy if you design your own | storage engine or use an alternative one. There are storage | engines that will happily drive a petabyte of storage across | a large array of NVMe devices at the theoretical limits of | the hardware, though not so much in open source. | | Another thing to consider is that you lose significant | performance in a few different dimensions if your storage I/O | scheduler design is not tightly coupled to your execution | scheduler design. While it requires writing more code it also | eliminates a bunch of rough edges. This alone is the reason | many database-y applications write their own storage engines. | For people that do it for a living, writing an excellent | custom storage engine isn't that onerous. | | RocksDB is a fine choice for applications where performance | and scale are not paramount or your hardware is limited. On | large servers with hefty workloads, you'll probably want to | use something else. | agallego wrote: | Indeed. I think you have different saturation points the wider | the use cases you hit. One example w/ a single-core (which btw, | agreed whole heartedly for io) is checksumming + decoding. | | For kafka, we have multiple indexes - a time index and an | offset index which are simple metadata. the trouble becomes on | how you handle decompression+checksumming+compression for | supporting compacted topics. ( | https://github.com/vectorizedio/redpanda/blob/dev/src/v/stor... | ) | | So single core starts to get saturated while doing both fore- | ground and background requests. | | ..... | | Now assume that you handle that with correct priorities for IO | and CPU scheduling.... the next bottleneck will be keeping up | w/ background tasks. | | So then you start to add more threads. but as you mentioned and | what I tried to highligiht in that article was that the cost of | implicit or simple synchronization is very expensive (as noted | by you intuition) | | The thread-per-core buffer management with defer destructors is | really handy at doing 3 things explicitly | | 1. your cross core communication is _explicit_ - that is you | give it shares as part of a quota so that you understand how | your system priorities are working across the system for any | kind of workload. This is helpful to prioritize foreground and | background work. | | 2. there is effectively a const memory addresses once you parse | it - so you treat it is largely immutable and you can add hooks | (say crash if modified on a remote core) | | 3. makes memory accounting fast. i.e.: instead of pushing a | global barrier for the allocator you simply send a message back | to the originating core for allocator accounting. This becomes | hugely important as you start to increase the number of cores. | user5994461 wrote: | >>> the trouble becomes on how you handle | decompression+checksumming+compression | | gzip will cap 1 MB/s with the strongest compression setting | and 50 MB/s with the fastest setting, which is really slow. | | The first step to improve kafka is for kafka to adopt zstd | compression. | | Another thing that really hurts is SSL. Desktop CPU with AES | instructions can push 1 GB/s so it's not too bad, but that | may not the the CPU you have or the default algorithm used by | the software. | agallego wrote: | Kafka _has_ `zstd` encoding. | | Here is our version of the streaming decoder i wrote a | while ago https://github.com/vectorizedio/redpanda/blob/dev | /src/v/comp... | | that's our default for our internal RPC as well. | | in fact kafka protocol support lz4, zstd, snappy, gzip all | of them. and you can change them per batch. compression is | good w/ kafka. | loeg wrote: | lz4 is a good option for really high-performance | compression as well. (Zstd is my general recommmendation, | and both beat the pants off of gzip, but for very high | throughput applications lz4 still beats zstd. Both are | designs from Yann Collet.) | agallego wrote: | indeed. though, the recent zstd changes w/ different | levels of compression sort of close the gap in perf that | lz4 had over zstd. (if interested in this kind of detail | for a new streaming storage engiene, i gave a talk last | week at the facebook performance summit - https://twitter | .com/perfsummit1/status/1337603028677902336) ___________________________________________________________________ (page generated 2020-12-12 23:00 UTC)