[HN Gopher] Thread-Per-Core Buffer Management for a modern stora...
       ___________________________________________________________________
        
       Thread-Per-Core Buffer Management for a modern storage system
        
       Author : arjunnarayan
       Score  : 61 points
       Date   : 2020-12-12 18:34 UTC (4 hours ago)
        
 (HTM) web link (vectorized.io)
 (TXT) w3m dump (vectorized.io)
        
       | dotnwat wrote:
       | Noah here, developer at Vectorized. Happy to answer any
       | questions.
        
         | monstrado wrote:
         | What kind of gaps are there currently between Kafka and
         | RedPanda? Particularly in the "Enterprise" world (e.g.
         | security, etc).
        
           | dotnwat wrote:
           | In progress right now are transactions and ACLs. A
           | preliminary form of ACLs with SCRAM will be available later
           | this month. Transactions will come early next year. Those are
           | probably the most visible differences.
        
             | monstrado wrote:
             | That's great! Would it be correct to assume that KSQL
             | works?
        
               | dotnwat wrote:
               | Having not tried, I would expect ksql to not work until
               | transaction support lands. That said, perhaps there are
               | some ways to configure it to avoid dependencies on those
               | underlying APIs.
        
         | carllerche wrote:
         | The article called out 500usec as an upper bound for compute.
         | How do you handle heavier compute operations (TlS, encoding /
         | decoding, ...)
        
           | agallego wrote:
           | on seastar, you yield a lot. so loops go from
           | for( ... : collection) {}
           | 
           | to                  return seastar::do_for_each(collection,
           | callback);
        
       | mirekrusin wrote:
       | What is the point of talking performance-by-thread-per-core if
       | raft sits in front of it, ie. only one will do the work at any
       | time anyway?
        
         | agallego wrote:
         | so redpanda partitions 'raft' groups per kafka partition. so in
         | the `topic/partition` model every partition is it's own raft
         | group (similar to multi raft in cockroachdb). So it is in fact
         | even more important due to the replication cost and therefore
         | the additional work of checksumming, compression, etc.
         | 
         | Last, a coordinator core for the one of the TCP connections
         | from a client will likely make requests to remote cores (say
         | you receive a request on core 44, but the destination is core
         | 66), so having a thread per core with explicit message passing
         | is pretty fundamental.
         | ss::future<std::vector<append_entries_reply>>
         | dispatch_hbeats_to_core(ss::shard_id shard, hbeats_ptr
         | requests) {             return with_scheduling_group(
         | get_scheduling_group(),               [this, shard, r =
         | std::move(requests)]() mutable {                   return
         | _group_manager.invoke_on(                     shard,
         | get_smp_service_group(),                     [this, r =
         | std::move(r)](ConsensusManager& m) mutable {
         | return dispatch_hbeats_to_groups(m, std::move(r));
         | });               });         }
         | 
         | Here is some code that shows importance of _accounting the
         | x-core comms explicitly_
        
           | mirekrusin wrote:
           | Ok, thanks. Does redpanda do some kind of auto anti-affinity
           | on hosts for partition group to spread across remote cores?
           | 
           | ps. redpanda link from article is broken, goes to
           | https://vectorized.io/blog/tpc-
           | buffers/vectorized.io/redpand... 404
        
             | agallego wrote:
             | Oh shoot! thank you... fixing the link give me 5 mins.
             | 
             | So currently the partition allocator - https://github.com/v
             | ectorizedio/redpanda/blob/dev/src/v/clus... - is primitive.
             | 
             | But we have a working-not-yet-exposed HTTP admin api on the
             | controller that allows for Out Of Band placement.
             | 
             | so the mechanics are there, but not yet integrated w/ the
             | partition allocator.
             | 
             | Thinking that we integrate w/ k8s more deeply next year.
             | 
             | The thinking at least is that at install we generate some
             | machine labels say in /etc/redpanda/labels.json or smth
             | like that and then the partition allocator can take simple
             | constraints.
             | 
             | I worked on a few schedulers for www.concord.io with Fenzo
             | on top of mesos 6 years ago and this worked nicely for both
             | 'affinity', 'soft affinity' and anti-affinity constraints.
             | 
             | Do you have any thoughts on how you'd like this exposed?
        
       | zinclozenge wrote:
       | If anybody's interested, there's a Seastar inspired library for
       | Rust that is being developed https://github.com/DataDog/glommio
        
       | bob1029 wrote:
       | More threads (i.e. shared state) is a huge mistake if you are
       | trying to maintain a storage subsystem with synchronous access
       | semantics.
       | 
       | I am starting to think you can handle all storage requests for a
       | single logical node on just one core/thread. I have been pushing
       | 5~10 million JSON-serialized entities to disk per second with a
       | single managed thread in .NET Core (using a Samsung 970 Pro for
       | testing). This _includes_ indexing and sequential integer key
       | assignment. This testing will completely saturate the drive (over
       | 1 gigabyte per second steady-state). Just getting an increment of
       | a 64 bit integer over a million times per second across an
       | arbitrary number of threads is a big ask. This is the difference
       | you can see when you double down on single threaded ideology for
       | this type of problem domain.
       | 
       | The technical trick to my success is to run all of the database
       | operations in micro batches (10~1000 microseconds per). I use
       | LMAX Disruptor, so the batches are formed naturally based on
       | throughput conditions. Selecting data structures and algorithms
       | that work well in this type of setup is critical. Append-only is
       | a must with flash and makes orders of magnitude difference in
       | performance. Everything else (b-tree algorithms, etc) follows
       | from this realization.
       | 
       | Put another way, If you find yourself using Task or async/await
       | primitives when trying to talk to something as fast as NVMe
       | flash, you need to rethink your approach. The overhead with
       | multiple threads, task parallel abstractions, et. al. is going to
       | cripple any notion of high throughput in a synchronous storage
       | domain.
        
         | wmf wrote:
         | They're using shared-nothing between threads so shared state
         | isn't a problem. It sounds like the Redpanda architecture is
         | almost the same as what you're talking about.
        
           | bob1029 wrote:
           | Yes I am seeing some similarities in threading model for
           | sure.
           | 
           | That said, there are a lot of other simultaneous
           | considerations at play when we are talking about punching
           | through business entity storage rates that exceed the rated
           | IOPS capacity of the underlying storage medium. My NVMe test
           | drive can only push ~500k write IOPS in the most ideal case,
           | but I am able to write several million logical entities
           | across those operations due to batching/single-writer
           | effects.
        
             | agallego wrote:
             | So depending on the disks, we found that different IO sizes
             | will yield optimal settings for saturating disks. As an
             | anecdote in clouds, the IOPS is the key principle and often
             | you can drive higher throughput depending on your DMA block
             | size (i.e.: 128KB vs 16KB etc)... obviously a tradeoff on
             | the memory pressure. but you can test them all easily w/
             | `fio`
        
         | wskinner wrote:
         | I learned the same thing while writing a log structured merge
         | tree. Single threaded writes are a must - not only for
         | performance but also simplicity of implementation.
         | 
         | I'm curious what about your use required implementing your own
         | storage subsystem rather than using an embedded key value store
         | like RocksDB.
        
           | jandrewrogers wrote:
           | RocksDB has two big limitations that preclude its use for
           | many types of high-performance data infrastructure (which it
           | sounds like the OP's use case was). First, its throughput
           | performance is _much_ worse (integer factor) than what can be
           | achieved with a different design for some applications.
           | Second, it isn 't designed to work well for very large
           | storage volumes. Again, easy to remedy if you design your own
           | storage engine or use an alternative one. There are storage
           | engines that will happily drive a petabyte of storage across
           | a large array of NVMe devices at the theoretical limits of
           | the hardware, though not so much in open source.
           | 
           | Another thing to consider is that you lose significant
           | performance in a few different dimensions if your storage I/O
           | scheduler design is not tightly coupled to your execution
           | scheduler design. While it requires writing more code it also
           | eliminates a bunch of rough edges. This alone is the reason
           | many database-y applications write their own storage engines.
           | For people that do it for a living, writing an excellent
           | custom storage engine isn't that onerous.
           | 
           | RocksDB is a fine choice for applications where performance
           | and scale are not paramount or your hardware is limited. On
           | large servers with hefty workloads, you'll probably want to
           | use something else.
        
         | agallego wrote:
         | Indeed. I think you have different saturation points the wider
         | the use cases you hit. One example w/ a single-core (which btw,
         | agreed whole heartedly for io) is checksumming + decoding.
         | 
         | For kafka, we have multiple indexes - a time index and an
         | offset index which are simple metadata. the trouble becomes on
         | how you handle decompression+checksumming+compression for
         | supporting compacted topics. (
         | https://github.com/vectorizedio/redpanda/blob/dev/src/v/stor...
         | )
         | 
         | So single core starts to get saturated while doing both fore-
         | ground and background requests.
         | 
         | .....
         | 
         | Now assume that you handle that with correct priorities for IO
         | and CPU scheduling.... the next bottleneck will be keeping up
         | w/ background tasks.
         | 
         | So then you start to add more threads. but as you mentioned and
         | what I tried to highligiht in that article was that the cost of
         | implicit or simple synchronization is very expensive (as noted
         | by you intuition)
         | 
         | The thread-per-core buffer management with defer destructors is
         | really handy at doing 3 things explicitly
         | 
         | 1. your cross core communication is _explicit_ - that is you
         | give it shares as part of a quota so that you understand how
         | your system priorities are working across the system for any
         | kind of workload. This is helpful to prioritize foreground and
         | background work.
         | 
         | 2. there is effectively a const memory addresses once you parse
         | it - so you treat it is largely immutable and you can add hooks
         | (say crash if modified on a remote core)
         | 
         | 3. makes memory accounting fast. i.e.: instead of pushing a
         | global barrier for the allocator you simply send a message back
         | to the originating core for allocator accounting. This becomes
         | hugely important as you start to increase the number of cores.
        
           | user5994461 wrote:
           | >>> the trouble becomes on how you handle
           | decompression+checksumming+compression
           | 
           | gzip will cap 1 MB/s with the strongest compression setting
           | and 50 MB/s with the fastest setting, which is really slow.
           | 
           | The first step to improve kafka is for kafka to adopt zstd
           | compression.
           | 
           | Another thing that really hurts is SSL. Desktop CPU with AES
           | instructions can push 1 GB/s so it's not too bad, but that
           | may not the the CPU you have or the default algorithm used by
           | the software.
        
             | agallego wrote:
             | Kafka _has_ `zstd` encoding.
             | 
             | Here is our version of the streaming decoder i wrote a
             | while ago https://github.com/vectorizedio/redpanda/blob/dev
             | /src/v/comp...
             | 
             | that's our default for our internal RPC as well.
             | 
             | in fact kafka protocol support lz4, zstd, snappy, gzip all
             | of them. and you can change them per batch. compression is
             | good w/ kafka.
        
             | loeg wrote:
             | lz4 is a good option for really high-performance
             | compression as well. (Zstd is my general recommmendation,
             | and both beat the pants off of gzip, but for very high
             | throughput applications lz4 still beats zstd. Both are
             | designs from Yann Collet.)
        
               | agallego wrote:
               | indeed. though, the recent zstd changes w/ different
               | levels of compression sort of close the gap in perf that
               | lz4 had over zstd. (if interested in this kind of detail
               | for a new streaming storage engiene, i gave a talk last
               | week at the facebook performance summit - https://twitter
               | .com/perfsummit1/status/1337603028677902336)
        
       ___________________________________________________________________
       (page generated 2020-12-12 23:00 UTC)