[HN Gopher] Scaling Kafka at Honeycomb
       ___________________________________________________________________
        
       Scaling Kafka at Honeycomb
        
       Author : i0exception
       Score  : 72 points
       Date   : 2021-11-30 19:25 UTC (3 hours ago)
        
 (HTM) web link (www.honeycomb.io)
 (TXT) w3m dump (www.honeycomb.io)
        
       | rubiquity wrote:
       | I've never used Kafka but this post is yet another hard earned
       | lesson in log replication systems where storage tiering should be
       | much higher on the hierarchy of needs than horizontal scaling of
       | individual logs/topics/streams. In my experience the times when
       | you need storage tiering something awful is already happening.
       | 
       | During network partitions or other scenarios where your disks are
       | filling up quickly it's much easier to reason about how to get
       | your log healthy by aggressively offloading to tiered storage and
       | trimming than it is to re-partition (read: reconfigure), which
       | often requires writes to some consensus-backed metadata store,
       | which is also likely experiencing its own issues at that time.
       | 
       | Another great benefit of storage tiering is that you can
       | externally communicate a shorter data retention period than you
       | actually have in practice, while you really put your recovery and
       | replay systems through their paces to get the confidence you
       | need. Tiered storage can also be a great place to bootstrap new
       | nodes from.
        
       | StreamBright wrote:
       | >> Historically, our business requirements have meant keeping a
       | buffer of 24 to 48 hours of data to guard against the risk of a
       | bug in retriever corrupting customer data.
       | 
       | I have used much larger buffers before. Some bugs can lurk around
       | for a while before noticed. For example the lack of something is
       | much harder to notice.
        
         | lizthegrey wrote:
         | Yes -- now that we're just paying the cost to store once on S3
         | rather than 3x on NVMe, we can plausibly extend the window to
         | 72 hours or longer! before, it was a pragmatic, constraint-
         | driven compromise.
        
       | EdwardDiego wrote:
       | > July 2019 we did a rolling restart to convert from self-
       | packaged Kafka 0.10.0
       | 
       | Ouch, that's a lot of fixed bugs you weren't reaping the benefits
       | of >_< What was the reason to stick on 0.10.0 for so long?
       | 
       | After we hit a few bad ones that finally convinced our sysop team
       | to move past 0.11.x, life was far better - especially recovery
       | speed after an unclean shutdown. Used to take two hours, dropped
       | to like 10 minutes.
       | 
       | There was a particular bug I can't find for the life of me that
       | we hit about four times in one year where the replicas would get
       | confused about where the high watermark was, and refuse to fetch
       | from the leader. Although to be fair to Kafka 0.10.x, I think
       | that was a bug introduced in 0.11.0. Which is where I developed
       | my personal philosophy of "never upgrade to a x.x.0 Kafka release
       | if it can be avoided."
       | 
       | > The toil of handling reassigning partitions during broker
       | replacement by hand every time one of the instances was
       | terminated by AWS began to grate upon us
       | 
       | I see you like Cruise Control in the Confluent Platform, did you
       | try it earlier?
       | 
       | > In October 2020, Confluent announced Confluent Platform 6.0
       | with Tiered Storage support
       | 
       | Tiered storage is slowly coming to FOSS Kafka, hopefully in
       | 3.2.0, thanks to some very nice developers from AirBnB. Credit to
       | the StreamNative team, that FOSS Pulsar has tiered storage (and
       | schema registry) built-in.
        
         | lizthegrey wrote:
         | > What was the reason to stick on 0.10.0 for so long?
         | 
         | Aforementioned self-packaging, we were mangling the .tar.gz
         | files into .debs, and we had to remember to update the debs and
         | then push them out onto our systems, instead of just using Apt.
         | Thus why Confluent's prebuilt distro helped a lot! But also the
         | team was just _afraid_ of Kafka and didn't want to touch it
         | unnecessarily.
         | 
         | > I see you like Cruise Control in the Confluent Platform, did
         | you try it earlier?
         | 
         | We definitely should have. We tried Datadog's Kafka-kit but
         | found adapting it to use Wavefront or Honeycomb Metrics
         | products was more problematic than it needed to be.
         | 
         | > Tiered storage is slowly coming to FOSS Kafka, hopefully in
         | 3.2.0, thanks to some very nice developers from AirBnB. Credit
         | to the StreamNative team, that FOSS Pulsar has tiered storage
         | built-in.
         | 
         | Yeah, we're glad the rest of the world gets to have it, and
         | also glad we paid upfront for Confluent's enterprise feature
         | version to get us out of the immediate bind we had in 2020.
         | Those EBS/instance storage bills were adding up fast.
        
           | EdwardDiego wrote:
           | > we paid upfront for Confluent's enterprise feature version
           | to get us out of the immediate bind we had in 2020.
           | 
           | Definitely agree it's an essential feature for large datasets
           | - in the past I've used Kafka Connect to stream data to S3
           | for longer term retention, but it's something else to manage,
           | and getting data back into a topic if needed can be a bit
           | painful.
        
             | lizthegrey wrote:
             | getting to just use the same consistent API without
             | rewriting clients was AMAZING.
        
           | ewhauser421 wrote:
           | The irony of Honeycomb using an open source tool from Datadog
           | is not lost on me =)
        
             | lizthegrey wrote:
             | stand on the shoulders of giants!
        
       | mherdeg wrote:
       | It's funny how my bugbears from interacting with distributed
       | async messaging (Kafka) are like 90 degrees orthogonal from the
       | things described here:
       | 
       | (1) Occasionally have wanted to wonder _what the actual traffic
       | is_. This takes extra software work (writing some kind of
       | inspector tool to consume a sample message and produce a human-
       | readable version of what 's inside it).
       | 
       | (2) Sometimes see problems which happen at the broker-partition
       | or partition-consumer assignment level, and tools for visualizing
       | this are really messy.
       | 
       | For example you have 200 partitions and 198 consumer threads --
       | this means that because of the pigeonhole principle there are 2
       | threads which own 2 partitions. Randomly, 1% of your data
       | processing will take twice as long, which can be very hard to
       | visualize.
       | 
       | Or for example 10 of your 200 partitions that are managed by
       | broker B which, for some reason, is mishandling messages -- so 5%
       | of messages are being handled poorly, which may not emerge in
       | your metrics the way you expect. Viewing slowness by partition,
       | by owning consumer, and by managing broker can be tricky to
       | remember to do when operating the system.
       | 
       | (3) Provisioning capacity to have n-k availability (so that
       | availability-zone-wide outages as well as deployments/upgrades
       | don't hurt processing) can be tricky.
       | 
       | How many messages per second are arriving? What is the mean
       | processing time per message? How many processors (partitions) do
       | you need to keep up? How much _slack_ do you have -- how much
       | excess capacity is there above the typical message arrival rate,
       | so that you can model how long it will take the cluster to
       | process a backlog after an outage?
       | 
       | (4) Remembering how to scale up when message arrival rate feels
       | like a bit of a chore. You have to increase the number of
       | partitions to be able to handle the new messages ... but then you
       | also have to remember to scale up every consumer. You did
       | remember that, right? And you know you can't ever reduce the
       | partition count, right?
       | 
       | (5) I often end up wondered what the processing latency is. You
       | can approximate this by dividing the total backlog of unprocessed
       | messages for an entire consumer group (unit "messages") by the
       | message arrival rate (unit "arriving messages per second") which
       | gets you something that has dimensionality of "seconds" and
       | represents a quasi processing lag. But the lag is often different
       | per-partition.
       | 
       | Better is to teach the application-level consumer library to emit
       | a metric about how long processing took and how old the message
       | it evaluated was - then, as long as processing is still
       | happening, you can measure delays. Both are messy metrics that
       | need you get and remain hands-on with the data to understand
       | them.
       | 
       | (6) There's a complicated relationship between "processing time
       | per message" and effective capacity -- any application changes
       | which make a Kafka consumer slower may not have immediate effects
       | on end-to-end lag SLIs, but they may increase the amount of
       | parallelism needed to handle peak traffic, and this can be tough
       | to reason about.
       | 
       | (7) Planning only ex post facto for processing outages is always
       | a pain. More than once I've heard teams say "this outage would be
       | a lot shorter if we had built in a way to process newly arrived
       | messages first", and I've even seen folks jury-rig LIFO by e.g.
       | changing the topic name for newly arrived messages and using the
       | previous queue as a backlog only.
       | 
       | I wonder if my clusters have just been too small? The stuff here
       | ("how can we afford to operate this at scale?") is super
       | interesting, just not the reliability stuff I've worried about
       | day-to-day.
        
       | sealjam wrote:
       | > ...RedPanda, a scratch backend rewrite in Rust that is client
       | API compatible
       | 
       | I thought RedPanda was mostly C++?
        
         | zellyn wrote:
         | The RedPanda website claims to be written in C++, and their
         | open source github repo agrees.
        
           | lizthegrey wrote:
           | thanks for the correction! knew an error would slip in there
           | somewhere! apparently they have considered rust though!
           | https://news.ycombinator.com/item?id=25112601
           | 
           | it's fixed now.
        
       ___________________________________________________________________
       (page generated 2021-11-30 23:00 UTC)