[HN Gopher] Scaling Kafka at Honeycomb ___________________________________________________________________ Scaling Kafka at Honeycomb Author : i0exception Score : 72 points Date : 2021-11-30 19:25 UTC (3 hours ago) (HTM) web link (www.honeycomb.io) (TXT) w3m dump (www.honeycomb.io) | rubiquity wrote: | I've never used Kafka but this post is yet another hard earned | lesson in log replication systems where storage tiering should be | much higher on the hierarchy of needs than horizontal scaling of | individual logs/topics/streams. In my experience the times when | you need storage tiering something awful is already happening. | | During network partitions or other scenarios where your disks are | filling up quickly it's much easier to reason about how to get | your log healthy by aggressively offloading to tiered storage and | trimming than it is to re-partition (read: reconfigure), which | often requires writes to some consensus-backed metadata store, | which is also likely experiencing its own issues at that time. | | Another great benefit of storage tiering is that you can | externally communicate a shorter data retention period than you | actually have in practice, while you really put your recovery and | replay systems through their paces to get the confidence you | need. Tiered storage can also be a great place to bootstrap new | nodes from. | StreamBright wrote: | >> Historically, our business requirements have meant keeping a | buffer of 24 to 48 hours of data to guard against the risk of a | bug in retriever corrupting customer data. | | I have used much larger buffers before. Some bugs can lurk around | for a while before noticed. For example the lack of something is | much harder to notice. | lizthegrey wrote: | Yes -- now that we're just paying the cost to store once on S3 | rather than 3x on NVMe, we can plausibly extend the window to | 72 hours or longer! before, it was a pragmatic, constraint- | driven compromise. | EdwardDiego wrote: | > July 2019 we did a rolling restart to convert from self- | packaged Kafka 0.10.0 | | Ouch, that's a lot of fixed bugs you weren't reaping the benefits | of >_< What was the reason to stick on 0.10.0 for so long? | | After we hit a few bad ones that finally convinced our sysop team | to move past 0.11.x, life was far better - especially recovery | speed after an unclean shutdown. Used to take two hours, dropped | to like 10 minutes. | | There was a particular bug I can't find for the life of me that | we hit about four times in one year where the replicas would get | confused about where the high watermark was, and refuse to fetch | from the leader. Although to be fair to Kafka 0.10.x, I think | that was a bug introduced in 0.11.0. Which is where I developed | my personal philosophy of "never upgrade to a x.x.0 Kafka release | if it can be avoided." | | > The toil of handling reassigning partitions during broker | replacement by hand every time one of the instances was | terminated by AWS began to grate upon us | | I see you like Cruise Control in the Confluent Platform, did you | try it earlier? | | > In October 2020, Confluent announced Confluent Platform 6.0 | with Tiered Storage support | | Tiered storage is slowly coming to FOSS Kafka, hopefully in | 3.2.0, thanks to some very nice developers from AirBnB. Credit to | the StreamNative team, that FOSS Pulsar has tiered storage (and | schema registry) built-in. | lizthegrey wrote: | > What was the reason to stick on 0.10.0 for so long? | | Aforementioned self-packaging, we were mangling the .tar.gz | files into .debs, and we had to remember to update the debs and | then push them out onto our systems, instead of just using Apt. | Thus why Confluent's prebuilt distro helped a lot! But also the | team was just _afraid_ of Kafka and didn't want to touch it | unnecessarily. | | > I see you like Cruise Control in the Confluent Platform, did | you try it earlier? | | We definitely should have. We tried Datadog's Kafka-kit but | found adapting it to use Wavefront or Honeycomb Metrics | products was more problematic than it needed to be. | | > Tiered storage is slowly coming to FOSS Kafka, hopefully in | 3.2.0, thanks to some very nice developers from AirBnB. Credit | to the StreamNative team, that FOSS Pulsar has tiered storage | built-in. | | Yeah, we're glad the rest of the world gets to have it, and | also glad we paid upfront for Confluent's enterprise feature | version to get us out of the immediate bind we had in 2020. | Those EBS/instance storage bills were adding up fast. | EdwardDiego wrote: | > we paid upfront for Confluent's enterprise feature version | to get us out of the immediate bind we had in 2020. | | Definitely agree it's an essential feature for large datasets | - in the past I've used Kafka Connect to stream data to S3 | for longer term retention, but it's something else to manage, | and getting data back into a topic if needed can be a bit | painful. | lizthegrey wrote: | getting to just use the same consistent API without | rewriting clients was AMAZING. | ewhauser421 wrote: | The irony of Honeycomb using an open source tool from Datadog | is not lost on me =) | lizthegrey wrote: | stand on the shoulders of giants! | mherdeg wrote: | It's funny how my bugbears from interacting with distributed | async messaging (Kafka) are like 90 degrees orthogonal from the | things described here: | | (1) Occasionally have wanted to wonder _what the actual traffic | is_. This takes extra software work (writing some kind of | inspector tool to consume a sample message and produce a human- | readable version of what 's inside it). | | (2) Sometimes see problems which happen at the broker-partition | or partition-consumer assignment level, and tools for visualizing | this are really messy. | | For example you have 200 partitions and 198 consumer threads -- | this means that because of the pigeonhole principle there are 2 | threads which own 2 partitions. Randomly, 1% of your data | processing will take twice as long, which can be very hard to | visualize. | | Or for example 10 of your 200 partitions that are managed by | broker B which, for some reason, is mishandling messages -- so 5% | of messages are being handled poorly, which may not emerge in | your metrics the way you expect. Viewing slowness by partition, | by owning consumer, and by managing broker can be tricky to | remember to do when operating the system. | | (3) Provisioning capacity to have n-k availability (so that | availability-zone-wide outages as well as deployments/upgrades | don't hurt processing) can be tricky. | | How many messages per second are arriving? What is the mean | processing time per message? How many processors (partitions) do | you need to keep up? How much _slack_ do you have -- how much | excess capacity is there above the typical message arrival rate, | so that you can model how long it will take the cluster to | process a backlog after an outage? | | (4) Remembering how to scale up when message arrival rate feels | like a bit of a chore. You have to increase the number of | partitions to be able to handle the new messages ... but then you | also have to remember to scale up every consumer. You did | remember that, right? And you know you can't ever reduce the | partition count, right? | | (5) I often end up wondered what the processing latency is. You | can approximate this by dividing the total backlog of unprocessed | messages for an entire consumer group (unit "messages") by the | message arrival rate (unit "arriving messages per second") which | gets you something that has dimensionality of "seconds" and | represents a quasi processing lag. But the lag is often different | per-partition. | | Better is to teach the application-level consumer library to emit | a metric about how long processing took and how old the message | it evaluated was - then, as long as processing is still | happening, you can measure delays. Both are messy metrics that | need you get and remain hands-on with the data to understand | them. | | (6) There's a complicated relationship between "processing time | per message" and effective capacity -- any application changes | which make a Kafka consumer slower may not have immediate effects | on end-to-end lag SLIs, but they may increase the amount of | parallelism needed to handle peak traffic, and this can be tough | to reason about. | | (7) Planning only ex post facto for processing outages is always | a pain. More than once I've heard teams say "this outage would be | a lot shorter if we had built in a way to process newly arrived | messages first", and I've even seen folks jury-rig LIFO by e.g. | changing the topic name for newly arrived messages and using the | previous queue as a backlog only. | | I wonder if my clusters have just been too small? The stuff here | ("how can we afford to operate this at scale?") is super | interesting, just not the reliability stuff I've worried about | day-to-day. | sealjam wrote: | > ...RedPanda, a scratch backend rewrite in Rust that is client | API compatible | | I thought RedPanda was mostly C++? | zellyn wrote: | The RedPanda website claims to be written in C++, and their | open source github repo agrees. | lizthegrey wrote: | thanks for the correction! knew an error would slip in there | somewhere! apparently they have considered rust though! | https://news.ycombinator.com/item?id=25112601 | | it's fixed now. ___________________________________________________________________ (page generated 2021-11-30 23:00 UTC)