[HN Gopher] OpenTelemetry at Scale: Using Kafka to handle bursty... ___________________________________________________________________ OpenTelemetry at Scale: Using Kafka to handle bursty traffic Author : pranay01 Score : 138 points Date : 2023-10-22 18:58 UTC (14 hours ago) (HTM) web link (signoz.io) (TXT) w3m dump (signoz.io) | francoismassot wrote: | I heard several times that Kafka was put in front of | elasticsearch clusters for handling traffic burst. You can also | use Redpanda, Pulsar, NATS and other distributed queues. | | One thing that is also very interesting with Kafka is that you | can achieve exactly-once semantic without too much efforts: by | keeping track of the positions of partitions in your own database | and carefully acknowledging them when you are sure data is safely | stored in your db. That's what we did with our engine Quickwit, | so far it's the most efficient way to index data in it. | | One obvious drawback with Kafka is that it's one more piece to | maintain... and it's not a small one. | pranay01 wrote: | Have you done/seen any benchmarks between Redpanda/NATS and | Kafka for this use case? | | Some folks in SigNoz community have also suggested NATS for | this, but I have not deep dived into benchmarks/features yet | francoismassot wrote: | Unfortunately no :/ | radicality wrote: | Exactly-once semantics of what specifically? Or do you mean at- | least-once ? | francoismassot wrote: | Exactly-once semantic between Kafka and the observability | engine. | richieartoul wrote: | You have to do a bit more than that if you want exactly once | end-to-end (I.E if Kafka itself can contain duplicates). One of | my former colleagues did a good write up on how Husky does it: | https://www.datadoghq.com/blog/engineering/husky-deep-dive/ | francoismassot wrote: | Yeah, I was only talking about exactly once semantic between | Kafka and Quickwit. | viraptor wrote: | > exactly-once semantic without too much efforts: by keeping | track of the positions of partitions in your own database and | carefully acknowledging them when you are sure data is safely | stored in your db | | That's not really "exactly once". What happens when your system | dies after it made sure the data is safely stored in the db and | before ack-ing? | Svenskunganka wrote: | Depending on how you use the database it is. If you write the | data as well as the offset to the DB in the same transaction, | you can then seek to the offset stored in the DB after | application restart and continue from there. | viraptor wrote: | > after application restart and continue from there. | | What if the application doesn't restart before the queue | decides the message was lost and resends? | hashhar wrote: | In Kafka the "queue" is dumb, it doesn't lose messages | (it's an append only durable log) nor does it resend | anything unless the consumer requests it. | mirekrusin wrote: | You should drop "(...) and carefully acknowledging them | when you are sure data is safely stored in your db (...)" | part then, because it means it's not necessary, you don't | rely on it. | | One-or-more semantics + local deduplication gives one-and- | only semantics. | | In this case you're optimising local deduplication with | strictly monotonic index. | | One downside is that you leak internals of other system | (partitions). | | The other is that it implies serialised processing - you | can't process anything in parallel as you have single index | threshold that defines what has been and what has yet not | been processed. | Svenskunganka wrote: | I'm not the one who wrote the original comment, so I | can't modify it. But one should still commit offsets | because it is the happy-path; DB transaction successful? | Commit offset. If the latter fails due to e.g application | crash and you seek at startup to the partition offset | stored in the DB + 1, you get exactly-once semantics. | There's some more details, e.g you'd have to do the same | during consumer group rebalance, and topic configuration | also plays a role, for example if the topic is a | compacted topic or not, and if you write tombstones, what | its retention policy is. | | edit: You added some more to your comment after I posted | this one, so I'll try to cover them as well: | | > One downside is that you leak internals of other system | (partitions). | | Yeah, sure. | | > The other is that it implies serialised processing - | you can't process anything in parallel as you have single | index threshold that defines what has been and what has | yet not been processed. | | It doesn't imply serialised processing. It depends on the | use-case, if each record in a topic has to be processed | serially, you can't parallelize full-stop; number of | partitions equals 1. But if each record can be | individually processed you get parallelism equal to the | number of partitions the topic has configured. You also | achieve parallelism in the same way if only some records | in a topic needs to be processed serially, at which point | you can use the same key for the records needing to be | serially processed and they will end up in the same | partition, for example recording the coordinates of a | plane - each plane can be processed in parallel, but an | individual plane's coordinates need to be processed | serially - just use the planes unique identifier as key | and the coordinates for the same plane will be appended | to the log of the same partition. | mirekrusin wrote: | Yes, it's good option but it requires serialised | processing in partition scope, which may or may not be | desirable. | | If one-and-only-one semantics are needed and processing | should be parallel, other methods have to be used. | francoismassot wrote: | Good point: first you're right, we do the ack on Kafka | but it's not necessary. Second, this is not what I wanted | to stress... and I should have not used the verb | "acknowledge". What we do is upload the data on S3, then | we commit partitions + positions in what we call the | metastore. I can't edit my comment unfortunately. | | > One downside is that you leak internals of other system | (partitions). | | True, but we generalized the concept of partitions for | other datasources, pretty convenient to use it for | distributing indexing tasks. | fulmicoton wrote: | > The other is that it implies serialised processing - | you can't process anything > in parallel as you have | single index threshold that defines what has been and > | what has yet not been processed. | | Fortunately Kafka is partitioned. You cannot work in | parallel along partitions. | | Also, you can streamline your process. If you are running | your data through operation (A, B, C). (C on batch N) can | run at the same time as (B on batch N+1), and (A on batch | N+2) | | We do both at quickwit. | foota wrote: | Isn't exactly once delivery the kind of problem like the CAP | thereom where it's not possible? | | You can make the downstream idemptoent wrt what the queue is | delivering, but the queue might still redeliver things. | ankitnayan wrote: | https://www.confluent.io/blog/exactly-once-semantics-are-pos... | bushbaba wrote: | Seems like overkill no? Otel collectors are fairly cheap, why add | expensive Kafka into the mix. If you need to buffer why not just | dump to s3 or similar data store as a temporary storage array. | francoismassot wrote: | I really like this idea. And there is an OTEL exporter to AWS | S3, still in alpha but I'm gonna test it soon: | https://github.com/open-telemetry/opentelemetry-collector-co... | prpl wrote: | Why not both, dump to S3 and write pointers to kafka for | portable event-based ingestion (since everybody does messages a | bit differently) | bushbaba wrote: | No need as s3 objects is your dead letter queue and the | system should be designed anyway to coupe with multiple write | of same event. | | The point is to only use s3 etc in the event of system | instability. Not as a primary data transfer means. | lmm wrote: | > If you need to buffer why not just dump to s3 or similar data | store as a temporary storage array. | | At that point it's very easy to sleepwalk into implementing | your own database on top of s3, which is very hard to get good | semantics out of - e.g. it offers essentially no ordering | guarantees, and forget atomicity. For telemetry you might well | be ok with fuzzy data, but if you want exact traces every time | then Kafka could make sense. | dikei wrote: | Yeah, and to use S3 efficiently you also need to batch your | messages into large blobs of at least 10s of MB, which | further complicates the matter, especially if you don't want | to lose those messages buffers. | bushbaba wrote: | if your otel collector is being overwhelmed. In such cases | you have a lot of backlogged data not able to be ingested. | So you dead letter queue it to s3 for freeing up buffers. | | The approach here is to only send data to s3 as a last | ditch resort. | ankitnayan wrote: | it's very hard to think s3 work as a buffer. Every datastore | can work for almost all storage usecases buffer/queue/db when | the scale is low but the latter were designed to work at | scale | richieartoul wrote: | (WarpStream founder) | | This is more or less exactly what WarpStream is: | https://www.warpstream.com/blog/minimizing-s3-api-costs-with... | | Kafka API, S3 costs and ease of use | daurnimator wrote: | I expect it would be far cheaper to scale up tempo/loki than it | would be to even run an idle kafka cluster. This feels like | spending thousands of dollars to save tens of dollars. | neetle wrote: | Tempo can still buckle under huge bursts of traffic, and you | don't need the retention to be in the hours | pranay01 wrote: | Querying in Tempo/Loki does seem to not scale particularly | well, and Loki has known issues with high cardinality data, | so... | ankitnayan wrote: | When handling surges of the order of 10x, it's much more | difficult to scale the different components of loki than to | write them to Kafka/Redpanda first and consume at a consistent | rate. | blinded wrote: | This arch is how the big players do it at scale (ie. datadog, new | relic - the second it passes their edge it lands in a kafka | cluster). Also otel components lack rate limiting(1) meaning its | super easy to overload your backend storage (s3). | | Grafana has some posts how they softened the s3 blow with | memcached(2,3). | | 1. https://github.com/open-telemetry/opentelemetry-collector- | co... 2. https://grafana.com/docs/loki/latest/operations/caching/ | 3. https://grafana.com/blog/2023/08/23/how-we-scaled-grafana- | cl... | | I know the post is about telemetry data and my comments on | grafana are logs, but the arch bits still apply. | ankitnayan wrote: | Caching is to improve read performance whereas Kafka is used to | handle ingest volume. I couldn't correlate the Grafana articles | shared | wardb wrote: | Grafana Labs employee here => On the linked articles: I'm not | aware of any caching being used in the writing data to S3 part | of the pipeline other then some time based/volume based | buffering at the ingester microservices before writing the | chunks of data to object storage. | | The linked Loki caching docs/articles are for optimising the | read access patterns of S3/object storage, not for writes. | chris_armstrong wrote: | A similar idea [^1] has cropped up in the serverless | OpenTelemetry world to collate OpenTelemetry spans in a Kinesis | stream before forwarding them to a third-party service for | analysis, obviating the need for a separate collector, reducing | forwarding latency and removing the cold-start overhead of the | AWS Distribution for OpenTelemetry Lambda Layer. | | [^1] https://x.com/donkersgood/status/1662074303456636929?s=20 | Joel_Mckay wrote: | If you have distributed concurrent data streams that exhibit | coherent temporal events, than at some point you pretty much have | to implement a queuing balancer. | | One simply trades latency for capacity and eventual coherent data | locality. | | Its almost a arbitrary detail whether you use Kafka, RabbitMQ, or | Erlang channels. If you can add smart client application-layer | predictive load-balancing, than it is possible to cut burst | traffic loads by a magnitude or two. Cost optimized Dynamic host | scaling is not always a solution that solves every problem. | | Good luck out there =) | nicognaw wrote: | Signoz is too good at SEO. | | Early days, I looked up otel and observability stuff, and I | always saw Signoz articles on the first screen. ___________________________________________________________________ (page generated 2023-10-23 09:00 UTC)