[HN Gopher] Migrating to OpenTelemetry ___________________________________________________________________ Migrating to OpenTelemetry Author : kkoppenhaver Score : 127 points Date : 2023-11-16 17:29 UTC (5 hours ago) (HTM) web link (www.airplane.dev) (TXT) w3m dump (www.airplane.dev) | caust1c wrote: | Curious about the code implemented for logs! Hopefully that's | something that can be shared at some point. Also curious if it | integrates with `log/slog` :-) | | Congrats too! As I understand it from stories I've heard from | others, migrating to OTel is no easy undertaking. | bhyolken wrote: | Thanks! For logs, we actually use github.com/segmentio/events | and just implemented a handler for that library that batches | logs and periodically flushes them out to our collector using | the underlying protocol buffer interface. We plan on migrating | to log/slog soon, and once we do that we'll adapt our handler | and can share the code. | caust1c wrote: | Awesome! Great work and thanks for sharing your experience! | MajimasEyepatch wrote: | It's interesting that you're using both Honeycomb and Datadog. | With everything migrated to OTel, would there be advantages to | consolidating on just Honeycomb (or Datadog)? Have you found | they're useful for different things, or is there enough overlap | that you could use just one or the other? | bhyolken wrote: | Author here, thanks for the question! The current split | developed from the personal preferences of the engineers who | initially set up our observability systems, based on what they | had used (and liked) at previous jobs. | | We're definitely open to doing more consolidation in the | future, especially if we can save money by doing that, but from | a usability standpoint we've been pretty happy with Honeycomb | for traces and Datadog for everything else so far. And, that | seems to be aligned with what each vendor is best at at the | moment. | MuffinFlavored wrote: | > from the personal preferences of the engineers | | https://www.honeycomb.io/pricing | | https://www.datadoghq.com/pricing/ | | Am I wrong to say... having 2 is "expensive"? Maybe not if | 50% of your stuff is going to Honeycomb and 50% going to | DataDog. Could you save money/complexity (less places to look | for things) having just DataDog or just Honeycomb? | bhyolken wrote: | Right now, there isn't much duplication of what we're | sending to each vendor, so I don't think we'd save a ton by | consolidating, at least based on list prices. We could | maybe negotiate better prices based on higher volumes, but | I'm not sure if Airplane is spending enough at this point | to get massive discounts there. | | Another potential benefit would definitely be reduced | complexity and better integration for the engineering team. | So, for instance, you could look at a log and then more | easily navigate to the UI for the associated trace. | Currently, we do this by putting Honeycomb URLs in our | Datadog log events, which works but isn't quite as | seamless. But, given that our team is pretty small at this | point and that we're not spending a ton of our time on | performance optimizations, we don't feel an urgent need to | consolidate (yet). | MuffinFlavored wrote: | When you say DataDog for everything else (as in not | traces), besides logs, what else do you mean? | claytonjy wrote: | Metrics, probably? The article calls out logs, metrics, | and traces as the 3 pillars of observability. | bhyolken wrote: | Yeah, metrics and logs, plus a few other things that | depend on these (alerts, SLOs, metric-based dashboards, | etc.). | tapoxi wrote: | I made this switch very recently. For our Java apps it was as | simple as loading the otel agent in place of the Datadog SDK, | basically "-javaagent:/opt/otel/opentelemetry-javaagent.jar" in | our args. | | The collector (which processes and ships metrics) can be | installed in K8S through Helm or an operator, and we just added a | variable to our charts so the agent can be pointed at the | collector. The collector speaks OTLP which is the fancy combined | metrics/traces/logs protocol the OTEL SDKs/agents use, but it | also speaks Prometheus, Zipkin, etc to give you an easy migration | path. We currently ship to Datadog as well as an internal | service, with the end goal being migrating off of Datadog | gradually. | andrewstuart2 wrote: | We tried this about a year and a half ago and ended up going | somewhat backwards into DD entrenchment, because they've | decided that anything not an official DD metric (that is, | collected by their agent typically) is custom and then becomes | substantially more expensive. We wanted a nice migration path | from any vendor to any other vendor but they have a fairly | effective strategy for making gradual migrations more expensive | for heavy telemetry users. At least our instrumentation these | days is otel, but it's the metrics we expected to just scrape | from prometheus that we had to dial back and start using more | official DD agent metrics and configs to get, lest our bill | balloon by 10x. It's a frustrating place to be. Especially | since it's still not remotely cheap, just that it could be way | worse. | | I know this isn't a DataDog post, and I'm a bit off topic, but | I try to do my best to warn against DD these days. | shawnb576 wrote: | This has been a concern for me too. But the agent is just a | statsd receiver with some extra magic, so this seems like a | thing that could be solved with the collector sending traffic | to an agent rather than the HTTP APIs? | | I looked at the OTel DD stuff and did not see any support for | this, fwiw, maybe it doesn't work b/c the agent expects more | context from the pod (e.g. app and label?) | andrewstuart2 wrote: | Yeah, the DD agent and the otel-collector DD exporter | actually use the same code paths for the most part. The | relevant difference tends to be in metrics, where the | official path involves the DD agent doing collection | directly, for example, collecting redis metrics by giving | the agent your redis database hostname and creds. It can | then pack those into the specific shape that DD knows about | and they get sent with the right name, values, etc so that | DD calls them regular metrics. | | If you instead went the more flexible route of using many | of the de-facto standard prometheus exporters like the one | for redis, or built-in prometheus metrics from something | like istio, and forward those to your agent or configure | your agent to poll those prometheus metrics, it won't do | any reshaping (which I can see the arguments for, kinda, | knowing a bit about their backend) and they just end up in | the DD backend as custom metrics, and charge you at | $0.10/mo per 100 time series. If you've used prometheus | before for any realistic deployments with enrichment etc, | you can probably see this gets expensive ridiculously fast. | | What I wish they'd do instead is have some form of adapter | from those de facto standards, so I can still collect | metrics 99% my own way, in a portable fashion, and then add | DD as my backend without ending up as custom everything, | costing significantly more. | xyst wrote: | > somewhat backwards into DD entrenchment, because they've | decided that anything not an official DD metric (that is, | collected by their agent typically) is custom and then | becomes substantially more expensive. | | It a vendor pulled shit like this on me. That's when I would | counsel them. Of course most big orgs would rather not do the | leg work to actually become portable, migrate off vendor. So | of course they will just pay the bill. | | Vendors love the custom shit they build because they know | once it's infiltrated the stack then it's basically like | gangrene (have to cut off the appendage to save the host) | k__ wrote: | I had the impression, logs and metrics are a pre-observability | thing. | SteveNuts wrote: | I've never heard the term "pre-observability", what does that | mean? | renegade-otter wrote: | The era when "debugging in production" wasn't standard. | marcosdumay wrote: | Observability is about logs and metrics, and pre-observability | (I guess you mean the high-level-only records simpler | environments keep) is also about logs and metrics. | | Anything you register to keep track of your environment has the | form of either logs or metrics. The difference is about the | contents of such logs and metrics. | tsamba wrote: | Interesting read. What did you find easier about using GCP's log | tooling for your internal system logs, rather than the OTel | collector? | roskilli wrote: | > Moreover, we encountered some rough edges in the metrics- | related functionality of the Go SDK referenced above. Ultimately, | we had to write a conversion layer on top of the OTel metrics API | that allowed for simple, Prometheus-like counters, gauges, and | histograms. | | Have encountered this a lot from teams attempting to use the | metrics SDK. | | Are you open to comment on specifics here and also what kind of | shim you had to put in front of the SDK? It would be great to | continue to retrieve feedback so that we can as a community have | a good idea of what remains before it's possible to use the SDK | for real world production use cases in anger. Just wiring up the | setup in your app used to be fairly painful but that has gotten | somewhat better over the last 12-24 months, I'd love to also hear | what is currently causing compatibility issues w/ the metric | types themselves using the SDK which requires a shim and what the | shim is doing to achieve compatibility. | bhyolken wrote: | Sure, happy to provide more specifics! | | Our main issue was the lack of a synchronous gauge. The | officially supported asynchronous API of registering a callback | function to report a gauge metric is very different from how we | were doing things before, and would have required lots of | refactoring of our code. Instead, we wrote a wrapper that | exposes a synchronous-like API: https://gist.github.com/yolken- | airplane/027867b753840f7d15d6.... | | It seems like this is a common feature request across many of | the SDKs, and it's in the process of being fixed in some of | them (https://github.com/open-telemetry/opentelemetry- | specificatio...)? I'm not sure what the plans are for the | golang SDK specifically. | | Another, more minor issue, is the lack of support for | "constant" attributes that are applied to all observations of a | metric. We use these to identify the app, among other use | cases, so we added wrappers around the various "Add", "Record", | "Observe", etc. calls that automatically add these. (It's | totally possible that this is supported and I missed it, in | which case please let me know.) | | Overall, the SDK was generally well-written and well- | documented, we just needed some extra work to make the | interfaces more similar to the ones we were using before. | arccy wrote: | the official SDKs will only support an api once there's a | spec that allows it. | | for const attributes, generally these should be defined at | the resource / provider level: https://pkg.go.dev/go.opentele | metry.io/otel/sdk/metric#WithR... | CSMastermind wrote: | > The data collected from these streams is sent to several | vendors including Datadog (for application logs and metrics), | Honeycomb (for traces), and Google Cloud Logging (for | infrastructure logs). | | It sounds like they were in a place that a lot of companies are | in where they don't have a single pane of glass for | observability. One of if not the main benefit I've gotten out of | Datadog is having everything in Datadog so that it's all | connected and I can easily jump from a trace to logs for | instance. | | One of the terrible mistakes I see companies make with this | tooling is fragmenting like this. Everyone has their own personal | preference for tool and ultimately the collective experience is | significantly worse than the sum of its parts. | devin wrote: | Eh, personally I view honeycomb and datadog as different enough | offerings that I can see why you'd choose to have both. | dexterdog wrote: | Depending on your usage it can be prohibitively expensive to | use datadog for everything like that. We have it for just our | prod env because it's just not worth what it brings to the | table to put all of our logs into it. | maccard wrote: | I've spent a small amount of time in datadog, lots in grafana, | and somewhere in between in honeycomb. Out applications are | designed to emit traces, and comparing honeycomb with tracing | to a traditional app with metrics and logs, I would choose | tracing every time. | | It annoys me that logs are overlooked in honeycomb, (and | metrics are... fine). But, given the choice between a single | pane of glass in grafana or having to do logs (and metrics | sometimes) in cloudwatch but spending 95% of my time in | honeycomb - I'd pick honeycomb every time | mdtusz wrote: | Agreed - honeycomb has been a boon, however some improvements | to metric displays and the ability to set the default "board" | used in the home page would be very welcome. Also would be | pretty happy if there was a way to drop events on the | honeycomb side for a way to dynamically filter - e.g. "don't | even bother storing this trace if it has a http.status_code < | 400". This is surprisingly painful to implement on the | application side (at least in rust). | | Hopefully someone that works there is reading this. | masterj wrote: | It sounds like you should look into their tail-sampling | Refinery tool https://docs.honeycomb.io/manage-data- | volume/refinery/ | viraptor wrote: | Have you tried the traces in grafana/tempo yet? | https://grafana.com/docs/grafana/latest/panels- | visualization... | | It seems to miss some aggregation stuff, but also it's | improving every time I check. I wonder if anyone's used it in | anger yet and how far is it from replacing datadog or | honeycomb. | arccy wrote: | tempo still feels very much: look at a trace that you found | from elsewhere (like logs) | rewmie wrote: | > It sounds like they were in a place that a lot of companies | are in where they don't have a single pane of glass for | observability. | | One of the biggest features of AWS which is very easy to take | for granted and go unnoticed is Amazon CloudWatch. It supports | metrics, logging, alarms, metrics from alarms, alarms from | alarms, querying historical logs, trigger actions, etc etc etc. | and it covers each and every single service provided by AWS | including metaservices like AWS Config and Cloudtrail. | | And you barely notice it. It's just there, and you can see | everything. | | > One of the terrible mistakes I see companies make with this | tooling is fragmenting like this. | | So much this. It's not fun at all to have to go through logs | and metrics on any application,and much less so if for some | reason their maintainers scattered their metrics emission to | the four winds. However, with AWS all roads lead to Cloudwatch, | and everything is so much better. | badloginagain wrote: | I feel we hold up single-observability-solution as the Holy | Grail, and I can see the argument for it- one place to | understand the health of your services. | | But I've also been in terrible vendor lock-in situations, being | bent over the barrel because switching to a better solution is | so damn expensive. | | At least now with OTel you have an open standard that allows | you to switch easier, but even then I'd rather have 2 solutions | that meet my exact observability requirements than a single | solution that does everything OKish. | nevon wrote: | I would love to save a few hundred thousands a year by running | Otel collector over Datadog agents, just on the cost-per-host | alone. Unfortunately that would also mean giving up Datatog APM | and NPM, as far as I can tell, which have been really valuable. | Going back to just metrics and traces would feel like quite the | step backwards and be a hard sell. | arccy wrote: | you can submit opentelemetry traces to datadog which should be | the equivalent of apm/npm, though maybe with a less polished | integration. | throwaway084t95 wrote: | What is the "first principles" argument that observability | decomposes into logs, metrics, and tracing? I see this dogma | accepted everywhere, but I'm inquisitive about it | yannyu wrote: | First you had logs. Everyone uses logs because it's easy. Logs | are great, but suddenly you're spending a crapton of time or | money maintaining terabytes or petabytes of storage and ingest | of logs. And even worse, in some cases for these logs, you | don't actually care about 99% of the log line and simply want a | single number, such as CPU utilization or the value of the | shopping cart or latency. | | So, someone says, "let's make something smaller and more | portable than logs. We need to track numerical data over time | more easily, so that we can see pretty charts of when these | values are outside of where they should be." This ends up being | metrics and a time-series database (TSDB), built to handle not | arbitrary lines of text but instead meant to parse out metadata | and append numerical data to existing time-series based on that | metadata. | | Between metrics and logs, you end up with a good idea of what's | going on with your infrastructure, but logs are still too | verbose to understand what's happening with your applications | past a certain point. If you have an application crashing | repeatedly, or if you've got applications running slowly, | metrics and logs can't really help you there. So companies | built out Application Performance Monitoring, meant to tap | directly into the processes running on the box and spit out all | sorts of interesting runtime metrics and events about not just | the applications, but the specific methods and calls those | applications are utilizing within their stack/code. | | Initially, this works great if you're running these APM tools | on a single box within monolithic stacks, but as the world | moved toward Cloud Service Providers and | containerized/ephemeral infrastructure, APM stopped being as | effective. When a transaction starts to go through multiple | machines and microservices, APM deployed on those boxes | individually can't give you the context of how these disparate | calls relate to a holistic transaction. | | So someone says, "hey, what if we include transaction IDs in | these service calls, so that we can post-hoc stitch together | these individual transaction lines into a whole transaction, | end-to-end?" Which is how you end up with the concept of spans | and traces, taking what worked well with Application | Performance Monitoring and generalizing that out into the | modern microservices architectures that are more common today. | shoelessone wrote: | I really really want to use OTel for a small project but have | always had a really tough time finding a path that is cheap or | free for a personal project. | | In theory you can send telemetry data with OTel to Cloud Watch, | but I've struggle to connect the dots with the front end | application (e.g. React/Next.js). | arccy wrote: | grafana cloud, honeycomb, etc have free tiers, though you'll | have to watch how much data you send them. or you can self host | something like signoz or the elastic stack. frontend will | typically go to an instance of opentelemetry collector to | filter/convert to the protocol for the storage backend. ___________________________________________________________________ (page generated 2023-11-16 23:00 UTC)