[HN Gopher] Migrating to OpenTelemetry
       ___________________________________________________________________
        
       Migrating to OpenTelemetry
        
       Author : kkoppenhaver
       Score  : 127 points
       Date   : 2023-11-16 17:29 UTC (5 hours ago)
        
 (HTM) web link (www.airplane.dev)
 (TXT) w3m dump (www.airplane.dev)
        
       | caust1c wrote:
       | Curious about the code implemented for logs! Hopefully that's
       | something that can be shared at some point. Also curious if it
       | integrates with `log/slog` :-)
       | 
       | Congrats too! As I understand it from stories I've heard from
       | others, migrating to OTel is no easy undertaking.
        
         | bhyolken wrote:
         | Thanks! For logs, we actually use github.com/segmentio/events
         | and just implemented a handler for that library that batches
         | logs and periodically flushes them out to our collector using
         | the underlying protocol buffer interface. We plan on migrating
         | to log/slog soon, and once we do that we'll adapt our handler
         | and can share the code.
        
           | caust1c wrote:
           | Awesome! Great work and thanks for sharing your experience!
        
       | MajimasEyepatch wrote:
       | It's interesting that you're using both Honeycomb and Datadog.
       | With everything migrated to OTel, would there be advantages to
       | consolidating on just Honeycomb (or Datadog)? Have you found
       | they're useful for different things, or is there enough overlap
       | that you could use just one or the other?
        
         | bhyolken wrote:
         | Author here, thanks for the question! The current split
         | developed from the personal preferences of the engineers who
         | initially set up our observability systems, based on what they
         | had used (and liked) at previous jobs.
         | 
         | We're definitely open to doing more consolidation in the
         | future, especially if we can save money by doing that, but from
         | a usability standpoint we've been pretty happy with Honeycomb
         | for traces and Datadog for everything else so far. And, that
         | seems to be aligned with what each vendor is best at at the
         | moment.
        
           | MuffinFlavored wrote:
           | > from the personal preferences of the engineers
           | 
           | https://www.honeycomb.io/pricing
           | 
           | https://www.datadoghq.com/pricing/
           | 
           | Am I wrong to say... having 2 is "expensive"? Maybe not if
           | 50% of your stuff is going to Honeycomb and 50% going to
           | DataDog. Could you save money/complexity (less places to look
           | for things) having just DataDog or just Honeycomb?
        
             | bhyolken wrote:
             | Right now, there isn't much duplication of what we're
             | sending to each vendor, so I don't think we'd save a ton by
             | consolidating, at least based on list prices. We could
             | maybe negotiate better prices based on higher volumes, but
             | I'm not sure if Airplane is spending enough at this point
             | to get massive discounts there.
             | 
             | Another potential benefit would definitely be reduced
             | complexity and better integration for the engineering team.
             | So, for instance, you could look at a log and then more
             | easily navigate to the UI for the associated trace.
             | Currently, we do this by putting Honeycomb URLs in our
             | Datadog log events, which works but isn't quite as
             | seamless. But, given that our team is pretty small at this
             | point and that we're not spending a ton of our time on
             | performance optimizations, we don't feel an urgent need to
             | consolidate (yet).
        
               | MuffinFlavored wrote:
               | When you say DataDog for everything else (as in not
               | traces), besides logs, what else do you mean?
        
               | claytonjy wrote:
               | Metrics, probably? The article calls out logs, metrics,
               | and traces as the 3 pillars of observability.
        
               | bhyolken wrote:
               | Yeah, metrics and logs, plus a few other things that
               | depend on these (alerts, SLOs, metric-based dashboards,
               | etc.).
        
       | tapoxi wrote:
       | I made this switch very recently. For our Java apps it was as
       | simple as loading the otel agent in place of the Datadog SDK,
       | basically "-javaagent:/opt/otel/opentelemetry-javaagent.jar" in
       | our args.
       | 
       | The collector (which processes and ships metrics) can be
       | installed in K8S through Helm or an operator, and we just added a
       | variable to our charts so the agent can be pointed at the
       | collector. The collector speaks OTLP which is the fancy combined
       | metrics/traces/logs protocol the OTEL SDKs/agents use, but it
       | also speaks Prometheus, Zipkin, etc to give you an easy migration
       | path. We currently ship to Datadog as well as an internal
       | service, with the end goal being migrating off of Datadog
       | gradually.
        
         | andrewstuart2 wrote:
         | We tried this about a year and a half ago and ended up going
         | somewhat backwards into DD entrenchment, because they've
         | decided that anything not an official DD metric (that is,
         | collected by their agent typically) is custom and then becomes
         | substantially more expensive. We wanted a nice migration path
         | from any vendor to any other vendor but they have a fairly
         | effective strategy for making gradual migrations more expensive
         | for heavy telemetry users. At least our instrumentation these
         | days is otel, but it's the metrics we expected to just scrape
         | from prometheus that we had to dial back and start using more
         | official DD agent metrics and configs to get, lest our bill
         | balloon by 10x. It's a frustrating place to be. Especially
         | since it's still not remotely cheap, just that it could be way
         | worse.
         | 
         | I know this isn't a DataDog post, and I'm a bit off topic, but
         | I try to do my best to warn against DD these days.
        
           | shawnb576 wrote:
           | This has been a concern for me too. But the agent is just a
           | statsd receiver with some extra magic, so this seems like a
           | thing that could be solved with the collector sending traffic
           | to an agent rather than the HTTP APIs?
           | 
           | I looked at the OTel DD stuff and did not see any support for
           | this, fwiw, maybe it doesn't work b/c the agent expects more
           | context from the pod (e.g. app and label?)
        
             | andrewstuart2 wrote:
             | Yeah, the DD agent and the otel-collector DD exporter
             | actually use the same code paths for the most part. The
             | relevant difference tends to be in metrics, where the
             | official path involves the DD agent doing collection
             | directly, for example, collecting redis metrics by giving
             | the agent your redis database hostname and creds. It can
             | then pack those into the specific shape that DD knows about
             | and they get sent with the right name, values, etc so that
             | DD calls them regular metrics.
             | 
             | If you instead went the more flexible route of using many
             | of the de-facto standard prometheus exporters like the one
             | for redis, or built-in prometheus metrics from something
             | like istio, and forward those to your agent or configure
             | your agent to poll those prometheus metrics, it won't do
             | any reshaping (which I can see the arguments for, kinda,
             | knowing a bit about their backend) and they just end up in
             | the DD backend as custom metrics, and charge you at
             | $0.10/mo per 100 time series. If you've used prometheus
             | before for any realistic deployments with enrichment etc,
             | you can probably see this gets expensive ridiculously fast.
             | 
             | What I wish they'd do instead is have some form of adapter
             | from those de facto standards, so I can still collect
             | metrics 99% my own way, in a portable fashion, and then add
             | DD as my backend without ending up as custom everything,
             | costing significantly more.
        
           | xyst wrote:
           | > somewhat backwards into DD entrenchment, because they've
           | decided that anything not an official DD metric (that is,
           | collected by their agent typically) is custom and then
           | becomes substantially more expensive.
           | 
           | It a vendor pulled shit like this on me. That's when I would
           | counsel them. Of course most big orgs would rather not do the
           | leg work to actually become portable, migrate off vendor. So
           | of course they will just pay the bill.
           | 
           | Vendors love the custom shit they build because they know
           | once it's infiltrated the stack then it's basically like
           | gangrene (have to cut off the appendage to save the host)
        
       | k__ wrote:
       | I had the impression, logs and metrics are a pre-observability
       | thing.
        
         | SteveNuts wrote:
         | I've never heard the term "pre-observability", what does that
         | mean?
        
           | renegade-otter wrote:
           | The era when "debugging in production" wasn't standard.
        
         | marcosdumay wrote:
         | Observability is about logs and metrics, and pre-observability
         | (I guess you mean the high-level-only records simpler
         | environments keep) is also about logs and metrics.
         | 
         | Anything you register to keep track of your environment has the
         | form of either logs or metrics. The difference is about the
         | contents of such logs and metrics.
        
       | tsamba wrote:
       | Interesting read. What did you find easier about using GCP's log
       | tooling for your internal system logs, rather than the OTel
       | collector?
        
       | roskilli wrote:
       | > Moreover, we encountered some rough edges in the metrics-
       | related functionality of the Go SDK referenced above. Ultimately,
       | we had to write a conversion layer on top of the OTel metrics API
       | that allowed for simple, Prometheus-like counters, gauges, and
       | histograms.
       | 
       | Have encountered this a lot from teams attempting to use the
       | metrics SDK.
       | 
       | Are you open to comment on specifics here and also what kind of
       | shim you had to put in front of the SDK? It would be great to
       | continue to retrieve feedback so that we can as a community have
       | a good idea of what remains before it's possible to use the SDK
       | for real world production use cases in anger. Just wiring up the
       | setup in your app used to be fairly painful but that has gotten
       | somewhat better over the last 12-24 months, I'd love to also hear
       | what is currently causing compatibility issues w/ the metric
       | types themselves using the SDK which requires a shim and what the
       | shim is doing to achieve compatibility.
        
         | bhyolken wrote:
         | Sure, happy to provide more specifics!
         | 
         | Our main issue was the lack of a synchronous gauge. The
         | officially supported asynchronous API of registering a callback
         | function to report a gauge metric is very different from how we
         | were doing things before, and would have required lots of
         | refactoring of our code. Instead, we wrote a wrapper that
         | exposes a synchronous-like API: https://gist.github.com/yolken-
         | airplane/027867b753840f7d15d6....
         | 
         | It seems like this is a common feature request across many of
         | the SDKs, and it's in the process of being fixed in some of
         | them (https://github.com/open-telemetry/opentelemetry-
         | specificatio...)? I'm not sure what the plans are for the
         | golang SDK specifically.
         | 
         | Another, more minor issue, is the lack of support for
         | "constant" attributes that are applied to all observations of a
         | metric. We use these to identify the app, among other use
         | cases, so we added wrappers around the various "Add", "Record",
         | "Observe", etc. calls that automatically add these. (It's
         | totally possible that this is supported and I missed it, in
         | which case please let me know.)
         | 
         | Overall, the SDK was generally well-written and well-
         | documented, we just needed some extra work to make the
         | interfaces more similar to the ones we were using before.
        
           | arccy wrote:
           | the official SDKs will only support an api once there's a
           | spec that allows it.
           | 
           | for const attributes, generally these should be defined at
           | the resource / provider level: https://pkg.go.dev/go.opentele
           | metry.io/otel/sdk/metric#WithR...
        
       | CSMastermind wrote:
       | > The data collected from these streams is sent to several
       | vendors including Datadog (for application logs and metrics),
       | Honeycomb (for traces), and Google Cloud Logging (for
       | infrastructure logs).
       | 
       | It sounds like they were in a place that a lot of companies are
       | in where they don't have a single pane of glass for
       | observability. One of if not the main benefit I've gotten out of
       | Datadog is having everything in Datadog so that it's all
       | connected and I can easily jump from a trace to logs for
       | instance.
       | 
       | One of the terrible mistakes I see companies make with this
       | tooling is fragmenting like this. Everyone has their own personal
       | preference for tool and ultimately the collective experience is
       | significantly worse than the sum of its parts.
        
         | devin wrote:
         | Eh, personally I view honeycomb and datadog as different enough
         | offerings that I can see why you'd choose to have both.
        
         | dexterdog wrote:
         | Depending on your usage it can be prohibitively expensive to
         | use datadog for everything like that. We have it for just our
         | prod env because it's just not worth what it brings to the
         | table to put all of our logs into it.
        
         | maccard wrote:
         | I've spent a small amount of time in datadog, lots in grafana,
         | and somewhere in between in honeycomb. Out applications are
         | designed to emit traces, and comparing honeycomb with tracing
         | to a traditional app with metrics and logs, I would choose
         | tracing every time.
         | 
         | It annoys me that logs are overlooked in honeycomb, (and
         | metrics are... fine). But, given the choice between a single
         | pane of glass in grafana or having to do logs (and metrics
         | sometimes) in cloudwatch but spending 95% of my time in
         | honeycomb - I'd pick honeycomb every time
        
           | mdtusz wrote:
           | Agreed - honeycomb has been a boon, however some improvements
           | to metric displays and the ability to set the default "board"
           | used in the home page would be very welcome. Also would be
           | pretty happy if there was a way to drop events on the
           | honeycomb side for a way to dynamically filter - e.g. "don't
           | even bother storing this trace if it has a http.status_code <
           | 400". This is surprisingly painful to implement on the
           | application side (at least in rust).
           | 
           | Hopefully someone that works there is reading this.
        
             | masterj wrote:
             | It sounds like you should look into their tail-sampling
             | Refinery tool https://docs.honeycomb.io/manage-data-
             | volume/refinery/
        
           | viraptor wrote:
           | Have you tried the traces in grafana/tempo yet?
           | https://grafana.com/docs/grafana/latest/panels-
           | visualization...
           | 
           | It seems to miss some aggregation stuff, but also it's
           | improving every time I check. I wonder if anyone's used it in
           | anger yet and how far is it from replacing datadog or
           | honeycomb.
        
             | arccy wrote:
             | tempo still feels very much: look at a trace that you found
             | from elsewhere (like logs)
        
         | rewmie wrote:
         | > It sounds like they were in a place that a lot of companies
         | are in where they don't have a single pane of glass for
         | observability.
         | 
         | One of the biggest features of AWS which is very easy to take
         | for granted and go unnoticed is Amazon CloudWatch. It supports
         | metrics, logging, alarms, metrics from alarms, alarms from
         | alarms, querying historical logs, trigger actions, etc etc etc.
         | and it covers each and every single service provided by AWS
         | including metaservices like AWS Config and Cloudtrail.
         | 
         | And you barely notice it. It's just there, and you can see
         | everything.
         | 
         | > One of the terrible mistakes I see companies make with this
         | tooling is fragmenting like this.
         | 
         | So much this. It's not fun at all to have to go through logs
         | and metrics on any application,and much less so if for some
         | reason their maintainers scattered their metrics emission to
         | the four winds. However, with AWS all roads lead to Cloudwatch,
         | and everything is so much better.
        
         | badloginagain wrote:
         | I feel we hold up single-observability-solution as the Holy
         | Grail, and I can see the argument for it- one place to
         | understand the health of your services.
         | 
         | But I've also been in terrible vendor lock-in situations, being
         | bent over the barrel because switching to a better solution is
         | so damn expensive.
         | 
         | At least now with OTel you have an open standard that allows
         | you to switch easier, but even then I'd rather have 2 solutions
         | that meet my exact observability requirements than a single
         | solution that does everything OKish.
        
       | nevon wrote:
       | I would love to save a few hundred thousands a year by running
       | Otel collector over Datadog agents, just on the cost-per-host
       | alone. Unfortunately that would also mean giving up Datatog APM
       | and NPM, as far as I can tell, which have been really valuable.
       | Going back to just metrics and traces would feel like quite the
       | step backwards and be a hard sell.
        
         | arccy wrote:
         | you can submit opentelemetry traces to datadog which should be
         | the equivalent of apm/npm, though maybe with a less polished
         | integration.
        
       | throwaway084t95 wrote:
       | What is the "first principles" argument that observability
       | decomposes into logs, metrics, and tracing? I see this dogma
       | accepted everywhere, but I'm inquisitive about it
        
         | yannyu wrote:
         | First you had logs. Everyone uses logs because it's easy. Logs
         | are great, but suddenly you're spending a crapton of time or
         | money maintaining terabytes or petabytes of storage and ingest
         | of logs. And even worse, in some cases for these logs, you
         | don't actually care about 99% of the log line and simply want a
         | single number, such as CPU utilization or the value of the
         | shopping cart or latency.
         | 
         | So, someone says, "let's make something smaller and more
         | portable than logs. We need to track numerical data over time
         | more easily, so that we can see pretty charts of when these
         | values are outside of where they should be." This ends up being
         | metrics and a time-series database (TSDB), built to handle not
         | arbitrary lines of text but instead meant to parse out metadata
         | and append numerical data to existing time-series based on that
         | metadata.
         | 
         | Between metrics and logs, you end up with a good idea of what's
         | going on with your infrastructure, but logs are still too
         | verbose to understand what's happening with your applications
         | past a certain point. If you have an application crashing
         | repeatedly, or if you've got applications running slowly,
         | metrics and logs can't really help you there. So companies
         | built out Application Performance Monitoring, meant to tap
         | directly into the processes running on the box and spit out all
         | sorts of interesting runtime metrics and events about not just
         | the applications, but the specific methods and calls those
         | applications are utilizing within their stack/code.
         | 
         | Initially, this works great if you're running these APM tools
         | on a single box within monolithic stacks, but as the world
         | moved toward Cloud Service Providers and
         | containerized/ephemeral infrastructure, APM stopped being as
         | effective. When a transaction starts to go through multiple
         | machines and microservices, APM deployed on those boxes
         | individually can't give you the context of how these disparate
         | calls relate to a holistic transaction.
         | 
         | So someone says, "hey, what if we include transaction IDs in
         | these service calls, so that we can post-hoc stitch together
         | these individual transaction lines into a whole transaction,
         | end-to-end?" Which is how you end up with the concept of spans
         | and traces, taking what worked well with Application
         | Performance Monitoring and generalizing that out into the
         | modern microservices architectures that are more common today.
        
       | shoelessone wrote:
       | I really really want to use OTel for a small project but have
       | always had a really tough time finding a path that is cheap or
       | free for a personal project.
       | 
       | In theory you can send telemetry data with OTel to Cloud Watch,
       | but I've struggle to connect the dots with the front end
       | application (e.g. React/Next.js).
        
         | arccy wrote:
         | grafana cloud, honeycomb, etc have free tiers, though you'll
         | have to watch how much data you send them. or you can self host
         | something like signoz or the elastic stack. frontend will
         | typically go to an instance of opentelemetry collector to
         | filter/convert to the protocol for the storage backend.
        
       ___________________________________________________________________
       (page generated 2023-11-16 23:00 UTC)