[HN Gopher] Tracing: Structured logging, but better
       ___________________________________________________________________
        
       Tracing: Structured logging, but better
        
       Author : pondidum
       Score  : 188 points
       Date   : 2023-09-18 21:52 UTC (2 days ago)
        
 (HTM) web link (andydote.co.uk)
 (TXT) w3m dump (andydote.co.uk)
        
       | crabbone wrote:
       | > The second problem with writing logs to stdout
       | 
       | Who on Earth does that? Logs are almost always written to
       | stderr... In part to prevent other problems author is talking
       | about (eg. mixing with the output generated by the application).
       | 
       | I don't understand why this has to be either or... If you store
       | the trace output somewhere you get a log... (let's call it "un-
       | annotated" log, since trace won't have the human-readable message
       | part). Trace is great when examining the application
       | interactively, but if you use the same exact tool and save the
       | results for later you get logs, with all the same problems the
       | author ascribes to logs.
        
         | FridgeSeal wrote:
         | I do, as does everyone at my work? Along with basically
         | everyone I've ever worked with, ever?
         | 
         | Like, I develop cli apps, so like, what else would go to stdout
         | that you suppose will interfere?
        
         | dalyons wrote:
         | Being doing it for decade+, ever since the 12 factor app
         | concept became popular. It's way more common imho for web apps
         | than stderr logging.
        
         | OJFord wrote:
         | Loads of people, it drives me around the twist too (especially
         | when there's inevitably custom parsing to separate the log
         | messages from the output) but it happens, probably well
         | correlated with people that use more GUI tools, not that
         | there's anything wrong with that, just I think the more you use
         | a CLI the more you're probably aware of this being an issue, or
         | other lesser best practices that might make life easier like
         | newline and tab separation.
        
       | alkonaut wrote:
       | I like a log to read like a book if it's the result of a task
       | taking a finite time, such as for example an installation, a
       | compilation, a loading of a browser page or similar. Users are
       | going to look into it for clues about what happened and they a)
       | aren't always related to those who wrote the tools b) don't have
       | access to the source code or any special log analytics/querying
       | tools.
       | 
       | That's when you want a _log_ and that's what the big traditional
       | log frameworks were designed to handle.
       | 
       | A web backend/service is basically the opposite. End users don't
       | have access to the log, those who analyze it can cross reference
       | with system internals like source code or db state and the log is
       | basically infinite. In that situation a structured log and
       | querying obviously wins.
       | 
       | It's honestly not even clear that these systems are that closely
       | related.
        
         | WatchDog wrote:
         | It's a good distinction to make, logging for client based
         | systems, is essentially UI design.
         | 
         | For a web app, serving lots of concurrent users, they are
         | essentially unreadable without tools, so you may as well
         | optimise the logs for tool based consumption.
        
       | layer8 wrote:
       | > Log Levels are meaningless. Is a log line debug, info, warning,
       | error, fatal, or some other shade in between?
       | 
       | I partly agree and disagree. In terms of severity, there are only
       | three levels:
       | 
       | - info: not a problem
       | 
       | - warning: potential problem
       | 
       | - error: actual problem (operational failure)
       | 
       | Other levels like "debug" are not about severity, but about level
       | of detail.
       | 
       | In addition, something that is an error in a subcomponent may
       | only be a warning or even just an info on the level of the
       | superordinate component. Thus the severity has to be interpreted
       | relative to the source component.
       | 
       | The latter can be an issue if the severity is only interpreted
       | globally. Either it will be wrong for the global level, or
       | subcomponents have to know the global context they are running in
       | to use the severity appropriate for that context. The latter
       | causes undesirable dependencies on a global context. Meaning, the
       | developer of a lower-level subcomponent would have to know the
       | exact context in which that component is used, in order to chose
       | the appropriate log level. And what if the component is used in
       | different contexts entailing different severities?
       | 
       | So one might conclude that the severity indication is useless
       | after all, but IMO one should rather conclude that severity needs
       | to be interpreted relative to the component. This also means that
       | a lower-level error may have to be logged again in the higher-
       | level context if it's still an error there, so that it doesn't
       | get ignored if e.g. monitoring only looks at errors on the
       | higher-level context.
       | 
       | Differences between "fatal" and "error" are really nesting
       | differences between components/contexts. An error is always fatal
       | on the level where it originates.
        
         | abraae wrote:
         | > In addition, something that is an error in a subcomponent may
         | only be a warning or even just an info on the level of the
         | superordinate component.
         | 
         | Or, keep it simple.
         | 
         | - error means someone is alerted urgently to look at the
         | problem
         | 
         | - warning means someone should be looking into it eventually,
         | with a view to reclassifying as info/debug or resolving it.
         | 
         | IMO many people don't care much about their logs, until the
         | shit hits the fan. Only then, in production, do they realise
         | just how much harder their overly verbose (or inadequate)
         | logging is making things.
         | 
         | The simple filter of "all errors send an alert" can go a long
         | way to encouraging a bit of ownership and correctness on
         | logging.
        
           | layer8 wrote:
           | > - error means someone is alerted urgently to look at the
           | problem
           | 
           | The issue is that the code that encounters the problem may
           | not have the knowledge/context to decide whether it warrants
           | alerting. The code higher up that does have the knowledge, on
           | the other hand, often doesn't have the lower-level
           | information that is useful to have in the log for analyzing
           | the failure. So how do you link the two? When you write
           | modular code that minimizes assumptions about its context,
           | that situation is a common occurrence.
        
             | abraae wrote:
             | If the code detecting the error is a library/subordinate
             | service then the same rule can be followed - should this be
             | immediately brought to a human's attention?
             | 
             | The answer for a library will often be no, since the
             | library doesn't "have the knowledge/context to decide
             | whether it warrants alerting".
             | 
             | So in that case the library can log as info, and leave it
             | to the caller to log as error if warranted (after learning
             | about the error from return code/http status etc.).
             | 
             | When investigating the error, the human has access to the
             | info details from the subordinate service.
        
         | SkyPuncher wrote:
         | I agree with your premise, but do consider debug to be a fourth
         | level.
         | 
         | Info is things like "processing X"
         | 
         | Debug is things like "variable is Y" or "made it to this point"
        
         | Hermitian909 wrote:
         | The OP is wrong, log levels are very valuable if you leverage
         | them.
         | 
         | Here's a classic problem as an illustration: The storage cost
         | of your logs is really prohibitive. You would like to cut out
         | some of your logs from storage but cannot lower retention below
         | some threshold (say 2 weeks maybe). For this example, assume
         | that tracing is also enabled and every log has a traceId
         | 
         | A good answer is to run a compaction job that inspects each
         | trace. If it contains an error preserve it. Remove X% of all
         | other traces.
         | 
         | Log levels make the ergonomics for this excellent and it can
         | save millions of dollars a year at sufficient scale.
        
         | BillinghamJ wrote:
         | I tend to think of "warning" as - "something unexpected
         | happened, but it was handled safely"
         | 
         | And then "error" as - "things are not okay, a developer is
         | going to need to intervene"
         | 
         | And errors then split roughly between "must be fixed sometime",
         | and "must be fixed now/ASAP"
        
           | layer8 wrote:
           | > I tend to think of "warning" as - "something unexpected
           | happened, but it was handled safely"
           | 
           | It was handled safely at the level where it occurred, but
           | because it was unusual/unexpected, the underlying cause may
           | cause issues later on or higher up.
           | 
           | If one were sure it would 100% not indicate any issue, one
           | wouldn't need to warn about it.
        
       | waffletower wrote:
       | There are logging libraries that include syntactically scoped
       | timers, such as mulog (https://github.com/BrunoBonacci/mulog).
       | While a great library, we preferred timbre
       | (https://github.com/taoensso/timbre) and rolled our own logging
       | timer macro that interoperates with it. More convenient to have
       | such niceties in a Lisp of course. Since we also have
       | OpenTelemetry available, it would also be easy to wrap traces
       | around code form boundaries as well. Thanks OP for the idea!
        
       | mrkeen wrote:
       | > If you're writing log statements, you're doing it wrong.
       | 
       | I too use this bait statement.
       | 
       | Then I follow it up with (the short version):
       | 
       | 1) Rewrite your log statements so that they're machine readable
       | 
       | 2) Prove they're machine-readable by having the down-stream
       | services read them instead of the REST call you would have
       | otherwise sent.
       | 
       | 3) Switch out log4j for Kafka, which will handle the persistence
       | & multiplexing for you.
       | 
       | Voila, you got yourself a reactive, event-driven system with
       | accurate "logs".
       | 
       | If you're like me and you read the article thinking "I like the
       | result but I hate polluting my business code with all that
       | tracing code", well now you can create an _independent_ reader of
       | your kafka events which just focuses on turning events into
       | traces.
        
         | rewmie wrote:
         | > 3) Switch out log4j for Kafka, which will handle the
         | persistence & multiplexing for you.
         | 
         | I don't think this is a reasonable statement. There are already
         | a few logging agents that support structured logging without
         | dragging in heavyweight dependencies such as Kafka. Bringing up
         | Kafka sounds like a case of a solution looking for a problem.
        
           | ahoka wrote:
           | I think OP meant event sourcing.
        
             | rewmie wrote:
             | > I think OP meant event sourcing.
             | 
             | That is really besides the point. Logging and tracing have
             | always been fundamentally event sourcing, but that never
             | forced anyone ever at all to onboard onto freaking Kafka of
             | all event streaming/messaging platforms.
             | 
             | This blend of suggestion sounds an awful lot like resume
             | driven development instead of actually putting together a
             | logging service.
        
           | mrkeen wrote:
           | > There are already a few logging agents that support
           | structured logging without dragging in heavyweight
           | dependencies such as Kafka.
           | 
           | What are they? Because admittedly I've lost a little love for
           | the operational side of Kafka, and I wish the client-side
           | were a little "dumber", so I could match it better to my uses
           | cases.
        
         | bowsamic wrote:
         | How to get me to leave your company 101
        
           | mrkeen wrote:
           | I did write a pretty glib description of what to do ;)
           | 
           | That said, I've had conflicts with a previous team-mate about
           | this. He couldn't wrap his head around Kafka being a source
           | of truth. But when I asked him whether he'd trust our Kafka
           | or our Postgres if they disagreed, he conceded that he'd
           | believe Kafka's side of things.
        
       | amelius wrote:
       | This is stuff that a debugger is supposed to do for you, for
       | free.
       | 
       | This should not require code at the application level, but it
       | should be implemented at the tooling level.
        
       | goalieca wrote:
       | Logging is essential for security. I think tracing is wonderful
       | and so are metrics. I see these as more of a triad for
       | observability.
        
         | waffletower wrote:
         | Indeed, the three legs (metrics, logs, traces) of
         | OpenTelemetry's telescope. https://opentelemetry.io
        
           | candiddevmike wrote:
           | Something missing from OTel IMO is a standard way of linking
           | all three together. It seems like an exercise left to the
           | reader, but I feel like there should be standard metadata for
           | showing a relationship between traces, metrics, and logs.
           | Right now each of these functions is on an island (same with
           | the tooling and storage of the data, but that's another
           | rant).
        
             | discodachshund wrote:
             | Isn't that the trace ID? For metrics, it's in the form of
             | exemplars, and for logs it is the log context
        
               | candiddevmike wrote:
               | That might be dependent on the library then, there isn't
               | an official OTel Go logging library yet. Seems you have
               | to add the trace ID exemplars manually too
        
               | phillipcarter wrote:
               | Go is behind several of the languages in OTel right now.
               | Just a consequence of a very difficult implementation and
               | its load-bearing nature as being the language (and
               | library) of choice for CNCF infrastructure. If you use
               | Java or .NET, for example, it's quite fleshed out.
        
               | jen20 wrote:
               | One would hope that there will not _be_ an Open Telemetry
               | logging library for Go. Unlike last time there was a
               | thread about this, there is now a standard - `slog` in
               | the stdlib.
        
       | spullara wrote:
       | It drives me insane that the standardized tracing libraries have
       | you only report closed spans. What if it crashes? What if it
       | stalls? Why should I keep open spans in memory when I can just
       | write an end span event?
        
       | jauntywundrkind wrote:
       | What's most incredible to me is how close tracing feels in spirit
       | to me to event-sourcing.
       | 
       | Here's this log of every frame of compute going on, plus data or
       | metadata about the frame.... but afaik we have yet to start using
       | the same stream of computation for business processes as we do
       | for it's excellent observability.
        
         | alexisread wrote:
         | Any of the Clickhouse-based Otel stores can do event sourcing -
         | just set up materialised views on the trace tables. I know the
         | following use CH: https://uptrace.dev/ https://signoz.io/
         | https://github.com/hyperdxio/hyperdx
        
         | juliogreff wrote:
         | As a matter of fact, at a previous job we used traces as a data
         | source for event sourcing. One use case: we tracked usage of
         | certain features in API calls in traces, and some batch job ran
         | at whatever frequency aggregated which users were using which
         | features. While it was far from real time because of the sheer
         | amount of data, it was so simple to implement that we had
         | dozens of use cases implemented like that.
        
       | skybrian wrote:
       | How would a hobbyist programmer get started with tracing for a
       | simple web app? Where do the traces end up and how do I query it?
       | Can tracing be used in a development environment?
       | 
       | Context: the last thing I wrote used Deno and Deno Deploy.
        
         | curioussavage wrote:
         | Just install opentelemetry libs. I found this example with a
         | quick search: https://dev.to/grunet/leveraging-opentelemetry-
         | in-deno-45bj
         | 
         | opentelemetry has a service you can run that will collect the
         | telemetry data and you can export it to something like
         | prometheus which can store it and let you query it. Example
         | here https://github.com/open-telemetry/opentelemetry-collector-
         | co...
         | 
         | Typically in dev environments trace spans are just emitted to
         | stdout just like logs. I sometimes turn that off too though
         | because it gets noisy.
        
       | andersrs wrote:
       | I have a side project that I run in Kubernetes with a postgres
       | database and a few Go/Nodejs apps. Recommend me a lightweight
       | otel backend that isn't going to blow out my cloud costs.
        
       | perpil wrote:
       | I was recently musing about the 2 different types of logs:
       | 
       | 1. application logs, emitted multiple times per request and serve
       | as breadcrumbs
       | 
       | 2. request logs emitted once per request and include latencies,
       | counters and metadata about the request and response
       | 
       | The application logs were useless to me except during
       | development. However the request logs I could run aggregations on
       | which made them far more useful for answering questions. What the
       | author explains very well is that the problem with application
       | logs is they aren't very human-readable which is where
       | visualizing a request with tracing shines. If you don't have
       | tracing, creating request logs will get you most of the way
       | there, it's certainly better than application logs.
       | https://speedrun.nobackspacecrew.com/blog/2023/09/08/logging...
        
       | benreesman wrote:
       | As a historical critic of Rust-mania (and if I'm honest, kind of
       | an asshole about it too many times, fail), I've recently bumped
       | into stuff like tokio-tracing, eyre, tokio-console, and some
       | others.
       | 
       | And while my historical gripes are largely still the status quo:
       | stack traces in multi-threaded, evented/async code that _actually
       | show real line numbers_? Span-based tracing that makes concurrent
       | introspection possible _by default_?
       | 
       | I'm in. I apologize for everything bad I ever said and don't care
       | whatever other annoying thing.
       | 
       | That's the whole show. Unless it deletes my hard drive I don't
       | really care about anything else by comparison.
        
       | zoogeny wrote:
       | One thing about logging and tracing is the inevitable cost (in
       | real money).
       | 
       | I love observability probably more than most. And my initial
       | reaction to this article is the obvious: why not both?
       | 
       | In fact, I tend to think more in terms of "events" when writing
       | both logs and tracing code. How that event is notified, stored,
       | transmitted, etc. is in some ways divorced from the activity. I
       | don't care if it is going to stdout, or over udp to an
       | aggregator, or turning into trace statements, or ending up in
       | Kafka, etc.
       | 
       | But inevitably I bump up against cost. For even medium sized
       | systems, the amount of data I would like to track gets quite
       | expensive. For example, many tracing services charge for the tags
       | you add to traces. So doing `trace.String("key", value)` becomes
       | something I think about from a cost perspective. I worked at a
       | place that had a $250k/year New Relic bill and we were avoiding
       | any kind of custom attributes. Just getting APM metrics for
       | servers and databases was enough to get to that cost.
       | 
       | Logs are cheap, easy, reliable and don't lock me in to an
       | expensive service to start. I mean, maybe you end up integrating
       | splunk or perhaps self-hosting kibana, but you can get 90% of the
       | benefits just by dumping the logs into Cloudwatch or even S3 for
       | a much cheaper price.
        
         | alexisread wrote:
         | Any of the Clickhouse-based Otel stores can dump the traces to
         | s3 for long-term storage, and can be self-hosted. I know the
         | following use CH: https://uptrace.dev/ https://signoz.io/
         | https://github.com/hyperdxio/hyperdx
        
         | phillipcarter wrote:
         | FWIW part of the reason you're seeing that is, at least
         | traditionally, APM companies rebranding as Observability
         | companies stuffed trace data into metrics data stores, which
         | becomes prohibitively expensive to query with custom
         | tags/attributes/fields. Newer tools/companies have a different
         | approach that makes cost far more predictable and generally
         | lower.
         | 
         | Luckily, some of the larger incumbents are also moving away
         | from this model, especially as OpenTelemetry is making tracing
         | more widespread as a baseline of sorts for data. And you can
         | definitely bet they're hearing about it from their customers
         | right now, and they want to keep their customers.
         | 
         | Cost is still a concern but it's getting addressed as well.
         | Right now every vendor has different approaches (e.g., the one
         | I work for has a robust sampling proxy you can use), but that
         | too is going the way of standardization. OTel is defining how
         | to propagate sampling metadata in signals so that downstream
         | tools can use the metadata about population representativeness
         | to show accurate counts for things and so on.
        
         | thinkharderdev wrote:
         | > I mean, maybe you end up integrating splunk or perhaps self-
         | hosting kibana
         | 
         | I think this is the issue. Both Splunk and OpenSearch (even
         | self-hosted OpenSearch) get really pricy as well especially
         | with large volumes of log data. Cloudwatch can also get
         | ludicrously expensive. They charge something like $0.50 per GB
         | (!) and another $0.03 per GB to store. I've seen situations at
         | a previous employer where someone accidentally deployed a
         | lambda function with debug logging and ran up a few thousand $$
         | in Cloudwatch bills overnight.
         | 
         | You should look at Coralogix (disclaimer: I work there). We've
         | built a platform that allows you to store your observability
         | data in S3 and query it through our infrastructure. It can be
         | dramatically more cost-effective than other providers in this
         | space.
        
         | jameshart wrote:
         | Observability costs feel high when everything's working fine.
         | When something snaps and everything is down and you need to
         | know why in a hurry... those observability premiums you've been
         | paying all along can pay off fast.
        
         | thangalin wrote:
         | > In fact, I tend to think more in terms of "events" when
         | writing both logs and tracing code.
         | 
         | They are events[1]. For my text editor, KeenWrite, events can
         | be logged either to the console when run from the command-line
         | or displayed in a dialog when running in GUI mode. By changing
         | "logger.log()" statements to "event.publish()" statements, a
         | number of practical benefits are realized, including:
         | 
         | * Decoupled logging implementation from the system (swap one
         | line of code to change loggers).
         | 
         | * Publish events on a message bus (e.g., D-Bus) to allow
         | extending system functionality without modifying the existing
         | code base.
         | 
         | * Standard logging format, which can be machine parsed, to help
         | trace in-field production problems.
         | 
         | * Ability to assign unique identifiers to each event, allowing
         | for publication of problem/solution documentation based on
         | those IDs (possibly even seeding LLMs these days).
         | 
         | [1]: https://dave.autonoma.ca/blog/2022/01/08/logging-code-
         | smell/
        
           | jameshart wrote:
           | But events that another system relies upon are now an _API_.
           | Be careful not to lock together things that are only
           | superficially similar, as it affects your ability to change
           | them independently.
        
             | thangalin wrote:
             | Architecturally, the decoupling works as follows:
             | Event -> Bus -> UI Subscriber -> Dialog (table)
             | Event -> Bus -> Log Subscriber -> Console (text)
             | Event -> Bus -> D-Bus Subscriber -> Relay -> D-Bus ->
             | Publish (TCP/IP)
             | 
             | With D-Bus, published messages are versioned, allowing for
             | API changes without breaking third-party consumers. The
             | D-Bus Subscriber provides a layer of isolation between the
             | application and the published messages so that the two can
             | vary independently.
        
         | hosh wrote:
         | I have made use of tracing, metrics, and logging all together
         | and find each of them have its own place, as well as synergies
         | of being able to work with all three together.
         | 
         | Cost is a real issue, and not just in terms of how much the
         | vendor costs you. When tracing becomes a noticeable fraction of
         | CPU or memory usage relative to the application, it's time to
         | rethink doing 100% sampling. In practice, if you are sampling
         | thousands of requests per second, you're very unlikely to
         | actually look through each one of those thousands (thousands of
         | req/s may not be a lot for some sites, but it is already
         | exceeding human-scale without tooling). In order to keep
         | accurate, useful statistics with sampling, you end up using
         | metrics to store trace metrics prior to sampling.
        
       | hosh wrote:
       | That's weird. I use both logging and tracing where I can. And
       | metrics.
       | 
       | While there are better tools for alerting, metrics, or
       | aggregations, it helps a lot in debugging and troubleshooting.
        
         | aero142 wrote:
         | I think the author's point is that tracing is a better
         | implementation of both logs and metrics, and I think it's a
         | valid point. * metrics are pre-aggregated into timeseries data,
         | which makes cardinality expensive. You could also aggregate a
         | value from a trace statement. * Logs are hand crafted and
         | unique, and are usually improved by adding structured
         | attributes. Structured attributes are better as traces because
         | you can have execution context and well defined attributes that
         | provide better detail.
         | 
         | Traces can be aggregated or sampled to provide all of the
         | information available from logs, but in a more flexible way. *
         | Certain traces can be retained at 100%. This is equivalent to
         | logs. * Certain trace attributes can be converted to timeseries
         | data. This is equivalent to metrics. * Certain traces can be
         | sampled and/or queried with streaming infrastructure. This is a
         | way to observe data with high cardinality without hitting the
         | high cost.
        
           | hosh wrote:
           | There are things you can do with metrics and logging that you
           | cannot do with traces. These usually fall outside of
           | debugging application performance and bottlenecks. So I think
           | what the author says is true if you are only thinking about
           | application, and not for gaining a holistic understanding of
           | the entire system, including infrastructure.
           | 
           | Probably the biggest tradeoff with traces is that, in
           | practice, you are not retaining 100% of all traces. In order
           | to keep accurate statistics, it generally gets ingested as
           | metrics before sampling. The other is that traces are not
           | stored in such a way where you are looking at what is
           | happening at a point-in-time -- which is what logging does
           | well. If I want to ensure I have execution context for
           | logging, I make the effort to add trace and span ids so that
           | traces and logging can be correlated.
           | 
           | To be fair, I live in the devops world more often than not,
           | and my colleagues on the dev teams rarely have to venture
           | outside of traces.
           | 
           | I don't mind the points this author is making. My main
           | criticism is that it is scoped to the world of applications
           | -- which is fine -- but then taken as universal for all of
           | software engineering.
        
       | fnordpiglet wrote:
       | Tracing is poor at both very long lived traces, at stream
       | processing, and most tracing implementations are too heavy to run
       | in computationally bound tasks beyond at a very coarse level.
       | Logging is nice in that it has no context, no overhead, is
       | generally very cheap to compose and emit, and with including
       | transaction id and done in a structured way gives you most of
       | what tracing does without all the other baggage.
       | 
       | That said for the spaces where tracing works well, it works
       | unreasonably well.
        
         | cschneid wrote:
         | When I worked at ScoutAPM, that list is basically the exact
         | areas where we had issues supporting. We didn't do full-on
         | tracing in the OpenTracing kind of way, but the agent was
         | pretty similar, with spans (mostly automatically inserted), and
         | annotations on those spans with timing, parentage, and extra
         | info (like the sql query this represented in Active record).
         | 
         | The really hard things, which we had reasonable answers for,
         | but never quite perfect: * Rails websockets (actioncable) *
         | very long running background jobs (we stopped collecting at
         | some limit, to prevent unbounded memory) * trying to profile
         | code, we used a modified version of Stackprof to do sampling
         | instead of exact profiling. That worked surprisingly well at
         | finding hotspots, with low overhead.
         | 
         | All sorts of other tricks came along too. I should go look at
         | that codebase again to remind me. That'd be good for my
         | resume.... :)
         | 
         | https://github.com/scoutapp/scout_apm_ruby
        
         | riv991 wrote:
         | I think Open Telemetry has solved the stream processing problem
         | issue with span links[1]. Treating each unit of work as an
         | individual trace but being able to combine them and see a
         | causal relationship. Slack published a blog about it pretty
         | recently [2]
         | 
         | [1]
         | https://opentelemetry.io/docs/concepts/signals/traces/#span-...
         | 
         | [2] https://slack.engineering/tracing-notifications/
        
         | phillipcarter wrote:
         | Hmmm, for long-lived processes and stream processing we use
         | tracing just fine. What we do is make a cutoff of 60 seconds,
         | which each chunk is its own trace. But our backend queries
         | trace data directly, so we can still analyze the aggregate,
         | long-term behavior and then dig into a particular 60 second
         | chunk if it's problematic.
        
       | ducharmdev wrote:
       | Minor nitpick, but I wish this post started with defining what we
       | mean by logging vs tracing, since some people use these
       | interchangeably. The reader instead has to infer this from the
       | criticisms of logging.
        
         | ryanklee wrote:
         | I've never encountered this confusion anywhere, so I wouldn't
         | ever think to dispel it. Which isn't to say that I disagree
         | with the more general point that defining your terms is good
         | thing.
         | 
         | In any case, the post itself (which is not long) illustrates
         | and marks out many of the differences.
        
         | jlokier wrote:
         | I agree. I'm working with code that uses 'verbose "message"'
         | for level 1 verbosity logs and 'trace "message"' for level 2
         | verbosity. Makes sense in its world, but it's not the same
         | meaning as how cloud-devops-observability culture uses those
         | words.
        
       | vkoskiv wrote:
       | Nit to the author: 'rapala' seems like a mistranslation. It is a
       | brand name of a company that makes fishing lures, as far as I can
       | tell. It is not the Finnish word for "to bait", and is therefore
       | only used to refer to a that particular brand. I'm not sure what
       | the purpose of the text in parenthesis is here, but 'houkutella'
       | would be the most apt translation in this case.
        
       | lambda_garden wrote:
       | Couldn't this be injected into the runtime so that no code
       | changes are required?
       | 
       | Perhaps really performance critical stuff could have a "notrace"
       | annotation.
        
         | thinkharderdev wrote:
         | Sure, and a lot of tools will do this in one way or another.
         | Either instrument code directly or provide annotations/macros
         | to trace a specific method (something like tokio-tracing in the
         | Rust ecosystem).
         | 
         | However, tracing literally every method call would probably be
         | prohibitively expensive so typically you have either:
         | 
         | 1. Instrumentation with "understands" common
         | frameworks/libraries and knows what to instrument (eg request
         | handlers in web frameworks)
         | 
         | 2. Full opt-in. They make it easy to add a trace for a method
         | invocation with a simple annotation but nothing gets
         | instrumented by default
        
         | austinsharp wrote:
         | Yes, OTel has autoinstrumentation libraries for some language
         | that can pick up a fair amount by default. Though it's unlikely
         | that that would ever be sufficient, it's a nice start.
         | 
         | For Java:
         | https://opentelemetry.io/docs/instrumentation/java/automatic...
        
         | imiric wrote:
         | There are several projects that leverage eBPF for automatic
         | instrumentation[1].
         | 
         | How accurate and useful these are vs. doing this manually will
         | depend on the use case, but I reckon the automatic approach
         | gets you most of the way there, and you can add the missing
         | traces yourself, so if nothing else it saves a lot of work.
         | 
         | [1]: https://ebpf.io/applications/
        
       | hardwaresofton wrote:
       | I think there's an alternate universe out there where:
       | 
       | - we collectively realized that logs, events, traces, metrics,
       | and errors are actually all just logs
       | 
       | - we agreed on a single format that encapsulated all that
       | information in a structured manner
       | 
       | - we built firehose/stream processing tooling to provide modern
       | o11y creature comforts
       | 
       | I can't tell if that universe is better than this one, or worse.
        
         | phillipcarter wrote:
         | That's more or less the model Honeycomb uses. Every signal type
         | is just a structured event. Reality is a bit messier, though.
         | In particular, metrics are the oddball in this world and
         | required a lot of work to make economical.
        
         | dalyons wrote:
         | Is that really an alternate universe? That's the universe that
         | splunk and friends are selling, everything's a log. It's really
         | expensive.
        
         | andrewstuart2 wrote:
         | Traces are just distributed "logs" (in the data structure
         | sense; data ordered only by its appearance in _something_ )
         | where you also pass around the tiniest bit of correlation
         | context between apps. Traces are structured, timestamped, and
         | can be indexed into much more debug-friendly structures like a
         | call tree. But you could just as easily ignore all the data and
         | print them out in streaming sorted order without any
         | correlation.
         | 
         | Honestly it sounds like you're pitching opentelemetry/otlp but
         | where you only trace and leave all the other bits for later
         | inside your opentelemetry collector, which can turn traces into
         | metrics or traces into logs.
        
       | thegrizzlyking wrote:
       | Logs are mostly "Hi I reached this line of code, here is some
       | metadata"
        
       | jasonjmcghee wrote:
       | I really enjoyed the content- it's a great article.
       | 
       | Note to author: all but the last code block have a very odd
       | mixture of rather large font sizes (at least on mobile) which
       | vary line to line that make them pretty difficult to read.
       | 
       | Also the link to "Observability Driven Development." was a blank
       | slide deck AFAICT
        
       | hello1234567 wrote:
       | person writing this came to know some thing that he din't know
       | earlier and decided to convert his light bulb moment into a blog
       | post. not bad bad but failed to understand that logs are the
       | generalisation of very thing they are talking about.
        
       | jeffbee wrote:
       | This is a great article because everyone should understand the
       | similarity between logging and tracing. One thing worth pondering
       | though is the differences in cost. If I am not planning to
       | centrally collect and index informational logs, free-form text
       | logging is extremely cheap. Even a complex log line with
       | formatted strings and numbers can be emitted in < 1us on modern
       | machines. If you are handling something like 100s or 1000s of
       | requests per second per core, which is pretty respectable,
       | putting a handful of informational log statements in the critical
       | path won't hurt anyone.
       | 
       | Off-the-shelf tracing libraries on the other hand are pretty
       | expensive. You have one additional mandatory read of the system
       | clock, to establish the span duration, plus you are still paying
       | for a clock read on every span event, if you use span events.
       | Every span has a PRNG call, too. Distributed tracing is worthless
       | if you don't send the spans somewhere, so you have to budget for
       | encoding your span into json, msgpack, protobuf, or whatever.
       | It's a completely different ball game in terms of efficiency.
        
         | nithril wrote:
         | It is actually simpler to conceptualize the difference, one is
         | stateless, the other one is stateful.
         | 
         | Actually structured logging exists since years like in Java
         | https://github.com/logfellow/logstash-logback-encoder
        
         | xyzzy_plugh wrote:
         | I will agree that conceptually logging can be much cheaper than
         | tracing ever can, but in practice any semi-serious attempt at
         | structured logging ends up looking very, very close to tracing.
         | In fact I'd go so far as to say that the two are effectively
         | interchangeable at a point. What you do with that information,
         | whether you index it or build a graph, is up to you -- and that
         | is where the cost creeps in.
         | 
         | Adding timestamps and UUIDs and an encoding is par for the
         | course in logging these days, I don't think that is the right
         | angle to criticize efficiency.
         | 
         | Tracing can be very cheap if you "simply" (and I'm glossing
         | over a lot here) search for all messages in a liberal window
         | matching each "span start" message and index the result sets.
         | Offering a way to view results as a tree is just a bonus.
         | 
         | Of course, in practice this ends up meaning something
         | completely different, and far costlier. Why that is I cannot
         | fathom.
        
         | hyperpape wrote:
         | I don't generally disagree, but using json for structured logs
         | is a growing thing as well.
        
       | h1fra wrote:
       | Tracing is much more actionnable but barely usable without a
       | platform. Which makes local programming dependent on third party.
       | Also it requires passing context or have a way to get back the
       | context in every function that requires it, which can be
       | daunting.
       | 
       | On my side I have opted to mixed structured/text, a generic
       | message that can be easily understood while glancing over logs,
       | and a data object attached for more details.
        
         | candiddevmike wrote:
         | You can add Jaeger to your local dev containers and run it in
         | memory, it's really lightweight and easy to use.
        
         | hinkley wrote:
         | Someone got me excited about tracing and I started tweaking our
         | stats API to optionally add tracing. Retrofitted it into a
         | mature app, then immediately discovered that all of the data
         | was being dropped because AWS only likes very tiny traces.
         | Depth or fanout or both break it rather quickly.
         | 
         | And OpenTelemetry has a very questionable implementation. For a
         | nested trace, events fire when the trace closes, meaning that a
         | parent ID is reported before it is seen in the stream. That
         | can't be good for processing. Would be better to have a leading
         | edge event (also helps with errors throwing and the parent
         | never being reported).
         | 
         | Kind of a bummer. Needs work.
        
           | pcthrowaway wrote:
           | > OpenTelemetry has a very questionable implementation
           | 
           | The nice thing about OpenTelemetry is that it's a standard.
           | The questionable implementation you're referencing isn't a
           | source of truth. There isn't some canonical "questionable"
           | implementation.
           | 
           | There are many, slightly different, questionable
           | implementations.
        
             | hinkley wrote:
             | If the wire protocol has a bug, that's not something an
             | implementation can fix.
             | 
             | I'm saying the wire protocol is wrong.
        
       ___________________________________________________________________
       (page generated 2023-09-20 23:00 UTC)