[HN Gopher] Tracing: Structured logging, but better ___________________________________________________________________ Tracing: Structured logging, but better Author : pondidum Score : 188 points Date : 2023-09-18 21:52 UTC (2 days ago) (HTM) web link (andydote.co.uk) (TXT) w3m dump (andydote.co.uk) | crabbone wrote: | > The second problem with writing logs to stdout | | Who on Earth does that? Logs are almost always written to | stderr... In part to prevent other problems author is talking | about (eg. mixing with the output generated by the application). | | I don't understand why this has to be either or... If you store | the trace output somewhere you get a log... (let's call it "un- | annotated" log, since trace won't have the human-readable message | part). Trace is great when examining the application | interactively, but if you use the same exact tool and save the | results for later you get logs, with all the same problems the | author ascribes to logs. | FridgeSeal wrote: | I do, as does everyone at my work? Along with basically | everyone I've ever worked with, ever? | | Like, I develop cli apps, so like, what else would go to stdout | that you suppose will interfere? | dalyons wrote: | Being doing it for decade+, ever since the 12 factor app | concept became popular. It's way more common imho for web apps | than stderr logging. | OJFord wrote: | Loads of people, it drives me around the twist too (especially | when there's inevitably custom parsing to separate the log | messages from the output) but it happens, probably well | correlated with people that use more GUI tools, not that | there's anything wrong with that, just I think the more you use | a CLI the more you're probably aware of this being an issue, or | other lesser best practices that might make life easier like | newline and tab separation. | alkonaut wrote: | I like a log to read like a book if it's the result of a task | taking a finite time, such as for example an installation, a | compilation, a loading of a browser page or similar. Users are | going to look into it for clues about what happened and they a) | aren't always related to those who wrote the tools b) don't have | access to the source code or any special log analytics/querying | tools. | | That's when you want a _log_ and that's what the big traditional | log frameworks were designed to handle. | | A web backend/service is basically the opposite. End users don't | have access to the log, those who analyze it can cross reference | with system internals like source code or db state and the log is | basically infinite. In that situation a structured log and | querying obviously wins. | | It's honestly not even clear that these systems are that closely | related. | WatchDog wrote: | It's a good distinction to make, logging for client based | systems, is essentially UI design. | | For a web app, serving lots of concurrent users, they are | essentially unreadable without tools, so you may as well | optimise the logs for tool based consumption. | layer8 wrote: | > Log Levels are meaningless. Is a log line debug, info, warning, | error, fatal, or some other shade in between? | | I partly agree and disagree. In terms of severity, there are only | three levels: | | - info: not a problem | | - warning: potential problem | | - error: actual problem (operational failure) | | Other levels like "debug" are not about severity, but about level | of detail. | | In addition, something that is an error in a subcomponent may | only be a warning or even just an info on the level of the | superordinate component. Thus the severity has to be interpreted | relative to the source component. | | The latter can be an issue if the severity is only interpreted | globally. Either it will be wrong for the global level, or | subcomponents have to know the global context they are running in | to use the severity appropriate for that context. The latter | causes undesirable dependencies on a global context. Meaning, the | developer of a lower-level subcomponent would have to know the | exact context in which that component is used, in order to chose | the appropriate log level. And what if the component is used in | different contexts entailing different severities? | | So one might conclude that the severity indication is useless | after all, but IMO one should rather conclude that severity needs | to be interpreted relative to the component. This also means that | a lower-level error may have to be logged again in the higher- | level context if it's still an error there, so that it doesn't | get ignored if e.g. monitoring only looks at errors on the | higher-level context. | | Differences between "fatal" and "error" are really nesting | differences between components/contexts. An error is always fatal | on the level where it originates. | abraae wrote: | > In addition, something that is an error in a subcomponent may | only be a warning or even just an info on the level of the | superordinate component. | | Or, keep it simple. | | - error means someone is alerted urgently to look at the | problem | | - warning means someone should be looking into it eventually, | with a view to reclassifying as info/debug or resolving it. | | IMO many people don't care much about their logs, until the | shit hits the fan. Only then, in production, do they realise | just how much harder their overly verbose (or inadequate) | logging is making things. | | The simple filter of "all errors send an alert" can go a long | way to encouraging a bit of ownership and correctness on | logging. | layer8 wrote: | > - error means someone is alerted urgently to look at the | problem | | The issue is that the code that encounters the problem may | not have the knowledge/context to decide whether it warrants | alerting. The code higher up that does have the knowledge, on | the other hand, often doesn't have the lower-level | information that is useful to have in the log for analyzing | the failure. So how do you link the two? When you write | modular code that minimizes assumptions about its context, | that situation is a common occurrence. | abraae wrote: | If the code detecting the error is a library/subordinate | service then the same rule can be followed - should this be | immediately brought to a human's attention? | | The answer for a library will often be no, since the | library doesn't "have the knowledge/context to decide | whether it warrants alerting". | | So in that case the library can log as info, and leave it | to the caller to log as error if warranted (after learning | about the error from return code/http status etc.). | | When investigating the error, the human has access to the | info details from the subordinate service. | SkyPuncher wrote: | I agree with your premise, but do consider debug to be a fourth | level. | | Info is things like "processing X" | | Debug is things like "variable is Y" or "made it to this point" | Hermitian909 wrote: | The OP is wrong, log levels are very valuable if you leverage | them. | | Here's a classic problem as an illustration: The storage cost | of your logs is really prohibitive. You would like to cut out | some of your logs from storage but cannot lower retention below | some threshold (say 2 weeks maybe). For this example, assume | that tracing is also enabled and every log has a traceId | | A good answer is to run a compaction job that inspects each | trace. If it contains an error preserve it. Remove X% of all | other traces. | | Log levels make the ergonomics for this excellent and it can | save millions of dollars a year at sufficient scale. | BillinghamJ wrote: | I tend to think of "warning" as - "something unexpected | happened, but it was handled safely" | | And then "error" as - "things are not okay, a developer is | going to need to intervene" | | And errors then split roughly between "must be fixed sometime", | and "must be fixed now/ASAP" | layer8 wrote: | > I tend to think of "warning" as - "something unexpected | happened, but it was handled safely" | | It was handled safely at the level where it occurred, but | because it was unusual/unexpected, the underlying cause may | cause issues later on or higher up. | | If one were sure it would 100% not indicate any issue, one | wouldn't need to warn about it. | waffletower wrote: | There are logging libraries that include syntactically scoped | timers, such as mulog (https://github.com/BrunoBonacci/mulog). | While a great library, we preferred timbre | (https://github.com/taoensso/timbre) and rolled our own logging | timer macro that interoperates with it. More convenient to have | such niceties in a Lisp of course. Since we also have | OpenTelemetry available, it would also be easy to wrap traces | around code form boundaries as well. Thanks OP for the idea! | mrkeen wrote: | > If you're writing log statements, you're doing it wrong. | | I too use this bait statement. | | Then I follow it up with (the short version): | | 1) Rewrite your log statements so that they're machine readable | | 2) Prove they're machine-readable by having the down-stream | services read them instead of the REST call you would have | otherwise sent. | | 3) Switch out log4j for Kafka, which will handle the persistence | & multiplexing for you. | | Voila, you got yourself a reactive, event-driven system with | accurate "logs". | | If you're like me and you read the article thinking "I like the | result but I hate polluting my business code with all that | tracing code", well now you can create an _independent_ reader of | your kafka events which just focuses on turning events into | traces. | rewmie wrote: | > 3) Switch out log4j for Kafka, which will handle the | persistence & multiplexing for you. | | I don't think this is a reasonable statement. There are already | a few logging agents that support structured logging without | dragging in heavyweight dependencies such as Kafka. Bringing up | Kafka sounds like a case of a solution looking for a problem. | ahoka wrote: | I think OP meant event sourcing. | rewmie wrote: | > I think OP meant event sourcing. | | That is really besides the point. Logging and tracing have | always been fundamentally event sourcing, but that never | forced anyone ever at all to onboard onto freaking Kafka of | all event streaming/messaging platforms. | | This blend of suggestion sounds an awful lot like resume | driven development instead of actually putting together a | logging service. | mrkeen wrote: | > There are already a few logging agents that support | structured logging without dragging in heavyweight | dependencies such as Kafka. | | What are they? Because admittedly I've lost a little love for | the operational side of Kafka, and I wish the client-side | were a little "dumber", so I could match it better to my uses | cases. | bowsamic wrote: | How to get me to leave your company 101 | mrkeen wrote: | I did write a pretty glib description of what to do ;) | | That said, I've had conflicts with a previous team-mate about | this. He couldn't wrap his head around Kafka being a source | of truth. But when I asked him whether he'd trust our Kafka | or our Postgres if they disagreed, he conceded that he'd | believe Kafka's side of things. | amelius wrote: | This is stuff that a debugger is supposed to do for you, for | free. | | This should not require code at the application level, but it | should be implemented at the tooling level. | goalieca wrote: | Logging is essential for security. I think tracing is wonderful | and so are metrics. I see these as more of a triad for | observability. | waffletower wrote: | Indeed, the three legs (metrics, logs, traces) of | OpenTelemetry's telescope. https://opentelemetry.io | candiddevmike wrote: | Something missing from OTel IMO is a standard way of linking | all three together. It seems like an exercise left to the | reader, but I feel like there should be standard metadata for | showing a relationship between traces, metrics, and logs. | Right now each of these functions is on an island (same with | the tooling and storage of the data, but that's another | rant). | discodachshund wrote: | Isn't that the trace ID? For metrics, it's in the form of | exemplars, and for logs it is the log context | candiddevmike wrote: | That might be dependent on the library then, there isn't | an official OTel Go logging library yet. Seems you have | to add the trace ID exemplars manually too | phillipcarter wrote: | Go is behind several of the languages in OTel right now. | Just a consequence of a very difficult implementation and | its load-bearing nature as being the language (and | library) of choice for CNCF infrastructure. If you use | Java or .NET, for example, it's quite fleshed out. | jen20 wrote: | One would hope that there will not _be_ an Open Telemetry | logging library for Go. Unlike last time there was a | thread about this, there is now a standard - `slog` in | the stdlib. | spullara wrote: | It drives me insane that the standardized tracing libraries have | you only report closed spans. What if it crashes? What if it | stalls? Why should I keep open spans in memory when I can just | write an end span event? | jauntywundrkind wrote: | What's most incredible to me is how close tracing feels in spirit | to me to event-sourcing. | | Here's this log of every frame of compute going on, plus data or | metadata about the frame.... but afaik we have yet to start using | the same stream of computation for business processes as we do | for it's excellent observability. | alexisread wrote: | Any of the Clickhouse-based Otel stores can do event sourcing - | just set up materialised views on the trace tables. I know the | following use CH: https://uptrace.dev/ https://signoz.io/ | https://github.com/hyperdxio/hyperdx | juliogreff wrote: | As a matter of fact, at a previous job we used traces as a data | source for event sourcing. One use case: we tracked usage of | certain features in API calls in traces, and some batch job ran | at whatever frequency aggregated which users were using which | features. While it was far from real time because of the sheer | amount of data, it was so simple to implement that we had | dozens of use cases implemented like that. | skybrian wrote: | How would a hobbyist programmer get started with tracing for a | simple web app? Where do the traces end up and how do I query it? | Can tracing be used in a development environment? | | Context: the last thing I wrote used Deno and Deno Deploy. | curioussavage wrote: | Just install opentelemetry libs. I found this example with a | quick search: https://dev.to/grunet/leveraging-opentelemetry- | in-deno-45bj | | opentelemetry has a service you can run that will collect the | telemetry data and you can export it to something like | prometheus which can store it and let you query it. Example | here https://github.com/open-telemetry/opentelemetry-collector- | co... | | Typically in dev environments trace spans are just emitted to | stdout just like logs. I sometimes turn that off too though | because it gets noisy. | andersrs wrote: | I have a side project that I run in Kubernetes with a postgres | database and a few Go/Nodejs apps. Recommend me a lightweight | otel backend that isn't going to blow out my cloud costs. | perpil wrote: | I was recently musing about the 2 different types of logs: | | 1. application logs, emitted multiple times per request and serve | as breadcrumbs | | 2. request logs emitted once per request and include latencies, | counters and metadata about the request and response | | The application logs were useless to me except during | development. However the request logs I could run aggregations on | which made them far more useful for answering questions. What the | author explains very well is that the problem with application | logs is they aren't very human-readable which is where | visualizing a request with tracing shines. If you don't have | tracing, creating request logs will get you most of the way | there, it's certainly better than application logs. | https://speedrun.nobackspacecrew.com/blog/2023/09/08/logging... | benreesman wrote: | As a historical critic of Rust-mania (and if I'm honest, kind of | an asshole about it too many times, fail), I've recently bumped | into stuff like tokio-tracing, eyre, tokio-console, and some | others. | | And while my historical gripes are largely still the status quo: | stack traces in multi-threaded, evented/async code that _actually | show real line numbers_? Span-based tracing that makes concurrent | introspection possible _by default_? | | I'm in. I apologize for everything bad I ever said and don't care | whatever other annoying thing. | | That's the whole show. Unless it deletes my hard drive I don't | really care about anything else by comparison. | zoogeny wrote: | One thing about logging and tracing is the inevitable cost (in | real money). | | I love observability probably more than most. And my initial | reaction to this article is the obvious: why not both? | | In fact, I tend to think more in terms of "events" when writing | both logs and tracing code. How that event is notified, stored, | transmitted, etc. is in some ways divorced from the activity. I | don't care if it is going to stdout, or over udp to an | aggregator, or turning into trace statements, or ending up in | Kafka, etc. | | But inevitably I bump up against cost. For even medium sized | systems, the amount of data I would like to track gets quite | expensive. For example, many tracing services charge for the tags | you add to traces. So doing `trace.String("key", value)` becomes | something I think about from a cost perspective. I worked at a | place that had a $250k/year New Relic bill and we were avoiding | any kind of custom attributes. Just getting APM metrics for | servers and databases was enough to get to that cost. | | Logs are cheap, easy, reliable and don't lock me in to an | expensive service to start. I mean, maybe you end up integrating | splunk or perhaps self-hosting kibana, but you can get 90% of the | benefits just by dumping the logs into Cloudwatch or even S3 for | a much cheaper price. | alexisread wrote: | Any of the Clickhouse-based Otel stores can dump the traces to | s3 for long-term storage, and can be self-hosted. I know the | following use CH: https://uptrace.dev/ https://signoz.io/ | https://github.com/hyperdxio/hyperdx | phillipcarter wrote: | FWIW part of the reason you're seeing that is, at least | traditionally, APM companies rebranding as Observability | companies stuffed trace data into metrics data stores, which | becomes prohibitively expensive to query with custom | tags/attributes/fields. Newer tools/companies have a different | approach that makes cost far more predictable and generally | lower. | | Luckily, some of the larger incumbents are also moving away | from this model, especially as OpenTelemetry is making tracing | more widespread as a baseline of sorts for data. And you can | definitely bet they're hearing about it from their customers | right now, and they want to keep their customers. | | Cost is still a concern but it's getting addressed as well. | Right now every vendor has different approaches (e.g., the one | I work for has a robust sampling proxy you can use), but that | too is going the way of standardization. OTel is defining how | to propagate sampling metadata in signals so that downstream | tools can use the metadata about population representativeness | to show accurate counts for things and so on. | thinkharderdev wrote: | > I mean, maybe you end up integrating splunk or perhaps self- | hosting kibana | | I think this is the issue. Both Splunk and OpenSearch (even | self-hosted OpenSearch) get really pricy as well especially | with large volumes of log data. Cloudwatch can also get | ludicrously expensive. They charge something like $0.50 per GB | (!) and another $0.03 per GB to store. I've seen situations at | a previous employer where someone accidentally deployed a | lambda function with debug logging and ran up a few thousand $$ | in Cloudwatch bills overnight. | | You should look at Coralogix (disclaimer: I work there). We've | built a platform that allows you to store your observability | data in S3 and query it through our infrastructure. It can be | dramatically more cost-effective than other providers in this | space. | jameshart wrote: | Observability costs feel high when everything's working fine. | When something snaps and everything is down and you need to | know why in a hurry... those observability premiums you've been | paying all along can pay off fast. | thangalin wrote: | > In fact, I tend to think more in terms of "events" when | writing both logs and tracing code. | | They are events[1]. For my text editor, KeenWrite, events can | be logged either to the console when run from the command-line | or displayed in a dialog when running in GUI mode. By changing | "logger.log()" statements to "event.publish()" statements, a | number of practical benefits are realized, including: | | * Decoupled logging implementation from the system (swap one | line of code to change loggers). | | * Publish events on a message bus (e.g., D-Bus) to allow | extending system functionality without modifying the existing | code base. | | * Standard logging format, which can be machine parsed, to help | trace in-field production problems. | | * Ability to assign unique identifiers to each event, allowing | for publication of problem/solution documentation based on | those IDs (possibly even seeding LLMs these days). | | [1]: https://dave.autonoma.ca/blog/2022/01/08/logging-code- | smell/ | jameshart wrote: | But events that another system relies upon are now an _API_. | Be careful not to lock together things that are only | superficially similar, as it affects your ability to change | them independently. | thangalin wrote: | Architecturally, the decoupling works as follows: | Event -> Bus -> UI Subscriber -> Dialog (table) | Event -> Bus -> Log Subscriber -> Console (text) | Event -> Bus -> D-Bus Subscriber -> Relay -> D-Bus -> | Publish (TCP/IP) | | With D-Bus, published messages are versioned, allowing for | API changes without breaking third-party consumers. The | D-Bus Subscriber provides a layer of isolation between the | application and the published messages so that the two can | vary independently. | hosh wrote: | I have made use of tracing, metrics, and logging all together | and find each of them have its own place, as well as synergies | of being able to work with all three together. | | Cost is a real issue, and not just in terms of how much the | vendor costs you. When tracing becomes a noticeable fraction of | CPU or memory usage relative to the application, it's time to | rethink doing 100% sampling. In practice, if you are sampling | thousands of requests per second, you're very unlikely to | actually look through each one of those thousands (thousands of | req/s may not be a lot for some sites, but it is already | exceeding human-scale without tooling). In order to keep | accurate, useful statistics with sampling, you end up using | metrics to store trace metrics prior to sampling. | hosh wrote: | That's weird. I use both logging and tracing where I can. And | metrics. | | While there are better tools for alerting, metrics, or | aggregations, it helps a lot in debugging and troubleshooting. | aero142 wrote: | I think the author's point is that tracing is a better | implementation of both logs and metrics, and I think it's a | valid point. * metrics are pre-aggregated into timeseries data, | which makes cardinality expensive. You could also aggregate a | value from a trace statement. * Logs are hand crafted and | unique, and are usually improved by adding structured | attributes. Structured attributes are better as traces because | you can have execution context and well defined attributes that | provide better detail. | | Traces can be aggregated or sampled to provide all of the | information available from logs, but in a more flexible way. * | Certain traces can be retained at 100%. This is equivalent to | logs. * Certain trace attributes can be converted to timeseries | data. This is equivalent to metrics. * Certain traces can be | sampled and/or queried with streaming infrastructure. This is a | way to observe data with high cardinality without hitting the | high cost. | hosh wrote: | There are things you can do with metrics and logging that you | cannot do with traces. These usually fall outside of | debugging application performance and bottlenecks. So I think | what the author says is true if you are only thinking about | application, and not for gaining a holistic understanding of | the entire system, including infrastructure. | | Probably the biggest tradeoff with traces is that, in | practice, you are not retaining 100% of all traces. In order | to keep accurate statistics, it generally gets ingested as | metrics before sampling. The other is that traces are not | stored in such a way where you are looking at what is | happening at a point-in-time -- which is what logging does | well. If I want to ensure I have execution context for | logging, I make the effort to add trace and span ids so that | traces and logging can be correlated. | | To be fair, I live in the devops world more often than not, | and my colleagues on the dev teams rarely have to venture | outside of traces. | | I don't mind the points this author is making. My main | criticism is that it is scoped to the world of applications | -- which is fine -- but then taken as universal for all of | software engineering. | fnordpiglet wrote: | Tracing is poor at both very long lived traces, at stream | processing, and most tracing implementations are too heavy to run | in computationally bound tasks beyond at a very coarse level. | Logging is nice in that it has no context, no overhead, is | generally very cheap to compose and emit, and with including | transaction id and done in a structured way gives you most of | what tracing does without all the other baggage. | | That said for the spaces where tracing works well, it works | unreasonably well. | cschneid wrote: | When I worked at ScoutAPM, that list is basically the exact | areas where we had issues supporting. We didn't do full-on | tracing in the OpenTracing kind of way, but the agent was | pretty similar, with spans (mostly automatically inserted), and | annotations on those spans with timing, parentage, and extra | info (like the sql query this represented in Active record). | | The really hard things, which we had reasonable answers for, | but never quite perfect: * Rails websockets (actioncable) * | very long running background jobs (we stopped collecting at | some limit, to prevent unbounded memory) * trying to profile | code, we used a modified version of Stackprof to do sampling | instead of exact profiling. That worked surprisingly well at | finding hotspots, with low overhead. | | All sorts of other tricks came along too. I should go look at | that codebase again to remind me. That'd be good for my | resume.... :) | | https://github.com/scoutapp/scout_apm_ruby | riv991 wrote: | I think Open Telemetry has solved the stream processing problem | issue with span links[1]. Treating each unit of work as an | individual trace but being able to combine them and see a | causal relationship. Slack published a blog about it pretty | recently [2] | | [1] | https://opentelemetry.io/docs/concepts/signals/traces/#span-... | | [2] https://slack.engineering/tracing-notifications/ | phillipcarter wrote: | Hmmm, for long-lived processes and stream processing we use | tracing just fine. What we do is make a cutoff of 60 seconds, | which each chunk is its own trace. But our backend queries | trace data directly, so we can still analyze the aggregate, | long-term behavior and then dig into a particular 60 second | chunk if it's problematic. | ducharmdev wrote: | Minor nitpick, but I wish this post started with defining what we | mean by logging vs tracing, since some people use these | interchangeably. The reader instead has to infer this from the | criticisms of logging. | ryanklee wrote: | I've never encountered this confusion anywhere, so I wouldn't | ever think to dispel it. Which isn't to say that I disagree | with the more general point that defining your terms is good | thing. | | In any case, the post itself (which is not long) illustrates | and marks out many of the differences. | jlokier wrote: | I agree. I'm working with code that uses 'verbose "message"' | for level 1 verbosity logs and 'trace "message"' for level 2 | verbosity. Makes sense in its world, but it's not the same | meaning as how cloud-devops-observability culture uses those | words. | vkoskiv wrote: | Nit to the author: 'rapala' seems like a mistranslation. It is a | brand name of a company that makes fishing lures, as far as I can | tell. It is not the Finnish word for "to bait", and is therefore | only used to refer to a that particular brand. I'm not sure what | the purpose of the text in parenthesis is here, but 'houkutella' | would be the most apt translation in this case. | lambda_garden wrote: | Couldn't this be injected into the runtime so that no code | changes are required? | | Perhaps really performance critical stuff could have a "notrace" | annotation. | thinkharderdev wrote: | Sure, and a lot of tools will do this in one way or another. | Either instrument code directly or provide annotations/macros | to trace a specific method (something like tokio-tracing in the | Rust ecosystem). | | However, tracing literally every method call would probably be | prohibitively expensive so typically you have either: | | 1. Instrumentation with "understands" common | frameworks/libraries and knows what to instrument (eg request | handlers in web frameworks) | | 2. Full opt-in. They make it easy to add a trace for a method | invocation with a simple annotation but nothing gets | instrumented by default | austinsharp wrote: | Yes, OTel has autoinstrumentation libraries for some language | that can pick up a fair amount by default. Though it's unlikely | that that would ever be sufficient, it's a nice start. | | For Java: | https://opentelemetry.io/docs/instrumentation/java/automatic... | imiric wrote: | There are several projects that leverage eBPF for automatic | instrumentation[1]. | | How accurate and useful these are vs. doing this manually will | depend on the use case, but I reckon the automatic approach | gets you most of the way there, and you can add the missing | traces yourself, so if nothing else it saves a lot of work. | | [1]: https://ebpf.io/applications/ | hardwaresofton wrote: | I think there's an alternate universe out there where: | | - we collectively realized that logs, events, traces, metrics, | and errors are actually all just logs | | - we agreed on a single format that encapsulated all that | information in a structured manner | | - we built firehose/stream processing tooling to provide modern | o11y creature comforts | | I can't tell if that universe is better than this one, or worse. | phillipcarter wrote: | That's more or less the model Honeycomb uses. Every signal type | is just a structured event. Reality is a bit messier, though. | In particular, metrics are the oddball in this world and | required a lot of work to make economical. | dalyons wrote: | Is that really an alternate universe? That's the universe that | splunk and friends are selling, everything's a log. It's really | expensive. | andrewstuart2 wrote: | Traces are just distributed "logs" (in the data structure | sense; data ordered only by its appearance in _something_ ) | where you also pass around the tiniest bit of correlation | context between apps. Traces are structured, timestamped, and | can be indexed into much more debug-friendly structures like a | call tree. But you could just as easily ignore all the data and | print them out in streaming sorted order without any | correlation. | | Honestly it sounds like you're pitching opentelemetry/otlp but | where you only trace and leave all the other bits for later | inside your opentelemetry collector, which can turn traces into | metrics or traces into logs. | thegrizzlyking wrote: | Logs are mostly "Hi I reached this line of code, here is some | metadata" | jasonjmcghee wrote: | I really enjoyed the content- it's a great article. | | Note to author: all but the last code block have a very odd | mixture of rather large font sizes (at least on mobile) which | vary line to line that make them pretty difficult to read. | | Also the link to "Observability Driven Development." was a blank | slide deck AFAICT | hello1234567 wrote: | person writing this came to know some thing that he din't know | earlier and decided to convert his light bulb moment into a blog | post. not bad bad but failed to understand that logs are the | generalisation of very thing they are talking about. | jeffbee wrote: | This is a great article because everyone should understand the | similarity between logging and tracing. One thing worth pondering | though is the differences in cost. If I am not planning to | centrally collect and index informational logs, free-form text | logging is extremely cheap. Even a complex log line with | formatted strings and numbers can be emitted in < 1us on modern | machines. If you are handling something like 100s or 1000s of | requests per second per core, which is pretty respectable, | putting a handful of informational log statements in the critical | path won't hurt anyone. | | Off-the-shelf tracing libraries on the other hand are pretty | expensive. You have one additional mandatory read of the system | clock, to establish the span duration, plus you are still paying | for a clock read on every span event, if you use span events. | Every span has a PRNG call, too. Distributed tracing is worthless | if you don't send the spans somewhere, so you have to budget for | encoding your span into json, msgpack, protobuf, or whatever. | It's a completely different ball game in terms of efficiency. | nithril wrote: | It is actually simpler to conceptualize the difference, one is | stateless, the other one is stateful. | | Actually structured logging exists since years like in Java | https://github.com/logfellow/logstash-logback-encoder | xyzzy_plugh wrote: | I will agree that conceptually logging can be much cheaper than | tracing ever can, but in practice any semi-serious attempt at | structured logging ends up looking very, very close to tracing. | In fact I'd go so far as to say that the two are effectively | interchangeable at a point. What you do with that information, | whether you index it or build a graph, is up to you -- and that | is where the cost creeps in. | | Adding timestamps and UUIDs and an encoding is par for the | course in logging these days, I don't think that is the right | angle to criticize efficiency. | | Tracing can be very cheap if you "simply" (and I'm glossing | over a lot here) search for all messages in a liberal window | matching each "span start" message and index the result sets. | Offering a way to view results as a tree is just a bonus. | | Of course, in practice this ends up meaning something | completely different, and far costlier. Why that is I cannot | fathom. | hyperpape wrote: | I don't generally disagree, but using json for structured logs | is a growing thing as well. | h1fra wrote: | Tracing is much more actionnable but barely usable without a | platform. Which makes local programming dependent on third party. | Also it requires passing context or have a way to get back the | context in every function that requires it, which can be | daunting. | | On my side I have opted to mixed structured/text, a generic | message that can be easily understood while glancing over logs, | and a data object attached for more details. | candiddevmike wrote: | You can add Jaeger to your local dev containers and run it in | memory, it's really lightweight and easy to use. | hinkley wrote: | Someone got me excited about tracing and I started tweaking our | stats API to optionally add tracing. Retrofitted it into a | mature app, then immediately discovered that all of the data | was being dropped because AWS only likes very tiny traces. | Depth or fanout or both break it rather quickly. | | And OpenTelemetry has a very questionable implementation. For a | nested trace, events fire when the trace closes, meaning that a | parent ID is reported before it is seen in the stream. That | can't be good for processing. Would be better to have a leading | edge event (also helps with errors throwing and the parent | never being reported). | | Kind of a bummer. Needs work. | pcthrowaway wrote: | > OpenTelemetry has a very questionable implementation | | The nice thing about OpenTelemetry is that it's a standard. | The questionable implementation you're referencing isn't a | source of truth. There isn't some canonical "questionable" | implementation. | | There are many, slightly different, questionable | implementations. | hinkley wrote: | If the wire protocol has a bug, that's not something an | implementation can fix. | | I'm saying the wire protocol is wrong. ___________________________________________________________________ (page generated 2023-09-20 23:00 UTC)