[HN Gopher] eBPF-based auto-instrumentation outperforms manual i... ___________________________________________________________________ eBPF-based auto-instrumentation outperforms manual instrumentation Author : edenfed Score : 155 points Date : 2023-10-30 14:10 UTC (8 hours ago) (HTM) web link (odigos.io) (TXT) w3m dump (odigos.io) | nevodavid10 wrote: | This is great. Can you elaborate on how the performance is | better? | Barakikia wrote: | Our focus was on latency. The reason we were able to cut it | down was due to the fact that eBPF-based automatic | instrumentation separates the recording from the processing. | grazio wrote: | How did you actually reduce the latency here ? | RonFeder wrote: | The main factor for reduced latency is the separation | between recording and processing of data. The eBPF programs | are the only overhead for the instrumented process in terms | of latency. The eBPF programs transfer the collected data | to a separate process which handles all the exporting. In | contrast to manually adding code to an application which | adds latency and memory footprint in terms of handling the | exported data. | CSDude wrote: | Somewhat related, I mainly code in Kotlin. Adding open telemetry | was just adding agent to command line args (usual Java/JVM magic | most people don't like). Then I had a project in Go and I got so | tired of all the steps it took (setup and ensuring each context | is instrumented) and just gave up. We still add our manual | instrumentation for customization, but auto-instrumentation made | adoption much easier in the day 0. | edenfed wrote: | I think eBPF has also great potential to help JVM-based | languages. Especially around performance aspects even comparing | to the current java agents which use bytecode manipulation. | marwis wrote: | The article mentions avoiding GC pressure and separation | between recording and processing as big wins for performance | for runtimes like Java but you could do the same inside Java | by using ring buffer, no? | edenfed wrote: | Interesting idea. I think that as long as you able to do | processing, serializing and delivery in other process and | save this work from your application runtime you should see | great performance | avita1 wrote: | How do you solve the context propagation issue with eBPF based | instrumentation? | | E.g. if you get a RPC request coming in, and make an RPC request | in order to serve the incoming RPC request. The traced program | needs to track some ID for that request from the time it comes | in, through to the place where the the HTTP request comes out. | And then that ID has to get injected into a header on the wire so | the next program sees the same request ID. | | IME that's where most of the overhead (and value) from a manual | tracing library comes from. | edenfed wrote: | It depends on the programming language being instrumented. For | Go we are assuming the context.Context object is passed around | between different functions or goroutines. For Java, we are | using a combination of ThreadLocal tracing and Runnable tracing | to support use cases like reactive and multithreaded | applications. | camel_gopher wrote: | That's a very big assumption, at least for Go based | applications. | edenfed wrote: | We also thinking on implementing fallback mechanism to | automatically propagate context on the same goroutine if | context.Context is not passed | nulld3v wrote: | I don't think it's unreasonable, you need a Context to make | a gRPC call and you get one when handling a gRPC call. It | usually doesn't get lost in between. | otterley wrote: | True for gRPC, but not necessarily for HTTP - the HTTP | client and server packages that ship with Go predate the | Context package by quite a long while. | spullara wrote: | Going to be rough for supporting virtual threads then? | edenfed wrote: | We have a solution for virtual thread as well. Currently | working on a blog post describing exactly how. Will update | once releases | marwis wrote: | ScopedValue solves that problem: https://docs.oracle.com/en | /java/javase/21/docs/api/java.base... | rocmcd wrote: | 100%. Context propagation is _the_ key to distributed tracing, | otherwise you're only seeing one side of every transaction. | | I was hoping odigos was language/runtime-agnostic since it's | eBPF-based, but I see it's mentioned in the repo that it only | supports: | | > Java, Python, .NET, Node.js, and Go | | Apart from Go (that is a WIP), these are the languages already | supported with Otel's (non-eBPF-based) auto-instrumentation. | Apart from a win on latency (which is nice, but could in theory | be combated with sampling), why else go this route? | edenfed wrote: | eBPF instrumentation does not require code changes, | redeployment or restart to running applications. | | We are constantly adding more language support for eBPF | instrumentation and are aiming to cover the most popular | programming languages soon. | | Btw, not sure that sampling is really the solution to combat | overhead, after all you probably do want that data. Trying to | fix production issue when the data you need is missing due to | sampling is not fun | rocmcd wrote: | All good points, thank you. | | What's the limit on language support? Is it theoretically | possible to support any language/runtime? Or does it come | down to the protocol (HTTP, gRPC, etc) being used by the | communicating processes? | edenfed wrote: | We already solved compiled languages (Go, C, Rust) and | JIT languages (Java, C#). Interpreted languages (Python, | JS) are the only ones left, hopefully we will solve these | as well soon. The big challenge is supporting all the | different runtimes, once that is solved implementing | support for different protocols / open-source libraries | is not as complicated. | jetbalsa wrote: | Got to get PHP on that list :) | phillipcarter wrote: | FWIW it's theoretically possible to support any | language/runtime, but since eBPF is operating at the | level it's at, there's no magic abstraction layer to plug | into. Every runtime and/or protocol involves different | segments of memory and certain bytes meaning certain | things. It's all in service towards having no additional | requirements for an end-user to install, but once you're | in eBPF world everything is runtime-and-protocol-and- | library-specific. | RonFeder wrote: | The eBPF programs handle passing the context through the | requests by adding a field to the header as you mentioned. The | injected field is according to the w3c standard. | heyeb_25 wrote: | If I am manually implemented all my logs, what do I need to do to | move to Odgios? | edenfed wrote: | Nothing special, if you are working on Kubernetes its as easy | as running `odigos install` CLI and pointing to your current | monitoring system. | bakery_bake wrote: | According to what you say, nobody should implement logs manually? | I will check Odigos. | edenfed wrote: | Logs are easy and familiar API for adding additional data to | your traces. They still have their place, Odigos is just adding | much more context. | jrockway wrote: | They don't really show any of the settings they used, but for | traces, I imagine if you have a reasonable sampling rate, then | you aren't going to be running any code for most requests, so it | won't increase latency. (Looking at their chart, I guess they are | sampling .1% of requests, since 99.9% is where latency starts | increasing. I am not sure if I would trace .1% of pages loads to | google.com, as their table implies. Rather, I'd pick something | like 1 request per second, so that latency does not increase as | load increases.) | | A lot of Go metrics libraries, specifically Prometheus, introduce | a lot of lock contention around incrementing metrics. This was | unacceptably slow for our use case at work and I ended up writing | a metrics system that doesn't take any locks for most cases. | | (There is the option to introduce a lock for metrics that are | emitted on a timed basis; i.e. emit tx_bytes every 10s or 1MiB | instead of at every Write() call. But this lock is not global to | the program; it's unique to the metric and key=value "fields" on | the metric. So you can have a lot of metrics around and not | content on locks.) | | The metrics are then written to the log, which can be processed | in real time to synthesize distributed traces and prometheus | metrics, if you really want them: | https://github.com/pachyderm/pachyderm/blob/master/src/inter... | (Our software is self-hosted, and people don't have those systems | set up, so we mostly consume metrics/traces in log form. When | customers have problems, we prepare a debug bundle that is mostly | just logs, and then we can further analyze the logs on our side | to see event traces, metrics, etc.) | | As for eBPF, that's something I've wanted to use to enrich logs | with more system-level information, but most customers that run | our software in production aren't allowed to run anything as | root, and thus eBPF is unavailable to them. People will tolerate | it for things like Cilium or whatever, but not for ordinary | applications that users buy and request that their production | team install for them. Production Linux at big companies is super | locked down, it seems, much to my disappointment. (Personally, my | threat model for Linux is that if you are running code on the | machine, you probably have root through some yet-undiscovered | kernel bug. Historically, I've been right. But that is not the | big companies' security teams' mental model, it appears. They | aren't paranoid enough to run each k8s pod in a hypervisor, but | are paranoid enough to prevent using CAP_SYS_ADMIN or root.) | edenfed wrote: | Thanks for the valuable feedback! We used a constant throughout | of 10,000 rps. The exact testing setup can be found under "how | we tested". | | I think the example you gave for the lock used by Prometheus | library is a great example why generation of traces/metrics is | a great fit for offloading to different process (an agent). | | Patchyderm looks very interesting however I am not sure how you | can generate distributed traces based on metrics, how do you | fill in the missing context propagation? | | Our way to deal with eBPF root requirements is to be | transparent as possible. This is why we donated the code to the | CNCF and developing as part of the OpenTelemetry community. We | hope that being open will make users trust us. You can see the | relevant code here: https://github.com/open- | telemetry/opentelemetry-go-instrumen... | jrockway wrote: | > I am not sure how you can generate distributed traces based | on metrics | | Every log line gets an x-request-id field, and then when you | combine the logs from the various components, you can see the | propagation throughout our system. The request ID is a UUIDv4 | but the mandatory 4 nibble in the UUIDv4 gets replaced with a | digit that represents where the request came from; background | task, web UI, CLI, etc. I didn't take the approach of | creating a separate span ID to show sub-requests. Since you | have all the logs, this extra piece of information isn't | super necessary though my coworkers have asked for it a few | times because every other system has it. | | Since metrics are also log lines, they get the request-id, so | you can do really neat things like "show me when this | particular download stalled" or "show me how much bandwidth | we're using from the upstream S3 server". The aggregations | can take place after the fact, since you have all the raw | data in the logs. | | If we were running this such that we tailed the logs and sent | things to Jaeger/Prometheus, a lot of this data would have to | go away for cardinality reasons. But squirreling the logs | away safely, and then doing analysis after the fact when a | problem is suspected ends up being pretty workable. (We still | do have a Prometheus exporter not based on the logs, for | customers that do want alerts. For log storage, we bundle | Loki.) | otterley wrote: | The column in the table claiming the "number of page loads that | would experience the 99th %ile" is mathematically suspect. It | directly contradicts what a percentile is. | | By definition, at 99th percentile, if I have 100 page loads, the | _one_ with the worst latency would be over the 99th percentile. | That 's not 85.2%, 87.1%, 67.6%, etc. The formula shown in that | column makes no sense at all. | edenfed wrote: | I recommend watching Gil Tene's talk, I think he explains the | math better than I do: | https://www.youtube.com/watch?v=lJ8ydIuPFeU | tpankaj wrote: | That's not what that column is supposed to mean afaict. The way | I read it is it's showing that if the website requires hundreds | of different parallel backend service calls to serve the page | load, what's the probability a page load hits the p99 | instrumentation latency? | | We have a similar chart at my job to illustrate the point that | high p99 latency on a backend service doesn't mean only 1% of | end-user page loads are affected. | otterley wrote: | Ah, I see. So, for example, if one page request would result | in 190 different backend requests to fulfill, then the | possibility that at least one of those subrequests exceeds | the 99th percentile would be 85.2%. That makes a lot more | sense. | bjt12345 wrote: | But what if the 100 page loads are just a sample of the | population? | chabad360 wrote: | How hard is it to use Odigos without k8s? We mainly use docker | compose for our deployments (because it's convenient, and we | don't need scale), but I'm having trouble finding anything in the | documentation that explains the mechanism for hooking into the | container (and hence I have no clue how to repurpose it). | edenfed wrote: | We are currently supporting just Kubernetes environments. | docker-compose, VMs, and Serverless are on our roadmap and will | be ready soon | ranting-moth wrote: | Website doesn't display correctly on FF on android. Text bleeds | on left and right side. | edenfed wrote: | Thank you for reporting will fix ASAP | zengid wrote: | Anyone from the dtrace community want to enlighten a n00b about | how eBPF compares to what dtrace does? | zengid wrote: | From the hot takes in this post from 2018 [0], I may be asking | a contentious question. | | [0] https://news.ycombinator.com/item?id=16375938 | edenfed wrote: | I don't have a lot of experience using dtrace, but AFAIK the | big advantage of eBPF over dtrace is that you do not need to | instrument your application with static probes during coding. | tanelpoder wrote: | DTrace (on Solaris at least) can instrument any userspace | symbol or address, no need for static tracepoints in the app. | | One problem that DTrace has is that the "pid" provider that | you use for userspace app tracing only works on processes | that are already running. So, if more processes with the | executable of interest launch after you've started DTrace, | its pid provider won't catch the new ones. Then you end up | doing some tricks like tracking exec-s of the binary and | restarting your DTrace script... | bcantrill wrote: | That's not exactly correct, and is merely a consequence of | the fact that you are trying to use the pid provider. The | issue that you're seeing is that pid probes are created on- | the-fly -- and if you don't demand that they are created in | a new process, they in fact won't be. USDT probes generally | don't have this issue (unless they are explicitly lazily | created -- and some are). So you don't actually need/want | to restart your DTrace script, you just want to force | probes to be created in new processes (which will | necessitate some tricks, just different ones). | tanelpoder wrote: | So how would you demand that they'd be created in a new | process? I was already using pid* provider years ago when | I was working on this (and wasn't using static compiled- | in tracepoints). | bcantrill wrote: | They're really very different -- with very different origins | and constraints. If you want to hear about my own experiences | with bpftrace, I got into this a bit recently.[0] (And in fact, | one of my questions about the article is how they deal with | silently dropped data in eBPF -- which I found to be pretty | maddening.) | | [0] https://www.youtube.com/watch?v=mqvVmYhclAg#t=12m0s | edenfed wrote: | By dropped data do you mean by exceeding the size of the | allocated ring buffer/perf buffer? If so this is configurable | by the user, so you can adjust is according to the expected | load | bcantrill wrote: | eBPF can drop data silently under quite a few conditions, | unfortunately. And -- most frustratingly -- it's silent, so | it's not even entirely clear which condition you've fallen | into. This alone is a pretty significant with respect to | DTrace: when/where DTrace drops data, there is _always_ an | indicator as to why. And to be clear, this isn 't a | difference merely of implementation (though that too, | certainly), but of principle: DTrace, at root, is a | debugger -- and it strives to be as transparent to the user | as possible as to the truth of the underlying system. | zengid wrote: | I listened to this live! That's probably why I was wondering, | because I remember you talking about something you used in | Linux that didn't quite live up to your expectations with | DTrace, but I didn't catch all of the names. Thanks! | Thaxll wrote: | Of course it outperforms it, but it's basic instrumentation, how | do you properly select the labels for example? In your | application you will have custom instrumentation for business | logic, so what do you do? Now you have two systems instrumenting | the same app? | edenfed wrote: | You can enrich the spans created by eBPF by using OpenTelemetry | APIs as usual, the eBPF instrumentation is a replacement for | the instrumentation SDK. The eBPF program will detect the data | recorded via the APIs and will add it to the final trace | combining both automatic and manually created data. ___________________________________________________________________ (page generated 2023-10-30 23:00 UTC)