[HN Gopher] eBPF-based auto-instrumentation outperforms manual i...
       ___________________________________________________________________
        
       eBPF-based auto-instrumentation outperforms manual instrumentation
        
       Author : edenfed
       Score  : 155 points
       Date   : 2023-10-30 14:10 UTC (8 hours ago)
        
 (HTM) web link (odigos.io)
 (TXT) w3m dump (odigos.io)
        
       | nevodavid10 wrote:
       | This is great. Can you elaborate on how the performance is
       | better?
        
         | Barakikia wrote:
         | Our focus was on latency. The reason we were able to cut it
         | down was due to the fact that eBPF-based automatic
         | instrumentation separates the recording from the processing.
        
           | grazio wrote:
           | How did you actually reduce the latency here ?
        
             | RonFeder wrote:
             | The main factor for reduced latency is the separation
             | between recording and processing of data. The eBPF programs
             | are the only overhead for the instrumented process in terms
             | of latency. The eBPF programs transfer the collected data
             | to a separate process which handles all the exporting. In
             | contrast to manually adding code to an application which
             | adds latency and memory footprint in terms of handling the
             | exported data.
        
       | CSDude wrote:
       | Somewhat related, I mainly code in Kotlin. Adding open telemetry
       | was just adding agent to command line args (usual Java/JVM magic
       | most people don't like). Then I had a project in Go and I got so
       | tired of all the steps it took (setup and ensuring each context
       | is instrumented) and just gave up. We still add our manual
       | instrumentation for customization, but auto-instrumentation made
       | adoption much easier in the day 0.
        
         | edenfed wrote:
         | I think eBPF has also great potential to help JVM-based
         | languages. Especially around performance aspects even comparing
         | to the current java agents which use bytecode manipulation.
        
           | marwis wrote:
           | The article mentions avoiding GC pressure and separation
           | between recording and processing as big wins for performance
           | for runtimes like Java but you could do the same inside Java
           | by using ring buffer, no?
        
             | edenfed wrote:
             | Interesting idea. I think that as long as you able to do
             | processing, serializing and delivery in other process and
             | save this work from your application runtime you should see
             | great performance
        
       | avita1 wrote:
       | How do you solve the context propagation issue with eBPF based
       | instrumentation?
       | 
       | E.g. if you get a RPC request coming in, and make an RPC request
       | in order to serve the incoming RPC request. The traced program
       | needs to track some ID for that request from the time it comes
       | in, through to the place where the the HTTP request comes out.
       | And then that ID has to get injected into a header on the wire so
       | the next program sees the same request ID.
       | 
       | IME that's where most of the overhead (and value) from a manual
       | tracing library comes from.
        
         | edenfed wrote:
         | It depends on the programming language being instrumented. For
         | Go we are assuming the context.Context object is passed around
         | between different functions or goroutines. For Java, we are
         | using a combination of ThreadLocal tracing and Runnable tracing
         | to support use cases like reactive and multithreaded
         | applications.
        
           | camel_gopher wrote:
           | That's a very big assumption, at least for Go based
           | applications.
        
             | edenfed wrote:
             | We also thinking on implementing fallback mechanism to
             | automatically propagate context on the same goroutine if
             | context.Context is not passed
        
             | nulld3v wrote:
             | I don't think it's unreasonable, you need a Context to make
             | a gRPC call and you get one when handling a gRPC call. It
             | usually doesn't get lost in between.
        
               | otterley wrote:
               | True for gRPC, but not necessarily for HTTP - the HTTP
               | client and server packages that ship with Go predate the
               | Context package by quite a long while.
        
           | spullara wrote:
           | Going to be rough for supporting virtual threads then?
        
             | edenfed wrote:
             | We have a solution for virtual thread as well. Currently
             | working on a blog post describing exactly how. Will update
             | once releases
        
             | marwis wrote:
             | ScopedValue solves that problem: https://docs.oracle.com/en
             | /java/javase/21/docs/api/java.base...
        
         | rocmcd wrote:
         | 100%. Context propagation is _the_ key to distributed tracing,
         | otherwise you're only seeing one side of every transaction.
         | 
         | I was hoping odigos was language/runtime-agnostic since it's
         | eBPF-based, but I see it's mentioned in the repo that it only
         | supports:
         | 
         | > Java, Python, .NET, Node.js, and Go
         | 
         | Apart from Go (that is a WIP), these are the languages already
         | supported with Otel's (non-eBPF-based) auto-instrumentation.
         | Apart from a win on latency (which is nice, but could in theory
         | be combated with sampling), why else go this route?
        
           | edenfed wrote:
           | eBPF instrumentation does not require code changes,
           | redeployment or restart to running applications.
           | 
           | We are constantly adding more language support for eBPF
           | instrumentation and are aiming to cover the most popular
           | programming languages soon.
           | 
           | Btw, not sure that sampling is really the solution to combat
           | overhead, after all you probably do want that data. Trying to
           | fix production issue when the data you need is missing due to
           | sampling is not fun
        
             | rocmcd wrote:
             | All good points, thank you.
             | 
             | What's the limit on language support? Is it theoretically
             | possible to support any language/runtime? Or does it come
             | down to the protocol (HTTP, gRPC, etc) being used by the
             | communicating processes?
        
               | edenfed wrote:
               | We already solved compiled languages (Go, C, Rust) and
               | JIT languages (Java, C#). Interpreted languages (Python,
               | JS) are the only ones left, hopefully we will solve these
               | as well soon. The big challenge is supporting all the
               | different runtimes, once that is solved implementing
               | support for different protocols / open-source libraries
               | is not as complicated.
        
               | jetbalsa wrote:
               | Got to get PHP on that list :)
        
               | phillipcarter wrote:
               | FWIW it's theoretically possible to support any
               | language/runtime, but since eBPF is operating at the
               | level it's at, there's no magic abstraction layer to plug
               | into. Every runtime and/or protocol involves different
               | segments of memory and certain bytes meaning certain
               | things. It's all in service towards having no additional
               | requirements for an end-user to install, but once you're
               | in eBPF world everything is runtime-and-protocol-and-
               | library-specific.
        
         | RonFeder wrote:
         | The eBPF programs handle passing the context through the
         | requests by adding a field to the header as you mentioned. The
         | injected field is according to the w3c standard.
        
       | heyeb_25 wrote:
       | If I am manually implemented all my logs, what do I need to do to
       | move to Odgios?
        
         | edenfed wrote:
         | Nothing special, if you are working on Kubernetes its as easy
         | as running `odigos install` CLI and pointing to your current
         | monitoring system.
        
       | bakery_bake wrote:
       | According to what you say, nobody should implement logs manually?
       | I will check Odigos.
        
         | edenfed wrote:
         | Logs are easy and familiar API for adding additional data to
         | your traces. They still have their place, Odigos is just adding
         | much more context.
        
       | jrockway wrote:
       | They don't really show any of the settings they used, but for
       | traces, I imagine if you have a reasonable sampling rate, then
       | you aren't going to be running any code for most requests, so it
       | won't increase latency. (Looking at their chart, I guess they are
       | sampling .1% of requests, since 99.9% is where latency starts
       | increasing. I am not sure if I would trace .1% of pages loads to
       | google.com, as their table implies. Rather, I'd pick something
       | like 1 request per second, so that latency does not increase as
       | load increases.)
       | 
       | A lot of Go metrics libraries, specifically Prometheus, introduce
       | a lot of lock contention around incrementing metrics. This was
       | unacceptably slow for our use case at work and I ended up writing
       | a metrics system that doesn't take any locks for most cases.
       | 
       | (There is the option to introduce a lock for metrics that are
       | emitted on a timed basis; i.e. emit tx_bytes every 10s or 1MiB
       | instead of at every Write() call. But this lock is not global to
       | the program; it's unique to the metric and key=value "fields" on
       | the metric. So you can have a lot of metrics around and not
       | content on locks.)
       | 
       | The metrics are then written to the log, which can be processed
       | in real time to synthesize distributed traces and prometheus
       | metrics, if you really want them:
       | https://github.com/pachyderm/pachyderm/blob/master/src/inter...
       | (Our software is self-hosted, and people don't have those systems
       | set up, so we mostly consume metrics/traces in log form. When
       | customers have problems, we prepare a debug bundle that is mostly
       | just logs, and then we can further analyze the logs on our side
       | to see event traces, metrics, etc.)
       | 
       | As for eBPF, that's something I've wanted to use to enrich logs
       | with more system-level information, but most customers that run
       | our software in production aren't allowed to run anything as
       | root, and thus eBPF is unavailable to them. People will tolerate
       | it for things like Cilium or whatever, but not for ordinary
       | applications that users buy and request that their production
       | team install for them. Production Linux at big companies is super
       | locked down, it seems, much to my disappointment. (Personally, my
       | threat model for Linux is that if you are running code on the
       | machine, you probably have root through some yet-undiscovered
       | kernel bug. Historically, I've been right. But that is not the
       | big companies' security teams' mental model, it appears. They
       | aren't paranoid enough to run each k8s pod in a hypervisor, but
       | are paranoid enough to prevent using CAP_SYS_ADMIN or root.)
        
         | edenfed wrote:
         | Thanks for the valuable feedback! We used a constant throughout
         | of 10,000 rps. The exact testing setup can be found under "how
         | we tested".
         | 
         | I think the example you gave for the lock used by Prometheus
         | library is a great example why generation of traces/metrics is
         | a great fit for offloading to different process (an agent).
         | 
         | Patchyderm looks very interesting however I am not sure how you
         | can generate distributed traces based on metrics, how do you
         | fill in the missing context propagation?
         | 
         | Our way to deal with eBPF root requirements is to be
         | transparent as possible. This is why we donated the code to the
         | CNCF and developing as part of the OpenTelemetry community. We
         | hope that being open will make users trust us. You can see the
         | relevant code here: https://github.com/open-
         | telemetry/opentelemetry-go-instrumen...
        
           | jrockway wrote:
           | > I am not sure how you can generate distributed traces based
           | on metrics
           | 
           | Every log line gets an x-request-id field, and then when you
           | combine the logs from the various components, you can see the
           | propagation throughout our system. The request ID is a UUIDv4
           | but the mandatory 4 nibble in the UUIDv4 gets replaced with a
           | digit that represents where the request came from; background
           | task, web UI, CLI, etc. I didn't take the approach of
           | creating a separate span ID to show sub-requests. Since you
           | have all the logs, this extra piece of information isn't
           | super necessary though my coworkers have asked for it a few
           | times because every other system has it.
           | 
           | Since metrics are also log lines, they get the request-id, so
           | you can do really neat things like "show me when this
           | particular download stalled" or "show me how much bandwidth
           | we're using from the upstream S3 server". The aggregations
           | can take place after the fact, since you have all the raw
           | data in the logs.
           | 
           | If we were running this such that we tailed the logs and sent
           | things to Jaeger/Prometheus, a lot of this data would have to
           | go away for cardinality reasons. But squirreling the logs
           | away safely, and then doing analysis after the fact when a
           | problem is suspected ends up being pretty workable. (We still
           | do have a Prometheus exporter not based on the logs, for
           | customers that do want alerts. For log storage, we bundle
           | Loki.)
        
       | otterley wrote:
       | The column in the table claiming the "number of page loads that
       | would experience the 99th %ile" is mathematically suspect. It
       | directly contradicts what a percentile is.
       | 
       | By definition, at 99th percentile, if I have 100 page loads, the
       | _one_ with the worst latency would be over the 99th percentile.
       | That 's not 85.2%, 87.1%, 67.6%, etc. The formula shown in that
       | column makes no sense at all.
        
         | edenfed wrote:
         | I recommend watching Gil Tene's talk, I think he explains the
         | math better than I do:
         | https://www.youtube.com/watch?v=lJ8ydIuPFeU
        
         | tpankaj wrote:
         | That's not what that column is supposed to mean afaict. The way
         | I read it is it's showing that if the website requires hundreds
         | of different parallel backend service calls to serve the page
         | load, what's the probability a page load hits the p99
         | instrumentation latency?
         | 
         | We have a similar chart at my job to illustrate the point that
         | high p99 latency on a backend service doesn't mean only 1% of
         | end-user page loads are affected.
        
           | otterley wrote:
           | Ah, I see. So, for example, if one page request would result
           | in 190 different backend requests to fulfill, then the
           | possibility that at least one of those subrequests exceeds
           | the 99th percentile would be 85.2%. That makes a lot more
           | sense.
        
         | bjt12345 wrote:
         | But what if the 100 page loads are just a sample of the
         | population?
        
       | chabad360 wrote:
       | How hard is it to use Odigos without k8s? We mainly use docker
       | compose for our deployments (because it's convenient, and we
       | don't need scale), but I'm having trouble finding anything in the
       | documentation that explains the mechanism for hooking into the
       | container (and hence I have no clue how to repurpose it).
        
         | edenfed wrote:
         | We are currently supporting just Kubernetes environments.
         | docker-compose, VMs, and Serverless are on our roadmap and will
         | be ready soon
        
       | ranting-moth wrote:
       | Website doesn't display correctly on FF on android. Text bleeds
       | on left and right side.
        
         | edenfed wrote:
         | Thank you for reporting will fix ASAP
        
       | zengid wrote:
       | Anyone from the dtrace community want to enlighten a n00b about
       | how eBPF compares to what dtrace does?
        
         | zengid wrote:
         | From the hot takes in this post from 2018 [0], I may be asking
         | a contentious question.
         | 
         | [0] https://news.ycombinator.com/item?id=16375938
        
         | edenfed wrote:
         | I don't have a lot of experience using dtrace, but AFAIK the
         | big advantage of eBPF over dtrace is that you do not need to
         | instrument your application with static probes during coding.
        
           | tanelpoder wrote:
           | DTrace (on Solaris at least) can instrument any userspace
           | symbol or address, no need for static tracepoints in the app.
           | 
           | One problem that DTrace has is that the "pid" provider that
           | you use for userspace app tracing only works on processes
           | that are already running. So, if more processes with the
           | executable of interest launch after you've started DTrace,
           | its pid provider won't catch the new ones. Then you end up
           | doing some tricks like tracking exec-s of the binary and
           | restarting your DTrace script...
        
             | bcantrill wrote:
             | That's not exactly correct, and is merely a consequence of
             | the fact that you are trying to use the pid provider. The
             | issue that you're seeing is that pid probes are created on-
             | the-fly -- and if you don't demand that they are created in
             | a new process, they in fact won't be. USDT probes generally
             | don't have this issue (unless they are explicitly lazily
             | created -- and some are). So you don't actually need/want
             | to restart your DTrace script, you just want to force
             | probes to be created in new processes (which will
             | necessitate some tricks, just different ones).
        
               | tanelpoder wrote:
               | So how would you demand that they'd be created in a new
               | process? I was already using pid* provider years ago when
               | I was working on this (and wasn't using static compiled-
               | in tracepoints).
        
         | bcantrill wrote:
         | They're really very different -- with very different origins
         | and constraints. If you want to hear about my own experiences
         | with bpftrace, I got into this a bit recently.[0] (And in fact,
         | one of my questions about the article is how they deal with
         | silently dropped data in eBPF -- which I found to be pretty
         | maddening.)
         | 
         | [0] https://www.youtube.com/watch?v=mqvVmYhclAg#t=12m0s
        
           | edenfed wrote:
           | By dropped data do you mean by exceeding the size of the
           | allocated ring buffer/perf buffer? If so this is configurable
           | by the user, so you can adjust is according to the expected
           | load
        
             | bcantrill wrote:
             | eBPF can drop data silently under quite a few conditions,
             | unfortunately. And -- most frustratingly -- it's silent, so
             | it's not even entirely clear which condition you've fallen
             | into. This alone is a pretty significant with respect to
             | DTrace: when/where DTrace drops data, there is _always_ an
             | indicator as to why. And to be clear, this isn 't a
             | difference merely of implementation (though that too,
             | certainly), but of principle: DTrace, at root, is a
             | debugger -- and it strives to be as transparent to the user
             | as possible as to the truth of the underlying system.
        
           | zengid wrote:
           | I listened to this live! That's probably why I was wondering,
           | because I remember you talking about something you used in
           | Linux that didn't quite live up to your expectations with
           | DTrace, but I didn't catch all of the names. Thanks!
        
       | Thaxll wrote:
       | Of course it outperforms it, but it's basic instrumentation, how
       | do you properly select the labels for example? In your
       | application you will have custom instrumentation for business
       | logic, so what do you do? Now you have two systems instrumenting
       | the same app?
        
         | edenfed wrote:
         | You can enrich the spans created by eBPF by using OpenTelemetry
         | APIs as usual, the eBPF instrumentation is a replacement for
         | the instrumentation SDK. The eBPF program will detect the data
         | recorded via the APIs and will add it to the final trace
         | combining both automatic and manually created data.
        
       ___________________________________________________________________
       (page generated 2023-10-30 23:00 UTC)