[HN Gopher] Pprof++: A Go Profiler with Hardware Performance Mon... ___________________________________________________________________ Pprof++: A Go Profiler with Hardware Performance Monitoring Author : vquemener Score : 198 points Date : 2021-05-11 14:32 UTC (8 hours ago) (HTM) web link (eng.uber.com) (TXT) w3m dump (eng.uber.com) | avinassh wrote: | On a similar note, is there any similar tool exists for Rust? | ithkuil wrote: | I know one crate that offers a Go-like pprof experience: | https://github.com/tikv/pprof-rs | | But that crate doesn't (yet?) Use the hardware based perf | counters, so this may not answer your actual question. | Veserv wrote: | It is so odd that companies with such immense engineering budgets | and computational costs continue to use low resolution profiling | techniques such as sampling instead of modern high resolution | tracing based solutions. The gap in insight you can get between | seeing some things and literally everything is similar to the gap | between no profiler and a sampling profiler. You can actually | trace the direct causes of slowness instead of just the symptoms. | It is so massive that every time we move a system that was | optimized using a sampling profiler to a tracing profiler we | learn just how little we knew about how it worked and how stupid | the program was behaving and routinely see 30%-50% speedups from | those insights on even heavily optimized code. | | The only real downsides to tracing over sampling is that the | overhead is higher than low resolution sampling down to ~100 | usecs per sample (single-digit to low double digit overhead) and | that any performant solution basically requires instrumentation | and thus results in small overhead if you want it to be runtime | toggleable and requires tooling support to inject. However, given | the immense benefits, it is baffling that companies continue to | struggle with such minimal tooling and platforms that only have | such tooling available. | hodgesds wrote: | Nice to see Milind and the rest of the team are still making | progress! One of the alternatives explored in the original | proposal was using a go library that wraps perf, which I explored | quite a bit previously. I haven't had time to continue that work, | but it is possible to use a perf library within benchmarks (ex: | https://github.com/hodgesds/compressbench) to some success. | Getting this profiler merged into runtime would be really nice. | Alternatively, if someone wants to extend a perf library that is | able to resolve runtime stacks using eBPF that could be an | interesting alternative. | jrockway wrote: | I've also used `perf` to profile Go apps, and the results can be | pretty interesting. For example, the Go profiler doesn't continue | out of the process and into the kernel, so are your I/O waits on | the network, or are they on the local filesystem? `perf` can show | you the difference. (And it does come up in the real world; you | can often have Go functions that operate on io.Readers that are | connected to the network, or are connected to disk files. A | profile that extends into the kernel can show you which case is | slow.) | | I've been consistently disappointed by Go's mutex profiling. It | is kind of useful if you can synthesize the contention, but less | useful if the contention isn't obvious. For my most recent | contention debugging issue, I was very close to modifying the | runtime (mostly sync.Mutex) to emit events that perf could sample | on, and build a flame graph for cases like "took the fast path on | sync.Mutex". What takes the fast path under synthetic load could | easily take the slow path under different load, and that would | let you identify the bottlenecks before they occur. I ended up | not needing it, so didn't do it, but it's interesting to me that | other people have run into the case where the built-in profiles | aren't quite enough. There are definitely a lot of improvements | to be made here. | | (Getting ahead of lock contention has come up a lot in my Go | career. When I was at Google, I had an application that computed | metrics over hundreds of thousands of concurrent network streams. | I used the standard library for metrics, which synchronized on | sync.RWMutex, and it was just too slow to synchronize millions of | events per second. This sort of bottleneck will be obvious to | anyone, probably; you simply can't block all metric-increments on | reads every 15 seconds and expect to fully utilize use 96vCPUs | that are trying to write ;) No profile ever pointed me at the | culprit in a straightforward manner; I had to dig into the time- | per-assembly-instruction profile and noticed inordinate amount of | time being spent in an instruction incidental to the | implementation of sync.Mutex. I ended up switching to | atomic.AddInt64, which solved my problem. I think the overhead of | a write barrier is actually too heavy and wouldn't do that today; | I'd just maintain goroutine-local counters and synchronize them | occasionally. You can still fall behind, of course, but your | program can easily detect that case and log a message like "we're | too slow, buy another computer!" in that branch. Easier than | reading the profiles.) | jrockway wrote: | Thought it might also be helpful to add that I usually invoke | perf like this: perf record --call-graph | dwarf -e cycles --switch-events --sample-cpu -p $(pidof my-go- | process) | | And then view the results in hotspot | (https://github.com/KDAB/hotspot). | | I played a little with off-cpu profiling (-e | sched:sched_switch), but Go kind of outsmarts that by doing its | own scheduling (taking Goroutines off Machines when it thinks | it's going to block inside a syscall, so the resulting | flamegraphs are usually rooted in runtime.park_m; though if the | runtime gets something wrong, you'd be able to see it here.) | jeffbee wrote: | A lot of people don't grok the contention profiler, even people | who work on service performance full-time. The problem is the | docs. What you're getting from the contention profile is | samples of call stacks for contended releases, that is when a | lock is released and there are waiters. This isn't explained | anywhere, to my recollection. | felixge wrote: | Yup, there is an upstream issue to improve this: | https://github.com/golang/go/issues/44920 | | FWIW I'm planning to work on the problem from downstream as | well by adding more docs here: https://github.com/felixge/go- | profiler-notes . I haven't gotten to the mutex profile yet, | but I've covered the block profile in great detail. It | actually overlaps with the mutex profile for contention, but | it tracks Lock() rather than Unlock() which is subtly | different like you pointed out! | felixge wrote: | Very interesting points about mutex profiles. When you say fast | path, do you mean spin locks? | | For analyzing I/O via perf and looking at kernel stacks, I'm | surprised that would work. I'd think that perf will not show | you much waiting on the network, because the netpoller | implementation in Go uses epoll and friends to make that non- | blocking under the hood. For disk I/O it probably works. Maybe | I'm missing something? | jrockway wrote: | There's a CompareAndSwap to do a quick acquire when there | isn't contention: https://github.com/golang/go/blob/master/sr | c/sync/mutex.go#L... | | If you look at the slow path, there are many interesting | paths that a lock can take in there. They are all potentially | interesting when you're debugging contention. | jeffbee wrote: | Trying to figure out how this would be of use to anyone else, in | practice. You need a hacked up runtime and a privileged process | to use the PMU. Compared to just using perf, it seems like a | hassle. | jrockway wrote: | I'm happy to rebuild the binary when I'm stuck on a performance | problem. Obviously, the best case is having all the | instrumentation on the production system that is experiencing | problems, of course. I've run into problems that only manifest | themselves after running for several hours, so the debugging | cycle is slow when you aren't collecting what you need to debug | the issue. In other cases, I've had bugs where a test case can | reproduce them immediately, but the profiler still provides | more targeted guesses than randomly changing something to see | if it improves the outcome (even if the edit/test cycle is fast | enough to support this). That's the case where this sounds | helpful. | petethepig wrote: | This is really cool stuff, we're at Pyroscope [1] would love to | add support for it. | | Too bad it's a fork of Go runtime. As @akshayshah mentioned, it | seems like Go team is hesitant about straight up merging this. I | wonder if a better approach would be to open up Go runtime apis | so that this can be a module instead of a whole fork. I think | opening up these APIs would also enable more innovation in | profiling space. | | [1]: https://github.com/pyroscope-io/pyroscope | rob74 wrote: | "Modularizing" the runtime sounds like a good idea, but also | adds new constraints: they would have to document the | interfaces between the various modules and then either keep | them unchanged, which would hinder further development; or if | they change the module interfaces, the modules will have to be | updated nevertheless, so having modules won't be much of an | advantage... | akshayshah wrote: | Full disclosure: I used to work in Uber's infrastructure group. | | I'm always happy to see improvements to Go's profiling - the | standard library's profiling is good, but it hasn't improved much | since Go 1.0 (it took years just to get flame graphs integrated, | and it's still near-impossible to get a usable dump of the in- | memory object graph). | | That said, I'm _very_ wary of tools that require a fork of the | compiler and/or runtime. Uber's programming language research | group is small (though mighty) and has a lot of demands on their | time. Even over the medium term, it'll be hard to allocate time | to keep this project current. | | In the Go issue tracker, there's a long discussion [1] of the | formal proposal [2] for these improvements. My understanding of | the feedback is that the changes introduce too many Linux and | x86-specific APIs. I'd love to see the author narrow the proposal | to just improve timer fidelity when the OS exposes the PMU, | without any new APIs. The more controversial proposal to collect | richer platform- and architecture-specific data can be discussed | separately. | | Discussion [1]: https://github.com/golang/go/issues/36821 | | Proposal [2]: | https://go.googlesource.com/proposal/+/refs/changes/08/21950... | dvfjsdhgfv wrote: | Forgive my ignorance, but I have a very general question. The | article says Go runs on millions of cores at Uber. Obviously, | the company is interested in maximizing the performance, hence | the Pprof++ project. | | My experience with Go is very limited, but in my tests it was | always slower than C. Sometimes just a bit, sometimes 2-5 | times. So my question is: looking back, don't you guys regret | choosing this language? | | Please don't misunderstand me, I don't intend to start any | flamewars, but it seems like you're very much focused on CPU- | bound performance, and the choice of language is not neutral in | this case. | studmuffin650 wrote: | Full Disclosure: I do not work at Uber but at another large | enterprise that is at Ubers scale and is almost 100% Go | across 100s of micro services. | | Inside an enterprise, there's more to a language then just | the performance (though that is a large factor). You also | have to take into account existing tooling (both internal and | external), developer experience, whether you can find enough | developers to code it and source code maintainability, as | well as many other common concerns with a language. Most | languages will do well in a few cases but none are best in | class in all cases (or everyone would use it). Go does well | enough in the performance category while also doing | moderately to extremely well in other categories. In CPU | bound tasks that don't rely on CGo, go does extremely well in | my experience. I think in general though, for most | enterprises, Go strikes a happy medium and makes the right | trade offs that most developers are willing to make. | hu3 wrote: | I doubt most of Uber microservices are CPU bound. Yes it's | natural for a fleet of this size to also be scrutinized for | CPU savings but that's not the only cost involved when | choosing a language. Good tooling, dev experience and | availability of competent engineers comes to mind. | akshayshah wrote: | At least when I was at Uber (left in mid-2019), this was | absolutely the case: most services weren't CPU-bound, but a | few were. Rewriting the CPU-bound services in C++, Rust, or | some other language might have helped, but would have | required a _ton_ of well-understood, reliable libraries for | networking, observability, authn/authz, database access, | etc. | | In nearly all cases, there was plenty of room to make the | Go service faster. A more careful choice of data structure | and algorithm, finer-grained locking, fan-out across a | goroutine pool, or just avoiding a zillion heap allocations | solved most problems. I don't recall any cases where Go was | simply incapable of meeting our performance needs. | | As a side benefit, services with more stringent performance | requirements often exposed inefficiencies in widely-used | libraries. Optimizing those libraries made every Go service | faster and cut our overall compute spend. Avoiding rewrites | in C++ or Rust let those wins add up over time. | arp242 wrote: | Facebook managed to optimize mercurial (Python) to be faster | than git (C) for their huge monorepo use case simply because | Python made it so much easier to implement more efficient | algorithms. | | If you have some fairly simple function/task, then yeah, a C | version will probably blow the Go version away almost all of | the time. But that's not necessarily indicative of real-world | performance of a full application. | | And of course, there are other interests than "maximize | performance" too, such as developer efficiency. | ashkankiani wrote: | Ya got a source/article on that? I'm curious about it. | | Overall I agree. I'd take a speed hit for ease of | development most of the time, but there are degrees of | speed hit that are acceptable depending on the context. | akshayshah wrote: | Scaling Mercurial at Facebook: | https://engineering.fb.com/2014/01/07/core-data/scaling- | merc... | skybrian wrote: | You could get a 5x slowdown like that for an interpreter | written in Go (versus C), but that's a special case and | usually it could be replaced by switching from an interpreter | to a compiler. (Code generation.) | heleninboodler wrote: | As both a C and Go developer, in my experience, 2-5x | execution time increase going from C to Go does not sound | reasonable at all. | domano wrote: | I think finding a good balance between performance and | development time & risks is important and C does not seem | like a very good choice regarding the second part. At least | for web stuff. | PragmaticPulp wrote: | Thanks for the context. Good to keep in mind for the long term. | | That said, I'm very thankful that tools like this are being | shared with the community, even if it's less than perfect. It's | great that we have access to so many tools and so much | research. | akshayshah wrote: | Agreed! This is super cool, and I hope a working prototype | makes discussion of the formal proposal move a little faster. | vvern wrote: | > it's still near-impossible to get a usable dump of the in- | memory object graph | | This, lately, is my #1 gripe. I just cannot get viewcore to | provide valuable insights. Anybody know of any projects in this | space? | flakiness wrote: | They also forked Android to do similar profiling improvement | [1] and even wrote a paper on that [2]. It's an impressive | achievement and I hope they tried to upstream it as well. | | The Android ART profiler today is still kinda limited (too high | overhead or too imprecise) so we tend to switch over to | simpleperf [3]. However I think there are things that only in- | langue profilers can do. [1] | https://eng.uber.com/nanoscope/ [2] | https://2018.splashcon.org/details/vmil-2018/9/Profiling- | Android-Applications-with-Nanoscope [3] | https://developer.android.com/ndk/guides/simpleperf | etep wrote: | I'm curious about the results under the section titled | "Demonstration." | | The author claims that ground truth is that each goroutine | utilizes 10% of the CPU time (so stipulated that this _should_ be | the case). But, what if the results shown are accurate, i.e. that | the results are the actual CPU time (because of idiosyncrasies of | scheduling between the OS, the go runtime, and anything else | happening on that system). | | Does running the new profiler show less variance in the results | from that initial experiment? Showing this result would | strengthen the claim that the "out of the box" solution is | inaccurate. | felixge wrote: | I think this is great work. Hopefully this can motivate upstream | to reconsider the proposal for it. In the long run I don't think | anybody wants to run with a forked version of Go for better | profiling. | | > The ability to monitor go programs with a very high sampling | frequency -- up to 10s of microseconds | | I'm curious about the overhead of this and will probably try to | measure. This work certainly overcomes the limitations of | setitimer(2) for sampling rate, but faster sampling is going to | increase the performance overhead significantly. | | In particular Go's built-in stack unwinding APIs use gopclntab | (which is Go's Plan9 inspired version of eh_frame). This is | rather slow compared to frame pointer unwinding (up to 55x in my | testing). I think on average you can expect to spend at least 1 | microsecond on unwinding, and then some additional time hashing | the stack and updating the in-memory profile (perhaps another | microsecond? I haven't measured this part yet). I've started | writing some of this up here [1] but it's still WIP. | | Anyway, based on the above, sampling every 10usec will likely | slow the target program by ~20%. This is probably a very | optimistic estimate because it doesn't take into account all the | CPU cache thrashing that will happen. | | As I said, I'll try to take a closer look later, but if anybody | has some insights/comments here I'd be very interested. | | [1] https://github.com/DataDog/go-profiler- | notes/blob/main/stack... | | Disclaimer: I work at Datadog on Continuous Go Profiling, but I'm | here speaking for myself. | akshayshah wrote: | I suspect you're right about overhead. 10us seems much more | aggressive than even the author is actually suggesting, though | - the motivating example was a service with latencies in the | 10s of ms, so sampling every few ms would be sufficient. | | Also, if anyone reading isn't familiar with sampling profilers, | Felix gave a great talk explaining them: | https://github.com/felixge/talks/blob/master/fgprof/ | felixge wrote: | Yeah, decent sampling for requests that last 10s of ms seems | absolutely feasible with this approach, unlike the current | setitimer(2) stuff. | | I'm really looking forward to taking it for a spin soon and | see how far the sampling rate can be pushed without | introducing noticeable overhead in practice! ___________________________________________________________________ (page generated 2021-05-11 23:00 UTC)