[HN Gopher] Hotspot performance engineering fails ___________________________________________________________________ Hotspot performance engineering fails Author : slimsag Score : 97 points Date : 2023-04-27 16:40 UTC (6 hours ago) (HTM) web link (lemire.me) (TXT) w3m dump (lemire.me) | PaulHoule wrote: | I went through a time when I was pitching a "boxes-and-lines" | data processing tool like | | https://www.knime.com/ | | which more-or-less passed JSON documents (instead of SQL rows) | over the lines and found that the kind of people who bought and | finance database startups wouldn't touch anything that couldn't | be implemented with columnar processing. | | I thought that this kind of system would advance the "low code" | nature of these systems because with relational rows many kinds | of data processing require splitting up the data into streams and | joining them whereas an object-relational system lets you | localize processing in a small area of the graph and also be able | to reuse parts of a computation. | | Columnar processing is so much faster than row-based processing | and most investors and partners thought that customers _really_ | needed speed at the expense of being able to write simpler | pipelines. Even though I had a nice demo of a hybrid batch | /stream processing system (that gave correct answers), none of | them cared. Thus, from one viewpoint, architecture is everything. | | (Funny though, I later worked for a company that had a system | like this that wasn't quite sure what algebra the tool worked on | and the tool didn't quite always get the same answer on each | run...) | NovemberWhiskey wrote: | I don't think I entirely agree with the premise here. Yes, it is | extremely difficult to engineer performance in after the fact; | but assuming you've got an architecture that's basically fit for | purpose (from the performance perspective), then improving by | targeting hotspots is sound, isn't it? That's literally Amdahl's | law. | Sesse__ wrote: | Amdahl's law is specifically about the futility of optimizing | by removing hotspots... (Or rather, that it can only take you | so far.) | tasubotadas wrote: | The guy invents a strawman to justify premature optimization. | MattPalmer1086 wrote: | It was interesting to read about why hotspots are not the whole | story in performance. They are still important though. | | Facebook may have the resources and/or need to do complete | rewrites of everything to squeeze out more performance, but most | companies don't. | | I've personally improved performance of a lot of code | significantly by identifying hot spots. So calling hotspot | performance engineering a fail seems a bit unnecessarily | provocative. | jesse__ wrote: | > Facebook may have the resources and/or need to do complete | rewrites of everything to squeeze out more performance, but | most companies don't. | | Actually, if you watch the video Casey put together, he very | clearly demonstrates most companies _do_. | continuational wrote: | This is true, when you keep optimizing, you soon face death from | a thousand paper cuts. But often, it's enough to find that | bottleneck and make it a few times faster. | hinkley wrote: | The solution to this is zone defense instead of man-man. | | The sad fact is that a manager won't approve you working on | something that'll save 1% CPU. But once the tall and medium | tent poles have been knocked down, that's all there is left. | There are hundreds of them, and they double or triple your | response time and/or CPU load. | | I've had much, much better outcomes by rejecting trying to | achieve an N% speedup across the entire app, and instead | picking one subject area of the code and finding 20% there. You | deep dive into that section, fulling absorbing how it works and | why it works, and you fix every problem you see that registers | above the noise floor in your perf tool. Some second and third | tier performance problems complement each other, and you can | avoid one entirely by altering the other. The risk of the 1% | changes can be amortized over both the effort you expended | learning this code, and the testing time required to validate 3 | large changes scattered across the codebase versus 8 changes in | the same workflow. Much simpler to explain, much easier to | verify. | | Big wins feel good _now_ but the company comes to expect them. | In the place where I used this best, I delivered 20% | performance improvements per release for something like 8 | releases in a row, before I ran out of areas I hadn 't touched | before. Often I'd find a perf issue in how the current section | of code talks to another, and that would inform what section of | code I worked on next, while the problem domain was still fresh | in my brain. | charcircuit wrote: | It's always about architecture. In the micro these are the | hotspots you optimize in the macro these are the large rewrites | you see. | | Performance is not the only thing that you should optimize your | architecture for. Factors like adaptability, robustness, ease of | understanding, speed of implementation, maintance cost, etc are | things that you should consider. The factors that are the best | today are not always still the best in the future which is why | rewrites are a part of any software's life cycle. | aranchelk wrote: | That stuff is boring. Don't be a killjoy. | sosodev wrote: | This is true if your end goal is to have a super fast program but | that is very rarely the case. The GTA online loading times issues | went unnoticed for years because Rockstar just didn't care that | the loading times were long. Users still played the game and | spent a ton of money. | | Performance hotspots often are the difference between acceptable | and unacceptable performance. I'm sure I'm not the only person | who has seen that be the case many times. | hinkley wrote: | I don't think people understand the ways that we have adapted | to delays. At least once a month I complain about how when we | were kids, commercials were when you went for a pee break or to | get a snack. There was no pause button. Bing watching on | streaming always means you have to interrupt or wait twenty | five minutes. | | I suspect if you spied on a bunch of GTA players you'd find | them launching the game _and then_ going to the fridge, rather | than the other way around. | eklitzke wrote: | >This is true if your end goal is to have a super fast program | but that is very rarely the case. | | This is true in some banal sense, but kind of misses the point | that there are certain domains where high performance software | is a given, and in other domains it may rarely be important. If | you're working on games, certain types of financial systems, | autonomous vehicles, operating systems, etc. then high | performance is critical and something you need to think about | quite literally from day one. | tonyarkles wrote: | > This is true in some banal sense, but kind of misses the | point that there are certain domains where high performance | software is a given | | I work in a field where we're trying to squeeze the maximum | amount of juice out of a fixed amount of compute (the | hardware we're using only gets a rev every couple of years). | My background (MSc + past work) was in primarily distributed | systems performance analysis, and we definitely designed our | system from day one to have an architecture that could | support high performance. | | The GP's comment irks me. There are so many tools I use day- | to-day that are ancillary to the work I do where the | performance is absolutely miserable. I stare at them in | disbelief. I'm processing 500MB/s of high resolution image | data on about 30W in my primary system. How the hell does it | take 5 seconds for a friggin' email to load in a local | application? How does it take 3 seconds _for a password | search dialog_ to open when I click on it? How does WhatsApp | consume the same amount of memory as QGIS loaded up with | hundreds of geoprojected high-resolution images? | | I agree that many systems _don 't_ require maximum-throughput | heavy optimization, but there's a spectrum here and it's | infuriating to me how far left on that spectrum a lot of | applications are. | vgatherps wrote: | I feel the same frustration. I work in a field with | stupendously tight latency constraints and am shocked by | the disparity vs how much work we fit into tiny deadlines, | vs how horrifyingly slow gui software written by well | resourced mega corporations is. | | It feels to me like user interfaces are somehow not | considered high-performance applications because they | aren't doing super-high-throughput stuff, they're "just a | gui", they're running on a phone, etc. All of that is true | but it misses that guis are latency/determinism sensitive | applications. | | I remember hearing some quote about how Apple was the | _only_ software company that systematically measured | response time on their GUIs, and I 'd believe it because my | apple products are by far the snappiest and most responsive | computing devices I have (the only thing that even competes | is a very beefy desktop). | tonyarkles wrote: | Yeah, exactly, like... we're doing microsecond-precise | high-bandwidth imaging and processing it real-time (not | in the Hard Real-Time sense, but in the "we don't have | enough RAM to buffer more than a couple of seconds worth | of frames and we don't post-process it after the fact" | real-time sense) with a team of... 3-5 or so dedicated to | the end-to-end flow from photons to ML engine to disk. | The ML models themselves are a different team that we | just have to bonk once in a while if they do something | that hurts throughput too badly. | | I'm sure we'd be bored as hell working on UI performance | optimization, but if we could gamify it somehow... :D | manv1 wrote: | TL; DR: "It's better to design a fast system from the get-go | instead of trying to fix a slow system later." | | That's basically true. I worked on a system that was | Java/scala/spring/hibernate and it was just slow. It was slow | when it was servicing an empty request, and it just went downhill | from there. They just built it wrong...and they went ahead and | built it wrong again. | | Today, I could replace it was a few hundred lines of node in | AWS/Lambda and get multiple orders of magnitude of performance. | pestatije wrote: | [flagged] | tonyarkles wrote: | > Today, I could replace it was a few hundred lines of node in | AWS/Lambda and get multiple orders of magnitude of performance. | | I had a fun bake-off a few years back. I was in more of a | devOPS role (i.e. mostly Ops but writing code here and there | when needed) and we needed something akin to an API Gateway but | with some very domain-specific routing logic. One of the | developers and I talked it through, he wanted to do Node, I | suggested it would be a perfect place for Go. We decided to do | two parallel (~500 LOC) implementations over a weekend and run | them head-to-head on Monday. | | The code, logically, ended up coming out quite similar, which | made us both pretty happy. Then... we started the benchmarking. | They were neck and neck! For a fixed level of throughput, Go | was only winning by maybe 5% on latency. That stayed true up | until about 10krps, at which point Node flatlined because it | was saturating a single CPU and Go just kept going and going | and going until it saturated all of the cores on the VM we were | testing on. | | Could we have scaled out the Node version to multiple nodes in | the cluster? Sure. At 10krps though, it was already using 2-3x | the RAM that the Go version was using at 80krps, and | replicating 8 copies of it vs the 2x we did with the Go version | (just for redundancy) starts to have non-trivial resource | costs. | | And don't get me wrong, we had a bunch of the exact same | Java/scala/spring/hibernate type stuff in the system as well, | and it was dog-ass slow in comparison while also eating RAM | like it was candy. | manv1 wrote: | Yeah, the one time I used go it was pretty good. The big | question is always whether your stuff spends more time | waiting or more time processing. For the former, it's node. | For the latter, it's go. | ummonk wrote: | From my experience it's better to just consider performance from | the get-go, and carefully consider which tech stack you're using | and how the specific logic / system architecture you've chosen | will be performant. It's much easier than being stuck with | performance problems down the road that will need a painful | rewrite. | | The whole mantra of avoiding "premature optimizations" was | applicable in an era when "optimizations" meant rewriting C code | in assembly. | govolckurself wrote: | [dead] | secondcoming wrote: | Well, Lemire is renowned for his SIMD algos. | dilap wrote: | Yep. | | You need to be thinking about performance from the very | beginning, if you're ever going to be fast. | | Because, like the article said, "overall architecture trumps | everything". You (probably) can't go back and fix that without | doing a rewrite. | | (Though it can be OK to have particular small parts where say | "we'll do this in a slow way and it's clear how we'll swap it | out into a faster way later if it matters".) | | But if your approach is just "don't even worry about | performance, that's premature optimization", you'll be in for a | world of pain when you want to make it fast. | attractivechaos wrote: | A catch in Knuth's famous quote is how to define "premature". I | am not old enough to see how programmers in his time thought | about "premature", but my impression is quite a few modern | programmers think all optimizations are premature. | smolder wrote: | The other thing that's changed from the 'every optimization is | premature' era is that shrinking CPUs don't result in big gains | in frequency anymore -- Moore's law isn't going to make your | python run at C speed no matter how long you wait for better | hardware. | 0x000xca0xfe wrote: | And the speed of light ensures that memory latencies won't | get much better until CPUs are small cubes made of SRAM. | amluto wrote: | Come again? | | Modern servers seem to have about 100ns latency to main | memory. The speed of light (actually electrical signals) | delay is maybe 1-2ns. | actionfromafar wrote: | Ehrm. Small _spheres_ of SRAM, if I may. | speed_spread wrote: | That's one way of dealing with corner cases. | cma wrote: | That and memory latency has improved much slower than | everything else, so pointer chasing implicit throughout | languages like Python is just horrendously slow. SRAM for | bigger cache isn't scaling down anymore either in the last | several process nodes. | adamnemecek wrote: | I agree with the "premature optimization". It's one of those | phrases like "correlation does not imply causation" that makes | my blood boil. Like cool dude, did you just take freshman CS. | Psychlist wrote: | If you have a fast design/architecture, you may never need to | optimise the code at all. But the flip side is that with a bad | design or bad architecture optimising the implementation won't | save you. With a sufficiently bad architecture starting again | is the only reasonable choice. | | I've seen code that does "fast" searches of a tree in a dumb | way come out O(n^10) or worse (at some point you just stop | counting), and the solution was not to search most of the tree | at all. Find the relevant node and follow links from that. | | Meanwhile in my day job performance really doesn't matter. We | need a cloud system for the distributed high bandwidth side, | but the smallest instances we can buy with the necessary | bandwidth have so much CPU and RAM that even quite bad memory | leaks take days to bring an instance down. Admittedly this is | C++ with a sensible design (if I do say so myself) so ... good | design and architecture means you don't have to optimise. | turtleyacht wrote: | > these lines of code [were] pulling data from memory and | _software cannot beat Physics._ [These] are elementary | operations... | | > measuring big effects is easy, measuring small ones becomes | impossible because the _action of measuring interacts_ with the | software | | > to multiply the performance by N, you need ... 2^N | optimizations | | > why companies do full rewrite of their code for performance | saagarjha wrote: | Why quote these lines? | turtleyacht wrote: | Summarizing the article. Also gives me a way to evaluate | performance/optimization. Ideas to hang hooks on. | bastawhiz wrote: | > And that explain why companies do full rewrites of their code | for performance: the effort needed to squeeze more performance | from the existing code becomes too much and a complete rewrite is | cheaper. | | The article provides reasons why optimization gets harder, but no | arguments for why a rewrite is better. It's unclear whether the | author is arguing for rewrites or whether they're simply pointing | out why companies take them on. | | Arguably, though, companies taking on a full rewrite surely must | have considered the cost of optimization (versus naively saying | "the system is slow, replace it!"--though maybe some did). | Rewrites are big, expensive, and time-consuming. It means new | bugs and unknown unknowns, and no time to add features or fix | bugs because you're busy rewriting functional code. It's a | scapegoat for lack of improvement or progress. You shouldn't take | one on lightly. | | At the same time, this post also neglects that some efficiency | wins have little to do with the efficiency of the code, but | rather the efficiency of the logic. An N+1 query in your | application looks like your database is slow: you're wasting a | ton of time sitting and waiting for your DB to return | information! But the real problem is that you're repeatedly going | back-and-forth to the database to query lots of little pieces of | information that could have far more efficiently been queried all | at once. | | > It is relatively easy to double the performance of an | unoptimized piece of code, but much harder to multiply it by 10. | You quickly hit walls that can be unsurmountable: the effort | needed to double the performance again would just be too much. | | That's not really true, though. One bad SQL query can go from | many seconds or minutes to milliseconds. One accidentally- | quadratic algorithm can take orders of magnitude more time than a | linear-time algorithm. One bad regexp can account for the | majority of a request. Of course, as you fix the biggest | performance problems, the only problems left are ones that are | smaller than your biggest ones, so you'll have diminishing | returns. | | But it also begs the question, what choices has your existing | code made that makes it _ten times_ slower than you want it to | be? In my experience, you're doing work synchronously that could | have been put in a queue and worked on asynchronously. It's more | often "you're doing more work than you should" or "you're being | inefficient with the resources you have available" than "a | specific piece of code is computationally inefficient". | nemothekid wrote: | > _The article provides reasons why optimization gets harder, | but no arguments for why a rewrite is better. It 's unclear | whether the author is arguing for rewrites or whether they're | simply pointing out why companies take them on._ | | He didn't argue a rewrite is just "better"; his argument was | that a rewrite was the only card of the table. The | _architecture_ was deficient and to get more performance you | have to change the architecture, which means a rewrite. | | I tend to agree; I take the view that most engineers are smart, | and compilers/interpreters/virtual machines are even smarter so | most targeted optimizations aren't going to result in very much | gain. A codebase full of N+1 queries or unindexed queries never | cared about performance to begin with. | | For true gains, you will have to think about data which is the | true bottleneck for most applications - getting data from | memory, the disk or the network will be much longer that any | instruction cycle. The way memory moves through your | application is baked into your architecture and changing this | will almost always involve a rewrite. To your final point, | | > _In my experience, you 're doing work synchronously that | could have been put in a queue and worked on asynchronously._ | | moving from a synchronous codebase to an async one almost | always involves a rewrite. | 0x000xca0xfe wrote: | Optimizing for modern CPUs means optimizing for predictable | memory accesses and program flow. Minimizing memory usage helps a | lot, too. | | Unfortunately this is pretty counterintuitive and most | programming languages do not make it easy. And if you optimize | for size you almost get laughed at. | helen___keller wrote: | This whole thing is basically a straw man. "Performance | engineering works but sometimes it's not enough to overcome a bad | architecture". Alright, was that actually in question in the | first place? | vgatherps wrote: | You'd be surprised how common the view "Performance doesn't | matter now we'll just fix the hotspot later" is ___________________________________________________________________ (page generated 2023-04-27 23:00 UTC)