[HN Gopher] Hotspot performance engineering fails
       ___________________________________________________________________
        
       Hotspot performance engineering fails
        
       Author : slimsag
       Score  : 97 points
       Date   : 2023-04-27 16:40 UTC (6 hours ago)
        
 (HTM) web link (lemire.me)
 (TXT) w3m dump (lemire.me)
        
       | PaulHoule wrote:
       | I went through a time when I was pitching a "boxes-and-lines"
       | data processing tool like
       | 
       | https://www.knime.com/
       | 
       | which more-or-less passed JSON documents (instead of SQL rows)
       | over the lines and found that the kind of people who bought and
       | finance database startups wouldn't touch anything that couldn't
       | be implemented with columnar processing.
       | 
       | I thought that this kind of system would advance the "low code"
       | nature of these systems because with relational rows many kinds
       | of data processing require splitting up the data into streams and
       | joining them whereas an object-relational system lets you
       | localize processing in a small area of the graph and also be able
       | to reuse parts of a computation.
       | 
       | Columnar processing is so much faster than row-based processing
       | and most investors and partners thought that customers _really_
       | needed speed at the expense of being able to write simpler
       | pipelines. Even though I had a nice demo of a hybrid batch
       | /stream processing system (that gave correct answers), none of
       | them cared. Thus, from one viewpoint, architecture is everything.
       | 
       | (Funny though, I later worked for a company that had a system
       | like this that wasn't quite sure what algebra the tool worked on
       | and the tool didn't quite always get the same answer on each
       | run...)
        
       | NovemberWhiskey wrote:
       | I don't think I entirely agree with the premise here. Yes, it is
       | extremely difficult to engineer performance in after the fact;
       | but assuming you've got an architecture that's basically fit for
       | purpose (from the performance perspective), then improving by
       | targeting hotspots is sound, isn't it? That's literally Amdahl's
       | law.
        
         | Sesse__ wrote:
         | Amdahl's law is specifically about the futility of optimizing
         | by removing hotspots... (Or rather, that it can only take you
         | so far.)
        
       | tasubotadas wrote:
       | The guy invents a strawman to justify premature optimization.
        
       | MattPalmer1086 wrote:
       | It was interesting to read about why hotspots are not the whole
       | story in performance. They are still important though.
       | 
       | Facebook may have the resources and/or need to do complete
       | rewrites of everything to squeeze out more performance, but most
       | companies don't.
       | 
       | I've personally improved performance of a lot of code
       | significantly by identifying hot spots. So calling hotspot
       | performance engineering a fail seems a bit unnecessarily
       | provocative.
        
         | jesse__ wrote:
         | > Facebook may have the resources and/or need to do complete
         | rewrites of everything to squeeze out more performance, but
         | most companies don't.
         | 
         | Actually, if you watch the video Casey put together, he very
         | clearly demonstrates most companies _do_.
        
       | continuational wrote:
       | This is true, when you keep optimizing, you soon face death from
       | a thousand paper cuts. But often, it's enough to find that
       | bottleneck and make it a few times faster.
        
         | hinkley wrote:
         | The solution to this is zone defense instead of man-man.
         | 
         | The sad fact is that a manager won't approve you working on
         | something that'll save 1% CPU. But once the tall and medium
         | tent poles have been knocked down, that's all there is left.
         | There are hundreds of them, and they double or triple your
         | response time and/or CPU load.
         | 
         | I've had much, much better outcomes by rejecting trying to
         | achieve an N% speedup across the entire app, and instead
         | picking one subject area of the code and finding 20% there. You
         | deep dive into that section, fulling absorbing how it works and
         | why it works, and you fix every problem you see that registers
         | above the noise floor in your perf tool. Some second and third
         | tier performance problems complement each other, and you can
         | avoid one entirely by altering the other. The risk of the 1%
         | changes can be amortized over both the effort you expended
         | learning this code, and the testing time required to validate 3
         | large changes scattered across the codebase versus 8 changes in
         | the same workflow. Much simpler to explain, much easier to
         | verify.
         | 
         | Big wins feel good _now_ but the company comes to expect them.
         | In the place where I used this best, I delivered 20%
         | performance improvements per release for something like 8
         | releases in a row, before I ran out of areas I hadn 't touched
         | before. Often I'd find a perf issue in how the current section
         | of code talks to another, and that would inform what section of
         | code I worked on next, while the problem domain was still fresh
         | in my brain.
        
       | charcircuit wrote:
       | It's always about architecture. In the micro these are the
       | hotspots you optimize in the macro these are the large rewrites
       | you see.
       | 
       | Performance is not the only thing that you should optimize your
       | architecture for. Factors like adaptability, robustness, ease of
       | understanding, speed of implementation, maintance cost, etc are
       | things that you should consider. The factors that are the best
       | today are not always still the best in the future which is why
       | rewrites are a part of any software's life cycle.
        
         | aranchelk wrote:
         | That stuff is boring. Don't be a killjoy.
        
       | sosodev wrote:
       | This is true if your end goal is to have a super fast program but
       | that is very rarely the case. The GTA online loading times issues
       | went unnoticed for years because Rockstar just didn't care that
       | the loading times were long. Users still played the game and
       | spent a ton of money.
       | 
       | Performance hotspots often are the difference between acceptable
       | and unacceptable performance. I'm sure I'm not the only person
       | who has seen that be the case many times.
        
         | hinkley wrote:
         | I don't think people understand the ways that we have adapted
         | to delays. At least once a month I complain about how when we
         | were kids, commercials were when you went for a pee break or to
         | get a snack. There was no pause button. Bing watching on
         | streaming always means you have to interrupt or wait twenty
         | five minutes.
         | 
         | I suspect if you spied on a bunch of GTA players you'd find
         | them launching the game _and then_ going to the fridge, rather
         | than the other way around.
        
         | eklitzke wrote:
         | >This is true if your end goal is to have a super fast program
         | but that is very rarely the case.
         | 
         | This is true in some banal sense, but kind of misses the point
         | that there are certain domains where high performance software
         | is a given, and in other domains it may rarely be important. If
         | you're working on games, certain types of financial systems,
         | autonomous vehicles, operating systems, etc. then high
         | performance is critical and something you need to think about
         | quite literally from day one.
        
           | tonyarkles wrote:
           | > This is true in some banal sense, but kind of misses the
           | point that there are certain domains where high performance
           | software is a given
           | 
           | I work in a field where we're trying to squeeze the maximum
           | amount of juice out of a fixed amount of compute (the
           | hardware we're using only gets a rev every couple of years).
           | My background (MSc + past work) was in primarily distributed
           | systems performance analysis, and we definitely designed our
           | system from day one to have an architecture that could
           | support high performance.
           | 
           | The GP's comment irks me. There are so many tools I use day-
           | to-day that are ancillary to the work I do where the
           | performance is absolutely miserable. I stare at them in
           | disbelief. I'm processing 500MB/s of high resolution image
           | data on about 30W in my primary system. How the hell does it
           | take 5 seconds for a friggin' email to load in a local
           | application? How does it take 3 seconds _for a password
           | search dialog_ to open when I click on it? How does WhatsApp
           | consume the same amount of memory as QGIS loaded up with
           | hundreds of geoprojected high-resolution images?
           | 
           | I agree that many systems _don 't_ require maximum-throughput
           | heavy optimization, but there's a spectrum here and it's
           | infuriating to me how far left on that spectrum a lot of
           | applications are.
        
             | vgatherps wrote:
             | I feel the same frustration. I work in a field with
             | stupendously tight latency constraints and am shocked by
             | the disparity vs how much work we fit into tiny deadlines,
             | vs how horrifyingly slow gui software written by well
             | resourced mega corporations is.
             | 
             | It feels to me like user interfaces are somehow not
             | considered high-performance applications because they
             | aren't doing super-high-throughput stuff, they're "just a
             | gui", they're running on a phone, etc. All of that is true
             | but it misses that guis are latency/determinism sensitive
             | applications.
             | 
             | I remember hearing some quote about how Apple was the
             | _only_ software company that systematically measured
             | response time on their GUIs, and I 'd believe it because my
             | apple products are by far the snappiest and most responsive
             | computing devices I have (the only thing that even competes
             | is a very beefy desktop).
        
               | tonyarkles wrote:
               | Yeah, exactly, like... we're doing microsecond-precise
               | high-bandwidth imaging and processing it real-time (not
               | in the Hard Real-Time sense, but in the "we don't have
               | enough RAM to buffer more than a couple of seconds worth
               | of frames and we don't post-process it after the fact"
               | real-time sense) with a team of... 3-5 or so dedicated to
               | the end-to-end flow from photons to ML engine to disk.
               | The ML models themselves are a different team that we
               | just have to bonk once in a while if they do something
               | that hurts throughput too badly.
               | 
               | I'm sure we'd be bored as hell working on UI performance
               | optimization, but if we could gamify it somehow... :D
        
       | manv1 wrote:
       | TL; DR: "It's better to design a fast system from the get-go
       | instead of trying to fix a slow system later."
       | 
       | That's basically true. I worked on a system that was
       | Java/scala/spring/hibernate and it was just slow. It was slow
       | when it was servicing an empty request, and it just went downhill
       | from there. They just built it wrong...and they went ahead and
       | built it wrong again.
       | 
       | Today, I could replace it was a few hundred lines of node in
       | AWS/Lambda and get multiple orders of magnitude of performance.
        
         | pestatije wrote:
         | [flagged]
        
         | tonyarkles wrote:
         | > Today, I could replace it was a few hundred lines of node in
         | AWS/Lambda and get multiple orders of magnitude of performance.
         | 
         | I had a fun bake-off a few years back. I was in more of a
         | devOPS role (i.e. mostly Ops but writing code here and there
         | when needed) and we needed something akin to an API Gateway but
         | with some very domain-specific routing logic. One of the
         | developers and I talked it through, he wanted to do Node, I
         | suggested it would be a perfect place for Go. We decided to do
         | two parallel (~500 LOC) implementations over a weekend and run
         | them head-to-head on Monday.
         | 
         | The code, logically, ended up coming out quite similar, which
         | made us both pretty happy. Then... we started the benchmarking.
         | They were neck and neck! For a fixed level of throughput, Go
         | was only winning by maybe 5% on latency. That stayed true up
         | until about 10krps, at which point Node flatlined because it
         | was saturating a single CPU and Go just kept going and going
         | and going until it saturated all of the cores on the VM we were
         | testing on.
         | 
         | Could we have scaled out the Node version to multiple nodes in
         | the cluster? Sure. At 10krps though, it was already using 2-3x
         | the RAM that the Go version was using at 80krps, and
         | replicating 8 copies of it vs the 2x we did with the Go version
         | (just for redundancy) starts to have non-trivial resource
         | costs.
         | 
         | And don't get me wrong, we had a bunch of the exact same
         | Java/scala/spring/hibernate type stuff in the system as well,
         | and it was dog-ass slow in comparison while also eating RAM
         | like it was candy.
        
           | manv1 wrote:
           | Yeah, the one time I used go it was pretty good. The big
           | question is always whether your stuff spends more time
           | waiting or more time processing. For the former, it's node.
           | For the latter, it's go.
        
       | ummonk wrote:
       | From my experience it's better to just consider performance from
       | the get-go, and carefully consider which tech stack you're using
       | and how the specific logic / system architecture you've chosen
       | will be performant. It's much easier than being stuck with
       | performance problems down the road that will need a painful
       | rewrite.
       | 
       | The whole mantra of avoiding "premature optimizations" was
       | applicable in an era when "optimizations" meant rewriting C code
       | in assembly.
        
         | govolckurself wrote:
         | [dead]
        
         | secondcoming wrote:
         | Well, Lemire is renowned for his SIMD algos.
        
         | dilap wrote:
         | Yep.
         | 
         | You need to be thinking about performance from the very
         | beginning, if you're ever going to be fast.
         | 
         | Because, like the article said, "overall architecture trumps
         | everything". You (probably) can't go back and fix that without
         | doing a rewrite.
         | 
         | (Though it can be OK to have particular small parts where say
         | "we'll do this in a slow way and it's clear how we'll swap it
         | out into a faster way later if it matters".)
         | 
         | But if your approach is just "don't even worry about
         | performance, that's premature optimization", you'll be in for a
         | world of pain when you want to make it fast.
        
         | attractivechaos wrote:
         | A catch in Knuth's famous quote is how to define "premature". I
         | am not old enough to see how programmers in his time thought
         | about "premature", but my impression is quite a few modern
         | programmers think all optimizations are premature.
        
         | smolder wrote:
         | The other thing that's changed from the 'every optimization is
         | premature' era is that shrinking CPUs don't result in big gains
         | in frequency anymore -- Moore's law isn't going to make your
         | python run at C speed no matter how long you wait for better
         | hardware.
        
           | 0x000xca0xfe wrote:
           | And the speed of light ensures that memory latencies won't
           | get much better until CPUs are small cubes made of SRAM.
        
             | amluto wrote:
             | Come again?
             | 
             | Modern servers seem to have about 100ns latency to main
             | memory. The speed of light (actually electrical signals)
             | delay is maybe 1-2ns.
        
             | actionfromafar wrote:
             | Ehrm. Small _spheres_ of SRAM, if I may.
        
               | speed_spread wrote:
               | That's one way of dealing with corner cases.
        
           | cma wrote:
           | That and memory latency has improved much slower than
           | everything else, so pointer chasing implicit throughout
           | languages like Python is just horrendously slow. SRAM for
           | bigger cache isn't scaling down anymore either in the last
           | several process nodes.
        
         | adamnemecek wrote:
         | I agree with the "premature optimization". It's one of those
         | phrases like "correlation does not imply causation" that makes
         | my blood boil. Like cool dude, did you just take freshman CS.
        
         | Psychlist wrote:
         | If you have a fast design/architecture, you may never need to
         | optimise the code at all. But the flip side is that with a bad
         | design or bad architecture optimising the implementation won't
         | save you. With a sufficiently bad architecture starting again
         | is the only reasonable choice.
         | 
         | I've seen code that does "fast" searches of a tree in a dumb
         | way come out O(n^10) or worse (at some point you just stop
         | counting), and the solution was not to search most of the tree
         | at all. Find the relevant node and follow links from that.
         | 
         | Meanwhile in my day job performance really doesn't matter. We
         | need a cloud system for the distributed high bandwidth side,
         | but the smallest instances we can buy with the necessary
         | bandwidth have so much CPU and RAM that even quite bad memory
         | leaks take days to bring an instance down. Admittedly this is
         | C++ with a sensible design (if I do say so myself) so ... good
         | design and architecture means you don't have to optimise.
        
       | turtleyacht wrote:
       | > these lines of code [were] pulling data from memory and
       | _software cannot beat Physics._ [These] are elementary
       | operations...
       | 
       | > measuring big effects is easy, measuring small ones becomes
       | impossible because the _action of measuring interacts_ with the
       | software
       | 
       | > to multiply the performance by N, you need ... 2^N
       | optimizations
       | 
       | > why companies do full rewrite of their code for performance
        
         | saagarjha wrote:
         | Why quote these lines?
        
           | turtleyacht wrote:
           | Summarizing the article. Also gives me a way to evaluate
           | performance/optimization. Ideas to hang hooks on.
        
       | bastawhiz wrote:
       | > And that explain why companies do full rewrites of their code
       | for performance: the effort needed to squeeze more performance
       | from the existing code becomes too much and a complete rewrite is
       | cheaper.
       | 
       | The article provides reasons why optimization gets harder, but no
       | arguments for why a rewrite is better. It's unclear whether the
       | author is arguing for rewrites or whether they're simply pointing
       | out why companies take them on.
       | 
       | Arguably, though, companies taking on a full rewrite surely must
       | have considered the cost of optimization (versus naively saying
       | "the system is slow, replace it!"--though maybe some did).
       | Rewrites are big, expensive, and time-consuming. It means new
       | bugs and unknown unknowns, and no time to add features or fix
       | bugs because you're busy rewriting functional code. It's a
       | scapegoat for lack of improvement or progress. You shouldn't take
       | one on lightly.
       | 
       | At the same time, this post also neglects that some efficiency
       | wins have little to do with the efficiency of the code, but
       | rather the efficiency of the logic. An N+1 query in your
       | application looks like your database is slow: you're wasting a
       | ton of time sitting and waiting for your DB to return
       | information! But the real problem is that you're repeatedly going
       | back-and-forth to the database to query lots of little pieces of
       | information that could have far more efficiently been queried all
       | at once.
       | 
       | > It is relatively easy to double the performance of an
       | unoptimized piece of code, but much harder to multiply it by 10.
       | You quickly hit walls that can be unsurmountable: the effort
       | needed to double the performance again would just be too much.
       | 
       | That's not really true, though. One bad SQL query can go from
       | many seconds or minutes to milliseconds. One accidentally-
       | quadratic algorithm can take orders of magnitude more time than a
       | linear-time algorithm. One bad regexp can account for the
       | majority of a request. Of course, as you fix the biggest
       | performance problems, the only problems left are ones that are
       | smaller than your biggest ones, so you'll have diminishing
       | returns.
       | 
       | But it also begs the question, what choices has your existing
       | code made that makes it _ten times_ slower than you want it to
       | be? In my experience, you're doing work synchronously that could
       | have been put in a queue and worked on asynchronously. It's more
       | often "you're doing more work than you should" or "you're being
       | inefficient with the resources you have available" than "a
       | specific piece of code is computationally inefficient".
        
         | nemothekid wrote:
         | > _The article provides reasons why optimization gets harder,
         | but no arguments for why a rewrite is better. It 's unclear
         | whether the author is arguing for rewrites or whether they're
         | simply pointing out why companies take them on._
         | 
         | He didn't argue a rewrite is just "better"; his argument was
         | that a rewrite was the only card of the table. The
         | _architecture_ was deficient and to get more performance you
         | have to change the architecture, which means a rewrite.
         | 
         | I tend to agree; I take the view that most engineers are smart,
         | and compilers/interpreters/virtual machines are even smarter so
         | most targeted optimizations aren't going to result in very much
         | gain. A codebase full of N+1 queries or unindexed queries never
         | cared about performance to begin with.
         | 
         | For true gains, you will have to think about data which is the
         | true bottleneck for most applications - getting data from
         | memory, the disk or the network will be much longer that any
         | instruction cycle. The way memory moves through your
         | application is baked into your architecture and changing this
         | will almost always involve a rewrite. To your final point,
         | 
         | > _In my experience, you 're doing work synchronously that
         | could have been put in a queue and worked on asynchronously._
         | 
         | moving from a synchronous codebase to an async one almost
         | always involves a rewrite.
        
       | 0x000xca0xfe wrote:
       | Optimizing for modern CPUs means optimizing for predictable
       | memory accesses and program flow. Minimizing memory usage helps a
       | lot, too.
       | 
       | Unfortunately this is pretty counterintuitive and most
       | programming languages do not make it easy. And if you optimize
       | for size you almost get laughed at.
        
       | helen___keller wrote:
       | This whole thing is basically a straw man. "Performance
       | engineering works but sometimes it's not enough to overcome a bad
       | architecture". Alright, was that actually in question in the
       | first place?
        
         | vgatherps wrote:
         | You'd be surprised how common the view "Performance doesn't
         | matter now we'll just fix the hotspot later" is
        
       ___________________________________________________________________
       (page generated 2023-04-27 23:00 UTC)