[HN Gopher] Extreme HTTP Performance Tuning
       ___________________________________________________________________
        
       Extreme HTTP Performance Tuning
        
       Author : talawahtech
       Score  : 219 points
       Date   : 2021-05-20 20:01 UTC (2 hours ago)
        
 (HTM) web link (talawah.io)
 (TXT) w3m dump (talawah.io)
        
       | jeffbee wrote:
       | Very nice round-up of techniques. I'd throw out a few that might
       | or might not be worth trying: 1) I always disable C-states deeper
       | than C1E. Waking from C6 takes upwards of 100 microseconds, way
       | too much for a latency-sensitive service, and it doesn't save
       | _you_ any money when you are running on EC2; 2) Try receive flow
       | steering for a possible boost above and beyond what you get from
       | RSS.
       | 
       | Would also be interesting to discuss the impacts of turning off
       | the xmit queue discipline. fq is designed to reduce frame drops
       | at the switch level. Transmitting as fast as possible can cause
       | frame drops which will totally erase all your other tuning work.
        
         | talawahtech wrote:
         | Thanks!
         | 
         | > I always disable C-states deeper than C1E
         | 
         | AWS doesn't let you mess with c-states for instances smaller
         | than a c5.9xlarge[1]. I did actually test it out on a 9xlarge
         | just for kicks, but it didn't make a difference. Once this test
         | starts, all CPUs are 99+% Busy for the duration of the test. I
         | think it would factor in more if there were lots of CPUs, and
         | some were idle during the test.
         | 
         | > Try receive flow steering for a possible boost
         | 
         | I think the stuff I do in the "perfect locality" section[2]
         | (particularly SO_ATTACH_REUSEPORT_CBPF) achieves what receive
         | flow steering would be trying to do, but more efficiently.
         | 
         | > Would also be interesting to discuss the impacts of turning
         | off the xmit queue discipline
         | 
         | Yea, noqueue would definitely be a no-go on a constrained
         | network, but when running the (t)wrk benchmark in the cluster
         | placement group I didn't see any evidence of packet drops or
         | retransmits. Drop only happened with the iperf test.
         | 
         | 1.
         | https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/processo...
         | 
         | 2. https://talawah.io/blog/extreme-http-performance-tuning-
         | one-...
        
         | duskwuff wrote:
         | Does C-state tuning even do anything on EC2? My intuition says
         | it probably doesn't pass through to the underlying hardware --
         | once the VM exits, it's up to the host OS what power state the
         | CPU goes into.
        
           | jeffbee wrote:
           | It definitely works and you can measure the effect. There's
           | official documentation on what it does and how to tune it:
           | 
           | https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/processo.
           | ..
        
         | xtacy wrote:
         | I suspect that the web server's CPU usage will be pretty high
         | (almost 100%), so C-state tuning may not matter as much?
         | 
         | EDIT: also, RSS happens on the NIC. RFS happens in the kernel,
         | so it might not be as effective. For a uniform request workload
         | like the one in the article, statically binding flows to a NIC
         | queue should be sufficient. :)
        
       | fierro wrote:
       | How can you be sure the estimated max server capability is not
       | actually just a limitation in the _client_ , i.e, the client
       | maxes out at _sending_ 224k requests  / second.
       | 
       | I see that this is clearly not the case here, but in general how
       | can one be sure?
        
       | alufers wrote:
       | That is one hell of a comprehensive article. I wonder how much
       | impact would such extreme optimizations on a real-world
       | application, which for example does DB queries.
       | 
       | This experiment feels similar to people who buy old cars and
       | remove everything from the inside except the engine, which they
       | tune up so that the car runs faster :).
        
         | talawahtech wrote:
         | This comprehensive level of extreme tuning is not going to be
         | _directly_ useful to most people; but there are a few things in
         | there like SO_ATTACH_REUSEPORT_CBPF that I hope to see more
         | servers and frameworks adopt. Similarly I think it is good to
         | be aware of the adaptive interrupt capabilities of AWS
         | instances, and the impacts of speculative execution
         | mitigations, even if you stick to the defaults.
         | 
         | More importantly it is about the idea of using tools like
         | Flamegraphs (or other profiling tools) to identify and
         | eliminate _your_ bottlenecks. It is also just fun to experiment
         | and share the results (and the CloudFormation template). Plus
         | it establishes a high water mark for what is possible, which
         | also makes it useful for future experiments. At some point I
         | would like to do a modified version of this that includes DB
         | queries.
        
         | 101008 wrote:
         | Yes, my experience (not much) is that what makes YouTube or
         | Google or any of those products really impressive is the speed.
         | 
         | YouTube or Google Search suggestion is good, and I think it
         | could be replicable with that amount of data. What is insane is
         | the speed. I can't think how they do it. I am doing something
         | similar for the company I work on and it takes seconds (and the
         | amount of data isn't that much), so I can't wrap my head around
         | it.
         | 
         | The point is that doing only speed is not _that_ complicated,
         | and doing some algorithms alone is not _that_ complicated. What
         | is really hard is to do both.
        
           | ecnahc515 wrote:
           | A lot of this is just spending more money and resources to
           | make it possible to optimize for speed.
           | 
           | With sufficient caching with and a lot of parallelism makes
           | this possible. That costs money though. Caching means storing
           | data twice. Parallelism means more servers (since you'll
           | probably be aiming to saturate the network bandwidth for each
           | host).
           | 
           | Pre-aggregating data is another part of the strategy, as that
           | avoids using CPU cycles in the fast-path, but it means
           | storing even more copies of the data!
           | 
           | My personal anecdotal experience with this is with SQL on
           | object storage. Query engines that use object storage can
           | still perform well with the above techniques, even though
           | querying large amounts of data from object is slow. You can
           | bypass the slowness of object storage if you pre-cache the
           | data somewhere else that's closer/faster for recent data. You
           | can have materialized views/tables for rollups of data over
           | longer periods of time, which reduces the data needed to be
           | fetched and cached. It also requires less CPU due to working
           | with a smaller amount of pre-calculated data.
           | 
           | Apply this to every layer, every system, etc, and you can get
           | good performance even with tons of data. It's why doing
           | machine-learning in real- is way harder than pre-computing
           | models. Streaming platforms make this all much easier as you
           | can constantly be pre-computing as much as you can, and pre-
           | filling caches, etc.
           | 
           | Of course, having engineers work on 1% performance
           | improvements in the OS kernel, or memory allocators, etc will
           | add up and help a lot too.
        
           | simcop2387 wrote:
           | I've had them take seconds for suggestions before when doing
           | more esoteric searches. I think there's an inordinate amount
           | of cached suggestions and they have an incredible way to look
           | them up efficiently.
        
       | fabioyy wrote:
       | did you try DPDK?
        
       | strawberrysauce wrote:
       | Your website is super snappy. I see that it has a perfect
       | lighthouse score too. Can you explain the stack you used and how
       | you set it up?
        
       | miohtama wrote:
       | How much head room there would be if one were to use Unikernel
       | and skip the application space altogether?
        
       | 0xbadcafebee wrote:
       | Very well written, bravo. TOC and reference links makes it even
       | better.
        
       | the8472 wrote:
       | Since it's CPU-bound and spends a lot of time in the kernel would
       | compiling the kernel for the specific CPU used make sense? Or are
       | the CPU cycles wasted on things the compiler can't optimize?
        
       | drenvuk wrote:
       | I'm of two minds with regards to this: This is cool but unless
       | you have no authentication, data to fetch remotely or on disk
       | this is really just telling you what the ceiling is for
       | everything you could possibly run.
       | 
       | As for this article, there are so many knobs that you tweaked to
       | get this to run faster it's incredibly informative. Thank you for
       | sharing.
        
         | joshka wrote:
         | > this is really just telling you what the ceiling is
         | 
         | That's a useful piece of info to know when performance tuning a
         | real world app with auth / data / etc.
        
       | bigredhdl wrote:
       | I really like the "Optimizations That Didn't Work" section. This
       | type of information should be shared more often.
        
       | Thaxll wrote:
       | There was a similar article from Dropbox years ago:
       | https://dropbox.tech/infrastructure/optimizing-web-servers-f...
       | still very relevant
        
       | 120bits wrote:
       | Very well written.
       | 
       | - I have nodejs server for the APIs and its running on m5.xlarge
       | instance. I haven't done much research on what instance type
       | should I go for. I looked up and it seems like
       | c5n.xlarge(mentioned in the article) is meant compute optimized.
       | That cost difference isn't much between m5.xlarge and c5n.xlarge.
       | So, I'm assuming that switching to c5 instance would be better,
       | right?
       | 
       | - Does having ngnix handle the request is better option here? And
       | setup reverse proxy for NodeJS? I'm thinking of taking small
       | steps on scaling an existing framework.
        
         | talawahtech wrote:
         | Thanks!
         | 
         | The c5 instance type is about 10-15% faster than the m5, but
         | the m5 has twice as much memory. So if memory is not a concern
         | then switching to c5 is both a little cheaper and a little
         | faster.
         | 
         | You shouldn't need the c5n, the regular c5 should be fine for
         | most use cases, and it is cheaper.
         | 
         | Nginx in front of nodejs sounds like a solid starting point,
         | but I can't claim to have a ton of experience with that combo.
        
         | danielheath wrote:
         | For high level languages like node, the graviton2 instances
         | offer vastly cheaper cpu time (as in, 40%). That's the m6g /
         | c6g series.
         | 
         | As in all things, check the results on your own workload!
        
         | [deleted]
        
         | nodesocket wrote:
         | m5 has more memory, if you application is memory bound stick
         | with that instance type.
         | 
         | I'd recommend just using a standard AWS application load
         | balancer in front of your Node.js app. Terminate SSL at the ALB
         | as well using certificate manager (free). Will run you around
         | $18 a month more.
        
       | secondcoming wrote:
       | Fantastic article. Disabling spectre mitigations on all my team's
       | GCE instances is something I'm going to check out.
       | 
       | Regarding core pinning, the usual advice is to pin to the CPU
       | socket physically closest to the NIC. Is there any point doing
       | this on cloud instances? Your actual cores could be anywhere. So
       | just isolate one and hope for the best?
        
         | brobinson wrote:
         | There are a bunch more mitigations that can be disabled than he
         | disables in the article. I usually refer to https://make-linux-
         | fast-again.com/
        
         | halz wrote:
         | Pinning to the physically closest core is a bit misleading.
         | Take a look at output from something like `lstopo`
         | [https://www.open-mpi.org/projects/hwloc/], where you can
         | filter pids across the NUMA topology and trace which components
         | are routed into which nodes. Pin the network based workloads
         | into the corresponding NUMA node and isolate processes from
         | hitting the IRQ that drives the NIC.
        
       | ArtWomb wrote:
       | Wow. Such impressive bpftrace skill! Keeping this article under
       | my pillow ;)
       | 
       | Wonder where the next optimization path leads? Using huge memory
       | pages. io_uring, which was briefly mentioned. Or kernel bypass,
       | which is supported on c5n instances as of late...
        
       | diroussel wrote:
       | Did you consider wrk2?
       | 
       | https://github.com/giltene/wrk2
       | 
       | Maybe you duplicated some of these fixes?
        
         | talawahtech wrote:
         | Yea, I looked it wrk2 but it was a no-go right out the gate.
         | From what I recall the changes to handle coordinated omission
         | use a timer that has a 1ms resolution. So basically things
         | broke immediately because all requests were under 1ms.
        
       | specialist wrote:
       | What is the theoretical max req/s for a 4 vCPU c5n.xlarge
       | instance?
        
         | talawahtech wrote:
         | There is no published limit, but based on my tests the network
         | device for the c5n.xlarge has a hard limit of 1.8M pps (which
         | translates directly to req/s for small requests without
         | pipelining).
         | 
         | There is also a quota system in place, so even though that is
         | the hard limit, you can only operate at those speeds for a
         | short time before you start getting rate-limited.
        
       ___________________________________________________________________
       (page generated 2021-05-20 23:00 UTC)