[HN Gopher] Extreme HTTP Performance Tuning ___________________________________________________________________ Extreme HTTP Performance Tuning Author : talawahtech Score : 219 points Date : 2021-05-20 20:01 UTC (2 hours ago) (HTM) web link (talawah.io) (TXT) w3m dump (talawah.io) | jeffbee wrote: | Very nice round-up of techniques. I'd throw out a few that might | or might not be worth trying: 1) I always disable C-states deeper | than C1E. Waking from C6 takes upwards of 100 microseconds, way | too much for a latency-sensitive service, and it doesn't save | _you_ any money when you are running on EC2; 2) Try receive flow | steering for a possible boost above and beyond what you get from | RSS. | | Would also be interesting to discuss the impacts of turning off | the xmit queue discipline. fq is designed to reduce frame drops | at the switch level. Transmitting as fast as possible can cause | frame drops which will totally erase all your other tuning work. | talawahtech wrote: | Thanks! | | > I always disable C-states deeper than C1E | | AWS doesn't let you mess with c-states for instances smaller | than a c5.9xlarge[1]. I did actually test it out on a 9xlarge | just for kicks, but it didn't make a difference. Once this test | starts, all CPUs are 99+% Busy for the duration of the test. I | think it would factor in more if there were lots of CPUs, and | some were idle during the test. | | > Try receive flow steering for a possible boost | | I think the stuff I do in the "perfect locality" section[2] | (particularly SO_ATTACH_REUSEPORT_CBPF) achieves what receive | flow steering would be trying to do, but more efficiently. | | > Would also be interesting to discuss the impacts of turning | off the xmit queue discipline | | Yea, noqueue would definitely be a no-go on a constrained | network, but when running the (t)wrk benchmark in the cluster | placement group I didn't see any evidence of packet drops or | retransmits. Drop only happened with the iperf test. | | 1. | https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/processo... | | 2. https://talawah.io/blog/extreme-http-performance-tuning- | one-... | duskwuff wrote: | Does C-state tuning even do anything on EC2? My intuition says | it probably doesn't pass through to the underlying hardware -- | once the VM exits, it's up to the host OS what power state the | CPU goes into. | jeffbee wrote: | It definitely works and you can measure the effect. There's | official documentation on what it does and how to tune it: | | https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/processo. | .. | xtacy wrote: | I suspect that the web server's CPU usage will be pretty high | (almost 100%), so C-state tuning may not matter as much? | | EDIT: also, RSS happens on the NIC. RFS happens in the kernel, | so it might not be as effective. For a uniform request workload | like the one in the article, statically binding flows to a NIC | queue should be sufficient. :) | fierro wrote: | How can you be sure the estimated max server capability is not | actually just a limitation in the _client_ , i.e, the client | maxes out at _sending_ 224k requests / second. | | I see that this is clearly not the case here, but in general how | can one be sure? | alufers wrote: | That is one hell of a comprehensive article. I wonder how much | impact would such extreme optimizations on a real-world | application, which for example does DB queries. | | This experiment feels similar to people who buy old cars and | remove everything from the inside except the engine, which they | tune up so that the car runs faster :). | talawahtech wrote: | This comprehensive level of extreme tuning is not going to be | _directly_ useful to most people; but there are a few things in | there like SO_ATTACH_REUSEPORT_CBPF that I hope to see more | servers and frameworks adopt. Similarly I think it is good to | be aware of the adaptive interrupt capabilities of AWS | instances, and the impacts of speculative execution | mitigations, even if you stick to the defaults. | | More importantly it is about the idea of using tools like | Flamegraphs (or other profiling tools) to identify and | eliminate _your_ bottlenecks. It is also just fun to experiment | and share the results (and the CloudFormation template). Plus | it establishes a high water mark for what is possible, which | also makes it useful for future experiments. At some point I | would like to do a modified version of this that includes DB | queries. | 101008 wrote: | Yes, my experience (not much) is that what makes YouTube or | Google or any of those products really impressive is the speed. | | YouTube or Google Search suggestion is good, and I think it | could be replicable with that amount of data. What is insane is | the speed. I can't think how they do it. I am doing something | similar for the company I work on and it takes seconds (and the | amount of data isn't that much), so I can't wrap my head around | it. | | The point is that doing only speed is not _that_ complicated, | and doing some algorithms alone is not _that_ complicated. What | is really hard is to do both. | ecnahc515 wrote: | A lot of this is just spending more money and resources to | make it possible to optimize for speed. | | With sufficient caching with and a lot of parallelism makes | this possible. That costs money though. Caching means storing | data twice. Parallelism means more servers (since you'll | probably be aiming to saturate the network bandwidth for each | host). | | Pre-aggregating data is another part of the strategy, as that | avoids using CPU cycles in the fast-path, but it means | storing even more copies of the data! | | My personal anecdotal experience with this is with SQL on | object storage. Query engines that use object storage can | still perform well with the above techniques, even though | querying large amounts of data from object is slow. You can | bypass the slowness of object storage if you pre-cache the | data somewhere else that's closer/faster for recent data. You | can have materialized views/tables for rollups of data over | longer periods of time, which reduces the data needed to be | fetched and cached. It also requires less CPU due to working | with a smaller amount of pre-calculated data. | | Apply this to every layer, every system, etc, and you can get | good performance even with tons of data. It's why doing | machine-learning in real- is way harder than pre-computing | models. Streaming platforms make this all much easier as you | can constantly be pre-computing as much as you can, and pre- | filling caches, etc. | | Of course, having engineers work on 1% performance | improvements in the OS kernel, or memory allocators, etc will | add up and help a lot too. | simcop2387 wrote: | I've had them take seconds for suggestions before when doing | more esoteric searches. I think there's an inordinate amount | of cached suggestions and they have an incredible way to look | them up efficiently. | fabioyy wrote: | did you try DPDK? | strawberrysauce wrote: | Your website is super snappy. I see that it has a perfect | lighthouse score too. Can you explain the stack you used and how | you set it up? | miohtama wrote: | How much head room there would be if one were to use Unikernel | and skip the application space altogether? | 0xbadcafebee wrote: | Very well written, bravo. TOC and reference links makes it even | better. | the8472 wrote: | Since it's CPU-bound and spends a lot of time in the kernel would | compiling the kernel for the specific CPU used make sense? Or are | the CPU cycles wasted on things the compiler can't optimize? | drenvuk wrote: | I'm of two minds with regards to this: This is cool but unless | you have no authentication, data to fetch remotely or on disk | this is really just telling you what the ceiling is for | everything you could possibly run. | | As for this article, there are so many knobs that you tweaked to | get this to run faster it's incredibly informative. Thank you for | sharing. | joshka wrote: | > this is really just telling you what the ceiling is | | That's a useful piece of info to know when performance tuning a | real world app with auth / data / etc. | bigredhdl wrote: | I really like the "Optimizations That Didn't Work" section. This | type of information should be shared more often. | Thaxll wrote: | There was a similar article from Dropbox years ago: | https://dropbox.tech/infrastructure/optimizing-web-servers-f... | still very relevant | 120bits wrote: | Very well written. | | - I have nodejs server for the APIs and its running on m5.xlarge | instance. I haven't done much research on what instance type | should I go for. I looked up and it seems like | c5n.xlarge(mentioned in the article) is meant compute optimized. | That cost difference isn't much between m5.xlarge and c5n.xlarge. | So, I'm assuming that switching to c5 instance would be better, | right? | | - Does having ngnix handle the request is better option here? And | setup reverse proxy for NodeJS? I'm thinking of taking small | steps on scaling an existing framework. | talawahtech wrote: | Thanks! | | The c5 instance type is about 10-15% faster than the m5, but | the m5 has twice as much memory. So if memory is not a concern | then switching to c5 is both a little cheaper and a little | faster. | | You shouldn't need the c5n, the regular c5 should be fine for | most use cases, and it is cheaper. | | Nginx in front of nodejs sounds like a solid starting point, | but I can't claim to have a ton of experience with that combo. | danielheath wrote: | For high level languages like node, the graviton2 instances | offer vastly cheaper cpu time (as in, 40%). That's the m6g / | c6g series. | | As in all things, check the results on your own workload! | [deleted] | nodesocket wrote: | m5 has more memory, if you application is memory bound stick | with that instance type. | | I'd recommend just using a standard AWS application load | balancer in front of your Node.js app. Terminate SSL at the ALB | as well using certificate manager (free). Will run you around | $18 a month more. | secondcoming wrote: | Fantastic article. Disabling spectre mitigations on all my team's | GCE instances is something I'm going to check out. | | Regarding core pinning, the usual advice is to pin to the CPU | socket physically closest to the NIC. Is there any point doing | this on cloud instances? Your actual cores could be anywhere. So | just isolate one and hope for the best? | brobinson wrote: | There are a bunch more mitigations that can be disabled than he | disables in the article. I usually refer to https://make-linux- | fast-again.com/ | halz wrote: | Pinning to the physically closest core is a bit misleading. | Take a look at output from something like `lstopo` | [https://www.open-mpi.org/projects/hwloc/], where you can | filter pids across the NUMA topology and trace which components | are routed into which nodes. Pin the network based workloads | into the corresponding NUMA node and isolate processes from | hitting the IRQ that drives the NIC. | ArtWomb wrote: | Wow. Such impressive bpftrace skill! Keeping this article under | my pillow ;) | | Wonder where the next optimization path leads? Using huge memory | pages. io_uring, which was briefly mentioned. Or kernel bypass, | which is supported on c5n instances as of late... | diroussel wrote: | Did you consider wrk2? | | https://github.com/giltene/wrk2 | | Maybe you duplicated some of these fixes? | talawahtech wrote: | Yea, I looked it wrk2 but it was a no-go right out the gate. | From what I recall the changes to handle coordinated omission | use a timer that has a 1ms resolution. So basically things | broke immediately because all requests were under 1ms. | specialist wrote: | What is the theoretical max req/s for a 4 vCPU c5n.xlarge | instance? | talawahtech wrote: | There is no published limit, but based on my tests the network | device for the c5n.xlarge has a hard limit of 1.8M pps (which | translates directly to req/s for small requests without | pipelining). | | There is also a quota system in place, so even though that is | the hard limit, you can only operate at those speeds for a | short time before you start getting rate-limited. ___________________________________________________________________ (page generated 2021-05-20 23:00 UTC)