https://talawah.io/blog/linux-kernel-vs-dpdk-http-performance-showdown/ Logo talawah.io Home About Blog Contact Home About Blog Contact * Table of Contents * Top * Overview * Roadmap * Building Seastar * Benchmark Setup * DPDK on AWS * DPDK Optimization * Baseline Kernel Performance * OS Level Optimizations * Perfect Locality and Busy Polling * Constant Context Switching * It is better to RECV * Remember to Flush * And the winner is... * DPDK Caveats * Speculative Execution Mitigations * Conclusion * Appendix Linux Kernel vs DPDK: HTTP Performance Showdown July 4, 2022 Marc Richards Marc Richards Solutions Architect | DevOps Engineer | Problem Solver Discuss on: Hacker News # Overview In this post I will use a simple HTTP benchmark to do a head-to-head performance comparison between the Linux kernel's network stack, and a kernel-bypass stack powered by DPDK. I will run my tests using Seastar, a C++ framework for building high-performance server applications. Seastar has support for building apps that use either the Linux kernel or DPDK for networking, so it is the perfect framework for this comparison. I will build on a lot of the ideas and techniques from my previous performance tuning post so it might be worth it to at least read the overview section before continuing. # In Defense of the Kernel Bypassing the kernel can open up a whole new world of high-throughput and low latency. Depending on who you ask, you might hear that bypassing the kernel will result in a 3-5x performance improvement. However, most of those comparisons are done without much optimization on the kernel side. The Linux kernel is designed to be fast, but it is also designed to be multi-purpose, so it isn't perfectly optimized for high-speed networking by default. On the other hand, kernel-bypass technologies like DPDK take a single-minded approach to networking performance. An entire network interface is dedicated to a single application, and aggressive busy polling is used to achieve high throughput and low latency. For this post I wanted to see what the performance gap would look like when a finely tuned kernel/application goes head to head with kernel-bypass in a no holds barred fight. DPDK advocates suggest that bypassing the kernel is necessary because the kernel is "slow", but in reality a lot of DPDK's performance advantages come not from bypassing the kernel, but from enforcing certain constraints. As it turns out, many of these advantages can be achieved while still using the kernel. By turning off some features, turning on others, and tuning the application accordingly, one can achieve performance that approaches kernel-bypass speeds. Here are a few DPDK strategies that can also be accomplished using the kernel: * Busy polling (Interrupt moderation + net.core.busy_poll=1) * Perfect locality (RSS + XPS + SO_REUSEPORT_CBPF) * Simplified TCP/IP subsystem (Disable iptables/syscall auditing/ AF_PACKET sockets) One advantage that kernel-bypass technologies still have, is that they avoid the syscall overheads that arise from transitioning (and copying data) back and forth between user-land and the kernel. So DPDK should still have the overall advantage, but the question is, how much of an advantage. # Disclaimer Work on this exploratory project was sponsored by the folks over at ScyllaDB, Inc, the primary stewards of the open-source Seastar framework, and organizers of P99 CONF. After I spoke at P99 CONF last year, they contacted me to see if there were any areas of mutual interest that we could explore. My last experiment made me curious about doing a kernel vs DPDK showdown, and Seastar fit the bill perfectly, so this post is the result of that engagement. All technical discussions took place on their public Slack channel and mailing list. # Roadmap This post is pretty long, so here is a high-level outline in case you want to jump to a particular area of interest. # Getting Started * Building Seastar * Benchmark Setup # DPDK Setup and Optimizations * DPDK on AWS * DPDK Optimization # Kernel Stack Optimizations * Baseline Kernel Performance * OS Level Optimizations * Perfect Locality and Busy Polling (It took several tries to get this working) * Constant Context Switching * It is better to RECV * Remember to flush # Results, Caveats, and Curiosities * And the winner is... * DPDK Caveats * Speculative Execution Mitigations # The End * Conclusion * Appendix Clicking the menu icon at the top right will open a table of contents so that you can easily jump to a specific section. # Building Seastar I had a bit of a challenge getting Seastar built initially. I wanted to use Amazon Linux 2 since I am pretty familiar with it, but it became clear that I was fighting a losing battle with outdated dependencies. I switched to vanilla CentOS 8 and managed to get it running despite a few issues, but I still didn't feel like I was on solid enough ground. After a brief dalliance with CentOS Stream 9, I asked for help in the public Slack channel, and I was pointed in the direction of Fedora 34 as the best OS for building the most recent version of the codebase. I actually did most of my research and testing using Fedora 34 (kernel 5.15), but while Fedora may be great for its cutting-edge updates, sometimes the cutting edge becomes the bleeding edge. By the time I decided to start reproducing my results from scratch, I realized that the latest Fedora 34 updates were upgrading the kernel from 5.11 straight to version 5.16. Unfortunately kernel 5.16 triggered performance regression for my tests, so I needed an alternative. As it turns out, Amazon Linux 2022 is based on Fedora 34, but has a more conservative kernel update policy, opting to stick with the 5.15 LTS release, so I chose AL 2022 as the new base OS for these tests, and in a bit of revisionist history, for the rest of the post I will just pretend like I was using it the whole time. # HTTP Server I started out using Seastar's built-in HTTP Server (httpd) for my tests but I decided to go a level down from httpd to using a bare-bones TCP server that just pretends to be an HTTP server. The server just sends back a fixed HTTP response without doing any parsing or routing. This simplifies my analysis and highlights the effect of each change that I make more clearly. In particular I wanted to eliminate Seastar's built-in HTTP parser from the equation. Before I removed it, performance would vary significantly just based on how many HTTP headers the client sent. So rather than go down the rabbit hole of figuring out what was going on there, I decided to just cheat by using my simple tcp_httpd server instead. # Source Code I opened a few PRs on the main Seastar repo based on the work I did for this project, but most of the changes aren't suitable for upstreaming given that they depend on epoll, and current development is now focused around aio and io_uring. All the patches used in this post are available in my Seastar repo on GitHub. # Benchmark Setup This is a basic overview of the benchmark setup on AWS. I used the Techempower JSON Serialization test as the reference benchmark for this experiment. # Hardware * Server: 4 vCPU c5n.xlarge instance * Client: 16 vCPU c5n.4xlarge instance (the client becomes the bottleneck if I try to use a smaller instance size) * Network: Server and client located in the same availability zone (use2-az1) in a cluster placement group # Software * Operating System: Amazon Linux 2022 (kernel 5.15) * Server: My simple tcp_httpd server: sudo ./tcp_httpd --reactor-backend epoll * Client: I made a few modifications to wrk, the popular HTTP benchmarking tool, and nicknamed it twrk. twrk delivers more consistent results on short, low latency test runs. The standard version of wrk should yield similar numbers in terms of throughput, but twrk allows for improved p99 latencies, and adds support for displaying p99.99 latency. # Benchmark Configuration I ran twrk manually from the client using the following parameters: * No pipelining * 256 connections * 16 threads (1 per vCPU), with each thread pinned to a vCPU * 1 second warmup before stats collection starts, then the test runs for 5s twrk --latency --pin-cpus -H 'Host: server.tld' "http://172.31.XX.XX:8080/json" -t 16 -c 256 -D 1 -d 5 # DPDK on AWS Getting Seastar and DPDK working on AWS was no walk in the park. The DPDK documentation for the AWS ENA driver has improved significantly in recent times, but it was a little bit rougher when I started, and it was difficult to find working examples of using Seastar with DPDK. Thankfully, between assistance on the Slack channel and my stubborn persistence I was able to get things running. Here are some of the highlights for those looking to do the same: 1. DPDK needs to be able to take over an entire network interface, so in addition to the primary interface for connecting to the instance via SSH (eth0/ens5), you will also need to attach a secondary interface dedicated to DPDK (eth1/ens6). 2. DPDK relies on one of two available kernel frameworks for exposing direct device access to user-land, VFIO or UIO. VFIO is the recommended choice, and it is available by default on recent kernels. By default, VFIO depends on hardware IOMMU support to ensure that direct memory access happens in a secure way, however IOMMU support is only available for *.metal EC2 instances. For non-metal instances, VFIO supports running without IOMMU by setting enable_unsafe_noiommu_mode=1 when loading the kernel module. 3. Seastar uses DPDK 19.05, which is a little outdated at this point. The AWS ENA driver has a set of patches for DPDK 19.05 which must be applied to get Seastar running on AWS. I backported the patches to my DPDK fork for convenience. 4. Last but not least, I encountered a bug in the DPDK/ENA driver that resulted in the following error message: runtime error: ena_queue_start(): Failed to populate rx ring. This issue was fixed in the DPDK codebase last year so I backported the change to my DPDK fork. Using the tcp_httpd app, I ran my benchmark with DPDK as the underlying network stack: sudo ./tcp_httpd --network-stack native --dpdk-pmd Running 5s test @ http://172.31.12.71:8080/json 16 threads and 256 connections Thread Stats Avg Stdev Max Min +/- Stdev Latency 205.32us 36.57us 1.34ms 62.00us 69.36% Req/Sec 74.80k 1.81k 77.85k 69.06k 73.85% Latency Distribution 50.00% 204.00us 90.00% 252.00us 99.00% 297.00us 99.99% 403.00us 5954189 requests in 5.00s, 0.86GB read Requests/sec: 1190822.80 DPDK performance clocks in at an impressive 1.19M req/s right out of the gate. # Initial Flame Graph Flame Graphs provide a unique way to visualize CPU usage and identify your application's most frequently used code-paths. They are a powerful optimization tool, as they allow you to quickly identify and eliminate bottlenecks. Clicking the image below will open the original SVG file that was generated by the Flamegraph tool. These SVGs are interactive. You can click a segment to drill down for a more detailed view, or you can search (Ctrl + F or click the link at the top right) for a function name. Note that each complete flame graph captures four near-identical stacks representing the 4 reactor threads (one per vCPU), but throughout the post we will mostly focus on analyzing the data for a single reactor/vCPU. Flame graph - DPDK without write combining # Flame Graph Analysis A quick look at the flame graph is enough to see that the eth_ena_xmit_pkts function looks suspiciously large, weighing in at 53.1% of the total flame graph. # DPDK Optimization On 5th+ generation instances the ENA hardware/driver supports a LLQ (Low Latency Queue) mode for improved performance. When using these instances, it is strongly recommended that you enable the write combining feature of the respective kernel module (VFIO or UIO), otherwise, performance will suffer due to slow PCI transactions. The VFIO module doesn't support write combining by default, but the ENA team provides a patch and a script to automate the process of adding WC support to the kernel module. I originally had a couple issues getting it working with kernel 5.15 but the ENA team was pretty responsive about getting them fixed. The team also recently indicated they intend to upstream the VFIO patch which will hopefully make things even more painless in the future. Running 5s test @ http://172.31.12.71:8080/json 16 threads and 256 connections Thread Stats Avg Stdev Max Min +/- Stdev Latency 153.79us 31.63us 1.43ms 52.00us 68.70% Req/Sec 95.18k 2.31k 100.94k 89.75k 68.88% Latency Distribution 50.00% 152.00us 90.00% 195.00us 99.00% 233.00us 99.99% 352.00us 7575198 requests in 5.00s, 1.09GB read Requests/sec: 1515010.51 Enabling write combining brings performance from 1.19M req/s to 1.51M req/s, a 27% performance increase. Flame graph - DPDK with write combining # Flame Graph Analysis Our flame graph now looks a lot more balanced and eth_ena_xmit_pkts has dropped from 53.1% of the flame graph to a mere 6.1%. # A Tall Order DPDK has thrown down the gauntlet with an absolutely massive showing. 1.51M requests per second on a 4 vCPU instance is HUGE. Can the kernel even get close? # Baseline Kernel Performance Starting with an unmodified AL 2022 AMI, tcp_httpd performance starts out at around 358k req/s. In absolute terms this is really, really fast, but it is underwhelming by comparison. Running 5s test @ http://172.31.XX.XX:8080/json 16 threads and 256 connections Thread Stats Avg Stdev Max Min +/- Stdev Latency 711.06us 97.91us 1.65ms 108.00us 70.06% Req/Sec 22.48k 205.46 23.10k 21.83k 68.62% Latency Distribution 50.00% 696.00us 90.00% 0.85ms 99.00% 0.96ms 99.99% 1.10ms 1789658 requests in 5.00s, 264.55MB read Requests/sec: 357927.16 Flame graph - Initial # OS Level Optimizations I won't go into a lot of detail about the specific Linux changes that I made. At a high level, the changes are very similar in nature to the tweaks that I made to Amazon Linux 2/kernel 4.14 in my previous post. That being said, there was actually a significant performance regression for this workload going from kernel 4.14 and 5.15, and it took quite a bit of work to get performance back up to par. But I want to stay focused on the kernel vs DPDK comparison for now, so I will save those details for another day, and another post. Here is a high-level overview of the OS optimizations used: * Disable Speculative Execution Mitigations * Configure RSS and XPS for perfect locality * Interrupt Moderation and Busy Polling * Disable Raw/Packet Sockets (FYI it wasn't quite the same nosy neighbor this time around) * GRO, Congestion Control, and Static Interrupt Moderation * A handful of new optimizations Our OS optimizations took throughput from 358k req/s to a whopping 726k req/s. A solid 103% performance improvement. Running 5s test @ http://172.31.XX.XX:8080/json 16 threads and 256 connections Thread Stats Avg Stdev Max Min +/- Stdev Latency 346.76us 86.26us 1.51ms 62.00us 72.62% Req/Sec 45.61k 0.88k 48.82k 42.50k 70.15% Latency Distribution 50.00% 347.00us 90.00% 455.00us 99.00% 564.00us 99.99% 758.00us 3630818 requests in 5.00s, 536.71MB read Requests/sec: 726153.58 Flame graph - OS Optimizations # Perfect Locality and Busy Polling The OS level changes to enable perfect locality/busy polling don't really have much effect until the application is properly configured as well. My next step was to add SO_ATTACH_REUSEPORT_CBPF support to my Seastar fork so that the perfect locality setup would be complete. Running 5s test @ http://172.31.XX.XX:8080/json 16 threads and 256 connections Thread Stats Avg Stdev Max Min +/- Stdev Latency 338.93us 90.62us 1.56ms 61.00us 68.11% Req/Sec 46.57k 2.67k 54.00k 40.32k 64.29% Latency Distribution 50.00% 330.00us 90.00% 466.00us 99.00% 562.00us 99.99% 759.00us 3706485 requests in 5.00s, 547.89MB read Requests/sec: 741286.62 Throughput moved from 736k req/s to an underwhelming 741k req/s. A 2% performance bump was way below my expectations for this change. Flame graph - Enable SO_ATTACH_REUSEPORT_CBPF # Flame Graph Analysis The flame graph shows zero evidence of busy polling. Perfect locality and busy polling work together in a virtuous cycle, so the lack of busy polling is a strong indicator that something is wrong with our setup. Perfect locality requires the OS and application to be configured so that once a network packet arrives on a given queue, all further processing is handled by the same vCPU/queue silo for both incoming and outgoing data. This means the order in which processes/threads are started, and the CPUs to which they are pinned must be controlled. # Perfect Locality and Busy Polling: Take two I created a bftrace script to take a closer look at what was actually going on. The script attaches kprobes to reuseport_alloc() and reuseport_add_sock() to track the process/thread startup order and cpu affinity. The results immediately showed the problem. Even though the reactor threads are started sequentially (tcp_httpd/reactor-0, reactor-1, reactor-2, reactor-3), the CPU pinning is out of order (0, 2, 1, 3). tcp_httpd, cpu=0, socket 0 reactor-1, cpu=2, socket 1 reactor-2, cpu=1, socket 2 reactor-3, cpu=3, socket 3 Further investigations revealed that Seastar uses hwloc to understand the hardware topology and optimize accordingly. But the default CPU allocation strategy is not optimal for our use case, so after raising the issue on the mailing list, I added a function to my fork that exposes the mapping between reactor shard ids and cpu ids to apps that build on Seastar. I modified tcp_httpd to ensure that the cpu ids and socket ids matched. This resulted in the expected output from my bpftrace script. tcp_httpd, cpu=0, socket 0 reactor-2, cpu=1, socket 1 reactor-1, cpu=2, socket 2 reactor-3, cpu=3, socket 3 Performance improves slightly, but still leaves a lot to be desired. Running 5s test @ http://172.31.XX.XX:8080/json 16 threads and 256 connections Thread Stats Avg Stdev Max Min +/- Stdev Latency 317.99us 74.65us 1.39ms 78.00us 76.29% Req/Sec 49.51k 2.01k 54.74k 44.35k 68.88% Latency Distribution 50.00% 312.00us 90.00% 405.00us 99.00% 531.00us 99.99% 749.00us 3938893 requests in 5.00s, 582.25MB read Requests/sec: 787768.20 A 6% performance improvement this time, but still well below expectations. Flame graph - Optimize SO_ATTACH_REUSEPORT_CBPF # Flame Graph Analysis The flame graph doesn't show much in the way of change either, and there is still no busy polling happening, so something else is wrong. I dug into my bag of performance analysis tools to see if I could figure out what else was going on. # Impatiently Waiting I was able to use libreactor as a point of reference for how a fully optimized epoll-based HTTP server should behave, and contrast that against tcp_httpd. Running a 10 second syscount trace (syscount -d 10) while the benchmark ran for both libreactor and tcp_httpd produced some enlightening results: # libreactor SYSCALL COUNT recvfrom 9755167 sendto 9353652 epoll_wait 754685 read 94 bpf 43 newfstatat 18 ppoll 11 pselect6 7 futex 5 write 5 # tcp_httpd SYSCALL COUNT epoll_pwait 7525419 read 7272935 sendto 6926720 epoll_ctl 824992 poll 76612 timerfd_settime 34276 rt_sigprocmask 11356 ioctl 6447 membarrier 5676 newfstatat 18 For libreactor, the top two syscalls were send/recv with epoll_wait coming in at a distant third. Conversely with tcp_httpd, epoll_pwait was the number one syscall. This was a pretty good indicator that I needed to take a look at how epoll_pwait was called in the Seastar codebase. The epoll_pwait syscall waits for events associated with file descriptors. In our case, we are dealing with socket file descriptors (representing a TCP connection) specifically, and each event indicates readiness to send or receive data. The original epoll_(p)wait syscall can be thought of as taking 3 types of values for the timeout parameter * -1: The syscall waits indefinitely for an event * 0: The syscall returns immediately whether or not any events are ready * n: The syscall waits until either a file descriptor delivers an event or n milliseconds have passed libreactor uses a relatively simple reactor engine built entirely around epoll, so it can afford to wait indefinitely for the next event. Seastar, on the other hand, is a bit more sophisticated. Seastar supports a number of different high-resolution timers, poll functions, and cross-reactor message queues; and it tries to enforce certain guarantees about how long tasks are expected to run. Within the main do_run loop, Seastar calls epoll_pwait with a timeout of 0 (it doesn't wait at all), which is why we are not seeing any busy polling happen. Calling epoll_pwait with an indefinite timeout is a non-starter for Seastar, and even calling it with the epoll_pwait minimum value of 1ms is probably a little too long. This is demonstrated by the fact that Seastar's default value for how long tasks should run in a single cycle (task-quota-ms) is 0.5 (500us). In order to strike a balance between the framework's latency expectations and my performance goals, I decided to make use of the relatively new epoll_pwait2 syscall. epoll_pwait2 is equivalent to epoll_pwait, but the timeout argument can be specified with nanosecond resolution. I settled on a timeout value of 100us as a good balance between performance and latency guarantees. The new syscall is available as of kernel 5.11 but the corresponding glibc wrapper isn't available until glibc 2.35, and Amazon Linux 2022 ships with glibc 2.34. To work around that, I hacked up a wrapper function named epoll_pwait_us, and I updated my Seastar fork to call it with a value of 100. Running 5s test @ http://172.31.XX.XX:8080/ 16 threads and 256 connections Thread Stats Avg Stdev Max Min +/- Stdev Latency 273.38us 39.11us 1.37ms 79.00us 71.32% Req/Sec 57.48k 742.64 59.34k 55.62k 67.98% Latency Distribution 50.00% 271.00us 90.00% 322.00us 99.00% 378.00us 99.99% 613.00us 4575332 requests in 5.00s, 676.32MB read Requests/sec: 915053.04 Performance moves from 788k req/s to 915k req/s, a solid 16% jump. Now we're cooking! Flame graph - epoll_wait 100us # Flame Graph Analysis Looking at the flame graph you can clearly see that busy polling has finally kicked in, and looking at our syscall count we see the expected pattern emerge. # tcp_httpd SYSCALL COUNT read 8422317 sendto 7964784 epoll_ctl 450827 epoll_pwait2 375947 poll 79836 ioctl 202 bpf 49 newfstatat 18 ppoll 11 # Constant Context Switching I continued comparing tcp_httpd to libreactor using a few more perf tools to see if I could pick up any more anomalies. Sure enough, using sar -w 1 to monitor context switches produced some eyebrow raising numbers for tcp_httpd. # libreactor 01:13:50 AM proc/s cswch/s 01:13:57 AM 0.00 277.00 01:13:58 AM 0.00 229.00 01:13:59 AM 0.00 290.00 01:14:00 AM 0.00 340.00 # tcp_httpd 01:03:03 AM proc/s cswch/s 01:03:04 AM 0.00 17132.00 01:03:05 AM 0.00 17060.00 01:03:06 AM 0.00 17048.00 01:03:07 AM 0.00 17026.00 Looking at the flame graph without zooming in I noticed that for each reactor thread, Seastar creates a matching timer thread named timer-0, timer-1, etc. At first I didn't pay much attention to them since I wasn't explicitly setting any timers, and they were barely visible on the flame graph, but in light of the context switching numbers I decided to take a closer look. For each reactor/cpu core, start_tick() starts a thread using the task_quota_timer_thread_fn() function. The function waits for the reactor's _task_quota_timer to expire and then interrupts the main thread by calling request_preemption(). This is done to make sure that the tasks on the main thread don't hog resources by running for more than X ms without preemption. But for our specific workload, it causes excessive context switching and tanks performance. What we want to do is set it just long enough so that reactor::run_some_tasks () can complete all tasks and reset preemption without ever being interrupted. It should be noted that Seastar's default aio backend seems to make use of some aio specific preempting functionality to handle the task quota, so this particular behavior is limited to the epoll backend. Seastar allows users to pass in a value to set the --task-quota-ms via the command line. The default value is 0.5, but I found 10ms to be a more reasonable value for this workload. # tcp_httpd with --task-quota-ms 10 01:04:58 AM proc/s cswch/s 01:04:59 AM 0.00 1327.00 01:05:00 AM 0.00 1303.00 01:05:01 AM 0.00 1339.00 01:05:02 AM 0.00 1296.00 The number of context switches per second dropped dramatically from 17k to 1.3k Running 5s test @ http://172.31.XX.XX:8080/json 16 threads and 256 connections Thread Stats Avg Stdev Max Min +/- Stdev Latency 259.14us 29.51us 1.51ms 77.00us 71.92% Req/Sec 60.55k 532.97 61.77k 58.82k 66.71% Latency Distribution 50.00% 257.00us 90.00% 296.00us 99.00% 337.00us 99.99% 557.00us 4820680 requests in 5.00s, 712.59MB read Requests/sec: 964121.54 Throughput moved from 915k req/s to 964k req/s, a 5.3% improvement. Flame graph - task-quota-ms 10 # Flame Graph Analysis The change in the flame graph is pretty subtle. If you zoom out to the "all" view and then search for timer- you will see a small section to the far right go from 0.7% of the previous flame graph to 0.1% of the current one. Flame graphs are extremely useful but they don't always capture the performance impact in a proportional way. Sometimes you have to rummage around in your perf toolbox to find the right tool to pick up an anomaly. # It is better to RECV This is a simple fix that I figured out back when I was optimizing libreactor. When working with sockets, it is a little more efficient to use Linux's recv/send functions, than the more general-purpose read/write. Generally the difference is negligible, however when you move beyond 50k req/s it starts to add up. Seastar was already using send for outgoing data, but it was using read for incoming requests, so I made the relatively simple change to switch it to recv instead. Running 5s test @ http://172.31.XX.XX:8080/json 16 threads and 256 connections Thread Stats Avg Stdev Max Min +/- Stdev Latency 253.53us 30.51us 1.21ms 93.00us 74.63% Req/Sec 61.72k 597.21 62.99k 58.46k 71.81% Latency Distribution 50.00% 250.00us 90.00% 291.00us 99.00% 342.00us 99.99% 652.00us 4911503 requests in 5.00s, 726.02MB read Requests/sec: 982287.44 Throughput moved from 964k req/s to 982k req/s, just under a 2% performance improvement. Flame graph - recv syscall # Flame Graph Analysis If you look at the read/recv stack on the left side of the flame graph you will see that __libc_recv gets to the point a lot more directly than __libc_read. # Remember to Flush I found the final optimization by roaming around the codebase and switching things on/off to see what they did. When using the epoll reactor backend, the batch_flushes option on output_stream defers calling send() right away when flush() is called. It is designed as an optimization for RPC workloads that may call flush() multiple times, but it doesn't provide any benefit for our simple request/ response workload. As a matter of fact, it adds a little bit of overhead, so as a quick fix I just disabled batch_flushes. Running 5s test @ http://172.31.XX.XX:8080/json 16 threads and 256 connections Thread Stats Avg Stdev Max Min +/- Stdev Latency 246.66us 34.32us 1.25ms 61.00us 74.07% Req/Sec 63.30k 0.88k 65.72k 61.63k 66.84% Latency Distribution 50.00% 246.00us 90.00% 288.00us 99.00% 333.00us 99.99% 436.00us 5038933 requests in 5.00s, 744.85MB read Requests/sec: 1007771.89 Throughput moved from 982k req/s to 1.0M req/s, a 2.2% performance boost. Flame graph - fully optimized # Flame Graph Analysis The flame graph shows that the send stack moved from batch_flush_pollfn::poll to output_stream::flush Our optimization efforts have rewarded us with a psychologically satisfying base 10 number for our final figure: 1.0M request/s using nothing more than the good old Linux kernel. # And the winner is... In the end, DPDK still maintains a solid 51% percent performance lead over the kernel. Whether that is a lot or a little depends on your perspective. The way I look at it, when you compare the unoptimized and optimized versions of the kernel/application, we have narrowed DPDK's performance advantage from 4.2x to just 1.5x. Graph - kernel vs DPDK # DPDK Caveats DPDK's 51% advantage is nothing to scoff at, however I would be remiss if I sent you down a DPDK rabbit hole without adding some disclaimers about DPDK's challenges. 1. To start, it is a bit of a niche technology, so finding articles and examples online (especially for use-cases outside established areas) can be challenging. 2. Bypassing the kernel means you also bypass its time-tested TCP stack. If your application uses a TCP based protocol like HTTP, you need to provide your own TCP networking stack in userspace. There are frameworks like Seastar and F-Stack that help, but migrating your application to them may be non-trivial. 3. Working with a custom framework might also mean that you are tied to the specific DPDK version that it supports, which might not be the version supported by your network driver or kernel. 4. In bypassing the kernel you also bypass a rich ecosystem of existing tools and features for securing, monitoring and configuring your network traffic. Many of the tools and techniques that you are accustomed to no longer work. 5. If you use poll-mode processing your CPU usage will always be 100%. In addition to not being energy efficient/environmentally friendly, it also makes it difficult to quickly assess/ troubleshoot your workload using CPU usage as a gauge. 6. DPDK based applications take full control of the network interface, which means: + You must have more than one interface. + If you want to modify device settings, you have to do it before startup, or through the application. + If you want to capture metrics, the application has to be configured to do it; it is much harder to troubleshoot on the fly. That being said, there may be reasons to pursue a custom TCP/IP stack other than pure performance. An in-application TCP stack allows the application to precisely control memory allocation (avoiding contention for memory between the application and kernel) and scheduling (avoiding contention for CPU time). This can be important for applications that strive not only for maximum throughput, but also for excellent p99 latency. At the end of the day it is about balancing priorities. As an example, even though the ScyllaDB team occasionally gets reports of reactor stalls related to the kernel network stack, they still choose to stick with the kernel for their flagship product, because switching to DPDK would be far from a simple undertaking. # Speculative Execution Mitigations Early on in this post I glossed over the OS level optimizations that I made before I started optimizing the application. From a high level the changes were similar to my previous post, and for those who want to dig into more details, I plan to write up a kernel 4.14 vs 5.15 post "at some point". Nevertheless there is one particular optimization that deserves further analysis in this kernel vs DPDK showdown: disabling speculative execution mitigations. I won't rehash my opinion on these mitigations, you can read it here. For the purposes of this post, I turned them off, but if you look at the graph below you will see that turning them back on shows some interesting results. Graph - kernel vs DPDK with mitigations on/off As you can see, while disabling mitigations yields a 33% performance improvement on the kernel side, it had zero impact on DPDK performance. This leads to two main takeaways: 1. For environments where speculative execution mitigations are a must, DPDK represents an even bigger performance improvement over the kernel TCP stack. 2. Kernel technologies like io_uring that bypass the syscall interface for I/O have even greater potential for improving performance on the majority of workloads. Most people don't disable Spectre mitigations, so solutions that work with them enabled are important. I am not 100% sure that all of the mitigation overhead comes from syscalls, but it stands to reason that a lot of it arises from security hardening in user-to-kernel and kernel-to-user transitions. The impact is certainly visible in the syscall related functions on the flame graph. Flame graph - Mitigations ON # Conclusion We have demonstrated that even when the OS and application are optimized to the extreme, DPDK still has a 51% performance lead over the kernel networking stack. Instead of seeing that difference as an insurmountable hurdle, I see the gap as unrealized potential on the kernel side. The gap simply raises the question: To what extent can the Linux kernel be further optimized for thread-per-core applications without compromising its general-purpose nature? DPDK gives us an idea of what is possible under ideal circumstances, and serves as a target to strive towards. Even if the gap can't fully be closed, it quantifies the task and throws the obstacles into sharper focus. One very obvious obstacle is the overhead of the syscall interface when doing millions of syscalls per second. Thankfully, io_uring seems to offer a potential solution for that particular challenge. I have been keeping a close eye on io_uring, as it is still under pretty heavy development. I am particularly excited to see the recent wave of networking focused optimizations like busy poll support, recv hints, and even experimental explorations like lockless TCP support. It remains very high on my list of things to test "real soon". # Appendix # Special Thanks Special thanks to my reviewers: Dor and Kenia, and to everyone on the Seastar Slack channel and mailing list, particularly Piotr, Avi and Max. # C/C++ Primers I used my limited C knowledge, combined with basic pattern recognition to fumble around Seastar's C++ codebase for way longer than I probably should have, but when it came time to add the get_cpu_to_shard_mapping() function, I decided to stop fooling myself and learn a little C++. If you find yourself in a similar predicament, I recommend A Tour of C++ as a decent primer. If you need a quick C refresher as well, I recommend Essential C and Pointers and Memory.