[HN Gopher] Linux network performance parameters ___________________________________________________________________ Linux network performance parameters Author : dreampeppers99 Score : 285 points Date : 2023-09-06 11:56 UTC (8 hours ago) (HTM) web link (github.com) (TXT) w3m dump (github.com) | freedomben wrote: | Could anyone recommend a video or video series covering similar | material? | | There lots on networking in general, but I've had a hard time | finding some on Linux specific implementation | 8K832d7tNmiQ wrote: | I'm also seconding this, but from microcontroller perspective. | | I want to try developing a simple tcp echo server for a | microcontroller, but most examples just use the vendor's own | tcp library and put no effort explaining how to manually setup | and establish connection to the router. | patmorgan23 wrote: | Well you can always read the standard | [deleted] | mikece wrote: | How long until Linux reaches network performance parity with | FreeBSD (or surpasses it)? | [deleted] | [deleted] | dekhn wrote: | Linux has better network performance than FreeBSD over nearly | every use case I've seen. | nolist_policy wrote: | From every benchmark I've seen so far, Linux has always been | faster than the BSDs. | | For example, look at these benchmarks from 2003[1]. Makes you | wonder where the myth comes from. | | The newest benchmark I could find[2] points in the same | direction. | | Does anyone have more recent data? | | [1] http://bulk.fefe.de/scalability/ [2] | https://matteocroce.medium.com/linux-and-freebsd-networking-... | Thaxll wrote: | This is kind of an urban legend, do you think the multi | millions servers from Google, Amazon etc... have those | performance issues? | dijit wrote: | This is "appeal to authority" fallacy incarnate. | | Google/Amazon "etc;" are likely happy to pay the cost because | it really is "good enough" and the benefits of Linux over | FreeBSD are otherwise quite considerable. | | Google in particular seems blissfully happy to literally | throw hardware at problems; since hardware is (for them | especially) fundamentally extremely cheap. | | Even multiple percentage gains in _throughput_ are not | necessary for most applications, and Linux is decent enough | with latency if you avoid having complex IP /NFTables rules | and avoid CONNTRACK like the plague. | | as u/jeffbee says anyway, most of the larger tech companies | these days are using userland networking and bypass the | kernel almost completely for networking. | [deleted] | Thaxll wrote: | I know they bypass the kernel but my point still stands, | most of the servers on the internet runs on Linux, that's a | fact, so there was more money, time invested, man power on | that OS than any others. | dijit wrote: | Your point is that popularity means that it will improve. | | This is true, to a point. | | Counterpoint: Windows Desktop Experience. | | EDIT: that comment was glib, let me do a proper | counterpoint. | | Common area's are some of the most least maintained in | reality; I can think of meet-me-rooms or central fibre | hubs in major cities; they are expensive and subject to a | lot of the whims of the major provider. | | Crucially, despite large amounts of investment, the | underlying architecture or infrastructure remains, even | if the entire fabric of the area changes around it. | _Most_ providers using these kind of common areas do | everything they can to _avoid touching_ the area itself, | especially as after a while it becomes very difficult to | navigate and politically charged. | | Fundamentally the _architecture_ of Linux 's network | stack, really is, "good enough", which is almost worse | than you would originally think since "good enough" means | there's no reason to look there. There is an old parable | about "worse is better" because if something is truly | broken people will put effort into fixing it. | | Linux's networking stack is _fine_ , it's just not quite | as good an architecture as the FreeBSD one. FreeBSD one | has a lot less attention on it but fundamentally it's a | cleaner implementation and easier to get much more out | of.. | | You will find the same argument ad infinitum regarding | other subjects such as Epoll vs IOCP vs kqueue (Epoll was | _abysmally terrible_ though and ended up being replaced | by IO_URING, but even that took over a decade) | [deleted] | Thaxll wrote: | Yes things improve where we're talking about multi | billions $ of infra cost. | | Linux is not your "random on the side" feature that is | good enough. | dijit wrote: | To start with: it's not that much infra cost. | | Especially since you don't even know what you're | attempting to optimise for. | | Latency? p99 of linux is _fine_ , nobody is going to care | that the request took 300ms longer. Even in aggregate | across a huge fleet of machines waiting an extra 3ms is | totally, _totally_ fine. | | Throughput? you'll bottleneck on something else most | likely anyway, getting a storage array to hydrate at line | rate for 100GBPs is difficult and _anyway_ you want to do | authentication and distribution of chunks and metadata | operations anyway? right? | | You're forgetting that it's _likely_ an additional cost | of a couple million dollars per year in absolute hardware | to solve that issue with throughput, which is, in TCO | terms, a couple of developers. | | Engineering effort to replace the foundation of an OS? | Probably an order of magnitude more. Definitely contains | a significant amount more risk, and the potential risk of | political backlash for upheaving some other companies | workflow that is weird. | | Hardware isn't so expensive really. | | Of course, you could just bypass the kernel with much | less effort and avoid all of this shit entirely. | tptacek wrote: | Do you know for a fact that Google primarily uses userland | networking, or does that just seem accurate to you? | drewg123 wrote: | Google makes heavy use of userspace networking. I was | there roughly a decade ago. At least at that time, a | major factor is the choice of userspace over kernel | networking was time to deployment. Services like the ones | described above were built on the monorepo, and could be | deployed in seconds at the touch of a button. | | Meanwhile, Google had a building full of people | maintaining the Google kernel (eg, maintaining rejected | or unsubmitted patches that were critical for business | reasons), and it took many months to do a kernel release. | tptacek wrote: | Yes. I don't think anyone is disputing that Google does | significant userspace networking things. But the premise | of this thread is that "ordinary" (ie: non-network- | infrastructure --- SDN, load balancer, routing) | applications, things that would normally just get BSD | sockets, are based on userspace networking. That seems | not to be the case. | dijit wrote: | I can't honestly answer that with the NDA I signed. | | However there is some public information on _some_ | components that has been shared in this thread which | allows you to draw your own conclusion. | tptacek wrote: | Yes, the one link shared says essentially the opposite | thing. | dijit wrote: | You may have read it wrong. | tptacek wrote: | Did I? How? | dijit wrote: | For one by assuming the work that is done primarily for | microkernels/appliances is the absolute limit of | userspace networking at Google and that similar work | would not go into a hypervisor (hypervisors which are | universally treated as a vSwitch in almost all virtual | environments the world over). | | And making that assumption when there are many public | examples of Google doing this in other areas such as | gVisor and Netstack? | tptacek wrote: | If you have information about other userspace networking | projects at Google, I'd love to read it, but the Snap | paper repeatedly suggests that the userspace networking | characteristics of the design are distinctive. Certainly, | most networking at Google isn't netstack. Have you done | much with netstack? It is many things, but ultra-high- | performance isn't one of them. | dijit wrote: | userspace networking will take different forms depending | on the use-case. | | Which is one of the arguments of why to do it that way; | instead of using general purpose networking. | | I haven't the time or inclination to find anything public | on this, nor am I interested really in convincing you. | Ask a former googler. | tptacek wrote: | OK. I did. They said "no, it's not the case that | networking at Google is predominately user-mode". (They | also said "it depends on what you mean by most"). Do you | have more you want me to relay to them? Did you work on | this stuff at Google? | | Per the Snap thread above: if you're building a router or | a load balancer or some other bit of network | infrastructure, it's not unlikely that there's userland | IP involved. But if you're shipping a normal program on, | like, Borg or whatever, it's kernel networking. | dijit wrote: | I worked as a Google partner for some specialised | projects within AAA online gaming. | | I continue in a similar position today and thus my NDA is | still in complete effect which limits what I can say if | there's nothing public. | | I have not worked for Google, just very closely. | tptacek wrote: | Oh. Then, unless a Googler jumps in here and says I'm | wrong: no, ordinary applications at Google are not as a | rule built on userspace networking. That's not my opinion | (though: it was my prior, having done a bunch of | userspace networking stuff), it's the result of asking | Google people about it. | | Maybe it's all changed in the last year! But then: that | makes all of this irrelevant to the thread, about FreeBSD | vs. Linux network stack performance. | ori_b wrote: | Do you think that (outside of a few special cases) they're | using anything near the network bandwidth available to them? | | I would expect in the 1% to 10% bandwidth utilization, on | average. From my vague recollection, that's what it was at FB | when I was there. They put stupid amounts of network capacity | in so that the engineers rarely have to think about the | capacity of the links they're using, and that if their needs | grow, they're not bottlenecked on a build out. | | To answer the original question, it's complicated. I have a | weird client where freebsd gets 450 MiB/s, and Linux gets 85 | with the default congestion control algorithm. Changing the | congestion control algorithm can get me between 1.7 MiB/s and | 470 MiB/s. So, better performance... Under what | circumstances? | jeffbee wrote: | The big guys don't have the patience to wait for Linux kernel | networking to be fast and scalable. They bypass the kernel | and take over the hardware. | | https://blog.acolyer.org/2019/11/11/snap-networking/ | tptacek wrote: | _Over the course of several years, the architecture | underpinning Snap has been used in production for multiple | networking applications, including network virtualization | for cloud VMs [19], packet-processing for Internet peering | [62], scalable load balancing [22], and Pony Express, a | reliable transport and communications stack that is our | focus for the remainder of this paper._ | | This paper suggests, as I would have expected, that Google | uses userland networking in strategic spots where low-level | network development is important (SDNs and routing), and | not for normal applications. | jeffbee wrote: | "and Pony Express" is the operative phrase. As the paper | states on page 1, "Snap is deployed to over half of our | fleet of machines and supports the needs of numerous | teams." According to the paper it is not niche. | nolist_policy wrote: | Makes sense, they're probably using QUIC in lots of | products and the kernel can't accelerate that anyways, it | would only pass opaque UDP packets to and from the | application. | devonkim wrote: | Last I remember as of at least 7 years ago Google et al | were using custom NIC firmware to avoid having the kernel | get involved in general (I think they managed to do a lot | of Maglev directly on the NICs) because latency is so | dang important at high speed networking speeds that | letting anything context switch and need to wait on the | kernel is a big performance hit. Not a lot of room for | latency when you're working at 100 Gbps. | tptacek wrote: | Isn't Pony Express a ground-up replacement for all of | TCP/IP? It doesn't even present a TCP/UDP socket | interface. | jeffbee wrote: | Correct. That is my point. The sockets interface, and | design choices within the Linux kernel, make ordinary TCP | sockets too difficult to exploit in a datacenter | environment. The general trend is away from TCP sockets. | QUIC (HTTP/3) is a less extreme retreat from TCP, moving | all the flow control, congestion, and retry logic out of | the kernel and into the application. | | An example of how Linux TCP is unsuitable for datacenters | is that the minimum RTO is hard-coded to 200ms, which is | essentially forever. People have been trying to land | better or at least more configurable parameters upstream | for decades. I am hardly the first person to point out | the deficiencies. Google presented tuning Linux for | datacenter applications at LPC 2022, and their deck has | barely changed in 15 years. | tptacek wrote: | At the point where we're talking about applications that | don't even use standard protocols, we've stopped | supplying data points about whether FreeBSD's stack is | faster than Linux's, which is the point of the thread. | | _Later_ | | Also, the idea that QUIC is a concession made to | intractable Linux stack problems (the subtext I got from | that comment) seems pretty off, since the problems QUIC | addresses (HOLB, &c) are old, well known, and were the | subject of previous attempts at new transports (SCTP, | notably). | corbet wrote: | That's funny ... the "big guys" are some of the biggest | contributors to the Linux network stack, almost as if they | were actually using it and cared about how well it works. | jeffbee wrote: | History has shown that tons of Linux networking | scalability and performance contributions have been | rejected by the gatekeepers/maintainers. The upstream | kernel remains unsuitable for datacenter use, and all the | major operators bypass or patch it. | eddtests wrote: | Do you have links on this? I've not heard anything about | it | tptacek wrote: | I believe they're paraphrasing the Snap paper, and also | that they're extrapolating too far from it. | sophacles wrote: | All the major operators sometimes bypass or patch it for | some use cases. For others they use it as is. For other | still they laugh at you for taking the type of drugs that | makes one think any CPU is sufficient to handle | networking in code. | | Networking isn't a one size fits all thing - different | networks have different needs, and different systems in | any network will have different needs. | | Userland networking is great until you start needing to | deal with weird flows or unexpected traffic - then you | end up either needing something a bit more robust and | your performance starts dropping because you added a | bunch of branches to your code or switched over to a | kernel implementation that handles those cases. I've seen | a few cases of userland networking being slower than just | doing the kernel - and being kept because sometimes the | what you care about is control over packet lifecycle more | than raw throughput. | | Kernels prioritize robust network stacks that can handle | a lot of cases good enough. Different implementations | handle different scenarios better - there's plenty of | very high performance networking done with vanilla linux | and vanilla freebsd. | sophacles wrote: | Performance parity on which axis? For which use case? | | Talking generally about "network performance" is approximately | as useful as talking generally about "engine performance". Just | like it makes no sense to compare a weed-eater engine to a | locomotive diesel without talking about use case and desired | outcomes, it makes no sense to compare "performance of FreeBSD | network stack" and "Linux network stack" without understanding | the role those systems will be playing in the network. | | Depending on context, FreeBSD, Linux or various userland stacks | can be a great, average, or terrible choices. | circularfoyers wrote: | Can you provide some examples of different contexts where | Linux or FreeBSD might be better or worse choices? | sophacles wrote: | Sure: | | Linux is a networking swiss army knife (or maybe a | dremmel). It can do a lot of stuff reasonably well. It has | all sorts of knobs and levers, so you can often configure | it to do really weird stuff. I tend to reach for it first | to understand the shape of a problem/solution. | | BSD is fantastic for a lot of server applications, | particularly single tenant high throughput ones like mail | servers, dedicated app servers, etc. A great series of case | studies have come out of Netflix on this (google for | "800Gbps on freebsd netflix" for example - every iteration | of that presentation is fantastic and usually discussed | here at least once, and Drew G. shows up in comments and | answers questions). | | It's also pretty nice for firewalling/routing small and | medium networks - (opn|pf)sense are both great systems for | this built on FreeBSD (apologies for the drama injection | this may cause below). | | One of the reasons I reach for linux first unless I already | know the scope and shape of the problem is that the entire | "userland vs kernel" distinction is much blurrier there. | Linux allows you to pass some or all traffic to userland at | various points in the stack and in various ways, and inject | code at the kernel level via ebpf, leading to a lot of | hybrid solutions - this is nice in middleboxes where you | want some dynamism and control, particularly in multi- | tenant networks (and thats the space my work is in, so it's | what I know best) | | Please bear in mind that these are my opinions and | uses/takes on the tools. Just like with programming there's | a certain amount of "art" (or maybe "craft") to this, and | other folks will have different (but likely just a valid) | views - there's a lot of ways to do anything in networking. | [deleted] | doctorpangloss wrote: | Does performance tuning for Wi-Fi adapters matter? | | On desktops, other than disabling features, can anything fix the | problems with i210 and i225 ethernet? Those seem to be the two | most common NICs nowadays. | | I don't really understand why common networking hardware and | drivers are so flawed. There is a lot of attention paid to | RISC-V. How about start with a fully open and correct NIC? | They'll shove it in there if it's cheaper than an i210. Or maybe | that's impossible. | jeffbee wrote: | i225 is just broken but I get excellent performance from i210. | 1gb is hardly challenging on a contemporaneous CPU, and the | i210 offers 4 queues. What's your beef with i210? | doctorpangloss wrote: | There are a lot of problems with the i210. Here's a sample: | | https://www.google.com/search?q=i210+proxmox+e1000e+disable | | Most people don't really use their NICs "all the time" "with | many hosts." The i210 in particular will hang after a few | months of e.g. etcd cluster traffic on 9th and 10th gen | Intel, which is common for SFFPCs. | | On Windows, the ndis driver works a lot better. Many | disconnects in similar traffic load as Linux, and features | like receive side coalescing are broken. They also don't | provide proper INFs for Windows server editions, just | because. | | I assume Intel does all of this on purpose. I don't think | their functionally equivalent server SKUs are this broken. | | Apparently the 10Gig patents are expiring very soon. That | will make Realtek, Broadcom and Aquantia's chips a lot | cheaper. IMO, motherboards should be much smaller, shipping | with BMC and way more rational IO: SFP+, 22110, Oculink, U.2, | and PCIe spaced for Infinity Fabric & NVLink. Everyone should | be using LVFS for firmware - NVMe firmware, despite being | standardized to update, is a complete mess with bugs on every | major controller. | | I share all of this as someone with experience in operating | commodity hardware at scale. People are so wasteful with | their hardware. | trustingtrust wrote: | There are 3 revisions of i225 and Intel essentially got rid | of it and launched i226. That one also seems to be | problematic [1] . Why is it exponentially harder to make a | 2.5gbps NIC when the 1gbps NIC (i210 and i211) has worked | well for them. Shouldn't it be trivial to make it 2.5x? They | seem to make good 10gbps NICs so I would assume 2.5gbps | shouldn't need a 5th try from intel ? | | [1] - https://shorturl.at/esCNP | jeffbee wrote: | The bugs I am aware of are on the PCIe side. i225 will lock | up the bus if it attempts to do PTM to support PTP. That's | a pretty serious bug. You would think Intel has this nailed | since they invented PCIe and PCI for that matter. | Apparently not. Maybe they outsourced it. | elabajaba wrote: | > Does performance tuning for Wi-Fi adapters matter? | | If you're willing to potentially sacrifice 10-20% of (max local | network) throughput you can drastically improve wifi fairness | and improve ping times/reduce bufferbloat (random ping spikes | will still happen on wifi though). | | There's a huge thread https://forum.openwrt.org/t/aql-and-the- | ath10k-is-lovely/590... that has stuff about enabling and | tuning aqm, and some of the tradeoffs between throughput and | latency. | gjulianm wrote: | This is great, not just the parameters themselves but all the | steps that a packet follows from the point it enters the NIC | until it gets to userspace. | | Just one thing to add regarding network performance: if you're | working in a system with multiple CPUs (which is usually the case | in big servers), check NUMA allocation. Sometimes the network | card will be in one CPU while the application is executing on a | different one, and that can affect performance too. | klabb3 wrote: | A random thing I ran into with the defaults (Ubuntu Linux): | | - net.ipv4.tcp_rmem ~ 6MB | | - net.core.rmem_max ~ 1MB | | So.. the tcp_rmem value overrides by default, meaning that the | TCP receive window for a vanilla TCP socket actually goes up to | 6MB if needed (in reality - 3MB because of the halving, but let's | ignore that for now since it's a constant). | | But if I "setsockopt SO_RCVBUF" in a user-space application, I'm | actually capped at a maximum 1MB, even though I already have 6MB. | If I try to _reduce it_ from 6MB to e.g. 4MB, it will result in | 1MB. This seems very strange. (Perhaps I 'm holding it wrong?) | | (Same applies to SO_SNDBUF/wmem...) | | To me, it seems like Linux is confused about the precedence order | of these options. Why not have core.rmem_max be larger and the | authoritative directive? Is there some historical reason for | this? | pengaru wrote: | net.ipv4.tcp_rmem max is a limit for the auto-tuning the kernel | performs | | once you do SO_RCVBUF the auto-tuning is out of the picture for | that socket, and net.core.rmem_max becomes the max. | | It's pretty clearly documented @ Documentation/networking/ip- | sysctl.rst | | Edit: downvotes, really? smh | dekhn wrote: | And to add: the kernel autotunes better than you can, so | leave that enabled unless you're Vint Cert, Jim Gettys, or | Vern Paxton. | napkin wrote: | Just changing Linux's default congestion control | (net.ipv4.tcp_congestion_control) to 'bbr' can make a _huge_ | difference in some scenarios, I guess over distances with | sporadic packet loss and jitter, and encapsulation. | | Over the last year, I was troubleshooting issues with the | following connection flow: | | client host <-- HTTP --> reverse proxy host <-- HTTP over | Wireguard --> service host | | On average, I could not get better than 20% theoretical max | throughput. Also, connections tended to slow to a crawl over | time. I had hacky solutions like forcing connections to close | frequently. Finally switching congestion control to 'bbr' gives | close to theoretical max throughput and reliable connections. | | I don't really understand enough about TCP to understand why it | works. The change needed to be made on both sides of Wireguard. | drewg123 wrote: | The difference is that BBR does not use loss as a signal of | congestion. Most TCP stacks will cut their send windows in half | (or otherwise greatly reduce them) at the first sign of loss. | So if you're on a lossy VPN, or sending a huge burst at 1Gb/s | on a 10Mb/s VPN uplink, TCP will normally see loss, and back | way off. | | BBR tries to find Bottleneck Bandwidth rate. Eg, the bandwidth | of the narrowest or most congested link. It does this by | measuring the round trip time, and increasing the transmit rate | until the RTT increases. When the RTT increases, the assumption | is that a queue is building at the narrowest portion of the | path and the increase of RTT is proportional to the queue | depth. It then drops rate until the RTT normalizes due to the | queue draining. It sends at that rate for a period of time, and | then slightly increases the rate to see if RTT increases again | (if not, it means that the queuing that saw before was due to | competing traffic which has cleared). | | I upgraded from a 10Mb/s cable uplink to 1Gb/s symmetrical | fiber a few years ago. When I did so, I was ticked that my | upload speed on my corp. VPN remained at 5Mb/s or so. When I | switched to RACK TCP (or BBR) on FreeBSD, my upload went up by | a factor of 8 or so, to about 40Mb/s, which is the limit of the | VPN. | [deleted] ___________________________________________________________________ (page generated 2023-09-06 20:00 UTC)