[HN Gopher] Boosting upload speed and improving Windows' TCP stack ___________________________________________________________________ Boosting upload speed and improving Windows' TCP stack Author : el_duderino Score : 167 points Date : 2021-05-18 17:28 UTC (5 hours ago) (HTM) web link (dropbox.tech) (TXT) w3m dump (dropbox.tech) | mrpippy wrote: | API Monitor is really useful, but unfortunately is closed-source | and hasn't been updated in a few years. | rootsudo wrote: | "Dropbox is used by many creative studios, including video and | game productions. These studios' workflows frequently use offices | in different time zones to ensure continuous progress around the | clock. " | | Honestly I don't understand these orgs that don't go | OneDrive/O365 suite. What product value does dropbox have when | competing within Microsoft's own ecosystem? | chokeartist wrote: | I got excited when I saw that fancy Microsoft Message Analyzer | tool and wanted to try it out. Sadly it appears to be retired and | removed by MSFT? Sad! | hyperrail wrote: | Yeah, I have no idea either why Microsoft would want to remove | Message Analyzer completely, even if they could not maintain | it. You can still download it through the Internet Archive: | | * 32-bit x86: | https://web.archive.org/web/20191104120802/https://download.... | | * 64-bit x86: | https://web.archive.org/web/20190420141924/http://download.m... | | (those links via: | https://www.reddit.com/r/sysadmin/comments/e4qocq/microsoft_... | ) | | Or use the even older Microsoft utility Network Monitor, which | is still available on Microsoft's website: | https://www.microsoft.com/en-us/download/details.aspx?id=486... | | Supposedly Microsoft is working on adding to the existing | Windows Performance Analyzer (great GUI tool for ETW | performance tracing) to display ETW packet captures, which will | succeed Message Analyzer and Network Monitor: | https://techcommunity.microsoft.com/t5/networking-blog/intro... | jabroni_salad wrote: | It's really too bad. I'm happy enough to use Wireshark, but I | liked that MMA could filter by PID. | Agingcoder wrote: | Excellent article ! | | I got hit by the exact same issue which is described in the | fermilab paper, namely packet reordering caused by intel drivers. | It took me several days to diagnose the problem. Interestingly | enough, the problem virtually disappeared when running tcpdump, | which, after a lot of reading on the innards of the linux TCP | stack, and prodding with ebpf, eventually led me to conjecture | that it was a scheduling/core placement issue. Pinning my process | clearly made the problem disappear, and then finding the paper | nailed it. | | Networks are not my specialty (I come from a math background, am | self taught, and had always dismissed them as mere plumbing) , | but I have to say that I came out of this difficult (for me) | investigation with a great appreciation for networking in | general, and now enjoy reading anything I can find about them. | | It's never too late to learn, and I have yet to find something in | software engineering which is not interesting once you take a | closer look at it! | stephc_int13 wrote: | Is TCP the best choice? Why not UDP? | SaveTheRbtz wrote: | We will be eventually migrating to UDP (HTTP/3) once it is | rolled out on Envoys[0] on Dropbox Edge Network[1]. | | [0] https://dropbox.tech/infrastructure/how-we-migrated- | dropbox-... | | [1] https://dropbox.tech/infrastructure/dropbox-traffic- | infrastr... | bob1029 wrote: | This is a good question in my opinion. | | Theoretically, UDP would be the best choice if you had the time | & money to spend on building a very application-specific layer | on top that replicates many of the semantics of TCP. I am not | aware of any apps that require 100% of the TCP feature set, so | there is always an opportunity to optimize. | | You would essentially be saying "I know TCP is great, but we | have this one thing we really prefer to do our way so we can | justify the cost of developing an in-house mostly-TCP clone and | can deal with the caveats of UDP". | | If you know your communications channel is very reliable, UDP | can be better than TCP. | | Now, I am absolutely not advocating that anyone go out and do | this. If you are trying to bring a product like Dropbox to | market (and you don't have their budget), the last thing you | want to do is play games with low-level network abstractions | across thousands of potential client device types. TCP is an | excellent fit for this use case. | michaelmcmillan wrote: | And reimplement TCP on top? Would not recommend. | arduinomancer wrote: | Sure if you want to re-build TCP yourself on top of UDP | [deleted] | willis936 wrote: | It's an ideal application of TCP. Dropbox servers are | continually flooded by traffic from clients, so the good | congestion behavior from TCP is valuable. There is also less | need to implement error detection/correction/retransmission in | higher layers. | jandrese wrote: | Bulk data transfer is TCP's bread and butter. This is the | protocol living the dream. | mwcampbell wrote: | I wonder how the Dropbox developers managed to get in contact | with the Windows core TCP team. Maybe I'm too cynical, but I'm | surprised that Microsoft would go out of their way to work with a | competitor like this. | paxys wrote: | Microsoft is a massive and highly compartmentalized company. | Windows kernel developers have no reason to see Dropbox as a | competitor. | toast0 wrote: | Even if OneDrive vs Dropbox is important, this is a win for | Windows in general. People will switch OSes because the TCP | throughput is better on the other side; it's easy to measure | and easy to compare and makes a nice item in a pros and cons | list. | | Fixing something like this can help lots of use cases, but may | have been difficult to spot, so I'm sure the Windows TCP team | was thrilled to get the detailed, reproducible report. | freerk wrote: | How come Linux doesn't have this issue? Why did Microsoft had to | fix TCP with the RACK-TLP RFC when both Linux and MacOS | implementations did fine already? | SaveTheRbtz wrote: | Microsoft Devs explain this in their "Algorithmic improvements | boost TCP performance on the Internet"[1] article. | | TL;DR is that they had RACK (RFC draft) implemented as an MVP | but w/o the reordering heuristic. | | [1] https://techcommunity.microsoft.com/t5/networking- | blog/algor... | thrdbndndn wrote: | Cool article, but I'm not impressed by DropBox's upload speed on | my Windows computer, at all. | | I just tested rn with DropBox, GoogleDrive, and OneDrive, all | with their native desktop apps. I simply put a 300MB file in the | folder and let it sync. DB: 500 KiB/s | GD: 3 MiB/s OD: 11 MiB/s (my max bandwidth with 100Mbps) | | I don't know what causes the disparity here, but I have been | annoyed by this for years, and it's the same across multiple | computers I use at different locations. | | Another funny thing is if you just use the webpage, both GD and | DB can reach 100Mbps easily. | | Edit: should mention Google's DriveFS can reach max speed too, | but it's not available for my personal account (which uses the | "Backup and sync for Google" app). | pityJuke wrote: | Google are migrating Backup and Sync to DriveFS soon [0], but | you can upgrade right now. Now, I don't remember how I did it, | but I do have Drive FS on my personal account. | | [0]: | https://support.google.com/googleone/answer/10309431#zippy= | thrdbndndn wrote: | Good to know! Definitely will try it later, but I currently | have a backup job (one-way photo backup, not sync GD) set up | on my second GDrive account which I don't want to touch.. | yet. | encryptluks2 wrote: | Google Drive is a gem. I hope it lasts forever cause no one is | competing with them. | fletchowns wrote: | I disagree. I had my machine backed up to Google Drive using | their Backup and Sync program and when I got a new machine | there was no reasonable way to restore the data from the old | machine to the new machine, using Google Drive. Sure I can | copy data from my old machine, but what if ti was lost or | stolen? If the app can't handle this use case, what's the | point of it? The only way to restore the files is in small | chunks using the web based interface - not reasonable for | tens of thousands of files and hundreds of gigabytes. | | The workaround was to back everything up to the "Google | Drive" folder since this seems to be the only folder that | Backup and Sync can actually restore. | CPAhem wrote: | There are better Google Drive clients like SyncDocs that | can actually restore properly. | vladvasiliu wrote: | Had a somewhat similar issue, but with Drive File Stream. | | At one point I set it up to use my second SSD as the local | storage. Then I needed that SSD elsewhere, so I just took | it out. It was impossible to restart the damn thing. It | kept complaining about missing folders. I even tried | uninstalling and reinstalling it, but it kept its settings. | | Since I barely used that machine, if ever, and I'm not | particularly familiar with Windows, I never really looked | into how to completely clean up the configuration. But the | point is that there clearly are some pretty stupid | decisions about some products. | encryptluks2 wrote: | Are you saying the files were no longer available in Google | Drive? Did you download the Drive for desktop client to try | restoring files or just try reinstalling the Backup and | Sync client? | tssva wrote: | To be fair it is called Backup and Sync not Backup, Sync | and Restore. | | But on a more useful note how I have handled this in the | past is to download the complete Google Drive data using | Google Takeout. Not the greatest solution but it has | worked. | ASalazarMX wrote: | Except we're still waiting for an official Linux client since | Google promised it was "coming soon" in 2012 | | https://abevoelker.github.io/how-long-since-google-said-a- | go... | | Of course there are alternatives now, but I like to plug this | page whenever I can. | encryptluks2 wrote: | I'm using rclone and honestly prefer it at this point, and | there are others as well, so while an official client would | be nice it no longer is a concern for me. | strictfp wrote: | Really? Google drive sync has been hot garbage for me. Before | that program came everything was fine and dandy, but drive | sync constantly stumbles over it's own feet, restarts and | fails to up- and download files. I'm longing back to Rsync or | even FTP after trying to use google drive to move data. | Dylan16807 wrote: | > Edit: should mention Google's DriveFS can reach max speed | too, but it's not available for my personal account (which uses | the "Backup and sync for Google" app). | | That thing is far _too_ aggressive about network bandwidth. It | will upload 20 files at the same time and the speed limit | setting doesn 't work. | vladvasiliu wrote: | Is 20 files some kind of hyperbole? If not, how do you get it | to do that? | | I've never seen it transfer more than five files at a time, | which sometimes drives me crazy when there are a lot of small | files to sync. | Dylan16807 wrote: | It's not hyperbole. I make a backup of my computer, with | the output being a bunch of 500MB files. And I would then | copy or move those files into a folder on the file stream | drive. It's not entirely consistent, and it used to do | less, but with some update it decided that it should upload | _way too many_ files at once. I 've had to switch to an | entirely different program to upload those files | sequentially. | Isthatablackgsd wrote: | What is the program that you are using? I am currently | using odrive for macOS since they don't have DriveFS | support for Apple Silicon. odrive work ok, it just have a | weird file conflict sometime. | Dylan16807 wrote: | I'm using rclone to do big uploads. | | I still use DriveFS for everything else, at least for | now. Rclone is capable of mounting the drive but it's not | really designed for that. | Groxx wrote: | Yea, Dropbox on my Macs has continuously been outrageously slow | at uploading. Everything else is multiples faster. | | Dropbox does at least _resume_ fairly reliably though, so I can | generally ignore it the whole time... unless I have something I | want to sync ASAP. Then I sometimes use the web UI and cross my | fingers that I don 't get a connection hiccup tth_tth | nailer wrote: | Are you using the version of Windows with the fix mentioned in | the article? | drewg123 wrote: | Is Google Drive using QUIC? If so, then its using the same BBR | congestion algorithm as the bbr tcp stack, and BBR's algorithm | which does not view loss as congestion will help a lot. | | It would be interesting to re-try the experiment on Linux or | FreeBSD using BBR as the TCP stack and see if the results are | any better for dropbox. | | FWIW, my corp openvpn is kinda terrible. My upload speeds via | the vpn did not improve at all when I moved and upgraded from | 10Mb/s to 1Gb/s upstream speeds. When I switched to BBR, my | bandwidth went from ~8Mb/s -> 60Mbs, which I think is the limit | of the corp vpn endpoint. | virtuallynathan wrote: | Looking at the flows on my network while uploading a file, it | seems Google Drive's mac client just uses regular old TCP, | same for the website. | tmashb wrote: | QUIC is UDP, and TCP does not use CCA in userspace. | dochtman wrote: | QUIC does run in user space, and also uses congestion | controllers running inside the QUIC stack, in user space. | | (I work on a QUIC implementation in Rust.) | [deleted] | [deleted] | [deleted] | [deleted] | tmashb wrote: | QUIC is a protocol... "...CCA in userspace" CCA stands | for congestion control algorithm. | AndrewDucker wrote: | QUIC absolutely uses congestion control. See section 6 here | https://tools.ietf.org/id/draft-ietf-quic-recovery-26.html | [deleted] | [deleted] | tmashb wrote: | No denying in that. | [deleted] | [deleted] | [deleted] | [deleted] | SaveTheRbtz wrote: | Interesting, can you try disabling upload limiter in settings? | Also what is your RTT to `nsf-1.dropbox.com`? | | PS. One known problem that we have right now is that we use a | multiplexed HTTP/2 connection, therefore: | | 1) We rely on the host's TCP congestion. (We have not yet | switched to HTTP/3 w/ BBR.) | | 2) We currently use a single TCP connection: it is more fair to | the other traffic on the link but can become bottleneck on | large RTTs. | thrdbndndn wrote: | Tried to change upload speed to no limit, doesn't make much | difference. | | Ping result: Pinging nsf-env-1.dropbox- | dns.com [162.125.3.12] with 32 bytes of data: Reply | from 162.125.3.12: bytes=32 time=27ms TTL=55 Reply from | 162.125.3.12: bytes=32 time=27ms TTL=55 Reply from | 162.125.3.12: bytes=32 time=27ms TTL=55 Reply from | 162.125.3.12: bytes=32 time=27ms TTL=55 Ping | statistics for 162.125.3.12: Packets: Sent = 4, | Received = 4, Lost = 0 (0% loss), Approximate round | trip times in milli-seconds: Minimum = 27ms, Maximum = | 27ms, Average = 27ms | | App Ver. 122.4.4867 | | Is the OS being Win7 a factor? (Work computer, can't update | [yet]). | | Download speed is normal (100Mbps). | kevingadd wrote: | Strange. Dropbox has no problem hitting mid-50s MiB/s if not | more on my gigabit connection. I wonder if it's a routing issue | and your path to their datacenters is bad? | thrdbndndn wrote: | It uploads fine with web version, so I doubt it's a routing | issue (granted, they could use different datacenter.) | lowleveldesign wrote: | If you're on recent Windows system, you should have pktmon [1] | available. I believe it's the ,,netsh trace" successor and has | much nicer command line. And you no longer need an external tool | to convert the trace to .npcap format. | | [1] https://docs.microsoft.com/en-us/windows- | server/networking/t... | emmericp wrote: | The real root cause for all that flow director mess and core | balancing is that there's a huge disconnect between how the | hardware works and what the socket API offers by default. | | The scaling model of the hardware is rather simple: hash over | packet headers and assign a queue based on this. And each queue | should be pinned to a core by pinning the interrupts, so you got | easy flow-level scaling. That's called RSS. It's simple and | effective. What it means is: the hardware decides which core | handles which flow. I wonder why the article doesn't mention RSS | at all? | | Now the socket API works in a different way: your application | decides which core handles which socket and hence which flow. So | you get cache misses if you don't tak into account how the | hardware is hashing your flows. That's bad. So you can do some | work-arounds by using flow director to explicitly redirect flows | to cores that handle things but that's just not really an elegant | solution (and the flow director lookup tables are small-ish). | | I didn't follow kernel development regarding this recently, but | there should be some APIs to get a mapping from a connection | tuple to the core it gets hashed to on RX (hash function should | be standardized to Toeplitz IIRC, the exact details on which | fields and how they are put into the function are somewhat | hardware- and driver-specific but usually configurable). So you'd | need to take this information into account when scheduling your | connections to cores. If you do that you don't get any cache | misses and don't need to rely on the limited capabilities of | explicit per-flow steering. | | Note that this problem will mostly go away once TAPS finally | replaces BSD sockets :) | SaveTheRbtz wrote: | We didn't mention RSS/RPS in the post mostly because they are | stable. (Albeit, relatively ineffective in terms of L2 cache | misses.) FlowDirector, OTOH, breaks that stability and causes a | lot of migrations, and hence a lot of re-ordering. | | Anyways, nice reference for TAPS! Fo those wanting to dig into | it a bit more, consider reading an introductory paper (before a | myriad of RFC drafts from the "TAPS Working Group"): | https://arxiv.org/pdf/2102.11035.pdf | | PS. We went through most of our low-level web-server | optimization for the Edge Network in an old blogpost: | https://dropbox.tech/infrastructure/optimizing-web-servers-f... | tyingq wrote: | Interesting. Is the Dropbox client still an obfuscated python | app? I'm curious if they spawn new processes for simultaneous | uploads since they probably aren't threading. | Twisol wrote: | > On one hand, Dropbox Desktop Client has just a few settings. | On the other, behind this simple UI lies some pretty | sophisticated Rust code with multi-threaded compression, | chunking, and hashing. On the lowest layers, it is backed up by | HTTP/2 and TLS stacks. | | And I found another Dropbox blog post about rewriting their | sync engine from Python to Rust: | https://dropbox.tech/infrastructure/rewriting-the-heart-of-o... | | But it isn't clear whether the outer shell of the app might | still be Python. | kevingadd wrote: | The Windows client I have installed appears to be a native app | using QT 5 and QT5WebEngine (embedded chromium) with an | absolutely bonkers number of threads (240). It's possible | there's still python in there but I suspect not, their UI has | been completely overhauled since the python days. | dijit wrote: | I had a similar issue with Windows kernels "recently" (2016~?)... | | I don't have the memory or patience to write a long and inspiring | blog post, but it comes down to: | | Even with IOCP/multiple threads: network traffic is single | threaded in the kernel, even worse, there's a mutex there. | Putting the effective limit on PPS for windows to something like | 1.1M for 3.0GHz. | | The task of this machine was /basically/ a connection multiplexer | with some TLS offloading; so listen on a socket, get an encrypted | connection, check your connection pool and forward where | appropriate. | | Our machine basically sat waiting (in kernel space) for this lock | 99.7% of the time, 0.3% was spent on SSL handshaking.. | | We solved our "issue" by spreading such load over many more | machines and gave them low-core-count high-clock-speed Xeons | instead of the normal complement of 20vCPU Xeons. | | AFAIK that issue persists, I'd be interested to know if someone | else managed to coerce windows to do the right thing here. | kevingadd wrote: | Do you know whether it's a single thread for all network | devices, or just per device? It would be interesting if this | ended up being a driver level constraint or something that can | be fixed by having multiple NICs in the machine. | dijit wrote: | it was LACP'd... buuuuuuut; it was only one PCIe card. | | Otherwise you can't easily do LACP because it can't be | offloaded to the card. | | We tried without LACP, but again, only one PCIe card. | toast0 wrote: | I did some work optimizing a similar problem, but simpler and | on another OS[1]. The basic concept that worked was Receive | Side Scaling (RSS), which was developed by Microsoft, for | Windows Server. Did you come accross that? It needs support in | the NIC and the driver, but intel gigE cards do it, so you | don't need the really fancy cards. I don't know what the | interface is like for Windows, but inbound RSS for FreeBSD is | pretty easy, and skimming Windows docs, it seemed like you | could do more advanced things there. | | The harder part was aligning the outgoing connections; for max | performance, you want all of the related connections pinned to | the same CPU, so that there's no inter CPU messaging; for me | that meant a frontend connection needs to hash to the same NIC | queue as the backend connection; for you, that needs to be all | of the demultiplexed connections on the same queue as the | multiplexed connection. Windows may have an API to make | connections that will hash properly, FreeBSD didn't (doesn't?), | so my code had to manage the local source ip and port when | connecting to remote servers so that the connection would hash | as needed. Assuming a lot of connections, you end up needing to | self-manage source ip and port anyway, and at least HAProxy has | code for that already, but running the rss hash to qualify | ports was new development, and a bit tricky because bulk | calculating it gets costly. | | Once I got everything setup well with respect to CPUs, things | got a lot better; still had some kernel bottlenecks though, I | wouldn't know how to resolve that for Windows, but there were | some easy wins for FreeBSD. | | Low core count is the right way to go though; I think the NICs | I used could only do 16 way RSS hashing, so my dual 14 core | xeon (2690v4) weren't a great fit; 12 cores were 100% idle all | the time; something power of two would be best. | | Email in profile if you want to continue the discussion off HN | (or after it fizzles out here). | | [1] Load balancing/proxying, but no TLS and no multiplexing, on | FreeBSD. | drewg123 wrote: | Do you actually use RSS via options RSS / options PCBGROUP? | I've tried it several times, and its just so hard to get | right & have matching cores / rx rings, etc. I've made it | work with a local patch to nginx, but it was so fragile that | I abandoned it. | | I had been thinking that RSS/PCBGROUP was totally abandoned | and could potentially be removed. | toast0 wrote: | I no longer work where I did this (and it's been shut down, | as it was a transitional proxy), so I can't be 100% sure | what the kernel configuration was; I was able to release | patches on the HAProxy mailing list, although they weren't | incorporated, but at least I can reference them [1]. | | But yes, I think I ended up using both RSS and PCBGROUP. | This was on a server running only one application (plus | like sshd and crond and whatever), so it was dead simple to | line up listen socket RSS and cpu affinity; I had a config | generator script that would look at the number of | configured queues and tell HAProxy process 0 to bind to cpu | 0 and rss queue 0, up until I ran out of RSS queues; we | needed a config generator script anyway, because the | backend configuration was subject to frequent changes. If | it was only listen sockets, RSS would have been sufficient | without needing PCBGROUP, but locking around opening new | outgoing sockets was a bottleneck and PCBGROUP helped | considerably, but it was still a bottleneck. This was on | FreeBSD 12. | | Edit: I also found some patches[2] I sent to freebsd- | transport that I don't know if anyone saw; I don't remember | if I updated the patches after this... I know I tried some | more stuff that I wasn't able to get working. Don't apply | these patches blindly, but these were some of the things I | had to fiddle with anyway. I think I saw there was some | stuff in 13 that likely made outgoing connections better. | | [1] https://www.mail- | archive.com/haproxy@formilux.org/msg34548.h... | | [2] https://lists.freebsd.org/pipermail/freebsd- | transport/2019-J... | betaporter wrote: | Sounds like you didn't have receive side scaling enabled; by | default flows are queued to core 0 to prevent reordering. If | you enable RSS, your flows will be hashed to core-specific | queues. | | It's inaccurate to describe traffic processing as single- | threaded in the kernel. | jandrese wrote: | Did it _have_ to be Windows? This is the sort of thing Linux or | *BSD boxes are better suited for. I wouldn 't even consider a | Windows machine for the task unless there's some sort of | licensed software you need to run on it to get the job done. | dijit wrote: | > This is the sort of thing Linux or *BSD boxes are better | suited for. | | Definitely, though enabling conntrack on Linux has similar | characteristics (forces single thread with some kind of | internal mutex) though it can do 5x the b/w.. | | We tried having stateful firewalls in front of our windows | boxen, that's how I know. | | Seems like Cloudflare has an older blog post detailing this | too: https://blog.cloudflare.com/conntrack-tales-one- | thousand-and... | | Anyway, to answer your question: AAA GameDev (and their | backends, if highly tailored) are Windows. | strictfp wrote: | Maybe try wine? Seriously, it might be very low effort to | get the binary to run on wine. | brundolf wrote: | Dropbox always publishes such good technical blog posts. And as a | user, it's reassuring to see how much they still care about | technical excellence. | whatever_dude wrote: | Do they? I constantly see DropBox taking days to sync files | that are 30kb on size. Or doing dumbfounding things like | download all files, then re-upload all files when I set sync to | "online only" on a folder if just one of the files is not set | to online only. | | Maybe they have grand academic visions and papers, but I've | been using them for well over a decade and I feel the client | quality has gone downhill over the past few years. They keep | adding unnecessary stuff like a redundant file browser while | the core service suffers. | brundolf wrote: | Maybe my usage stays in the golden path, but I've been using | them for ten years too and I have no complaints about the | core functionality. My only real complaint is that they've | been adding lots of features I don't care about, getting | slightly pushy about convincing you to try them, etc. But I | haven't seen the core stuff actually go downhill. ___________________________________________________________________ (page generated 2021-05-18 23:00 UTC)