[HN Gopher] Achieveing reliable UDP transmission at 10 Gb/s [pdf...
       ___________________________________________________________________
        
       Achieveing reliable UDP transmission at 10 Gb/s [pdf] (2017)
        
       Author : pmoriarty
       Score  : 92 points
       Date   : 2020-04-19 17:48 UTC (5 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | Matthias247 wrote:
       | Reading through the paper I can't see what the authors mean with
       | "reliable transmission" there, and how they achieve it.
       | 
       | I only see them referencing having increased socket buffers,
       | which then lead - in combination with the available (and non-
       | congested) network bandwidth and their app sending behavior - to
       | no transmission errors. As soon as you change any of those
       | parameters it seems like the system would break down, and they
       | have absolutely no measures in place to "make it reliable".
       | 
       | The right answer still seems: Implement a congestion controller,
       | retransmits, etc. - which essentially ends up in implementing
       | TCP/SCTP/QUIC/etc
        
         | rubatuga wrote:
         | They want reliable UDP, not TCP. They state that very clearly.
        
           | zamadatix wrote:
           | Yes but they didn't do anything to make UDP reliable they
           | just said in our test scenario we didn't notice any loss at
           | the application layer after increasing the socket receive
           | buffer and called it a day because elsewhere in the paper
           | they noted "For some detector readout it is not even evident
           | that guaranteed delivery is necessary. In one detector
           | prototype we discarded around 24% of the data due to
           | threshold suppression, so spending extra time making an
           | occasional retransmission may not be worth the added
           | complexity."
           | 
           | I think the paper meant "reliable" in a different way than
           | most would take "reliable" to mean on a paper about
           | networking similar to if someone created a paper about
           | "Achieving an asynchronous database for timekeeping" and
           | spent a lot of time talking about databases in the paper but
           | it turns out by "asynchronous" they meant you could enter
           | your hours at the end of the week rather than the moment you
           | walked in/out of the door.
        
         | bcoates wrote:
         | Having end-to-end control of their topology in production is
         | the measure they're using to make it reliable. Since they're
         | saturating the link the receiver parameters are reasonably
         | robust, the sender physically cannot burst any faster and
         | overrun the receiver.
         | 
         | Retransmit-based systems are probably unusable in this
         | application, even over the short hop the bandwidth-delay
         | product is probably much bigger than the buffer on the sensor.
         | The only case where retransmit would be happen is receiver
         | buffer overflow, which is catastrophic: the retransmit would
         | cause even more overflow.
         | 
         | If you had to fix random packet loss in a system like this you
         | wouldn't want to use retransmission, you'd need to do FEC.
        
           | aDfbrtVt wrote:
           | EPON already includes a RS(255,223) ECC scheme as part of the
           | standard.
        
         | tomohawk wrote:
         | If you have a very low error rate line, the main point at which
         | packet loss will occur for UDP is on the receiving system. If
         | the receive buffer size is not large enough, it is possible
         | that it can get filled up while the receiving app is doing
         | other things, and then packets will be dropped.
        
       | ignoramous wrote:
       | > (abstract) _Optimizations for throughput are: MTU, packet
       | sizes, tuning Linux kernel parameters, thread affinity, core
       | locality and efficient timers._
       | 
       | Cloudflare's u/majke shared a series of articles on a similar
       | topic [0][1][2] (with focus on achieving line-rate with higher
       | packets-per-second and lower latency instead of throughput) that
       | I found super helpful especially since they are so very thorough
       | [3].
       | 
       | Speaking of throughput, u/drewg123 wrote an article on how
       | Netflix does 100gbps _with_ FreeBSD 's network stack [4] and
       | here's BBC on how they do so by _bypassing_ Linux 's network
       | stack [5].
       | 
       | ---
       | 
       | [0] https://news.ycombinator.com/item?id=10763323
       | 
       | [1] https://news.ycombinator.com/item?id=12404137
       | 
       | [2] https://news.ycombinator.com/item?id=17063816
       | 
       | [3] https://news.ycombinator.com/item?id=12408672
       | 
       | [4] https://news.ycombinator.com/item?id=15367421
       | 
       | [5] https://news.ycombinator.com/item?id=16986100
        
       | a_t48 wrote:
       | I wish I had seen this at my last job. This is something I had to
       | set up and it was painful - lots of trial and error.
        
       | snisarenko wrote:
       | Optimizing UDP transmission over internet is an interesting
       | topic.
       | 
       | I remember reading a paper a while ago that showed that if you
       | send two consecutive UDP packets with exact same data over the
       | internet, at least 1 of them will arrive to the destination at
       | pretty high success rate (something like 99.99%)
       | 
       | I wonder if this still works with current internet
       | infrastructure, and if this trick is still used in real-time
       | streaming protocols.
        
         | wmf wrote:
         | So basically rate 1/2 FEC.
        
           | [deleted]
        
         | zamadatix wrote:
         | 99.99% for two tries would be a 1% drop chance which I'd say is
         | pretty lenient - we average better than that on our sites
         | running off 4G (jitter is horrible though and that will kill
         | any real-time protocols without huge delays added).
         | 
         | Generally you'd just implement a more generic FEC algorithm
         | though unless you had 2 separate paths you wanted to try (e.g.
         | race a cable modem and 4G with every packet and if one side
         | drops it hope the other side still finishes the race) as there
         | are FEC options that allow non integer redundancy levels and
         | can reduce header overhead compared to sending multiple copies
         | of small packets.
        
           | syrrim wrote:
           | >99.99% for two tries would be a 1% drop chance
           | 
           | Not per se. The drop chance for consecutive packets is likely
           | correlated, such that if you know the first one was dropped
           | you should increase your prior that the second one will also
           | be dropped.
        
             | zamadatix wrote:
             | Depends on the cause and root question. For instance in the
             | most common scenario of congestion routers do intelligent
             | random drops with increasing probability as the buffer gets
             | more full
             | https://en.wikipedia.org/wiki/Random_early_detection. The
             | internet actually relies on this random low drop chance to
             | make things work smoothly rather than waiting til things
             | are failing apart to signal to streams to slow down all at
             | once while it catches up. Same randomness with transmission
             | bit errors which will cause drops but the randomness is not
             | by design as much as by the way noise is what is causing
             | those.
             | 
             | On the other hand if the root question is if there is an
             | outage style issue then yeah if the path to the destination
             | is having a hard down style issue no number of packets are
             | going to help because they are all going to drop. Likewise
             | if the question is "on a short enough time scale is
             | reliability of delivering a single packet somewhere on the
             | internet ever less than 99%" then yeah somewhere there is a
             | failure scenario and if you look at a short enough time
             | scale any failure scenario can be made to say there is 0%
             | reliability.
        
         | mcguire wrote:
         | Odds are, at least one of the links between the source and
         | destination will be shared. If so, sending two packets is an
         | expensive attempt at reliability; it will cut the bandwidth in
         | half. Further, one data packet will arrive with a highish
         | success rate.
        
         | tomohawk wrote:
         | It depends on what the characteristics of the transmission line
         | are. If it is purely random, that is one thing, but often if
         | one packet is dropped or smashed, there is a higher probability
         | of following ones to meet the same fate. For example, if the
         | transmission is over a microwave link, it is easy to see how
         | something could cause a few thousand packets in a row to go
         | missing.
        
         | ignoramous wrote:
         | u/noselasd:
         | 
         | > _Also keep in mind this note:http://technet.microsoft.com/en-
         | us/library/cc940021.aspx _
         | 
         | > _Basically, if you send() 2 or more UDP datagrams in quick
         | succession, and the OS has to resolve the destination with ARP,
         | all but the 1 packet is dropped until you get an ARP reply
         | (this behavior isn 't entirely unique to windows, btw)._
         | 
         | https://news.ycombinator.com/item?id=8468313
        
       | zamadatix wrote:
       | "In a readout system such as ours the network only consists of a
       | data sender and a data receiver with an optional switch
       | connecting them. Thus the only places where congestion occurs are
       | at the sender or receiver. The readout system will typically
       | produce data at near constant rates during measurements so
       | congestion at the receiver will result in reduced data rates by
       | the transmitter when using TCP."
       | 
       | At that point a better paper title would have been "Increasing
       | buffers or optimizing application syscalls to receive 10 GB/s of
       | data" as it has nothing to do with achieving reliable UDP
       | transmission, which it doesn't even seem they needed:
       | 
       | "For some detector readout it is not even evident that guaranteed
       | delivery is necessary. In one detector prototype we discarded
       | around 24% of the data due to threshold suppression, so spending
       | extra time making an occasional retransmission may not be worth
       | the added complexity"
       | 
       | As far as actual reliable UDP testing at high speeds one might
       | also want to consider the test scenario as not all Ethernet
       | connections are equal. The 2 meter passive DACs used in this
       | probably achieve ~10^-18 bit error rate (BER) or 1 bit error in
       | every ~100 petabytes transferred. On the other hand go optical
       | even with forward error correction (FEC) it's not uncommon to
       | expect transmission loss in the real world. E.g. looking at
       | something a little more current
       | https://blogs.cisco.com/sp/transforming-enterprise-applicati...
       | is happy to call 10^-12 with FEC "traditionally considered to be
       | 'error free'" which would have likely resulted in lost packets
       | even in this 400 GB transfer test (though again they were fine
       | with up to 24% loss in some cases so I don't think they were
       | worried about reliable as much as reading the paper title would
       | suggest).
       | 
       | Generally if you have any of these: 1) unknown congestion 2)
       | unknown speed 3) unknown tolerance for error
       | 
       | You'll have to do something that eats CPU time and massive
       | amounts of buffers for reliability. If you need the best
       | reliability you can get but you don't have the luxury of
       | retransmitting for whatever reason then as much error correction
       | in the upper level protocol as you can afford from a CPU
       | perspective is your best bet.
       | 
       | If you want to see a modern take on achieving reliable
       | transmission over UDP check out HTTP/3.
        
         | aDfbrtVt wrote:
         | Traditional error free transmission in optical comms is 1E-15
         | BER. I can't access the EPON standard right now, but my
         | experience with other IEEE standards would tell me they're
         | probably guaranteeing 1E-15 for worst-case optical link. This
         | link is pretty close to optimal, so 400G of data is nowhere the
         | amount to say anything with certainty about the BER of the
         | channel.
        
           | zamadatix wrote:
           | IEEE only guarantees 10^-12 which is almost certainly why 1st
           | gen 25G products released exactly when they were able to hit
           | that. My estimate a 2m 10G DAC from 2017 would have a BER of
           | ~10^-18 is from personal experience (As unlikely as it sounds
           | I actually have done extensive testing 7 of the exact model
           | server and NIC in our lab purchased about the same time,
           | different switch though) not derived from the 400 GB
           | transfers in the paper.
        
         | ignoramous wrote:
         | > _Generally if you have any of these: 1) unknown congestion 2)
         | unknown speed 3) unknown tolerance for error_
         | 
         | > ... _If you want to see a modern take on achieving reliable
         | transmission over UDP check out HTTP /3._
         | 
         | Not an expert but I have seen folks here complain that QUIC /
         | HTTP3 doesn't have a proper congestion control like uTP
         | (BitTorrent over UDP) does with LEDBAT:
         | https://news.ycombinator.com/item?id=10546651
        
           | wmf wrote:
           | LEDBAT-style congestion control is not proper for
           | "foreground" Web traffic and it will result in lower
           | performance than TCP-based HTTP. Fixing bufferbloat is an
           | ongoing project and it isn't fair to blame QUIC for being no
           | worse than TCP.
        
       | mynegation wrote:
       | Relevant discussion on HN from 4 months ago of IBM's proprietary
       | large data transfer tool:
       | https://news.ycombinator.com/item?id=21898072
        
       | [deleted]
        
       | exdsq wrote:
       | Can you do something similar with TCP and increase the packet
       | size such that the "TCP Overhead" is reduced compared to 64 byte
       | payloads but with the increased reliability over UDP?
        
         | zamadatix wrote:
         | MTU is maximum transmission unit so increasing that does
         | nothing about making 64 byte packets more efficient. You should
         | try to send as much data as you can in one go and the socket
         | will automatically figure out how to split that up the best it
         | can. By default most systems default to a 1500 byte MTU so the
         | OS will chunk it up to fit in multiple 1500 byte packets. The
         | OS will usually try to optimize a send of a bunch of small
         | payloads in one larger packet as well via e.g.
         | https://en.wikipedia.org/wiki/Nagle%27s_algorithm but that's
         | not guaranteed and much more CPU inefficient even when it does
         | work.
         | 
         | 99% of the time you are transferring data you don't need to
         | think this deep into networking though. E.g. I have the exact
         | same DL360 Gen9 servers with the same 10G NICs in my lab and
         | 10G TCP streams run just fine on them without manual tweaking.
         | Setting MTU to 9000 does make it more efficient but that's
         | about as far as I'd go without a particularly strong driver to
         | optimize (e.g. "We've got 2,000 of these servers and if we
         | could get by with 5% fewer it'd save your yearly salary" kind
         | of things).
        
         | toast0 wrote:
         | In the system proposed, not really.
         | 
         | To use TCP instead of UDP there are two big problems:
         | 
         | 1) the sensor device would need to keep unacknowledged data in
         | memory, but it may not have enough memory for that
         | 
         | 2) if they're running at line rate (max bandwidth in this case)
         | in UDP, there's no bandwidth left to retransmit data
         | 
         | All of the buffer manipulation is going to be more CPU
         | intensive on both sides as well, and you'd run into congestion
         | control limiting the data rate in the early part of the capture
         | as well.
         | 
         | For a system like this, while UDP doesn't guarantee
         | reliability, careful network setup (either sensor direct to
         | recorder, or on a dedicated network with sufficient capacity
         | and no outside traffic) in combination with careful software
         | setup allows for a very low probability of lost packets dispite
         | no ability to retransmit.
        
       | fulafel wrote:
       | This would be interesting to try on today's faster ethernet
       | speeds, wonder how it goes at 100G.
        
       | otterley wrote:
       | (2017)
        
       | rubatuga wrote:
       | TLDR:                  sysctl -w net.core.rmem_max=12582912
       | sysctl -w net.core.wmem_max=12582912        sysctl -w
       | net.core.netdev_max_backlog=5000        ifconfig eno49 mtu 9000
       | txqueuelen 10000 up
        
       ___________________________________________________________________
       (page generated 2020-04-19 23:00 UTC)