hngopher.com

       [HN Gopher] 5Gbps Ethernet on the Raspberry Pi Compute Module 4
       ___________________________________________________________________
        
       5Gbps Ethernet on the Raspberry Pi Compute Module 4
        
       Author : geerlingguy
       Score  : 124 points
       Date   : 2020-10-30 18:25 UTC (4 hours ago)
        
 (HTM) web link (www.jeffgeerling.com)
 (TXT) w3m dump (www.jeffgeerling.com)
        
       | ProAm wrote:
       | That was a fun read. Thanks.
        
       | unilynx wrote:
       | > "I need four computers, and they all need gigabit network
       | interfaces... where could I find four computers to do this?"
       | 
       | Why not loop the ports back to themselves? IIRC, 1gbit ports
       | should autodetect when they're cross connected so it wouldn't
       | even need special cables
        
         | geerlingguy wrote:
         | Would that truly be able to test send / receive of a full (up
         | to) gigabit of data to/from the interface? If it's loopback, it
         | could test either sending 500 + receiving 500, or... sending
         | 500 + receiving 500. It's like sending data through localhost,
         | it doesn't seem to reflect a more real-world scenario (but
         | could be especially helpful just for testing).
        
           | nitrogen wrote:
           | I think maybe they meant linking Port 1 to Port 2, and Port 3
           | to Port 4? Also I believe gigabit ethernet can be full
           | duplex, so you should be able to send 1000 and receive 1000
           | on a single interface at the same time if it's in full duplex
           | mode.
        
         | adrian_b wrote:
         | When you loop back Ethernet links in the same computer, you
         | need to take care with the configuration, because normally the
         | operating system will not route the Ethernet packets through
         | the external wires but will process them like being for
         | localhost, so you will see a very large speed without any
         | relationship with the Ethernet speed.
         | 
         | How to force the packets through the external wires depends on
         | the operating system. On Linux you must use namespaces and
         | assign the two Ethernet interfaces that are looped on each
         | other to two distinct namespaces, then set appropriate routes.
        
       | q3k wrote:
       | Seems to be in the same ballpark as when I got ~3.09Gbps on the
       | Pi4's PCIe, but on a single 10G link:
       | https://twitter.com/q3k/status/1225588859716632576
        
         | geerlingguy wrote:
         | Oh, nice! How did I not find your tweets in all my searching
         | around?
        
           | q3k wrote:
           | Shitposting on Twitter makes for bad SEO :).
        
       | baybal2 wrote:
       | A much easier option:
       | 
       | Get a USB 3.0 2.5G or 5G card. With a fully functional DMA on the
       | USB controller it can get quite close to PCIE option.
       | 
       | A setback for all Linux users at the moment:
       | 
       | The only chipmaker making USB NICs doing 2.5G+ is RealTek, and
       | RealTek chose to use USB NCM API for their latest chips.
       | 
       | And as we know Linux support for NCM now is super slow, and
       | buggy.
       | 
       | I barely got 120megs from it. Will welcome any kernel hacker
       | taking on the problem.
        
         | [deleted]
        
         | vetinari wrote:
         | > The only chipmaker making USB NICs doing 2.5G+ is RealTek,
         | and RealTek chose to use USB NCM API for their latest chips.
         | 
         | QNAP QNA-UC5G1T uses Marvell AQtion AQC111U. Might be worth a
         | try.
        
       | escardin wrote:
       | It's probably outside the scope (and possibly cheating) but could
       | a DPDK stack & supported nic[1] push you past the PCIe limit?
       | 
       | [1] https://core.dpdk.org/supported/
        
         | q3k wrote:
         | Does DPDK actually let you not have to DMA packet data over to
         | the system memory and back?
        
       | geerlingguy wrote:
       | I think I've found the bottleneck now that I have the setup up
       | and running again today--ksoftirqd quickly hits 100% CPU and
       | stays that way until the benchmark run completes.
       | 
       | See: https://github.com/geerlingguy/raspberry-pi-pcie-
       | devices/iss...
        
         | iscfrc wrote:
         | You might want to try enabling jumbo frames by setting the MTU
         | to something >1500 bytes. Doing so should reduce the number of
         | IRQs per unit of time since each frame will be carrying more
         | data and therefore there will be fewer of them.
         | 
         | According to the Intel 82580EB datasheet[1] it supports an MTU
         | of "9.5KB." It's unclear if that means 9500 or 9728 bytes.
         | 
         | I looked briefly for a datasheet that includes the ethernet
         | specs. of the Broadcom 2711 but didn't immediately find
         | anything.
         | 
         | Recent versions of iproute2 can output the maximum MTU of an
         | interface via:                 # Look for "maxmtu" in the
         | output       ip -d link list
         | 
         | Barring that you can try incrementally upping the MTU until you
         | run in to errors.
         | 
         | The MTU of an interface can be set via:                 ip link
         | set $interface mtu $mtu
         | 
         | Note that for symmetrical testing via direct crossover you'll
         | want to have the MTU be the same on each interface pair.
         | 
         | [1]
         | https://www.intel.com/content/www/us/en/embedded/products/ne...
         | (pg. 25, "Size of jumbo frames supported")
        
           | geerlingguy wrote:
           | I set the MTU to its max (just over 9000 on the intel, heh),
           | but that didn't make a difference. The one thing that did
           | move the needle was overclocking the CPU to 2.147 GHz (from
           | base 1.5 GHz clock), and that got me to 3.4 Gbps. So it seems
           | to be a CPU constraint at this point.
        
             | neurostimulant wrote:
             | I wonder if using user-space tcp stack (or anything that
             | could bypass the kernel) could push the number higher.
        
         | syoc wrote:
         | I would have a look at sending data with either DPDK
         | (https://doc.dpdk.org/burst-replay/introduction.html) or
         | AF_PACKET and mmap (https://sites.google.com/site/packetmmap/ )
         | 
         | You can also use ethtool -C on the NICs on both ends of the
         | connection to rate limit the irq signal handeling allowing you
         | to optimize for throughput instead of latency.
        
       | drewg123 wrote:
       | _So theoretically, 5 Gbps was possible_
       | 
       | No, it is not. That NIC is a PCIe Gen2 NIC. By using only a
       | single lane, you're limiting the bandwidth to ~500MB/sec
       | theoretical. That's 4Gb/s theoretical, and getting 3Gb/s is ~75%
       | of the theoretical bandwidth, which is pretty decent.
        
         | geerlingguy wrote:
         | I'll take pretty decent, then :)
         | 
         | I mean, before this the most I had tested successfully was a
         | little over 2 Gbps with three NICs on a Pi 4 B.
        
           | drewg123 wrote:
           | Can you run an lspci -vvv on the Intel NIC? I just re-read
           | things, and it seems like 1 of those Gb/s is coming from the
           | on-board NIC. I'm curious if maybe PCIe is running at Gen1
        
             | geerlingguy wrote:
             | Here you go! https://pastebin.com/A8gsGz3t
        
               | drewg123 wrote:
               | So its running Gen2 x1, which is good. I was afraid that
               | it might have downshifted to Gen1. Other threads point to
               | your CPU being pegged, and I would tend to agree with
               | that.
               | 
               | What direction are you running the streams in? In
               | general, sending is much more efficient than receiving
               | ("its better to give than to receive"). From your
               | statement that ksoftirqd is pegged, I'm guessing you're
               | receiving.
               | 
               | I'd first see what bandwidth you can send at with iperf
               | when you run the test in reverse so this pi is sending.
               | Then, to eliminate memory bw as a potential bottleneck,
               | you could use sendfile. I don't think iperf ever
               | supported sendfile (but its been years since I've used
               | it). I'd suggest installing netperf on this pi, running
               | netserver on its link partners, and running "netperf
               | -tTCP_SENDFILE -H othermachine" to all 5 peers and see
               | what happens.
        
         | stkdump wrote:
         | Well, when a LAN is 1Gb/s they are actually not talking about
         | real bits. It actually is 100MB/s max, not 125MB/s as one might
         | expect. Back in the old days they used to call it baud.
        
           | wmf wrote:
           | This is wrong; 1 Gbps Ethernet is 125 MB/s (including
           | headers/trailer and inter-packet gap so you only get ~117 in
           | practice). Infiniband, SATA, and Fibre Channel cheat but
           | Ethernet doesn't.
        
       | geerlingguy wrote:
       | Sorry about the slightly-clickbaity title. I actually have at
       | least a 10 GbE card (and switch) on the way to test those and see
       | if I can get more out of it, but for _this_ test, I had a
       | 4-interface Intel I340-T4, and I managed to get a maximum
       | throughput of 3.06 Gbps when pumping bits through all 4 of those
       | plus the built-in Gigabit interface on the Compute Module.
       | 
       | For some reason I couldn't break that barrier, even though all
       | the interfaces can do ~940 Mbps on their own, and any three on
       | the PCIe card can do ~2.8 Gbps. It seems like there's some sort
       | of upper limit around 3 Gbps on the Pi CM4 (even when combining
       | the internal interface) :-/
       | 
       | But maybe I'm missing something in the Pi OS / Debian/Linux
       | kernel stack that is holding me back? Or is it a limitation on
       | the SoC? I though the ethernet chip was separate from the PCIe
       | lanes on it, but maybe there's something internal to the BCM2711
       | that's bottlenecking it.
       | 
       | Also... tons more detail here:
       | https://github.com/geerlingguy/raspberry-pi-pcie-devices/iss...
        
         | wil421 wrote:
         | Do you think an SFP+ nic would work? It would be cool to try
         | out fiber.
        
           | baybal2 wrote:
           | There are no SFP option on 5gbps NICs as i understand as per
           | standard
        
         | mmastrac wrote:
         | Awesome work. Been watching your videos on these (the video
         | card one was especially interesting).
         | 
         | At what point are you saturating the poor little ARM CPU (or
         | its tiny PCIe interface)?
        
           | geerlingguy wrote:
           | Heh, I know that ~3 Gbps is the maximum you can get through
           | the PCIe interface (x1, PCI 2.0), so that is expected. But I
           | was hoping the internal ethernet interface was separate and
           | could add one 1 Gbps more... the CPU didn't seem to be maxed
           | out and was also not overheating at the time (especially not
           | with my 12" fan blasting on it).
        
             | dualboot wrote:
             | with some tuning you should be able to saturate the PCIe 1x
             | slot.
             | 
             | Excellent reading on this available here :
             | 
             | http://www.intel.com/content/dam/doc/application-
             | note/82575-...
             | 
             | and here :
             | 
             | https://blog.cloudflare.com/how-to-achieve-low-latency/
             | 
             |  _Edit : with the inbound 10Gb card referenced_
        
             | toast0 wrote:
             | Was all this TCP? You might try UDP as well, in case you're
             | hitting a bottleneck in the tcp stack.
        
         | stratosmacker wrote:
         | Jeff,
         | 
         | First off, thank you for doing this kind of 'r&d', it is really
         | exciting to see what the Pi is capable of after less than a
         | decade.
         | 
         | Would you be interested in someone testing a SAS PCI card? I'm
         | going to pick up one of these as soon as they're not
         | backordered...
        
         | monocasa wrote:
         | You might be hitting the limits of the RAM. I think LPDDR3
         | maxes out at ~4.2Gbps, and running other bus masters like the
         | HDMI and OS itself would be cutting into that.
        
           | wmf wrote:
           | 32-bit LPDDR4-3200 should give 12.8 Gbytes/s which is 102
           | Gbits/s.
        
             | monocasa wrote:
             | You can't just multiply width*frequency for DRAM these
             | days, as much as I wish we still lived in the days of
             | ubiquitous SRAM.
             | 
             | The chip in some of the 2GB RPI4s is rated for only
             | 3.7Gbps.
             | 
             | https://www.samsung.com/semiconductor/dram/lpddr4/K4F6E304H
             | B...
        
               | wmf wrote:
               | No, that chip is rated for 3.7 Gbps _per pin_ and it 's
               | 32 bits wide. Even at ~60% efficiency you're an order of
               | magnitude off.
        
               | monocasa wrote:
               | Real world tests are seeing around 3 to 4 Gbps of memory
               | bandwidth.
               | 
               | https://medium.com/@ghalfacree/benchmarking-the-
               | raspberry-pi...
               | 
               | LPDDR cannot sustain anywhere near the max speed of the
               | interface. It's more of a hope that you can burst
               | something out and go to sleep rather than trying to
               | maintain that speed. In a lot of ways DRAM hasn't gotten
               | faster in decades when you look at how latency clocks
               | nearly always increase at the same rate of interface
               | speed increases. And LPDDR is the niche where that shines
               | the most, because it doesn't have oodles of dies to
               | interleave to hide that issue.
        
               | mlyle wrote:
               | Bits aren't bytes.
        
               | monocasa wrote:
               | The y axis is labeled "megabits per second".
        
               | hedgehog wrote:
               | Those numbers look way off, maybe they mixed up the
               | units? Should be a few GBps at least.
        
               | wmf wrote:
               | Innumeracy strikes again. It's actually 4-5 Gbytes/s [1]
               | plus whatever bandwidth the video scanout is stealing
               | (~400 Mbytes/s?). That's only ~40% efficient which is
               | simultaneously terrible and pretty much what you'd expect
               | from Broadcom. However 4 Gbytes/s is 32 Gbits/s which
               | leaves plenty of headroom to do 5 Gbits/s of network I/O.
               | 
               | [1]
               | https://www.raspberrypi.org/forums/viewtopic.php?t=271121
        
               | mmastrac wrote:
               | Is there a way to see if you are hitting memory bandwidth
               | issues in Linux?
        
               | monocasa wrote:
               | Not in a holistic way AFAIK, and for sure not rigged up
               | to the Raspbian kernel (since all of that lives on the
               | videocore side), but I bet Broadcom or the RPi foundation
               | has access to some undocumented perf counters on the DRAM
               | controller that could illuminate this if they were the
               | ones debugging it.
        
         | CyberDildonics wrote:
         | Instead of lying and then apologizing once you get what you
         | want, it would be better to just not lie in the first place.
        
           | geerlingguy wrote:
           | Technically it's not a lie--there are 5x1 Gbps of interfaces
           | here. But I wanted to acknowledge that I used a technicality
           | to get the title how I wanted it, because if I didn't do
           | that, a lot of people wouldn't read it, and then we wouldn't
           | get to have this enlightening discussion ;)
        
         | ksec wrote:
         | >Sorry about the slightly-clickbaity title.
         | 
         | Well yes because 5Gbps Ethernet is actually a thing ( NBase-T
         | or 5GBASE-T). So 1Gbps x 5 would be more accurate.
         | 
         | Cant wait to see results on 10GbE though :)
         | 
         | P.S I really wish 5Gbps Ethernet is more common.
        
           | geerlingguy wrote:
           | True true... though in my work trying to get a flexible 10
           | GbE network set up in my house, I've found that the support
           | for 2.5 and 5 GbE are iffy at best on many devices :(
        
           | ncrmro wrote:
           | My ATT router made by Nokia has one 5gbe and the fiber plugs
           | in directly with SFP!
        
         | StillBored wrote:
         | Its a single lane pcie gen2 interface. The max theoretical is
         | 500MB/sec. So you can't ever touch 10G with it. In reality
         | getting 75% of theoretical on PCIe tends to be a rough upper
         | limit on most PCIe interfaces, so the 3Gbit your seeing is
         | pretty close to what one would expect.
         | 
         | edit: Oh its 3Gbit across 5 interfaces, one of which isn't
         | PCIe, so the PCIe side is probably only running at about 50%.
         | It might be interesting to see if the CPUs are pegged (or just
         | one of them). Even so, PCIe on the rpi isn't coherent so that
         | is going to slow things down too.
        
           | leptons wrote:
           | >It might be interesting to see if the CPUs are pegged (or
           | just one of them).
           | 
           | This is very likely the answer. I see a lot of people who
           | think of the Pi as some kind of workhorse and are trying to
           | use it for things that it simply can't do. The Pi is a great
           | little piece of hardware, but it's not really made for this
           | kind of thing. I'd never think about using a Raspberry Pi if
           | I had to think about "saturating a NIC".
        
             | geerlingguy wrote:
             | Well it can saturate up to two, and almost three, gigabit
             | NICs now. So not too shabby.
             | 
             | But I like to know the limits so I can plan out a project
             | and know whether I'm safe using a Pi, or a 3-5x more
             | expensive board or small PC :)
        
           | geerlingguy wrote:
           | It looks like the problem is `ksoftirqd` gets pegged at 100%
           | and the system just queues up packets, slowing everything
           | down. See: https://github.com/geerlingguy/raspberry-pi-pcie-
           | devices/iss...
        
             | StillBored wrote:
             | So, this is sorta indicative of a RSS problem, but on the
             | rpi it could be caused by other things. Check
             | /proc/interrupts to assure you have balanced MSI's,
             | although that itself could be a problem too.
             | 
             | edit: run `perf top` to see if that gives you a better
             | idea.
        
               | geerlingguy wrote:
               | Results:                   15.96%  [kernel]
               | [k] _raw_spin_unlock_irqrestore         12.81%  [kernel]
               | [k] mmiocpy          6.26%  [kernel]
               | [k] __copy_to_user_memcpy          6.02%  [kernel]
               | [k] __local_bh_enable_ip          5.13%  [igb]
               | [k] igb_poll
               | 
               | When it hit full blast, I started getting "Events are
               | being lost, check IO/CPU overload!"
        
               | SoapSeller wrote:
               | Another idea will be to increase interrupt coalescing via
               | ethtool -c/C
        
             | dualboot wrote:
             | This is common even on x86 systems.
             | 
             | You have to set the irq affinity to utilize the available
             | CPU cores.
             | 
             | There is a script included with the source you used to
             | compile drivers called "set_irq_affinity"
             | 
             | Ex (Sets IRQ Affinity for all available cores) :
             | 
             | [path-to-i40epackage]/scripts/set_irq_affinity -x all ethX
        
               | geerlingguy wrote:
               | So like https://pastebin.com/2Z4UECPq ? -- this didn't
               | make a difference in the overall performance :(
        
               | dualboot wrote:
               | Looks like the script needs to be adjusted to function on
               | the Pi.
               | 
               | I wish I had the cycles and the kit on hand to play with
               | this!
        
       ___________________________________________________________________
       (page generated 2020-10-30 23:00 UTC)