[HN Gopher] How 1500 bytes became the MTU of the internet ___________________________________________________________________ How 1500 bytes became the MTU of the internet Author : petercooper Score : 357 points Date : 2020-02-19 12:13 UTC (10 hours ago) (HTM) web link (blog.benjojo.co.uk) (TXT) w3m dump (blog.benjojo.co.uk) | rcarmo wrote: | I used to glue stuff together to FDDI rings and Token Ring | networks back in the day (I used Xylan switches, which had ATM-25 | UTP line cards among other long-forgotten oddities), and MTU | sizes always struck me as being particularly arbitrary. | | But I'm not really sure about the clock sync limitations being a | factor here. It was way back in the deepest past. | | What I do remember vividly is the mess that physical layer | networking evolved into over the years thanks to dial-up and DSL | (ever had to set your MTU to 1492 to accommodate an extra PPP | header?). | | And something is obviously wrong today, since we're still using | the same baseline value for our gigabit fiber to the home | connections, our 3/4/5G (scratch to taste) mobile phones, etc. | kalleboo wrote: | > _ever had to set your MTU to 1492 to accommodate an extra PPP | header?_ | | I had to replace my Apple AirPort Extreme when I got gigabit | fiber since it didn't have a manual MTU setting and it didn't | autodetect the MTU properly over PPPoE... In 2020 I still need | to manually set the MTU on my Ubiquiti USG... | mrkstu wrote: | Oh, and throw IPsec and a bunch of other protocols into the | mix- networking is often a fragile beast: | | https://www.networkworld.com/article/2224654/mtu-size-issues... | hylaride wrote: | PPPoE was such an ugly mess. In its early days there were | various server-side OSes (IRIX and one other I can't remember) | that had piss-poor TCP/IP MTU implementations. The result was | that you ran into random websites that just took forever to | load as packets that happened to not fill the full 1492 limits | eventually got the data through. By around 2003 i rarely | encountered it anymore, but then I moved to a place with cable | and never had to deal with it again. | neurostimulant wrote: | > ever had to set your MTU to 1492 to accommodate an extra PPP | header? | | Ah, I was always wondering why my ISP configured my fiber | modem's mtu to 1492. So it's due to using PPPoE? Is there no | way to use bigger mtu when using PPPoE? | jburgess777 wrote: | RFC 4638 provides a mechanism for this. It relies on the | Ethernet devices supporting a 'baby jumbo' MTU of 1508 bytes | and support for it is still a bit scarce. | | https://tools.ietf.org/html/rfc4638 | toast0 wrote: | Nowadays, there's PPPoA (over ATM) which wraps at a lower | level, and allows 1500 byte ethernet payloads through. But | running the ethernet over ATM at 1508 MTU so that PPPoE would | be 1500 was probably out of reach --- when PPPoE was | introduced, the customer endpoint was often the customer PC, | and some of those were using fairly old nics that might not | have supported larger packets. | | Sadly, smaller than 1500 byte MTUs still cause issues for | some people to this day. It's all fine if everything is | properly configured, or if at least everything sends and | receives ICMP, but if something is silently dropping packets, | you're in for a bad day. These days, I think it's usually | problems with customers sending large packets, as opposed to | early days where receiving large packets would routinely | fail, but a lot of that is because large sites gave up on | sending large packets. | rcarmo wrote: | Yes, PPPoA was also a thing I dealt with, and another | source of irritating MTU issues. | franga2000 wrote: | Not my field, so I might be making an obvious error here, but: | | If there are efficiency gains to be had from using jumbo frames, | wouldn't setting my MTU to a multiple a 1500 still be of some | benefit? If my PC, my switch and my router all support it, that | would still be a tiiiny improvement. If the server's network does | as well and let's say both of our direct providers, even if none | of the exchanges or backbones in between do, that would still be | an efficiency gain for ~10% of the link, right? | benjojo12 wrote: | Locally you can set your MTU to larger than 1500, but if you | (generally) try and send a packet towards the internet larger | than 1500 it will be dropped without a trace, or it will be | dropped and an ICMP message will be generated to tell your | system to lower the MTU. Assuming you have not firewalled off | ICMP ;) | | As a handy feature on Linux at least, you can set your MTU to | 9000 locally, and then set the default (internet generally) | route to have a MTU of 1500 to prevent issues: | | ip route add 0.0.0.0/0 via 10.11.11.1 mtu 1500 | luma wrote: | Over-sized packets can (and generally will) be fragmented by | your router. It shouldn't be dropped unless you've | intentionally set DNF. | zajio1am wrote: | AFAIK, most OSes today set DNF by default. | benjojo12 wrote: | Fragments are very hit or miss on the internet, | https://blog.cloudflare.com/ip-fragmentation-is-broken/ | duxup wrote: | They will fragment them but many times you will see | performance or other misc issues ... eventually. | apexalpha wrote: | Oh I never knew this. I wonder if I could enable Jumbo Frames | to stream 4k content more efficiently on my local LAN. | a_t48 wrote: | You could...if the software on both your server and your | media pc support large frames. And you're willing to deal | with every once in a while some piece of software doing the | wrong thing and sending out every packet with large MTU | without doing detection on max packet size. | duxup wrote: | Potentially but troubleshooting performance issues from | mismatched MTU can be brutal so most providers drop anything | over 1500. | | Many devices can do over 1500 but anyone who has done so | without careful consideration knows the outcome isn't | predictable unless everyone on the network is prepared to do | so. | | A dedicated / controlled SAN type environment can do it just | fine, beyond that it can be difficult. | tambourine_man wrote: | Nah, it's 1492 forever! | dredmorbius wrote: | Found the ADSL user. | willis936 wrote: | IEEE 802 history is disappearing without a trace? Afaik it's | pretty well documented, you just need to be a member for some of | the stuff. | | http://www.ieee802.org/ | | I feel like the last piece we're missing in this story is the | performance impact of fragmentation. Like why not just set all | new hardware to an MTU of 9000 and wait ten years? | teddyh wrote: | My favorite Ethernet resource is Charles Spurgeon's Ethernet | (IEEE 802.3) Web Site: http://www.ethermanage.com/resources/ | | It used to have even more stuff, but I think he removed a lot | when he got his book published. | cesarb wrote: | > Like why not just set all new hardware to an MTU of 9000 and | wait ten years? | | The hardware in question is Ethernet NICs. However, for you to | set the MTU on an Ethernet NIC to 9000, _every_ device on the | same Ethernet network (at least the same Ethernet VLAN), | including all other NICs and switches, including ones which | aren 't connected yet, must also support and be configured for | that MTU. And this also means you cannot use WiFi on that | Ethernet network (since, at least last time I looked, WiFi | cannot use a MTU that large). | willis936 wrote: | Sending a jumbo frame down a line that has hardware that | doesn't support jumbo frames somewhere along the way does not | mean the packet gets dropped. The NIC that would send the | jumbo frame fragments the packet down to the lower MTU. So | what's the performance impact of that fragmentation? If it | isn't higher than the difference in bandwidth overhead from | headers of 9000 MTU traffic vs. 1500 MTU traffic then why not | transition to 9000 MTU? | tyingq wrote: | It does mean packets sent to another local, non-routed, | non-jumbo-frame interface would get lost. So you could, for | example, _maybe_ talk to the internet, but you couldn 't | print anything to the printer down the hall. | cesarb wrote: | AFAIK, Ethernet has no support for fragmentation; I've | never seen, in the Ethernet standards I've read (though I | might have missed it), a field saying "this is a fragment | of a larger frame". There's fragmentation in the IP layer, | but it needs: (a) that the frame contains an IP packet; (b) | that the IP packet can be fragmented (no "don't fragment" | on IPv4, or a special header on IPv6); (c) that the sending | host knows the receiving host's MTU; (d) that it's not a | broadcast or multicast packet (which have no singular | "receiving host"). | | You can have working fragmentation if you have two separate | Ethernet segments, one for 1500 and the other for 9000, | connected by an IP router; the cost (assuming no broken | firewalls blocking the necessary ICMP packets, which sadly | is still too common) is that the initial transmission will | be _resent_ since most modern IP stacks set the "don't | fragment" bit (or don't include the extra header for IPv6 | fragmentation). | vlan0 wrote: | PMTU-D will save their ass in some cases. But it's not safe | to assume all routers in the path will respond to ICMP. | toast0 wrote: | It doesn't matter that the routers respond to ICMP, it | matters that they generate them, and that they're | addressed properly, and that intermediate routers don't | drop them. | | Some routers will generate the ICMPs, but are rate | limited, and the underlying poor configuration means that | the rate limits are hit continously and most connections | are effectively in a path mtu blackhole. | vlan0 wrote: | >It doesn't matter that the routers respond to ICMP, it | matters that they generate them, and that they're | addressed properly, and that intermediate routers don't | drop them. | | >Some routers will generate the ICMPs, but are rate | limited, and the underlying poor configuration means that | the rate limits are hit continously and most connections | are effectively in a path mtu blackhole. | | Sure. But I'm not about to sit here and name all the | different reasons for folks. And since most here do not | have a strong networking background running consumer | grade routers at home, it seemed most applicable. | | I could have used a more encompassing term like PMTU-D | blackhole, but I didn't. | sathackr wrote: | But how does the NIC know that, 11 hops away, there is a | layer 2 device, which cannot communicate with the | NIC(switches do not typically have the ability to | communicate directly with the devices generating the | packets), that only supports a 1500 byte frame? | | Now you need Path MTU discovery, which as the article | indicates, has its own set of issues. (Overhead from trial | and error, ICMP being blocked due to security concerns, | etc...) | willis936 wrote: | Why should it need to? Ethernet is designed to have non- | deterministic paths (except in cases of automotive, | industrial, and time sensitive networks). If you get to a | hop that doesn't support jumbo frames then break it into | smaller frames and send them individually. The higher | layers don't care if the data comes in one frame or ten. | [deleted] | wbl wrote: | If you block ICMP you deserve what you get. Don't do | this. (Edit: don't block ICMP) | oarsinsync wrote: | So now you're trying to communicate from your home | machine to some random host on the internet (website, | VPS, streaming service), and you're configured for MTU | 9000, the remote service is also configured for MTU 9000, | but some transit provider in the middle is not, and | they've disabled ICMP for $reasons. | | They blocked ICMP, do you deserve what you get? | wbl wrote: | Transit providers should push packets and generally do. | With PMTU failures it's usually clueless network admins | on firewalls nearer endpoints. And no, you don't and I | wish the admin responsible could feel your pain. | oarsinsync wrote: | > Transit providers should | | Agreed | | > and generally do | | Agreed. | | Now if you can make it 'will always just push packets', | we'll be golden. | | Unfortunately, there are enough ATM/MPLS/SONET/etc | networks being run by people who no longer understand | what they're doing, that we're never going to get there. | | To make matters more entertaining, IPv6 depends on icmp6 | even more. | [deleted] | toast0 wrote: | > Sending a jumbo frame down a line that has hardware that | doesn't support jumbo frames somewhere along the way does | not mean the packet gets dropped | | Almost all IP packets on the internet at large have the 'do | not fragment' flag set. IP defragmentation performance | ranges from pretty bad to an easy DDoS vector, so a lot of | high traffic hosts drop fragments without processing them. | | If we had truncation (with a flag) instead of | fragmentation, that might have been usable, because the | endpoints could determine in-band the max size datagram and | communicate it and use that; but that's not what we have. | zamadatix wrote: | Fragmentation/reassembly is an l3 concept and not | guaranteed to large MTUs when it is there. | Avamander wrote: | The worst case I just recently encountered with Jumbo Frames | was with NetworkManager trying to follow local DNS server's | advertised MTU but when the local interface doesn't support | Jumbo Frames it just dies and keeps looping. | | Even if you really want devices to use JF, some fail | miserably because it's just not well thought out. | tyingq wrote: | _" why not just set all new hardware to an MTU of 9000"_ | | Routers can fragment the packets, switches can't. So that would | be pretty chaotic for non-techie installed equipment. | kitteh wrote: | Plenty of routers today that can't fragment packets. And they | have rate limiters where they can only generate a small | amount of ICMP 3/4s (maybe 50 a second). | brutt wrote: | Last time I saw hardware Ethernet switch was 20 years ago. | 8-[ ] | tyingq wrote: | There's one in your house probably. It won't frag packets | between your wired PC and your wired printer. | | There are also certainly a shit load of them in closets and | top-of-rack all over where I work. | vlan0 wrote: | >I feel like the last piece we're missing in this story is the | performance impact of fragmentation. Like why not just set all | new hardware to an MTU of 9000 and wait ten years? | | Because a node with a MTU of 9000 will very likely be unable to | determine the MTU of every link in it's path. At best, you'll | see fragmentation. At worst, the node's packets will be | registered as interface errors when it encounters an interface | lower than 9k. Neither of those are desirable. | [deleted] | phicoh wrote: | The problem seems to be that both the IEEE and the IETF don't | want to do anything. | | IEEE could define a way to support larger frames. 'just wait 10 | years' doesn't strike me as the best solution, but at least it | is a solution. In my opinion a better way be if all devices | would report the max frame length they support. Bridges would | just report the minimum over all ports on the same vlan. When | there are legacy devices that don't report anything, just stay | at 1500. | | IETF can also do someything today by having hosts probe the | effective max. frame length. There are drafts but they don't go | anywhere because too few people care. | hinkley wrote: | > If we look at data from a major internet traffic exchange point | (AMS-IX), we see that at least 20% of packets transiting the | exchange are the maximum size. | | He's so optimistic. My brain heard this as " _only_ 20% of | packets [...] are the maximum size" | | What are all of those 64 byte packets? Interactive shells, or | some other low bitrate protocol? | wmf wrote: | Probably mostly ACKs. | hinkley wrote: | Well now I feel dumb. | labawi wrote: | Note that maximum size is defined as >=1514, with 50% of | packets being >= 1024. It very well may be that ~45% of packets | are >= 1400 bytes. | | The transfer graph is wrong - it shows packet count | distribution, not size. Quick math says roughly 90% of transfer | size are >= 1024 byte packets. | mhandley wrote: | For 802.11, the biggest overhead is not packet headers but the | randomized medium aquisition time so as to minimize collisions. | 1500 bytes is way too small here with modern 802.11, so if you | only send one packet for each medium aquisition, you end up with | something upwards of 90% overhead. The solution 802.11n and later | uses here is to use Aggregate MPDUs (AMPDUs). For each medium | aquisition, the sender can send multiple packets in a contiguous | burst, up to 64 KBytes. This ends up adding a lot of mechanism, | including a sliding window block ack, and it impacts queuing | disciplines, rate adaptation and pretty much everything else. | Life would be so much simpler if the MTU had simply grown over | time in proportion to link speeds. | saber6 wrote: | > Life would be so much simpler if the MTU had simply grown | over time in proportion to link speeds. | | For every thing besides real-time maybe. | sjwright wrote: | The M in MTU stands for maximum, not mandatory. | Avamander wrote: | Tell that to NetworkManager | LeifCarrotson wrote: | It ends up being mandatory if you're sharing a non-MIMO | link with other systems that are using large packets. | saber6 wrote: | I understand. I have architected networks for over a decade | now. The real issue is serialization delay. If I have a | tiny voice packet that has to wait to be physically | transmitted behind a huge dump truck packet (big), it can | still be a problem even with high speed links with regards | to microbursts. | wtallis wrote: | > Life would be so much simpler if the MTU had simply grown | over time in proportion to link speeds. | | The problem is that the world went wireless, so _maximum_ link | speeds grew a lot but _minimum_ link speeds are still | relatively low. A single 64kB packet tying up a link for | multiple milliseconds--unconditionally delaying everything else | in the queue by at least that much--is not what we want. | mhandley wrote: | 802.11 AMPDUs already tie up the link for ~4ms in normal | operation. Without this, the medium acquisition overheads | kill throughput. But you're correct that a single 64KB packet | sent at MCS-0 would take a lot longer than that. | | 802.11 already includes a fragmentation and reassembly | mechanism at the 802.11 level, distinct from any end-to-end | IP fragmentation. Unlike IP fragmentation, fragments are | retransmitted if lost. So you could use 802.11 fragmentation | for large packets sent at slow link speeds to avoid tying up | the link for a long time. | btown wrote: | Especially since there are a lot of low latency applications | (games, etc.) that take advantage of being able to fit data | in a single packet that will not be held up due to other | applications sharing the link that might try to stuff larger | packets down the link. | inetknght wrote: | > _The problem is that the world went wireless, so maximum | link speeds grew a lot but minimum link speeds are still | relatively low._ | | I would argue: the problem is that the MTU isn't negotiated | at all, but especially not based on link availability. | snuxoll wrote: | IPv6 tries to solve this with path MTU discovery. | inetknght wrote: | Yes, but IPv6 is still at a higher level than Ethernet, | Wifi, et al and is therefore subject to the limitations | of the lower level framing | jandrese wrote: | Sure, I mean that's what pMTUd is all about. One big | difference with IPv6: Routers can't fragment packets. | They either send or they don't. | snuxoll wrote: | Sure? | | At this point 1500 is the standard, we can't ever hope to | increase it without a way to negotiate the acceptable | value across the entire transmission path - that's what | IPv6 gives us. | inetknght wrote: | I'm not sure that negotiating the acceptable value across | the entire transmission path is a reasonable thing to do. | I'm not sure that IPv6 _should_ be aware of a | minimum/maximum MTU of underlying transmission path | particularly since that path can often change | transparently and each segment is subject to different | requirements. | [deleted] | fulafel wrote: | Is there a way to set a bigger MTU with wireless, like it is | with wired ethernet? | fireattack wrote: | Probably a dumb question, why the maximum size (and the one has | most of packages) in the AMS-IX graph 1514 bytes instead of 1500 | bytes that got discussed in the article? | ra1n85 wrote: | 1500 bytes is the MTU of IP, in most cases. It often excludes | the Ethernet header, which is 14 bytes excluding the FCS, | preamble, IFG, and any VLANs. | | If have a 1500 byte MTU for IP, then we need at least a 1514 | byte MTU for IP + Ethernet. We often call the > 1514B MTU the | "interface MTU". It's unnecessarily confusing. | Animats wrote: | The original MTU was 576 bytes, enough for 512 bytes of payload | plus 64 bytes for the IP and TCP header with a few options. 1500 | bytes is a Berkeleyism, because their TCP was originally | Ethernet-only. | wmf wrote: | Yeah, didn't T1 and ISDN use 576 to limit serialization delay | and jitter? The backbone probably switched to 1500 when OC-3 | was adopted. | tssva wrote: | The default MTU for a T1/E1 was usually 1500. The default for | HSSI was 4470 which meant the default for DS3 circuits was | 4470. This was also the usual default MTU for IP over ATM | which is what most OC-3 circuits would have been using when | they were initially rolled out for backbone use. This | remained the usual default MTU all the way through OC-192 | circuits running packet over sonnet. | | I left the lSP backbone and large enterprise WAN field around | that time and can't speak to more recent technologies. | jleahy wrote: | As others have said, with Manchester encoding 10BASE2 is self- | clocking, you can use the data to keep your PLL locked, just as | you would on modern ethernet standards. However I imagine with | these standards you may not even have needed an expensive/power- | hungry PLL, probably you could just multi-sample at a higher | clock rate like a UART did (I don't actually know how this | silicon was designed in practice). | | Futher PLLs have not got a lot better, but a lot worse. Maybe | back when 10BASE2 was introduced you could train a PLL on 16 | transitions and then have acquired lock but there's no way you | can do that anymore (at modern data rates). PCI express takes | thousands of transitions to exit L0s->L0, which is all to allow | for PLL lock. | | My best guess for the 1500 number is that with a 200ppm clock | difference between the sender and receiver (the maximum allowed | by the spec, which says your clock must be +-100ppm) then after | 1500 bytes you have slipped 0.3 bytes. You don't want to slip | more than half a byte during a packet as it may result in | duplicated or skipped byte in your system clock domain. (200 | _1e-6)_ 1500=0.3. | Unklejoe wrote: | I thought most Ethernet PHYs don't lock actually to the clock, | but instead use a FIFO that starts draining once it's half way | full. The size of this FIFO is such that it doesn't under or | overflow given the largest frame size and worst case 200 PPM | difference. | | I figured this is what the interframe gap is for - to allow the | FIFO to completely drain. | saber6 wrote: | IFP is really more to let the receiver knows where one stream | of bits stop and the next stream of bits start. How they | handle the incoming spray of data is up to them on a | queue/implementation level. | zamadatix wrote: | I've always wondered how 9000 became "jumbo". Technically | anything over 1500 is consider ju!no and there is no standard. | The largest I've seen is 16k. I think there are some crc accuracy | concerns at larger sizes but 9k still seems quite arbitrary for | computer land. | ajross wrote: | Ethernet frame size was never strictly limited. The way the | packet length works with Ethernet II frames (802.3 is more | explicit, but never really caught on) is that the hardware | needs to read all the way to the end of the packet and detect a | valid CRC and a gap at the end before it knows the thing is | done. So there's no reason beyond buffer size to put a fixed | limit on it, and different hardware had different SRAM | configuration. | | Wikipedia has this link showing that 9000 bytes was picked by | one site c. 2003 simply because it was generally well-supported | by their existing hardware: | https://noc.net.internet2.edu/i2network/jumbo-frames/rrsum-a... | cesarb wrote: | The explanation according to | https://web.archive.org/web/20010221204734/http://sd.wareone... | is: "First because ethernet uses a 32 bit CRC that loses its | effectiveness above about 12000 bytes. And secondly, 9000 was | large enough to carry an 8 KB application datagram (e.g. NFS) | plus packet header overhead." | | That is, 9000 is the first multiple of 1500 which can carry an | 8192-byte NFS packet (plus headers), while still being small | enough that the Ethernet CRC has a good probability to detect | errors. | gargs wrote: | This reminds me of various Windows applications back in the day | (Windows 3.1 and 95) that claimed to fine tune your connection | and one of the tricks they used was changing the MTU setting, as | far as I can recall. Could anyone share how that worked? | ndespres wrote: | If your computer sends a larger MTU than the next device | upstream can handle, the packets will be fragmented leading to | increased CPU usage, increased work by the driver, higher I/O | on the network interface, higher CPU load on your router or | modem, etc depending on where the bottleneck is. For example if | you connect over Ethernet to a DSL modem, or to a router that | has a DSL uplink, all your packets will be fragmented. This is | because DSL uses 8 bytes per packet for PPPoE authentication. | So if you send a 1500 byte packet to the modem, it will get | broken up by the modem into 2 packets: one is 1492+8 bytes, and | the other is 8+8 bytes. | | But your PC is still sending more packets.. the modem is | struggling to fragment them all and send them upstream.. its | memory buffer is filling up.. your computer is retrying packets | that it never got a response on.. | | By lowering your computer MTU to 1492 to start with, you avoid | the extra work by the modem, which can have considerable speed | increase. | 2rsf wrote: | I remembered something different related to shared medium and | CMSA/CD where 1500 ensured fairness, and the minimum of 46 | related to propagation time in the longest allowable cable | | More at: | | https://networkengineering.stackexchange.com/questions/2962/... | CGamesPlay wrote: | I think you're saying that the smallest bucket of packets are all | packets that would have been combined with a larger packet of | that had been an option... but that doesn't make sense. That | class of packets includes TCP SYN, ACK, RST, and 128 bytes could | fit an entire instant message on many protocols. | MrLeap wrote: | For.. reasons, I found myself having to make a 'driver' for a | PoE+ sensing device this month. The manufacturer had an SDK, but | compiling it requires an old version of Visual Studio - a bouquet | of dependencies, and it had no OSX support. None of the bundled | applications would do what I needed (namely, let me forward the | raw sensing data to another application.. _SOMEHOW_ ). | | The data isn't encoded in the usual ways, so even 4 hours of | begging FFMPEG were to no avail. | | A few glances at wireshark payloads, the roughly translated | documentation, and weighing my options, I embarked on a harrowing | journey to synthesize the correct incantation of bytes to get the | device to give me what I needed. | | I've never worked with RTP/RTSP prior to this -- and I was | disheartened to see nodejs didn't have any nice libraries for | them. Oh well, it's just udp when it comes down to it, right? | | SO MY NAIVETE BEGOT A JOURNEY INTO THE DARKNESS. Being a bit of | an unknown-unknown, this project did _not_ budget time for the | effort this relatively impromptu initiative required. An element | of sentimentality for the customer, and perhaps delusions of | grandeur, I convinced myself I could just crunch it out in a few | days. | | A blur of coffee and 7 days straight crunch later, I built a | daisy chain of crazy that achieved the goal I set out for. I read | rfc3550 so many times I nearly have it committed to memory. The | final task was to figure out how to forward the stream I had | ensorcelled to another application. UDP seemed like the "right" | choice, if I could preserve the heavy lifting I had accomplished | to reassemble the frames of data.. MTU sizes are not big enough | to accomodate this (hence probably why the device uses RTP, | LOL.). OSX supports some hilariously massive MTU's (It's been a | few days, but I want to say something like 13,000 bytes?) Still, | I'd have to chunk and reasemble each frame into quarters. Having | to write _additional_ client logic to handle drops and OOO and | relying on OSX's embiggened MTU's when I wanted this to be | relatively OS independent... and the SHIP OR DIE pressure from | above made me do bad. At this point, I was so crunched out that | the idea of writing reconnect logic and doing it with TCP was | painful so I'm here to confess... I did bad... | | The client application spawns a webserver, and the clients poll | via HTTP at about 30HZ. Ahhh it's gross... | | I'm basically adrift on a misery raft of my own manufacture. | Maybe protobufs would be better? I've slept enough nights to take | a melon baller to the bad parts.. | sneak wrote: | What does it sense that changes >=30 times a second? | dahfizz wrote: | You only use the Real Time Protocol (RTP) when you need time | sensitive data streaming (typically audio or video) | ses1984 wrote: | I'm guessing video frames given ffmpeg was part of the story. | Craighead wrote: | 60 hz electricity in American electric systems maybe? Thus | its polling every other wave? | jsight wrote: | I was curious about that to. Lots of references to video | related standards that imply its a PoE camera, but then why | isn't the data encoded in the usual ways? What does that | mean? | MrLeap wrote: | What codec would you use for a camera that captures not | RGB, but poetry of the soul? | | CONTEXTLESS, HEADERLESS, ENDLESS BYTE STREAMS OF COURSE, | where the literal, idealized (remember udp) position of | each byte is part of a vector in a non euclidean coordinate | system. | cfallin wrote: | > What codec would you use for a camera that captures not | RGB, but poetry of the soul? | | I would love to read a collaborative work between you and | James Mickens -- this genre of writing seems sadly under- | present in the computing world... | jtbayly wrote: | This needs to be its own post. Lol. | hinkley wrote: | https://en.m.wikipedia.org/wiki/Jumbo_frame | | The wiki page talks about getting 5% more data through at full | saturation but it doesn't mention an important detail that I | recall from when it was proposed. | | It turned out with gigabit Ethernet or higher that a single TCP | connection cannot saturate the channel with an MTU of 1500 | bytes. The bandwidth went up but the latency did not go down, | and ACKs don't arrive fast enough to keep the sender from | getting throttled by the TCP windowing algorithm. | | If I have a typical network with a bunch of machines on it | nattering at each other, that might not sound so bad. But when | I really just need to get one big file or stream from one | machine to another, it becomes a problem. | | So they settled on a multiple of 1500 bytes to avoid uneven | packet fragmentation (if you get half packets every nth packet | you lose that much throughput). Somehow that multiple became 6. | | And then other people wanted bigger or smaller and I'm not | quite sure how OS X ended up with 13000. You're gonna get | 8x1500 + 1000 there. Or worse, 9000 + 4000. | hinkley wrote: | In college I only had one group project, which scandalized me | but apparently lots of others found this normal. We had to fire | UDP packets over the network and feed them to an MJPeG card. | You got more points based on the quality of the video stream. | | My very industrious teammate did 75% of the work (4 man team, I | did 20%, if you are generous with the value of debugging). One | of the things we/he tried was to just drop packets that arrived | out of order rather than reorder them. Turned out the | reordering logic was reducing framerates. So he ran some trials | and looked at OOO traffic, and across the three or so routers | between source and sink he never observed a single packet | arriving out of order. So we just dropped them instead and got | ourselves a few more frames per second. | pantalaimon wrote: | Tbh that's what most real time video/audio applications will | do. Reordering adds latency and that is worse than the | occasional dropped frame. | MrLeap wrote: | I can drop a frame, I can't casually drop misordered | packets. It takes many packets to build a frame. I have to | reorder interframe packets (actually I just insert-in- | order). If I drop packets, I get data scrolling like a | busted CRT raster. | | I'm using a KoalaBarrel. Koalas receive envelopes full of | eucalyptus leaves. Koalas have to eat their envelopes in | order. First koala to get his full subscription becomes fat | enough to crush all the koalas beneath him. Keep adding | koalas. Disregard letters addressed to dead koalas. | anticensor wrote: | > embiggened | | For non-native speakers: embiggened means huge, enlarged, | overgrown. | | _I am not a native speaker of English either_ | squiggleblaz wrote: | *For non-Simpsons watchers | | The word was created as a joke in a Simpsons episode, a word | used in Springfield only. It is described as "perfectly | cromulent" by a Springfielder, which is evidently meant to | mean "acceptable" or "ordinary" but is another | Springfieldism. | | The joke may be lost on future generations who don't realise | they're not normal words. | skykooler wrote: | Actually, "embiggened" is an actual word, though archaic, | it's been around for over 130 years. The coinage of | "cromulant" to describe it as such was the joke there, not | "embiggen" itself. | | Source: https://en.wiktionary.org/wiki/embiggen | kahirsch wrote: | It was used _once_ in 1884 and the writer there | specifically said he invented it. There are no other | recorded uses of the word before The Simpsons. | kalleboo wrote: | The show writers thought they came up with the word on | their own, they didn't know about the previous usage of | the word in 1884 (the episode was written in 1996, the | internet wasn't quite as full of facts back then), | "embiggen" was still supposed to be a joke. | MrLeap wrote: | To be fair to everyone, I've had native English speakers tell | me what I speak is barely English. | IshKebab wrote: | I don't think the Ethernet Frame Overhead graph is correct. | Surely the overhead is proportionally higher, per amount of data, | for smaller packets. That graph shows that the overhead is just | proportional to the amount of data sent, irrespective of the | packet size, which can't be right. | tartoran wrote: | I find that technology cements in strata (the archaeology term) | just as the layers that accumulate as the result of natural | processes and human activity. The dynamics are not exactly the | same but the tendency is similar. I wonder whether we'll always | be capable of digging down deeper to the beginnings as things get | more and more complicated. | smoyer wrote: | The article talks about how the 1500 byte MTU came about but | doesn't mention that the problem of clock recovery was solved by | using 4b/5b or 8b/10b encoding when sending Ethernet through | twisted-pair wiring. This encoding technique also provides a | neutral voltage bias. | | EDIT: As pointed out below, I failed to account for the clock- | rate being 25% faster than the bit-rate in my original assertion | that Ethernet over twisted-pair was only 80% efficient due to the | encoding (see below) | mchristen wrote: | There has to be something else going on here because I | routinely achieve > 800mbps on my gigabit network over copper. | jws wrote: | Yes, the wire symbol rates are higher. For instance, 100mbit | Ethernet has a 125 million symbols per second wire rate. | adrianmonk wrote: | I'm not a hardware engineer, but from some quick research it | appears that 100 megabit ethernet ("fast ethernet") transmits | at effectively 125 MHz. So the 100 megabit number describes | the usable bit rate, not the electrical pulses on the wire. | | Gigabit Ethernet is more complicated, and it uses multiple | voltages and all four pairs of wires bidirectionally. So it | is not just a single serial stream of on/off. | Unklejoe wrote: | > Ethernet through twisted-pair wiring only provides 80% of the | listed bit-rate | | Actually, they already accommodated for this in the advertised | speed. | | In other words, a 1 GbE SerDes runs at 1.250 Gbit/s, so you end | up with an actual 1 Gbit/s bandwidth. | | The reason you don't actually hit 1 Gbit/s in practice is due | to other overheads such as the interframe gaps, preambles, FCS, | etc. | blattimwind wrote: | > The reason you don't actually hit 1 Gbit/s in practice is | due to other overheads such as the interframe gaps, | preambles, FCS, etc. | | Actually Gigabit Ethernet is highly efficient; it can | actually give you 98-99 % of line rate as the payload rate. | smoyer wrote: | You're absolutely correct ... it's been a long time since I | was designing fiber transceivers but I should have remembered | this. Ultimately efficiency is also affected by other layers | of the protocol stack too (UDP versus TCP headers) which also | explains why larger frames can be more efficient. In the | early days of RTP and RTSP, there were many discussions about | frame size, how it affected contention and prioritization and | whether it actually helped to have super-frames if the | intermediate networks were splitting and combining the frames | anyway. | anonymousiam wrote: | Minor factoid the article does not mention. ATM is an alternative | to Ethernet that's used in many optical fiber environments. The | "transfer unit" size of the ATM "cell" is 53 bytes (5 for the | header and 48 for the payload). This is much smaller than 1500. | | Another quirky story from the past: Sometime around 20 years ago | I was having a bizarre networking problem. I could telnet into a | host with no trouble, and the interactive session would be going | just fine until I did something that produced a large volume of | output (such as 'cat' on a large file). At that point the session | would freeze and I would eventually get disconnected. After | troubleshooting for a while I identified the problem as one of | the Ethernet NICs on the client host. It was a premium NIC (3Com | 3C509). Nonetheless, the NIC crystal oscillator frequency had | drifted sufficiently that it would lose clock synchronization to | the incoming frame if the MTU was larger than about 1000. | saber6 wrote: | ATM is mostly dead. The only places it exists now is legacy | deployments. Everyone has been deploying MPLS/IP instead of ATM | for the past 15-20 years. | gerdesj wrote: | The picture in the article looks like a 3c509 - three media | types and 100Mbs-1. Cor! I have loads of them somewhere. Plus a | fair few 905 and 595s. | gugagore wrote: | It would be nice to corroborate this reason with another source, | because my understanding is that clock synchronization was not a | factor in determining the MTU, which seems really more like a OSI | layer 2/3 consideration. | | I am surprised the PLLs could not maintain the correct clocking | signal, since the signal encodings for early ethernet were "self- | clocking" [1,2,3] (so even if you transmitted all 0s or all 1s, | you'd still see plenty of transitions on the wire). | | Note that this is different from, for example, the color burst at | the beginning of each line in color analog TV transmission [4]. | It is also used to "train" a PLL, which is used to demodulate the | color signal transmission. After the color burst is over, the PLL | has nothing to synchronize to. But the 10base2/5/etc have a | carrier throughout the entire transmission. | | [1] | https://en.wikipedia.org/wiki/Ethernet_physical_layer#Early_... | | [2] https://en.wikipedia.org/wiki/10BASE2#Signal_encoding | | [3] http://www.aholme.co.uk/Ethernet/EthernetRx.htm | | [4] https://en.wikipedia.org/wiki/Colorburst | stripline wrote: | I also don't believe this is the reason. Early Ethernet | physical standards used Manchester encoding to recover the data | clock. | peteri wrote: | I would agree given I worked on an Ethernet chipset back in | 1988/9 keeping the PLL synched was not a problem. I can't | remember what the maximum packet size we supported was (my | guess is 2048) but that was more of a buffering to SRAM and | needing more space for counters. | | The datasheet for the NS8391 has no such requirement for PLL | sync. | | https://archive.org/details/bitsavers_nationaldaDataCommunic. | .. | trixie_ wrote: | Kind of expected an article titled 'How 1500 bytes became the MTU | of the internet' to tell us how 1500 bytes became the MTU of the | internet. | | Even I could of told you, 'the engineers at the time picked 1500 | bytes'. | leroman wrote: | Looks like a ripe low-hanging fruit for SpaceX Starlink to pick.. | leroman wrote: | Why the downvote? possibly facilitating the end-to-end | transport will allow them to offer jambo packets | ekimekim wrote: | This would only be possible if you were talking from a jumbo- | configured client (let's say you've set up your laptop | correctly), across a jumbo-configured network (Starlink, in | your scenario), to a jumbo-configured server (here's the | problem). | | The problem is that Starlink only controls the steps from | your router to "the internet". If you're trying to talk to | spacex.com it'd be possible, but if you're trying to talk to | google.com then now you need Starlink to be peering with ISPs | that have jumbo frames, and they need to peer with ISPs with | jumbo frames, etc etc and then also google's servers need to | support jumbo frames. | | Basically, the problem is that Starlink is not actually end | to end, if you're trying to reach arbitrary servers on the | internet. It just connects you to the rest of the internet, | and you're back to where you started. | | This is also true for any other ISP, Starlink is not special | in this regard. | Avamander wrote: | True, you'd expect endpoints to support Jumbo Frames as | well, but why not start at least making it possible. It's a | dead loop otherwise. IPv6 was the same at start. | saber6 wrote: | Because you don't know what you're talking about and are | engaging in "what if"-isms? There is no business case to | solve with jumbos frames over the Internet. I've been in this | business for 20 years. Seen this argument a dozen times. It | never changes. | dooglius wrote: | Most connections will not be peer-to-peer over Starlink, so | you need to deal with the least common denominator. | hylaride wrote: | Well, depending on the quality of the connections, a re- | transmit of a jumbo frame could mean having to re-transmit a | lot more data. | | But since the local network and the end network where the | servers are located will almost certainly be 1500, the point | is almost all but moot. | bjornsing wrote: | Actually, MTUs below 1500 bytes are pretty common, e.g. with PPP | over Ethernet or other forms of encapsulation/tunneling. | russfink wrote: | IIRC it was called "thinnet" (10B2). I loved the vampire taps on | thick net. | alexforencich wrote: | I think the author may have made a mistake in some of the math. | The frame size distribution plots are likely based on the number | of frames, not the amount of data contained in said frames. The | 1500 byte and other large frames should therefore account for the | lion's share of the actual data transferred. Correcting this | error will totally change the final two graphs. | labawi wrote: | Yes. But only the "AMS-IX traffic by packet size range" graph | is wildly inaccurate. Ethernet frame overhead is per-packet and | presumably right. | alexforencich wrote: | Ah yeah, that's probably true. According to some back of the | envelope math, it seems like the distribution should be more | like 5%, 1%, 1%, 3%, 50%, 39%, ignoring the first and last | size bins. | afandian wrote: | Off-topic but looking at that old network card picture reminded | me of a very vague memory of more than one card with a component | that looked like a capacitor, except it looked cracked. | | Is my mind playing tricks? Were they faulty units or was there | meant to be a crack? | | This picture could be the same thing: | | https://www.vogonswiki.com/images/3/37/Viglen_Ethergen_PnP_2... | gerdesj wrote: | Old network card eh? Its a 3Com 3C509: | | https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin... | | I still have a load of them gathering dust somewhere. However a | system with ISA on it is a bit rare now and I'm not sure I can | be bothered to compile a modern kernel small enough to boot on | one. Besides, it will probably need cross compiling on | something with some grunt that has heard of the G unit prefix. | mertenVan wrote: | Software developer talking confidently about electrical | engineering issues he knows nothing about. How cute. /s | | All Ethernet adapters since the first Alto card had self-clocking | data recovery [1]. | | Clock accuracy was never a problem, as long as it was withing the | acceptable range required for PLL lock/track loop. | | The reason for 1500 MTU is that for packet-based systems, you | don't want infinitely large packets. You want _small_ packets. | but large enough so that packet overhead is insignificant, which | in engineering terms means less than 2%-5% overhead. Thus 1500 | max packet size. Everything above that just makes switching and | buffering needlessly expensive, SRAM was hella expensive back | then. Still is today (in terms of silicon area). | | Look at all the memory chips on Xerox Alto's Ethernet board | (below) - memory chips were already taking ~50% of the board | area! | | [1] Schematic of the original Alto Ethernet card clock recovery | circuit: https://www.righto.com/2017/11/fixing-ethernet-board- | from-vi... | | EDIT: Lol! Author has completely replaced erroneous explanation | with correct explanation, including link to seminal paper about | packet switching. Good. | mlyle wrote: | > How cute. /s | | > EDIT: Lol! Author has completely replaced erroneous | explanation with correct explanation, including link to seminal | paper about packet switching. Good. | | Don't be a jerk. Being right doesn't give you the right to make | fun of people. | mertenVan wrote: | Noted. I find the over-confidence over a completely imagined | issue funny and interesting. I make that mistake too. It's | always interesting to do a post-mortem: why was I so | confident? how did I miss the correct answer? I respect the | author for doing such a fast turn around :) | hackmiester wrote: | Good to know that your intent wasn't malicious, but fwiw, I | also didn't find the tone particularly appropriate for HN, | either. | contingencies wrote: | The hardware world is full of this. | generatorguy wrote: | I think because in the hardware world the cost for being | wrong is so much higher since you can't push out an over | the air update or update your saas or whatever. So if you | don't know you stay quiet instead of being wrong. | kingosticks wrote: | > Everything above that just makes switching and buffering | needlessly expensive, SRAM was hella expensive back then. Still | is today (in terms of silicon area). | | Why does a larger MTU make switching more expensive? | | And why does it effect buffering? Won't the internal buses and | data buffers of networking chips be disconnected from the MTU? | Surely they'll be buffering in much smaller chunks, maybe | dictated by their SRAM/DRAM technology. Otherwise, when you | consider the vast amount of 64B packets, buffering with 1500B | granularity would be extremely expensive. | mertenVan wrote: | I suggest you read the paper linked by the blog author | ("Ethernet: Distributed Packet Switching for Local Computer | Networks"), specifically Section 6 (performance and | efficiency 6.3). It will answer all your questions. | | > Why does a larger MTU make switching more expensive? | | Switching requires storage of the entire packet in SRAM. | | Larger MTU = More SRAM chips | | If existing MTU is already 95% network efficient (see paper), | then larger MTU is simply wasted money. | mprovost wrote: | Traditionally it's been true that you need SRAM for the | entire packet, which also increases latency since you have | to wait for the entire packet to arrive down the wire | before retransmitting it. But modern switches are often | cut-through to reduce latency and start transmitting as | soon as they see enough of the headers to make a decision | about where to send it. This also means that they can't | checksum the entire packet, which was another nice feature | with having it all in memory. So if it detects corruption | towards the end of the incoming packet it's too late since | the start has already been sent - most switches will | typically stamp over the remaining contents and send | garbage so it fails a CRC check on the receiver. | | Which raises another point in relation to the 1500 MTU - | all of the CRC checks in various protocols were designed | around that number. Even the checksum in the TCP header | stops being effective with larger frames, so you end up | having to do checksums at the application level if you care | about end to end data integrity. | | https://tools.ietf.org/html/draft-ietf-tcpm-anumita-tcp- | stro... | mertenVan wrote: | You're describing cut-through switching [1]. Because of | its disadvantage, it is usually limited to uses that | require pure performance, such as HFT (High Frequency | Trading). Traditional store-and-forward switching is | still commonly used (or some hybrid approach). | | "The advantage of this technique is speed; the | disadvantage is that even frames with integrity problems | are forwarded. Because of this disadvantage, cut-through | switches were limited to specific positions within the | network that required pure performance, and typically | they were not tasked with performing extended | functionality (core)." [2] | | [1] https://en.wikipedia.org/wiki/Cut-through_switching | | [2] http://www.pearsonitcertification.com/articles/articl | e.aspx?... | mlyle wrote: | > Which raises another point in relation to the 1500 MTU | - all of the CRC checks in various protocols were | designed around that number. | | Hmm. Why is this? It seems if we have a CRC-32 in | Ethernet (and most other layer 2 protocols), we'll have a | guarantee to reject certain types of defects entirely... | But mostly we're relying on the fact that we'll have a 1 | in 4B chance of accepting each bad frame. Having a bigger | MTU means fewer frames to pass the same data, so it would | seem to me we have a lower chance of accepting a bad | frame per amount of end-user data passed. | | TCP itself has a weak checksum at any length. The real | risk is of hosts corrupting the frame between the actual | CRCs in the link layer protocols. E.g. you receive frame, | NIC sees it is good in its memory, then when DMA'd to bad | host memory it is corrupted. TCP's sum is not great | protection against this at any frame length. | mprovost wrote: | The risk is that multiple bits in the same packet are | flipped, which the CRC can't detect. If the bit error | rate of the medium is constant, then the larger the | frame, the more likely that is to occur. Also as Ethernet | speeds increase, the underlying BER stays the same (or | gets worse) so the chances of encountering errors in a | specific time period go up. 100G Ethernet transmits a | scary amount of bits so something that would have been | rare in 10Base-T might happen every few minutes. | thehappypm wrote: | When networks were new, computers connected to each other using a | shared trunk that you _physically_ drilled into. It 's a non- | trivial problem to send data over a shared channel; it's very | easy for two systems to clobber each other. A primitive, but | somewhat effective mechanism is ALOHA | (https://en.wikipedia.org/wiki/ALOHAnet), where multiple senders | randomly try to send their message to a single receiver. The | single receiver then repeats back any messages it successfully | receives. In that way the sender is able to confirm its message | got through -- an ack. After a certain amount of time with no | ack, senders repeat their messages. As you can imagine, shorter | packets are less likely to cause collisions. | | Ethernet uses something similar, but is able to detect if someone | else is using the wire, called carrier sense. A short packet of | 1500 bytes reduced the likelihood of collisions. | blitmap wrote: | Does multiplexing over Ethernet exist? | 5436436347 wrote: | Not anymore for all practical purposes, but it once did for | the very old 10Base-2 standard for Ethernet over coaxial | cable. This is practically why the old MII Ethernet PHY | interface protocol had the collision-sense lines to indicate | to the MAC to stop sending data if it detects incoming data, | in attempts to minimize collisions. | | https://en.wikipedia.org/wiki/10BASE2 | blitmap wrote: | This is very cool history, and something I never would have | stumbled upon myself. Thank you for sharing! :-) | throw0101a wrote: | Unreliable IP fragmentation, and the brokenness of Path MTU | Discovery (PMTUD), is causing the DNS folks to put a clamp on the | size of (E)DNS message size: | | * https://dnsflagday.net/2020/ ___________________________________________________________________ (page generated 2020-02-19 23:00 UTC)