[HN Gopher] What developers should know about TCP
       ___________________________________________________________________
        
       What developers should know about TCP
        
       Author : todsacerdoti
       Score  : 218 points
       Date   : 2020-05-14 10:03 UTC (1 days ago)
        
 (HTM) web link (robertovitillo.com)
 (TXT) w3m dump (robertovitillo.com)
        
       | citrin_ru wrote:
       | A very important point every develeper should know: successful
       | write(2) syscall doesn't not grantee that the data received by a
       | remote application. TCP is described as a protocol which grantees
       | packet delivery and this often misleading.
       | 
       | write(2) syscall returned without a error means that data has
       | been placed in OS kernel buffer. OS kernel then will try to send
       | it to a remote host. If couple packet will be lost it's not a
       | problem - kernel will retry a few times. But if power will be
       | lost shortly after a write, data may never hit the wire. Then
       | there is possibility that network link will be broken for a long
       | time. OS will retry, but for a limited time and then will give
       | up. Also remote host can crash at any time before remote
       | application actually will read the data.
       | 
       | So if you need reliable delivery you need acknowledgement on
       | application protocol level despite the fact that TCP already have
       | acknowledgements.
        
       | commandlinefan wrote:
       | When I first started working with computer networks, I just
       | thought of TCP/IP as "low-level stuff" and I focused instead on
       | the higher level stuff. After I kept running into
       | incomprehensible errors seemingly over and over again, I finally
       | broke down and picked up a copy of Richard Steven's "TCP/IP
       | Illustrated". Hands down, the best investment in time I've ever
       | made. If you deal with distributed systems (hint, you do), you
       | _need_ to understand how they actually work.
        
         | c0nsumer wrote:
         | A lot of benefit that I add in my day job is bridging the gap
         | between high level folks (OS people) and what's-actually-
         | happening-on-the-wire.
         | 
         | While so, so, so much of this is rarely the network, knowing
         | how to look under the covers and see what's actually hitting
         | the wire (versus what the API call asked for) leads to far, far
         | faster resolution of problems.
         | 
         | It's frustrating to me that so many people see this as a
         | mystery of "knowing networking" when it's really just basic
         | protocol analysis.
        
         | non-entity wrote:
         | Is this still a reccomend book? I was looking for a good TCP/IP
         | reference book, but many seemed rather old. Of course, I
         | imagine protocols like that dont get modified too much.
        
           | travmatt wrote:
           | I just finished Kurose' "Computer Networking: A Top-Down
           | Approach" and I'd recommend it.
        
           | commandlinefan wrote:
           | Well, it's definitely out of date: the first edition predates
           | even IPv6 (and the second edition is awful, don't buy it).
           | Still, the way it's laid out is so well done that once you
           | understand how TCP/IP worked in the mid-90's, you'll easily
           | be able to work out the evolution of it since on your own.
           | It's a shame there's no better up-to-date book, but Stevens
           | was one-of-a-kind. The Comer book isn't bad (but it's not
           | really good, either), and the Kurose & Ross book is less not
           | bad (and more not good), but even though both are more
           | modern, I'd still recommend TCP/IP Illustrated to really
           | understand what's going on in the network stack.
        
         | rb808 wrote:
         | Great I thought I'll take a look. Three volumes each over 1000
         | pages? Any other suggestions? Did you mean all 3 books?
        
           | commandlinefan wrote:
           | Hehe - I did end up reading all 3, and enjoyed them all, but
           | I'd say I got 90% of the value from volume 1. Volume 2 walks
           | through the BSD implementation of TCP/IP, which is
           | fascinating, but way more detail than you'd ever need to
           | know, and volume 3 goes off into some esoteric topics that
           | seemed promising at the time but mostly ended up being
           | abandoned (along with a brief discussion of HTTP as it was
           | around the 90's).
           | 
           | If you're going to read it, though, find a used copy of the
           | original Stevens' first edition, not that terrible desecrated
           | second edition.
        
             | [deleted]
        
             | Bootvis wrote:
             | What is wrong with the second edition?
        
               | commandlinefan wrote:
               | It was rewritten by a different author (the original
               | author, Richard Stevens, died in a car accident in the
               | late 90's). I guess the new guy tried his best, but he
               | just doesn't have the writing skill that Stevens had.
        
           | tenant wrote:
           | Depressing, isn't it? There are so many books that I probably
           | should read about almost any number of topics that, in my
           | work, I "touch on".
        
             | irrational wrote:
             | I have books that I purchased decades ago like this that
             | are still languishing on my bookshelf.
        
             | rb808 wrote:
             | Lol I'll probably buy it and put on my bookshelf with all
             | the others.
        
           | hilem wrote:
           | Not OP but the first volume is the one that's cited
           | frequently and the only one of the series I believe to have a
           | second edition.
        
           | kevstev wrote:
           | Just read the first one- it reads more like a novel than a
           | textbook IMHO, though I may be biased- I have always been
           | fascinated by networks and when I was coming of age this was
           | the "high tech" of the time- I used to read RFCs for fun (I
           | highly recommend this as well if you want to dig a little
           | deeper- Jon Postel's are great reads).
           | 
           | This is one of the best written textbooks, if not the best, I
           | have ever read.
        
         | outworlder wrote:
         | > I focused instead on the higher level stuff.
         | 
         | That's fine. But every developer should have a basic
         | understanding of networking. But that can also be dangerous.
         | 
         | I still have people in the company who swear you can't have
         | more than 65k incoming connections to a machine, because
         | "that's how many ports there are". Don't get me started on all
         | the misconceptions on TCP_TW_REUSE AND TCP_TW_RECYCLE. Lengthy
         | discussions because apparently "TIME_WAIT is bad and uses up
         | ports! "(see also, 65k). For context, these are servers, with
         | multiple clients, from different source IPs.
        
       | crazygringo wrote:
       | > _But what about large files, such as videos? Surely there is a
       | latency penalty for receiving the first byte, but shouldn't it be
       | smooth sailing after that?_
       | 
       | So the articles (unstated) conclusion seems to be that, as long
       | as there isn't network congestion, it _is_ smooth sailing after
       | that.
       | 
       | But that congestion reduces bandwidth. But of course, that
       | applies just as much to a national backbone as to last-mile.
       | 
       | So I'm curious: where _does_ most packet loss occur? Is it last-
       | mile, at your ISP, or along major backbones? Because that has
       | major implications as to whether caching video content closer to
       | users actually results in higher-quality video (e.g. supporting
       | 1080p instead of 720p) or not.
        
         | boryas wrote:
         | > where does most packet loss occur
         | 
         | Here's an interesting paper from SIGCOMM (it won best paper at
         | the conference in 2018, FWIW) that attempts to figure out what
         | links are congested without direct access to ISP networks:
         | https://www.caida.org/publications/papers/2018/inferring_per...
        
         | z3t4 wrote:
         | Ive been debugging packet loss issues lately and they did all
         | occur in the datacenter. For backbones and network exhanges
         | they move so much traffic already that things like everyone
         | working remote only increases traffic by a few percent, and
         | they have _a lot_ of over capacity in order to handle spikes or
         | when a new game is released and everyone downloads it at the
         | same time.
         | 
         | So yes it would really help to have more decentralisation. Like
         | putting the content closer to the user.
        
       | jeffbee wrote:
       | Just enough information to be dangerous? Article attributes
       | behaviors of loss-based congestion control schemes like Reno and
       | Cubic to TCP itself. In practice, the congestion control scheme
       | is not really part of the protocol (there is, for example, BBR).
       | There's also ECN, showing that loss is not the only way to
       | discover congestion.
        
         | convolvatron wrote:
         | the RTT discussion was a little misleading. its true that slow
         | start rates are entirely dependent on RTT...but eventually the
         | sawtooth should reach the same steady state.
         | 
         | there is work that shows that higher RTT connection do
         | statistically suffer a smaller fair share, but that's a subtler
         | if related issue. actually, I really wish the author would have
         | shown the sawtooth.
        
           | toast0 wrote:
           | At some point, with increasing bandwidth and increasing RTT,
           | you end up with your effective bandwidth capped by receive
           | windows and/or send buffers. Cross country high def video
           | might not be quite enough to hit that, but intercontinental
           | high def video would be.
           | 
           | Being closer means faster initial 'slow start', but also
           | faster 'slow start' on congestion, which is why you get a
           | bigger share.
        
             | convolvatron wrote:
             | sure. but thats really just a window being under the
             | bandwidth delay product. the discussion makes it seem like
             | you suffer an outright linear performance hit
        
       | [deleted]
        
       | 29athrowaway wrote:
       | The RFC is useful as well.
       | 
       | https://tools.ietf.org/html/rfc793
       | 
       | TCP state machine diagrams can be useful too.
        
       | [deleted]
        
       | freefriedrice wrote:
       | EDIT: Sure wish I could delete this post.
       | 
       | Wait, this isn't TCP, this is protocol level above TCP, right?
       | TCP doesn't shape traffic by itself through rate limiting and
       | congestion analysis, does it? I thought the layer above it used
       | TCP to send/receive the buffer size, and that has nothing to do
       | with TCP.
       | 
       | Am I wrong?
        
         | zwkrt wrote:
         | You are wrong! Obviously the application layer on top of TCP
         | could be the bottleneck, but TCP itself has mechanisms to
         | ensure traffic is flowing as fast and as smoothly as possible.
         | Look up "TCP Flow control" and "TCP Congestion Control"
        
         | scott_s wrote:
         | TCP definitely does congestion control itself:
         | https://en.wikipedia.org/wiki/TCP_congestion_control
        
       | duxup wrote:
       | When I used to do networking tech support for some networking
       | equipment the guy's who sat next to me supported the load
       | balancer product.
       | 
       | I swear a high percentage of their calls were questions about how
       | the load balancer wasn't working and sending all the traffic to
       | one server and then after some investigation we discover all
       | traffic is in fact directed to that lone server... because the
       | client code has the IP of that server hard coded. A tedious
       | discussion would then ensue about how that is not how to do it.
       | 
       | The next week? Same angry call...
       | 
       | Partly that is what inspired my decision to change careers. "Man
       | if these developers can't figure out basic networking, maybe I
       | could be a developer...?"
        
       | Matthias247 wrote:
       | What they mostly should know: TCP provides a bidirectional stream
       | of bytes on the application level. It does NOT provide a stream
       | of packets.
       | 
       | That means whatever you pass to a send() call is not necessarily
       | the same amount of data the receiver will observe in a single
       | read() call. You might get more or less bytes, since the
       | transport layer is free to buffer and to fragment data.
       | 
       | I have seen the assumption of TCP having packet boundaries on
       | application level being made too often - typically in
       | stackoverflow questions like: ,,I don't receive all data. Is my
       | OS/library broken?"
        
         | nicolaslem wrote:
         | One way to stop falling into this trap is by knowing what
         | happens behind the send syscall: the application is not sending
         | bytes down the wire, it just fills a buffer in the OS. Once in
         | the buffer there is no boundary between bytes from different
         | send calls. Same thing for receiving, in reverse.
        
         | anilakar wrote:
         | In the fall of 2016 I had a lengthy email exchange with an
         | industrial automation vendor who didn't understand this issue.
         | I even mailed them a short Python proof-of-concept snippet that
         | slept a few milliseconds between the write() calls and in
         | response got back my code "fixed" with the sleep removed.
         | 
         | In between the emails I googled a bit and found the changelogs
         | for the RTOS they were using. Turned out that it was a bug in
         | the upstream HTTP server. This also meant that the platform
         | they were using had all the security holes from those five-plus
         | years. The bug was later silently fixed when they acquired a
         | newer release from upstream.
         | 
         | Currently I'm having a similar issue with the very same vendor.
         | This time they don't understand why client-side authentication
         | means no authentication at all and why passwords must not be
         | stored in plain text in the database that can be remotely
         | backed up from the device.
        
           | irrational wrote:
           | Why don't you tell us the vendor's name? It seems like the
           | responsible thing to do.
        
             | anilakar wrote:
             | Even after the bug gets fixed, it'll probably take years
             | for all the embedded devices in the public internet to get
             | patched, so no.
        
               | laughinghan wrote:
               | But in the meantime, won't the vendor keep adding more
               | broken devices to the public internet, making the problem
               | worse?
               | 
               | The longer it takes for this problem to become public,
               | won't the more harm be caused when it does become public?
        
           | outworlder wrote:
           | > This time they don't understand why client-side
           | authentication means no authentication at all
           | 
           | I've seen this... with an intern! I can't imagine dealing
           | with a whole team like that.
        
           | throwaway_pdp09 wrote:
           | How do you not kill these people? How do you put up with it?
           | How do vendors like this survive?
        
             | maartenh wrote:
             | Just like in nature, they survive because they are good
             | enough, and don't experience enough competition to be
             | eliminated by selection.
        
               | the8472 wrote:
               | full disclosure could put some selective pressure on
               | them.
        
               | jolmg wrote:
               | Depending on what kind of vendor we're talking about, it
               | might be that such aspects aren't even part of what makes
               | them competitive. The average user is not going to know
               | about these types of issues, and so they're not even
               | going to consider such issues when evaluating the vendor.
        
             | ink_13 wrote:
             | Just about every industrial automation vendor is like this
             | in my experience. They never upgrade because they don't
             | want to break anything.
        
         | richardwhiuk wrote:
         | If you do want that, then SCTP will provide it.
        
           | [deleted]
        
         | jes5199 wrote:
         | if you turn off Nagle's algorithm, it gets closer to this
         | though
        
           | jfkebwjsbx wrote:
           | No, it has nothing to do with that.
        
         | wahern wrote:
         | A version of Microsoft Exchange had a bug in its SMTP
         | implementation that was tickled when lines crossed packet
         | boundaries. (EDIT: The issue was more likely a bug in
         | Exchange's TLS record processing, breaking when a logical line
         | crossed TLS records.) My async SMTP library used a simple fifo
         | for buffering outbound data which didn't realign the write
         | pointer to 0 except when it was completely drained, so when
         | reading slices (iovec's) from the fifo for write-out it would
         | occasionally call write/send with an incomplete line (i.e. part
         | of a line that wrapped around from the end of the fifo buffer
         | array to the front) even if the application had only written
         | full lines. (At the time it didn't support writev/sendmsg,
         | though I'm not sure it would have helped as the TLS record
         | layer might still have been prone to splitting logical lines
         | across packets.) There was no bug here on my end--everything
         | would be sent correctly--but you can't tell the customer that
         | he can't send e-mail to some third-party because that third-
         | party is using a broken version of Exchange.
         | 
         | The first quick fix was to unconditionally realign the fifo
         | contents after every write (the fifo had a realign method), but
         | that ran into a computational complexity problem when you had
         | lots of small lines (e.g. the application caller dumped a huge
         | message into the buffer and then flushed it out in one go) and
         | a high-latency connection that resulted in many short writes;
         | you were constantly memmove'ing the megabytes of remaining
         | contents in the buffer for every tiny write you did. So then I
         | ended up having to add a new interface to the fifo that
         | returned a slice up to a limit but always ending with a
         | specified delimiter (e.g. "\n") if the delimiter was within the
         | maximum chunk size.
         | 
         | Of course, none of these fixes would have completely remedied
         | the issue as lower layers (the TLS stack, the kernel TCP stack)
         | could have still potentially split logical lines, and I'm sure
         | did on occasion. But it at least seemed to put us on equal
         | footing with everybody else in terms of how often it happened,
         | which is really the best anybody could have done. Complaints
         | did die down.
        
         | outworlder wrote:
         | > What they mostly should know: TCP provides a bidirectional
         | stream of bytes on the application level. It does NOT provide a
         | stream of packets.
         | 
         | > That means whatever you pass to a send() call is not
         | necessarily the same amount of data the receiver will observe
         | in a single read() call.
         | 
         | Yes, this. For god's sake, listen to them.
         | 
         | I had to fight a coworker on this. I had quickly created some
         | client code just to validate that the server was working. Due
         | to some quirk, all the messages were arriving in full in every
         | read call. He told me to ship it.
         | 
         | I said no! "I need to check if there's more data and if so add
         | a loop to read again" "But it is working, release it". That
         | went on for a while, to no avail. Wouldn't look at
         | documentation either.
         | 
         | Eventually he head to leave for the day, and I took the time to
         | implement it correctly.
         | 
         | I started including basic TCP questions on interviews. Not many
         | people even get past the TCP handshake (if they even know about
         | that).
        
           | scott_s wrote:
           | The problem here was not a lack of knowledge of a particular
           | subject. The problem is that this person was unwilling to
           | learn about a thing they thought they knew.
        
             | draw_down wrote:
             | That's correct.
        
           | Ididntdothis wrote:
           | "But it is working, release it".
           | 
           | Famous last words :-)
        
             | austincheney wrote:
             | Sounds like how most software handles security until it's
             | audited.
        
           | SilasX wrote:
           | Stupid question: why would you be writing code that works at
           | the level of TCP? Don't you usually want to use the OS's (or
           | some popular library's) TCP software stack?
        
             | jfkebwjsbx wrote:
             | It seems to me GP is talking about using TCP, not
             | implementing it.
        
         | austincheney wrote:
         | Your terminology is a little off. TCP does not provide anything
         | for the application layer as it is transport layer. The
         | application layer rides on top of that. Examples of transport
         | protocols are TCP and UDP while application protocols are
         | things like http, ssh, irc, and all those things your
         | applications use.
         | 
         | The network layer on which the transport layer rides is packet
         | switched. The TCP uses segments with each segment having its
         | own header and sequence numbers. Streams are just a series of
         | segments populating across a single established handshake
         | without a prior defined termination segment.
        
           | Matthias247 wrote:
           | I didn't meant to talk about OSI terminologies. It was more
           | about: [user-space] applications which use the TCP/IP stack
           | do not observe packet boundaries, whereas the Kernel
           | certainly does. Obviously this is a bit ambiguous, and you
           | can even get packet boundaries in user-space by running a TCP
           | stack there. But for most TCP/IP usages it holds true.
        
             | austincheney wrote:
             | > It was more about: [user-space] applications which use
             | the TCP/IP stack do not observe packet boundaries
             | 
             | That is still a bit imprecise. Userland applications won't
             | directly see TCP as they are just looking at an application
             | protocol. Typically it's the OS that packages and unpacks
             | the application protocol data into a TCP segment, so of
             | course the userland application won't see it since its not
             | managing that part of the communication.
             | 
             | https://en.wikipedia.org/wiki/Transmission_Control_Protocol
             | #...
             | 
             | There are some exceptions where some application platforms
             | allow developers to write custom TCP protocols, such as
             | Node.js, but these exceptions generally apply to network
             | services and don't commonly apply to the end user
             | application experiance.
             | 
             | https://nodejs.org/dist/latest-v14.x/docs/api/net.html#net_
             | n...
        
         | twotwotwo wrote:
         | Yeah. Fun problem for beginners, because 1) your incorrect code
         | may work for a while when reads/writes are small or it's only
         | run on a local network or such, 2) you might design a broken
         | _protocol_ if you don 't understand fragmentation, etc., which
         | will tend to be harder than (say) an isolated client bug to
         | fix, 3) the implementation-dependent nature of fragmentation
         | can make it look like you hit a language/library/OS issue, 4)
         | your language/library may or may not offer tools to help a
         | beginner to implement a delimited or framed wire format
         | properly (ideally with things like record-size limits and
         | timeouts).
         | 
         | Not sure it says anything you haven't, but a StackOverflow
         | answer on fragmentation (framed by asker as Go not behaving
         | like C) is one of the more-read ones I've written:
         | https://stackoverflow.com/questions/26999615/go-tcp-read-is-...
        
         | Unklejoe wrote:
         | > stream of bytes
         | 
         | I've always wondered: What's the best/defacto way to delimit
         | this back into packets at the application level on the
         | receiving end?
         | 
         | I would think the obvious approach would be to insert some
         | magic word into the stream so that you can re-sync.
         | 
         | Or is this not an issue since you know that once you're
         | connected, you'll never drop a single byte, therefore, the only
         | way to get out of sync would be a program error?
        
           | mytailorisrich wrote:
           | The standard way is to include explicit information on the
           | length of the message that is following.
           | 
           | For example if the message is x bytes long then you first
           | send 'x' then you send the x bytes of the message.
           | 
           | Or your messages have a defined header that contains the
           | length of the message payload.
        
           | jstanley wrote:
           | You will never drop a single byte.
           | 
           | If you need some packet-oriented messaging, you could use
           | something like http://jsonlines.org/ (i.e. JSON messages
           | separated by newline characters), or
           | https://github.com/protocolbuffers/protobuf if it's more
           | performance-critical.
        
             | timeinput wrote:
             | Protobuf isn't self delimiting so you still have to have
             | some extra packet wrapper around it to say the length.
             | 
             | I like zeromq to get to a packet based system.
        
           | genpfault wrote:
           | Netstrings[1] :)
           | 
           | [1]: https://en.wikipedia.org/wiki/Netstring
        
           | vasilvv wrote:
           | It will never get out-of-sync because TCP guarantees that the
           | bytes will be delivered in the same order they've arrived.
           | 
           | The best approach is typically put a length in front of every
           | message. The good things about that approach are:
           | 
           | 1. The receiver can allocate buffer that is exactly the size
           | it needs to fit the message. 2. The receiver can check
           | whether the message is too long before seeing the entire
           | message.
           | 
           | The only disadvantage is that you have to know the length of
           | all messages in advance.
        
         | fenwick67 wrote:
         | This probably bites lots of newbies, since when you're just
         | sending traffic over localhost, the send()s and read()s tend to
         | line up.
        
           | yjftsjthsd-h wrote:
           | I have often wished for an "unhelpful testing environment" of
           | sorts, to deal with these things before they get out of hand.
           | It would feature a compiler that had creatively different
           | interpretations of undefined behaviors, randomly compile
           | against glibc and musl, have a base OS lovingly crafted from
           | Ubuntu, but with most coreutils replaced with busybox and/or
           | BSD versions. And, now, I suppose, it would have a customized
           | network stack (kernel module?) that would randomly
           | reorder/drop/duplicate packets, randomly reselect MTU on
           | every boot, or maybe just randomly fragments things
           | regardless of MTU. Ideally it would come with a FAQ of "my
           | program broke on X; what did I do wrong?".
           | 
           | The idea being that if your software is actually written to
           | relevant standards, and actually handles things properly
           | outside the golden path, then it should still work fine. If,
           | however, you accidentally did something implementation-
           | defined, or that only worked by coincidence, this system
           | _will_ break it.
        
             | jeroenhd wrote:
             | There are tools that intentionally insert failures into the
             | network streams of applications. A few of them are
             | described here: https://medium.com/@docler/network-issues-
             | simulation-how-to-...
             | 
             | The other linking/OS problems can probably be automated
             | with some simple integration tests and a bunch of different
             | docker containers to compile the code in. Should be
             | possible to squeeze it into a CI/CD flow somewhere with
             | some clever tricks.
        
             | Matthias247 wrote:
             | I created such an environment for my unit-tests: Wrapping
             | TCP sockets in a stream which only accepts 1 byte at a time
             | in both directions and returns EAGAIN on every second read
             | provides an easy way to make sure the code on top of the
             | socket does perform all the correct retries.
             | 
             | That will most likely not help newcomers which directly
             | write their code agains the OS socket. But once you get a
             | better understanding of the topic and start adding tests to
             | your codebase it's rather easy to add.
        
         | brlewis wrote:
         | For me, at least in this decade, it would have been better if I
         | didn't know that. I put off learning websockets longer than I
         | should have because I don't find packet boundaries fun to deal
         | with, and my interest in websockets was mainly for fun. Then
         | when I finally picked websockets up I was pleasantly surprised
         | that message framing is built in.
        
       | ex3ndr wrote:
       | The biggest issue with TCP is that it can randomly freeze and you
       | have to restart it in pretty much any network. You CAN NOT rely
       | on socket closing on any side, you have to maintain connection by
       | yourself.
       | 
       | I am super puzzled why something like websockets not solving this
       | problem, simple heartbeat could solve the problem, but no one
       | implements it.
        
         | gsich wrote:
         | You can use keepalives at the protocol (TCP) level.
        
       | dblohm7 wrote:
       | This reminds me of an issue I had to debug over a decade ago. Our
       | product had its own protocol written atop TCP, but its handshake
       | was written in a way such that it was much slower than it should
       | have been due to delays caused by the Nagle algorithm.
       | 
       | Turning on TCP_NODELAY was a quick-n-dirty fix, but the real fix
       | was to rewrite the handshake to be more compatible with the inner
       | workings of TCP.
        
       | resca79 wrote:
       | I loved this area when I was at university. At the end of
       | Computer Networking course I brought a project on based on
       | https://www.isi.edu/nsnam/ns/
       | 
       | It was really fun expecially because it allows you to understand
       | better all networking layers.
       | 
       | I did some tests about network topology to minimize lost tcp
       | packs as possible, given different network traffics
        
       | vinay_ys wrote:
       | Single biggest TCP issue I have had to debug and fix numerous
       | times is about not doing connection reuse properly leading to tcp
       | port exhaustion and causing seemly random delays causing timeout
       | failures at higher level protocols, usually http. This one single
       | issue has taken down multi-billion dollar production systems.
       | 
       | So, I hope people learn to check their http client/server
       | implementations to have proper connection handling. Client should
       | have a thoughtfully sized bounded connection pool with reasonably
       | large idle timeout. It shouldn't close the connection after every
       | application request (say, http request). There shouldn't be
       | sockets in TIME_WAIT state accumulating at the client end.
       | 
       | Server should accept thoughtfully limited number of connections
       | per client. Server should never close the connection except when
       | it is shutting down.
       | 
       | There should be tcp keepalive messages to keep the connection
       | alive with intermediate hop stateful firewalls (connection
       | tracking table entries in firewalls expire when the connection is
       | idle for too long) and to detect stale connections and re-
       | establish them.
       | 
       | All of these things can be verified by analyzing at a packet
       | capture. You can get a manageable sized pcap file by filtering on
       | client/server ip/port-range pairs for at least 330 seconds.
       | 
       | Knowing tools to understand/debug tcp issues is an essential
       | skill. sock stat command - ss, wireshark/tshark with Lua
       | scripting is super useful. Knowing higher level application
       | protocols like TLS and http is essential too.
        
       | bsamuels wrote:
       | Why doesn't the congestion control part of TCP prevent buffer
       | bloat[1]? Is it because ISP throttling of the internet connection
       | doesn't touch the TCP packets themselves?
       | 
       | I recently started doing off-site backups, which requires my
       | entire internet uplink to be used for uploading said backups for
       | about a week at a time. The internet basically becomes unusable
       | because all the packets end up in a buffer on the router and
       | latency spikes to 5000ms.
       | 
       | [1]
       | https://www.bufferbloat.net/projects/bloat/wiki/What_can_I_d...
        
         | the8472 wrote:
         | > Why doesn't the congestion control part of TCP prevent buffer
         | bloat[1]?
         | 
         | It can. Enable BBR + fq/fq_codel on the box in question and
         | CAKE on your router.
        
         | milesvp wrote:
         | This is a fundamental problem on the internet. Ram is so cheap
         | that every device has too big buffers that don't allow for
         | proper TCP back pressure. Eric Raymond gave a talk on this a
         | few years ago. He was going to distribute a lot of small
         | embedded devices around the world to measure this to try to
         | address it. I'm curious what happened to that effort.
        
         | jeffbee wrote:
         | If there is a huge FIFO queue on your router, the rate-finding
         | algorithms associated with TCP will be forced to conclude that
         | the RTT to your site is enormous. They may try to open the
         | window to compensate, but here's a fun fact: most operating
         | system default settings are insufficient to utilize very high
         | bandwidth-delay products. If you want to send a 1gbps flow
         | across an 80ms distance on Linux, you'll need to change some
         | parameters with sysctl before it will work. If your apparent
         | RTT is 5000ms, the flow you can get will be reduced in
         | proportion.
         | 
         | In any case, the solution to bufferbloat is queue discipline,
         | not congestion control.
        
           | jfkebwjsbx wrote:
           | Up to what speeds/latencies are the default sysctl parameters
           | alright? Is there any easy way to know whether you are
           | getting hit by this? Nowadays many people is getting 1 Gbps
           | links at home!
           | 
           | What do you mean by queue discipline?
        
             | jeffbee wrote:
             | You know, the worst part is that Linux sets the maximum
             | receive window size at boot time depending on how much
             | memory the system contains, ensuring that it's never quite
             | right. On this machine, with 32GB of main memory, it
             | defaults to 6291456 bytes.
        
               | jfkebwjsbx wrote:
               | I see, thanks!
               | 
               | What about the queue discipline?
        
               | jeffbee wrote:
               | If you face a choice of what frame to put on the wire at
               | any moment, the queue discipline makes that choice. The
               | easiest policy is to simply send the oldest frame, but
               | this is also the worst policy.
        
               | jfkebwjsbx wrote:
               | Ah, so the eviction/priority algorithm. Thanks!
        
         | vasilvv wrote:
         | Most of the common TCP congestion control algorithms (Reno,
         | Cubic) are loss-based: they try to send more and more data
         | until the link no longer can buffer all of the packets, and
         | drops some of them. Naturally, this approach requires the
         | buffer to fill up, causing the latency to spike.
         | 
         | There are algorithms that try to use increased delay as a
         | signal that the link is full. This approach has multiple
         | problems, one of which is that delay can be really noisy on
         | wireless networks; another is that if you have a loss-based and
         | a delay-based connection sharing the same link, the delay-based
         | one will get much less than a fair share of its bandwidth.
         | People have been trying to make an algorithm that both coexists
         | with Reno/CUBIC and does not induce bufferbloat for the last 25
         | years or so, and there's been some progress, but none of it has
         | reached the point where it could be used as a default
         | congestion control for all operating systems.
         | 
         | The problem of "I have files to transfer in background, but I
         | want my connection to yield to more important traffic" can
         | actually solved using a special congestion control algorithm
         | called LEDBAT [1]; it's used by Apple for things like software
         | updates, and BitTorrent uses it too. Unfortunately, I think
         | only Apple implements it in its TCP stack, so anyone who wants
         | to do that would have to roll their own thing using UDP.
         | 
         | [1] https://en.wikipedia.org/wiki/LEDBAT
        
         | kqr wrote:
         | Big buffers that can be filled fast trick congestion control
         | algorithms into thinking your wire is really fast. The point of
         | the buffer is to be transparent to the transmitting ends, so
         | they see the packets going out at lightning speed and assume
         | it's because they're actually going that fast, and not just
         | piled into a buffer that fast.
        
         | toast0 wrote:
         | > Why doesn't the congestion control part of TCP prevent buffer
         | bloat[1]? Is it because ISP throttling of the internet
         | connection doesn't touch the TCP packets themselves?
         | 
         | Most of the congestion control algorithms use packet loss as
         | the only indicator of congestion. In a network with oversized
         | buffers, congestion will result in delay and not packet loss.
         | If the delay gets large enough, recieve and congestion windows
         | will restrict the effective bandwidth, but the latency at that
         | point is terrible.
         | 
         | There are some alternate congestion control algorithms which do
         | use latency as a signal, but they aren't universally available,
         | and may not be a good fit for all flows.
         | 
         | For your backup use case, probably the simplest thing is to
         | reduce your sendbuffers for the backup sender process. Although
         | allowing packets to drop instead of queue at your router/modem
         | would really be best, often that's difficult to acheive.
        
           | api wrote:
           | A major reason explicit congestion notification is not used
           | is firewalls that block anything that isn't bog standard TCP
           | or UDP. Some even ban odd combinations of flags. There are
           | enough of these to make ECN useless.
        
             | toast0 wrote:
             | A router that is willing to buffer 5 seconds worth of
             | packets probably wasn't going to mark for congestion and
             | drop either.
             | 
             | Note also, Apple is using MP-TCP and ECN in iOS, and the
             | world didn't stop. It might not work everywhere, and I
             | don't praise Apple lightly, but there's a pretty clear path
             | to using things like this. Send a syn with it enabled, wait
             | a bit, and send one with it disabled. Keep track of
             | networks where it doesn't work and stop trying it there. If
             | you have leverage, yell at people to not do dumb things,
             | otherwise, let them figure out why expensive things work
             | better on their competetors' networks. You can't rely on
             | being able to use these things, but you can use them for
             | progressive enhancement.
        
       ___________________________________________________________________
       (page generated 2020-05-15 23:00 UTC)