[HN Gopher] Better visibility into Linux packet-dropping decisions
       ___________________________________________________________________
        
       Better visibility into Linux packet-dropping decisions
        
       Author : rwmj
       Score  : 88 points
       Date   : 2022-03-03 17:39 UTC (5 hours ago)
        
 (HTM) web link (lwn.net)
 (TXT) w3m dump (lwn.net)
        
       | egberts1 wrote:
       | There are 2,700 ways to arguably lose a packet within the kernel
       | and 6/10th of 1% are reported to the log.
       | 
       | One thing I've noticed is that introducing another function
       | argument can alter tightly written loops and calls (also known as
       | introducing instruction-cache busting).
       | 
       | I am quite sure Jen Axboe has something to say about this while
       | he is busy pushing the Linux kernel toward multi-million IO and
       | packet processing throughout.
        
         | rwmj wrote:
         | Error/packet-drop paths can be moved to a cold path outside of
         | hot loops using if(unlikely(...)), although I agree that does
         | require analysis & code modifications.
        
       | contingencies wrote:
       | Cool but I think the following are safe assumptions: Most packets
       | are dropped "elsewhere". Most packets are dropped due to
       | misconfiguration, route loss (often due to power loss, link
       | change or link integrity issues), or firewalling. Well written
       | applications tend to survive regardless. IPv6 is nominally better
       | than IPv4 in terms of surviving link state changes.
       | 
       | Also: The recent trend of putting everything down web sockets is
       | bad, as it effectively regresses to circuit-switched networking
       | with undefined QoS guarantees / failure types and crap tooling.
       | Hopefully we've already passed 'peak websocket'.
        
         | toast0 wrote:
         | > Most packets are dropped "elsewhere".
         | 
         | This is probably true, but elsewhere packet loss is often
         | diffuse. If you're looking into packet loss, it's likely
         | because something is overloaded on your machine resulting in a
         | burst of loss that affects user experience or your stats.
         | Surviving is great, but eliminating bottlenecks you can control
         | is better.
        
           | jeffbee wrote:
           | I don't take the statement that packets are dropped elsewhere
           | as obviously true. Your own box will drop due to queue
           | discipline buffer sizes, NIC queue pruning, and a hundred
           | other reasons.
        
       | predakanga wrote:
       | I recently came across another useful utility for debugging
       | unexpected packet drops - PWRU[0] (Packet, Where Are You) by
       | Cilium.
       | 
       | It uses eBPF to try to trace the path of the packet through the
       | kernel. Haven't needed to use it yet, but it could have saved me
       | a lot of trouble in the past.
       | 
       | [0]: https://github.com/cilium/pwru
        
       | tptacek wrote:
       | drop_mon (or whatever it's called) is one of the weirder things
       | in the Linux kernel. It has only one implementation I've found,
       | "dropwatch", which is, to put it gently, not a great example of a
       | modern C CLI program --- for instance, the kernel subsystem gives
       | you snapshots of packet contents themselves, and there is already
       | a very flexible and easy-to-use library for filtering packets
       | based on their contents with an enormous ecosystem, but all
       | dropwatch will do is print dumps.
       | 
       | I threw together a half-assed POC alternative implementation in
       | Go a couple months ago, using Matt Layher's fantastic netlink
       | libraries:
       | 
       | https://github.com/superfly/dropspy
       | 
       | I have the impression that the drop_mon stuff isn't taken super
       | seriously by anyone, but it's incredibly useful when you're
       | debugging complicated networking stuff.
        
         | suprjami wrote:
         | There's a dropwatch in SystemTap which produces similar output
         | to the dropwatch program but has been significantly more useful
         | for our purposes. Adding a backtrace to that can be useful.
         | Beyond that we usually instrument the kernel to either printk
         | when decisions are made, or add fields in the skb and set the
         | fields to values as they pass through different parts of the
         | stack, then finally things which react to those fields. That's
         | a lot quicker and more efficient than looking at every packet.
        
       | spockz wrote:
       | Is there any way to get to statistics about number of package
       | drops, retransmits, etc from the application level?
       | 
       | We are running jvm (netty client and tomcat server) applications
       | in K8s and are experiencing p99 delays for some requests. Server
       | side, application metrics have a p99 of 8ms, but clients
       | experience a p99 of 2000ms. In this same interval other requests
       | continue happily. I do suspects either scheduling somehow, or
       | something in the network.
       | 
       | Any suggestions on how to figure out how to detect where this
       | happens? (Putting wireshark in between would be a last resort.)
        
         | jeffbee wrote:
         | Better than wireshark would be a BPF program. For established
         | sockets you can get retransmission stats out of the kernel with
         | the `ss` tool, but that leaves you no visibility into transient
         | sockets.
         | 
         | Switching to QUIC or other user-space protocol would give you
         | the best observability.
        
         | xxpor wrote:
         | If it's almost exactly 2000 ms, it sounds like a retransmission
         | timer, i.e. packet loss.
        
           | spockz wrote:
           | Thanks. I'll look into that. It actually appears to be
           | multiples of 2000 + some additional processing times.
        
           | jeffbee wrote:
           | More specifically, it's probably a lost SYN. 2s is too long
           | to be a single retransmission on an established flow.
        
         | bigcat123 wrote:
        
         | namibj wrote:
         | If it's usiong kernel TCP, switch to a TCP congestion control
         | that uses tail loss probes. If you have enough control and the
         | network you control supports it, consider L4S TCP technology or
         | at least normal (old-style) ECN-based congestion control.
        
         | toast0 wrote:
         | netstat -s gives protocol statistics, some of which might be
         | useful, you might also have ethernet interface statistics
         | somewhere in case the interface is dropping packets (on
         | FreeBSD, interface stats for nice drivers will be in sysctl
         | dev.X where X is the driver name; some drivers have better data
         | than others, but I haven't debugged Linux issues enough to find
         | the same data there)
         | 
         | Either way, tcpdump on the host or the client (sometimes you
         | need dumps from both) should tell the tale. You probably don't
         | need or want wireshark in between the peers, a capture from
         | either can be loaded at your leasure. If you suspect a non-
         | application issue, you can do -s 64 or so and not capture most
         | of the application data, but still have all the TCP headers.
         | pcaps aren't always exactly the truth about what's on the wire,
         | but they're pretty close.
         | 
         | Personally, I get a lot of mileage out of Wireshark, so every
         | problem report looks like a pcap. If the server is really busy,
         | the pcaps get too big, and you have to do things like short
         | bursts of capturing and hope you see the issue in the burst, or
         | sampling, but if your server isn't very busy, you can do things
         | like pcap all the packets, and ask for problem reports to
         | include client ip and port number and then you can find it in
         | the big pcap a lot easier.
        
           | tptacek wrote:
           | Wireshark, tcpdump, and pcap tools in general are probably
           | way overkill for basic network statistics issues.
        
             | toast0 wrote:
             | Overkill is the best kind of kill! Client and server
             | disagree on p99 times by a lot and there's enough reports
             | to actually look into it says there's probably something
             | interesting going on, and pcap will tell you what it is.
             | And you'll almost always see at least a few other things
             | that might be worth fixing while you're looking at pcaps;
             | some of which might actually be fixable.
             | 
             | Sometimes you even get lucky and can see what traffic was
             | going on before the mysterious gap in packets, and make a
             | good guess at what's blocking your rx queues or whatever.
             | 
             | Stats are nice too, of course.
        
         | tptacek wrote:
         | Yes, there are a bunch of different network statistics, which
         | you can scrape for instance out of proc (see the "net"
         | subdirectory, and, particularly, "net/snmp"; it's trivially
         | parseable).
         | 
         | A more ordinary way to come at this would be to hook up a
         | Prometheus node_exporter and just look at this stuff in
         | Grafana.
        
       ___________________________________________________________________
       (page generated 2022-03-03 23:00 UTC)