[HN Gopher] Better visibility into Linux packet-dropping decisions ___________________________________________________________________ Better visibility into Linux packet-dropping decisions Author : rwmj Score : 88 points Date : 2022-03-03 17:39 UTC (5 hours ago) (HTM) web link (lwn.net) (TXT) w3m dump (lwn.net) | egberts1 wrote: | There are 2,700 ways to arguably lose a packet within the kernel | and 6/10th of 1% are reported to the log. | | One thing I've noticed is that introducing another function | argument can alter tightly written loops and calls (also known as | introducing instruction-cache busting). | | I am quite sure Jen Axboe has something to say about this while | he is busy pushing the Linux kernel toward multi-million IO and | packet processing throughout. | rwmj wrote: | Error/packet-drop paths can be moved to a cold path outside of | hot loops using if(unlikely(...)), although I agree that does | require analysis & code modifications. | contingencies wrote: | Cool but I think the following are safe assumptions: Most packets | are dropped "elsewhere". Most packets are dropped due to | misconfiguration, route loss (often due to power loss, link | change or link integrity issues), or firewalling. Well written | applications tend to survive regardless. IPv6 is nominally better | than IPv4 in terms of surviving link state changes. | | Also: The recent trend of putting everything down web sockets is | bad, as it effectively regresses to circuit-switched networking | with undefined QoS guarantees / failure types and crap tooling. | Hopefully we've already passed 'peak websocket'. | toast0 wrote: | > Most packets are dropped "elsewhere". | | This is probably true, but elsewhere packet loss is often | diffuse. If you're looking into packet loss, it's likely | because something is overloaded on your machine resulting in a | burst of loss that affects user experience or your stats. | Surviving is great, but eliminating bottlenecks you can control | is better. | jeffbee wrote: | I don't take the statement that packets are dropped elsewhere | as obviously true. Your own box will drop due to queue | discipline buffer sizes, NIC queue pruning, and a hundred | other reasons. | predakanga wrote: | I recently came across another useful utility for debugging | unexpected packet drops - PWRU[0] (Packet, Where Are You) by | Cilium. | | It uses eBPF to try to trace the path of the packet through the | kernel. Haven't needed to use it yet, but it could have saved me | a lot of trouble in the past. | | [0]: https://github.com/cilium/pwru | tptacek wrote: | drop_mon (or whatever it's called) is one of the weirder things | in the Linux kernel. It has only one implementation I've found, | "dropwatch", which is, to put it gently, not a great example of a | modern C CLI program --- for instance, the kernel subsystem gives | you snapshots of packet contents themselves, and there is already | a very flexible and easy-to-use library for filtering packets | based on their contents with an enormous ecosystem, but all | dropwatch will do is print dumps. | | I threw together a half-assed POC alternative implementation in | Go a couple months ago, using Matt Layher's fantastic netlink | libraries: | | https://github.com/superfly/dropspy | | I have the impression that the drop_mon stuff isn't taken super | seriously by anyone, but it's incredibly useful when you're | debugging complicated networking stuff. | suprjami wrote: | There's a dropwatch in SystemTap which produces similar output | to the dropwatch program but has been significantly more useful | for our purposes. Adding a backtrace to that can be useful. | Beyond that we usually instrument the kernel to either printk | when decisions are made, or add fields in the skb and set the | fields to values as they pass through different parts of the | stack, then finally things which react to those fields. That's | a lot quicker and more efficient than looking at every packet. | spockz wrote: | Is there any way to get to statistics about number of package | drops, retransmits, etc from the application level? | | We are running jvm (netty client and tomcat server) applications | in K8s and are experiencing p99 delays for some requests. Server | side, application metrics have a p99 of 8ms, but clients | experience a p99 of 2000ms. In this same interval other requests | continue happily. I do suspects either scheduling somehow, or | something in the network. | | Any suggestions on how to figure out how to detect where this | happens? (Putting wireshark in between would be a last resort.) | jeffbee wrote: | Better than wireshark would be a BPF program. For established | sockets you can get retransmission stats out of the kernel with | the `ss` tool, but that leaves you no visibility into transient | sockets. | | Switching to QUIC or other user-space protocol would give you | the best observability. | xxpor wrote: | If it's almost exactly 2000 ms, it sounds like a retransmission | timer, i.e. packet loss. | spockz wrote: | Thanks. I'll look into that. It actually appears to be | multiples of 2000 + some additional processing times. | jeffbee wrote: | More specifically, it's probably a lost SYN. 2s is too long | to be a single retransmission on an established flow. | bigcat123 wrote: | namibj wrote: | If it's usiong kernel TCP, switch to a TCP congestion control | that uses tail loss probes. If you have enough control and the | network you control supports it, consider L4S TCP technology or | at least normal (old-style) ECN-based congestion control. | toast0 wrote: | netstat -s gives protocol statistics, some of which might be | useful, you might also have ethernet interface statistics | somewhere in case the interface is dropping packets (on | FreeBSD, interface stats for nice drivers will be in sysctl | dev.X where X is the driver name; some drivers have better data | than others, but I haven't debugged Linux issues enough to find | the same data there) | | Either way, tcpdump on the host or the client (sometimes you | need dumps from both) should tell the tale. You probably don't | need or want wireshark in between the peers, a capture from | either can be loaded at your leasure. If you suspect a non- | application issue, you can do -s 64 or so and not capture most | of the application data, but still have all the TCP headers. | pcaps aren't always exactly the truth about what's on the wire, | but they're pretty close. | | Personally, I get a lot of mileage out of Wireshark, so every | problem report looks like a pcap. If the server is really busy, | the pcaps get too big, and you have to do things like short | bursts of capturing and hope you see the issue in the burst, or | sampling, but if your server isn't very busy, you can do things | like pcap all the packets, and ask for problem reports to | include client ip and port number and then you can find it in | the big pcap a lot easier. | tptacek wrote: | Wireshark, tcpdump, and pcap tools in general are probably | way overkill for basic network statistics issues. | toast0 wrote: | Overkill is the best kind of kill! Client and server | disagree on p99 times by a lot and there's enough reports | to actually look into it says there's probably something | interesting going on, and pcap will tell you what it is. | And you'll almost always see at least a few other things | that might be worth fixing while you're looking at pcaps; | some of which might actually be fixable. | | Sometimes you even get lucky and can see what traffic was | going on before the mysterious gap in packets, and make a | good guess at what's blocking your rx queues or whatever. | | Stats are nice too, of course. | tptacek wrote: | Yes, there are a bunch of different network statistics, which | you can scrape for instance out of proc (see the "net" | subdirectory, and, particularly, "net/snmp"; it's trivially | parseable). | | A more ordinary way to come at this would be to hook up a | Prometheus node_exporter and just look at this stuff in | Grafana. ___________________________________________________________________ (page generated 2022-03-03 23:00 UTC)