[HN Gopher] Tracking NFS problems down to the SFP level ___________________________________________________________________ Tracking NFS problems down to the SFP level Author : CaliforniaKarl Score : 30 points Date : 2021-02-05 20:17 UTC (1 days ago) (HTM) web link (news.sherlock.stanford.edu) (TXT) w3m dump (news.sherlock.stanford.edu) | lykr0n wrote: | This seems like an issue that could have been resolved a lot | quicker if they were doing network monitoring on the host size | and the switch side. | | Ideally you would be able to spot a large amount of link errors | on a port/switch/host, and fix it before it becomes a problem. | toast0 wrote: | I no longer have access to it, but I wrote a tool to find these | types of problems at my last job. It didn't seem generally | applicable enough to try to get it open sourced (and I didn't | want to polish it enough for that either). | | The key insight is that LACP is almost always configured to use | do a hash of { Source IP, Dest IP, Protocol, Source Port, Dest | Port } so that packets from each TCP and UDP flow will always be | sent on the same individual link. (this is directional though, so | it may go from peer A to peer B on cable X and from B to A on | cable Y). | | So the way to confirm a broken link is to connect a bunch of UDP | flows (on different ports) between peer A and peer B, send data, | and measure loss (and/or delay!). If you see zero loss / uniform | delay, either none of your flows cross the broken link or the | problem isn't a broken link or the rate of issues is too low to | detect. Once you've found a broken flow, you can use a 'paris | traceroute' tool to confirm the IP routers it's between. Paris | traceroute holds the UDP source and destination ports fixed, so | the route on LACP should say the same. I contributed support for | this in mtr, but I'm not sure if it still works; if you see 100% | packet loss with mtr in fixed udp port mode, send me an email | (address in profile) and I'll ask you for data and try to debug. | | Once you narrow down to the two routers the link is between, it | should be easy enough to confirm. Usually through link quality | counters, if not, through just pulling links and seeing if things | get better or worse. | | If you have a long network path, and most links are LACP, the | total number of possible paths between two peers gets large, and | there's a chance that you might not be able to survey them all | from a pair of servers; so you may have to try a few different | hosts. | | You can find packet loss, but also congestion/buffering this way. | In an ideal world, all the link error counters would be collected | and anomalies would be addressed, but it seems that doesn't | always happen. | pwarner wrote: | Yeah we get major problems in a top 2 cloud provider where our | on prem to cloud link dropped packets. We narrowed it down with | iperf to the packet loss only happening on some ephemeral | source ports. Ports were always ok, or always slow. Destroyed | and recreated the cloud gateways and all was well. Should say | another engineer figured it out. Cloud provider tried to blame | our side. They did not excel at operations... | azinman2 wrote: | Given enough of these fault analysis articles, I wonder if it's | possible to compile them into some kind of decision tree-like | interface where you can describe your problems and have it guide | possible failure scenarios (plus diagnose steps). Would be cool | to collect all of this knowledge beyond google, as this is the | type of stuff where Google often breaks down. | toast0 wrote: | 1) run tcpdump on both sides and compare | | 2) If both sides have the same tcpdump, it's not a network | problem. Find the software problem. truss or strace can help | | 3) If the sides differ; figure out if the network is broken, or | the os/network card is lying. | | 4) If the network is broken, fix it ;) | | 5) if the os/network card is lying, turn off the lying (mostly | offloads, like segmentation, large receive, and checksum) and | go back to step 1 | | This is basically common pattern debugging. I'm not getting the | results that I expect; find a way to observe when/where the | data in progress changes from what I expect to something else; | along the way, being as explicit as needed about what data is | expected. The closer you can narrow down where the failure | occurs, the more likely you are to be able to find the failure, | or a person responsible for fixing the failure who you can give | your investigation results to and they can fix the failure. | physicsgraph wrote: | I agree with your suggestion conceptually, but the starting | point of a decision tree strongly depends on many factors: | which version of the software is being used, how the software | was compiled, running with what OS, which what patches, on what | hardware, under what environmental conditions, in support of | what application usage patterns, and what load, etc. Merely | capturing the (potentially) relevant input conditions becomes | challenging, never mind the process of eliminating irrelevant | variables. And that's all premised on the concept that a | problem is recurring (rather than some fluke that no one else | has encountered). | | I think that's why Stack Overflow websites focused on shallow | conditions flourish -- the deep dives are usually specific to a | given situation. | sneak wrote: | My meta-fix for such a thing would be to hack up a script to | start putting these interface error counters into prometheus, and | then alerting on a spike above some threshold. ___________________________________________________________________ (page generated 2021-02-06 23:00 UTC)