[HN Gopher] Tracking NFS problems down to the SFP level
       ___________________________________________________________________
        
       Tracking NFS problems down to the SFP level
        
       Author : CaliforniaKarl
       Score  : 30 points
       Date   : 2021-02-05 20:17 UTC (1 days ago)
        
 (HTM) web link (news.sherlock.stanford.edu)
 (TXT) w3m dump (news.sherlock.stanford.edu)
        
       | lykr0n wrote:
       | This seems like an issue that could have been resolved a lot
       | quicker if they were doing network monitoring on the host size
       | and the switch side.
       | 
       | Ideally you would be able to spot a large amount of link errors
       | on a port/switch/host, and fix it before it becomes a problem.
        
       | toast0 wrote:
       | I no longer have access to it, but I wrote a tool to find these
       | types of problems at my last job. It didn't seem generally
       | applicable enough to try to get it open sourced (and I didn't
       | want to polish it enough for that either).
       | 
       | The key insight is that LACP is almost always configured to use
       | do a hash of { Source IP, Dest IP, Protocol, Source Port, Dest
       | Port } so that packets from each TCP and UDP flow will always be
       | sent on the same individual link. (this is directional though, so
       | it may go from peer A to peer B on cable X and from B to A on
       | cable Y).
       | 
       | So the way to confirm a broken link is to connect a bunch of UDP
       | flows (on different ports) between peer A and peer B, send data,
       | and measure loss (and/or delay!). If you see zero loss / uniform
       | delay, either none of your flows cross the broken link or the
       | problem isn't a broken link or the rate of issues is too low to
       | detect. Once you've found a broken flow, you can use a 'paris
       | traceroute' tool to confirm the IP routers it's between. Paris
       | traceroute holds the UDP source and destination ports fixed, so
       | the route on LACP should say the same. I contributed support for
       | this in mtr, but I'm not sure if it still works; if you see 100%
       | packet loss with mtr in fixed udp port mode, send me an email
       | (address in profile) and I'll ask you for data and try to debug.
       | 
       | Once you narrow down to the two routers the link is between, it
       | should be easy enough to confirm. Usually through link quality
       | counters, if not, through just pulling links and seeing if things
       | get better or worse.
       | 
       | If you have a long network path, and most links are LACP, the
       | total number of possible paths between two peers gets large, and
       | there's a chance that you might not be able to survey them all
       | from a pair of servers; so you may have to try a few different
       | hosts.
       | 
       | You can find packet loss, but also congestion/buffering this way.
       | In an ideal world, all the link error counters would be collected
       | and anomalies would be addressed, but it seems that doesn't
       | always happen.
        
         | pwarner wrote:
         | Yeah we get major problems in a top 2 cloud provider where our
         | on prem to cloud link dropped packets. We narrowed it down with
         | iperf to the packet loss only happening on some ephemeral
         | source ports. Ports were always ok, or always slow. Destroyed
         | and recreated the cloud gateways and all was well. Should say
         | another engineer figured it out. Cloud provider tried to blame
         | our side. They did not excel at operations...
        
       | azinman2 wrote:
       | Given enough of these fault analysis articles, I wonder if it's
       | possible to compile them into some kind of decision tree-like
       | interface where you can describe your problems and have it guide
       | possible failure scenarios (plus diagnose steps). Would be cool
       | to collect all of this knowledge beyond google, as this is the
       | type of stuff where Google often breaks down.
        
         | toast0 wrote:
         | 1) run tcpdump on both sides and compare
         | 
         | 2) If both sides have the same tcpdump, it's not a network
         | problem. Find the software problem. truss or strace can help
         | 
         | 3) If the sides differ; figure out if the network is broken, or
         | the os/network card is lying.
         | 
         | 4) If the network is broken, fix it ;)
         | 
         | 5) if the os/network card is lying, turn off the lying (mostly
         | offloads, like segmentation, large receive, and checksum) and
         | go back to step 1
         | 
         | This is basically common pattern debugging. I'm not getting the
         | results that I expect; find a way to observe when/where the
         | data in progress changes from what I expect to something else;
         | along the way, being as explicit as needed about what data is
         | expected. The closer you can narrow down where the failure
         | occurs, the more likely you are to be able to find the failure,
         | or a person responsible for fixing the failure who you can give
         | your investigation results to and they can fix the failure.
        
         | physicsgraph wrote:
         | I agree with your suggestion conceptually, but the starting
         | point of a decision tree strongly depends on many factors:
         | which version of the software is being used, how the software
         | was compiled, running with what OS, which what patches, on what
         | hardware, under what environmental conditions, in support of
         | what application usage patterns, and what load, etc. Merely
         | capturing the (potentially) relevant input conditions becomes
         | challenging, never mind the process of eliminating irrelevant
         | variables. And that's all premised on the concept that a
         | problem is recurring (rather than some fluke that no one else
         | has encountered).
         | 
         | I think that's why Stack Overflow websites focused on shallow
         | conditions flourish -- the deep dives are usually specific to a
         | given situation.
        
       | sneak wrote:
       | My meta-fix for such a thing would be to hack up a script to
       | start putting these interface error counters into prometheus, and
       | then alerting on a spike above some threshold.
        
       ___________________________________________________________________
       (page generated 2021-02-06 23:00 UTC)