[HN Gopher] Just Say No to Paxos Overhead: Replacing Consensus w...
       ___________________________________________________________________
        
       Just Say No to Paxos Overhead: Replacing Consensus with Network
       Ordering (2016)
        
       Author : yagizdegirmenci
       Score  : 41 points
       Date   : 2021-05-04 20:35 UTC (2 hours ago)
        
 (HTM) web link (www.usenix.org)
 (TXT) w3m dump (www.usenix.org)
        
       | infogulch wrote:
       | Kinda neat. Splits the big problem of consensus into ordering and
       | replication, and then leans on a network device like a switch to
       | 'solve' the ordering problem in the context of a single data
       | center. The key observation is that all those packets are going
       | through the switch anyways, and the switch has enough spare
       | compute to maintain a counter and add it as a header to packets,
       | and it can easily be dynamically programmed with SDN...
       | 
       | I bet public clouds could offer this as a specialized 'ordered
       | multicast vnet' infrastructure primitive to build intra-dc
       | replicated systems on top of.
        
         | strictfp wrote:
         | Reminds me of the hack of using auxiliary information such as
         | AWS group metadata to determine ordering.
         | 
         | If there's already an ordering system in place, directly or
         | indirectly, why not use it?
        
       | jpgvm wrote:
       | They aren't wrong but they are basically just digging up the
       | Tandem architecture and adapting it to the current RoCE/FCoE
       | enhanced ethernet we have today.
       | 
       | Ironically that is where most of our newer interconnects have
       | their heritage. PCI-e is descendant from Infiniband which in turn
       | is decendant from ServerNet which was developed at Tandem as the
       | internal interconnect for their NonStop systems for the very
       | usecases this paper describes.
       | 
       | i.e the very ordered networking this relies on and the
       | architectures it enables were invented ~25 years ago.
       | 
       | Unfortunately the reality of today is we don't build data centres
       | and especially not for very highly available systems. Instead we
       | pay that latency penalty to build geographically redundant
       | systems using cheap rented virtual hardware from 3rd parties.
       | 
       | This is in part because of how the always on nature of the
       | internet changed the availability requirements for most software
       | (which is now delivered as SaaS) from business hours in 1
       | timezone to always-on everywhere in the world. It's no longer an
       | excuse for a business critical application to be down because "DC
       | lost power" or some variation of that.
       | 
       | I'm old and grumpy though so whatever, everything old becomes new
       | again eventually.
        
         | AceJohnny2 wrote:
         | that's funny, the tech lead at my last company was formerly
         | from Tandem, and the architecture of our HA product reflected
         | that :)
         | 
         | It was a cool system, with architecture completely alien to the
         | consumer products I work on nowadays, and I kinda miss QNX.
         | 
         | Did you know a gregk?
        
         | strictfp wrote:
         | When working on networked control systems I realized that
         | ordered networking is the reason why automation still uses
         | proprietary protocols and interconnects such as Profibus and
         | Profinet.
         | 
         | If you want good guarantees you need determinism.
        
       | wahern wrote:
       | > The first aspect of our design is network serialization, where
       | all OUM packets for a particular group are routed through a
       | sequencer on the common path.
       | 
       | This solution actually just shunts the problem to a different
       | layer. To be robust to sequencer failure and rollover, you will
       | need to rely on an inner consensus protocol to choose sequencers.
       | Which is basically how Multi-Paxos, Raft, etc work--you use the
       | costly consensus protocol to pick a leader, and thereafter simply
       | rely on the leader to serialize and ensure consistency.
       | 
       | It seems like an interesting paper w/ a novel engineering model
       | and useful proofs. But from an abstract algorithmic perspective
       | it doesn't actually offer anything new, AFAICT. There are an
       | infinite number of ways to shuffle things around to minimize the
       | components and data paths directly reliant on the consensus
       | protocol. None obviate the need for an underlying consensus
       | protocol, absent problems where invariants can be maintained
       | independently.
        
       | brighton36 wrote:
       | These consensus systems are usually solutions in search of a
       | problem. It's pretty rare for these consensus systems to offer an
       | efficiency in practice...
        
         | pfraze wrote:
         | To be clear, this paper is referring to non-decentralized
         | (highly-consistent I assume) consensus algorithms where the
         | goal is to operate a cluster of machines as one logical system.
         | You use this, for instance, to maintain high uptime in the face
         | of individual machines going down.
         | 
         | I suspect you were reacting to decentralized consensus
         | (blockchains) which is a pretty different space.
        
       | klodolph wrote:
       | Yeah, this paper gets dredged up every now and then. Seems a bit
       | like a cheap trick to say "just say no to Paxos overhead" and
       | then _sequence network requests._ I'm all for exploring
       | alternatives to Paxos, but if you're doing something so novel, my
       | response is that I'll believe it when I see it operate at scale.
       | 
       | I'm just not sure how you would sequence requests in a typical
       | setup, with Clos fabrics everywhere, possibly when your
       | redundancy group is spread across different geographical
       | locations. Wouldn't you need some kind of queue to reorder
       | messages? That queue could get large, and quickly.
       | 
       | Paxos and Raft have the advantage of being simple and easy to
       | understand. (Not necessarily easy to incorporate into system
       | designs, sure, and the resulting systems can be fairly
       | complicated, but Paxos and Raft themselves are simple enough to
       | fit on notecards.)
        
         | gfv wrote:
         | >in a typical setup, with Clos fabrics everywhere
         | 
         | You choose a single spine switch to carry the multicast
         | messages destined to the process group. The paper also
         | explicitly notes that different process groups need not to
         | share the sequencer.
         | 
         | >when your redundancy group is spread across different
         | geographical locations
         | 
         | The paper applicability is limited to data center networks with
         | programmable network switching hardware.
        
       ___________________________________________________________________
       (page generated 2021-05-04 23:00 UTC)