[HN Gopher] Just Say No to Paxos Overhead: Replacing Consensus w... ___________________________________________________________________ Just Say No to Paxos Overhead: Replacing Consensus with Network Ordering (2016) Author : yagizdegirmenci Score : 41 points Date : 2021-05-04 20:35 UTC (2 hours ago) (HTM) web link (www.usenix.org) (TXT) w3m dump (www.usenix.org) | infogulch wrote: | Kinda neat. Splits the big problem of consensus into ordering and | replication, and then leans on a network device like a switch to | 'solve' the ordering problem in the context of a single data | center. The key observation is that all those packets are going | through the switch anyways, and the switch has enough spare | compute to maintain a counter and add it as a header to packets, | and it can easily be dynamically programmed with SDN... | | I bet public clouds could offer this as a specialized 'ordered | multicast vnet' infrastructure primitive to build intra-dc | replicated systems on top of. | strictfp wrote: | Reminds me of the hack of using auxiliary information such as | AWS group metadata to determine ordering. | | If there's already an ordering system in place, directly or | indirectly, why not use it? | jpgvm wrote: | They aren't wrong but they are basically just digging up the | Tandem architecture and adapting it to the current RoCE/FCoE | enhanced ethernet we have today. | | Ironically that is where most of our newer interconnects have | their heritage. PCI-e is descendant from Infiniband which in turn | is decendant from ServerNet which was developed at Tandem as the | internal interconnect for their NonStop systems for the very | usecases this paper describes. | | i.e the very ordered networking this relies on and the | architectures it enables were invented ~25 years ago. | | Unfortunately the reality of today is we don't build data centres | and especially not for very highly available systems. Instead we | pay that latency penalty to build geographically redundant | systems using cheap rented virtual hardware from 3rd parties. | | This is in part because of how the always on nature of the | internet changed the availability requirements for most software | (which is now delivered as SaaS) from business hours in 1 | timezone to always-on everywhere in the world. It's no longer an | excuse for a business critical application to be down because "DC | lost power" or some variation of that. | | I'm old and grumpy though so whatever, everything old becomes new | again eventually. | AceJohnny2 wrote: | that's funny, the tech lead at my last company was formerly | from Tandem, and the architecture of our HA product reflected | that :) | | It was a cool system, with architecture completely alien to the | consumer products I work on nowadays, and I kinda miss QNX. | | Did you know a gregk? | strictfp wrote: | When working on networked control systems I realized that | ordered networking is the reason why automation still uses | proprietary protocols and interconnects such as Profibus and | Profinet. | | If you want good guarantees you need determinism. | wahern wrote: | > The first aspect of our design is network serialization, where | all OUM packets for a particular group are routed through a | sequencer on the common path. | | This solution actually just shunts the problem to a different | layer. To be robust to sequencer failure and rollover, you will | need to rely on an inner consensus protocol to choose sequencers. | Which is basically how Multi-Paxos, Raft, etc work--you use the | costly consensus protocol to pick a leader, and thereafter simply | rely on the leader to serialize and ensure consistency. | | It seems like an interesting paper w/ a novel engineering model | and useful proofs. But from an abstract algorithmic perspective | it doesn't actually offer anything new, AFAICT. There are an | infinite number of ways to shuffle things around to minimize the | components and data paths directly reliant on the consensus | protocol. None obviate the need for an underlying consensus | protocol, absent problems where invariants can be maintained | independently. | brighton36 wrote: | These consensus systems are usually solutions in search of a | problem. It's pretty rare for these consensus systems to offer an | efficiency in practice... | pfraze wrote: | To be clear, this paper is referring to non-decentralized | (highly-consistent I assume) consensus algorithms where the | goal is to operate a cluster of machines as one logical system. | You use this, for instance, to maintain high uptime in the face | of individual machines going down. | | I suspect you were reacting to decentralized consensus | (blockchains) which is a pretty different space. | klodolph wrote: | Yeah, this paper gets dredged up every now and then. Seems a bit | like a cheap trick to say "just say no to Paxos overhead" and | then _sequence network requests._ I'm all for exploring | alternatives to Paxos, but if you're doing something so novel, my | response is that I'll believe it when I see it operate at scale. | | I'm just not sure how you would sequence requests in a typical | setup, with Clos fabrics everywhere, possibly when your | redundancy group is spread across different geographical | locations. Wouldn't you need some kind of queue to reorder | messages? That queue could get large, and quickly. | | Paxos and Raft have the advantage of being simple and easy to | understand. (Not necessarily easy to incorporate into system | designs, sure, and the resulting systems can be fairly | complicated, but Paxos and Raft themselves are simple enough to | fit on notecards.) | gfv wrote: | >in a typical setup, with Clos fabrics everywhere | | You choose a single spine switch to carry the multicast | messages destined to the process group. The paper also | explicitly notes that different process groups need not to | share the sequencer. | | >when your redundancy group is spread across different | geographical locations | | The paper applicability is limited to data center networks with | programmable network switching hardware. ___________________________________________________________________ (page generated 2021-05-04 23:00 UTC)