[HN Gopher] Paxos vs. Raft: Have we reached consensus on distrib...
       ___________________________________________________________________
        
       Paxos vs. Raft: Have we reached consensus on distributed consensus?
        
       Author : yagizdegirmenci
       Score  : 113 points
       Date   : 2021-07-14 10:39 UTC (2 days ago)
        
 (HTM) web link (charap.co)
 (TXT) w3m dump (charap.co)
        
       | aneutron wrote:
       | It's not possible do to so, since it would be done
       | asynchronously.
       | 
       | I'm sorry, I'll see myself out.
        
         | erik_seaberg wrote:
         | Obviously we need a third distributed consensus algorithm as a
         | tiebreaker.
        
           | _jal wrote:
           | That really should be generalized to support N algorithms.
           | Some sort of quorum protocol, maybe.
        
       | josh2600 wrote:
       | Ok so having worked on distributed consensus a bunch here are a
       | couple thoughts in no particular order:
       | 
       | * In the real world, servers misbehave, like, a lot, so you have
       | to deal with that fact. All assumptions about network robustness
       | in particular will be proven wrong on a long enough timeline.
       | 
       | * Leader election in a world without a robust network is an
       | exercise in managing acceptable failure tolerances. Every
       | application has a notion of acceptable failure rates or
       | acceptable downtime. Discovering that is a non-trivial effort.
       | 
       | Jeff Dean and others at Google famously came to the conclusion
       | that it was ok for some parts of the Gmail service to be down for
       | limited periods of time. Accepting that all self-healing/robust
       | systems will eventually degrade and have to be restored is the
       | first step in building something manageable. The AXD301 is the
       | most robust system ever built by humans to my knowledge (I think
       | it did 20 years of uptime in production). Most other systems will
       | fail long before that. Managing systems as they fail is an art,
       | particularly as all systems operate in a degraded state.
       | 
       | In short, in a lab environment networks function really well. In
       | the real world, it's a jungle. Plan accordingly.
        
         | goodpoint wrote:
         | Most importantly: leader election should be used only as last
         | resort.
         | 
         | Leaderless systems with eventual consistency are often possible
         | and more robust.
        
       | brickbrd wrote:
       | In practice, for the systems where I built a replication system
       | from the ground up, once you factor in all the performance,
       | scale, storage layer and networking implications, this Paxos vs.
       | Raft thing is largely a theoretical discussion.
       | 
       | Basic paxos, is well, too basic and people mostly run
       | modifications of this to get higher throughput and better
       | latencies. After those modifications, it does not look very
       | different from Raft with modifications applied for storage
       | integration and so on.
        
         | arielweisberg wrote:
         | This exactly my take as well. Multi-Paxos and Raft seem very
         | similar to me. Calling out what the exact differences and
         | tradeoffs are would be good blog/research fodder.
        
         | ignoramous wrote:
         | > _Basic Paxos, is well, too basic and people mostly run
         | modifications of this to get higher throughput and better
         | latencies. After those modifications, it does not look very
         | different from Raft._
         | 
         | Alan Vermeulen, one of the founding AWS engineers, calls
         | inventing newer solutions to distributed consensus an exercise
         | in _re-discovering_ Paxos.
         | 
         | https://youtu.be/QVvFVwyElLY?t=2367
        
       | tschellenbach wrote:
       | Stream's consensus algorithms are all Raft based, the Go Raft
       | ecosystem is very solid. We did end up forking some of the
       | libraries, but nothing major.
        
       | lowbloodsugar wrote:
       | Problem: "Raft protocol described and analyzed in English has
       | problems." Solution: "Here is a modification to the protocol,
       | described in English, that does not have such a problem."
       | 
       | Seems like there is a common failure mode of "describing
       | distributed protocols in plain english and thinking this is a
       | proof"?
       | 
       | The actual problem here is not "There was a problem in the Raft
       | protocol, and I figured it out and provided a work around". The
       | actual problem here is "Reasonably experienced software engineers
       | reviewed the specification and didn't see any problems." This
       | actual problem has not been addressed by the article.
        
         | adsharma wrote:
         | Author seems to be using https://github.com/ailidani/paxi for
         | actual implementation and proof.
         | 
         | I'm more of a python/rust guy. There have been some attempts to
         | make model checkers in rust:
         | https://github.com/stateright/stateright
         | 
         | The issue is that rust is a very large language and it's hard
         | to get it right.
         | 
         | I have a python implementation of raft over here:
         | 
         | https://github.com/adsharma/raft/tree/master/raft/states
         | 
         | That's small enough to be self contained and perhaps run
         | through a model checker some day and transpiled to many
         | statically typed languages.
         | 
         | The issue with TLA+ proofs such as:
         | 
         | https://github.com/fpaxos/raft.tla
         | 
         | is that it's hard to tell if a particular C++ or Rust
         | implementation conforms to the spec.
         | 
         | So how do we check and transpile?
         | 
         | * https://www.philipzucker.com/Modelling_TLA_in_z3py/ *
         | https://github.com/adsharma/py2many
        
           | hwayne wrote:
           | > is that it's hard to tell if a particular C++ or Rust
           | implementation conforms to the spec.
           | 
           | You can use either TLC's graph output or simulation mode to
           | generate lots of TLA+ traces. Then you write an orchestrator
           | that reads the trace as JSON and run the same actions in each
           | step on the program, then confirm they have the same final
           | states. You can go the other way, too, where you take program
           | traces and map them onto the TLA+ state space.
           | 
           | It's a bit tricky to set up, but it's possible. I've had
           | several clients who've built something similar.
        
             | im_down_w_otp wrote:
             | This is, incidentally, one of the kinds of use cases that
             | informed part of our product functionality.
             | 
             | 1. A way to write down fairly complicated system/behavior-
             | level specs for distributed systems.
             | 
             | 2. A way to instrument implementations to get the necessary
             | signal out of the system to verify it nominally.
             | 
             | 3. A way to automatically explore adverse conditions to
             | validate that the system keeps behaving as needed and
             | specifically isolate contributing factors for when it
             | doesn't.
             | 
             | The very first prototype was leveraged (in-part, as there
             | was much more too the system under test) in a consensus
             | mechanism that was central to the correct, stable, and most
             | importantly... safe... operation of some critical software
             | infrastructure. It was necessary to try to explore where
             | the model assumptions broke down in practice in a real
             | system.
        
             | adsharma wrote:
             | https://www.microsoft.com/en-
             | us/research/blog/p-programming-... makes the case better
             | than I can on why it's important to have the specification
             | and implementation come from the same source as opposed to
             | maintaining two and then verify via traces that they're
             | compatible.
             | 
             | But definitely worth doing (like Jepsen does) to find
             | problems in the implementation.
        
               | hwayne wrote:
               | Yeah, if you can do it economically, having a single
               | source is preferable.
        
           | toolslive wrote:
           | you can abstract IO and async'ness from the implementation,
           | and model it as a pure function. Then build and inspect the
           | state space. A detailed description was done here:
           | 
           | https://incubaid.wordpress.com/2012/10/25/caulking-your-
           | dist...
        
           | littlestymaar wrote:
           | > The issue is that rust is a very large language and it's
           | hard to get it right.
           | 
           | In my experience, Rust type system (especially the _Send_ and
           | _Sync_ traits, which prevent data races) makes it easier -
           | not harder - to get big things work.
        
             | adsharma wrote:
             | My intention is to reuse the rust type system where we can
             | and yet keep the language small enough to be able to
             | implement and verify non-trivial real world programs.
             | 
             | https://github.com/adsharma/py2many/blob/main/doc/langspec.
             | m... https://github.com/adsharma/py2many/blob/main/CHANGELO
             | G.md#r...
        
       | hutrdvnj wrote:
       | Can someone provide a short description of the differences
       | between Paxos and Raft?
        
         | polynomial wrote:
         | Paxos is a family of algorithms which are aimed at distributed
         | consistency / monotonic state guarantees. However, Paxos allows
         | for leaders with out-of-order logs to be elected leaders
         | (provided they then reorder their logs) whereas Raft requires a
         | server's log to be up-to-date before it can be elected leader.
         | Moreover, Raft has a reputation for having better
         | understandability than Paxos.
         | 
         | edit: It looks like the linked paper covers the main
         | differences, albeit in a more detailed manner. Also, it sems as
         | if the author rejects the idea that Raft is more understandable
         | and makes a case why he thinks Paxos is more understandable.
        
           | anonymousDan wrote:
           | Personally I find paxos more understandable. For example, KTH
           | have a really nice incremental development of Multi-Paxos
           | called Sequence Paxos: https://arxiv.org/abs/2008.13456
        
         | ignoramous wrote:
         | Raft is very similar to Multi-Paxos. These Raft and Paxos user
         | studies comparing the "understandability" of the two protocols
         | might be helpful: https://youtu.be/JEpsBg0AO6o (Paxos),
         | https://youtu.be/YbZ3zDzDnrw (Raft).
         | 
         | See also, comparison of various "Paxos" implementations:
         | https://muratbuffalo.blogspot.com/2016/04/consensus-in-cloud...
        
       | jiryu wrote:
       | I gave the presentation in the linked article, here's a written
       | version of the presentation: https://emptysqua.re/blog/paxos-vs-
       | raft/
       | 
       | I hope that the "Paxos vs Raft" debate can die down, now that
       | engineers are learning TLA+ and distributed systems more
       | thoroughly. These days we can design new protocols and prove
       | their correctness, instead of always relying on academics. For
       | example, at MongoDB we considered adopting the reconfiguration
       | protocol from Raft, but instead we designed our own and checked
       | it with TLA+. See "Design and Verification of a Logless Dynamic
       | Reconfiguration Protocol in MongoDB Replication" for details:
       | https://arxiv.org/pdf/2102.11960.pdf
        
       | butterisgood wrote:
       | VR for the win... (no not that VR)
       | http://pmg.csail.mit.edu/papers/vr-revisited.pdf
       | 
       | Ok, maybe not for the win, but it's worth a look. I'm actually
       | fairly certain one of the Paxos implementations I've worked with
       | and used is really more of a VR bend to Paxos anyway.
        
       | benlivengood wrote:
       | Since the article mentions Google as the outlier preferring
       | Paxos, I may be able to shed some light from a few years ago.
       | 
       | The Paxos, paxosdb, and related libraries (despite the name, all
       | are multi-paxos) are solid and integrated directly into a number
       | of products (Borg, Chubby, CFS, Spanner, etc.). There are years
       | of engineering effort and unit tests behind the core Paxos
       | library and so it makes sense to keep using and improving it
       | instead of going off to Raft. As far as I am aware the Google
       | Paxos implementation predates Raft by quite a while.
       | 
       | I think in general if most other people use Raft it's better for
       | the community to have single, stable, and well-tested shared
       | implementations for much the same reason it's good for Google to
       | stick with Paxos.
        
         | tyingq wrote:
         | This makes sense to me. Very few of us have the resources to
         | maintain, for example, the kind of globally synced (way beyond
         | typical NTP) clock infrastructure that Google has
         | (TrueTime[1]).
         | 
         | [1] https://cloud.google.com/spanner/docs/true-time-external-
         | con...
        
           | gravypod wrote:
           | If you want to know just how hard this can be with bare metal
           | that you control, take a look at this:
           | https://signalsandthreads.com/clock-synchronization/
        
             | jeffbee wrote:
             | On the other hand, if you run in Google Compute Engine you
             | get accurate clock discipline for free.
        
       | littlestymaar wrote:
       | Heidi Howard, the first author of this paper did two videos about
       | her paper:
       | 
       | - A 10' short intro https://www.youtube.com/watch?v=JQss0uQUc6o
       | 
       | - A more in depth one :
       | https://www.youtube.com/watch?v=0K6kt39wyH0
        
       ___________________________________________________________________
       (page generated 2021-07-16 23:00 UTC)