[HN Gopher] Paxos vs. Raft: Have we reached consensus on distrib... ___________________________________________________________________ Paxos vs. Raft: Have we reached consensus on distributed consensus? Author : yagizdegirmenci Score : 113 points Date : 2021-07-14 10:39 UTC (2 days ago) (HTM) web link (charap.co) (TXT) w3m dump (charap.co) | aneutron wrote: | It's not possible do to so, since it would be done | asynchronously. | | I'm sorry, I'll see myself out. | erik_seaberg wrote: | Obviously we need a third distributed consensus algorithm as a | tiebreaker. | _jal wrote: | That really should be generalized to support N algorithms. | Some sort of quorum protocol, maybe. | josh2600 wrote: | Ok so having worked on distributed consensus a bunch here are a | couple thoughts in no particular order: | | * In the real world, servers misbehave, like, a lot, so you have | to deal with that fact. All assumptions about network robustness | in particular will be proven wrong on a long enough timeline. | | * Leader election in a world without a robust network is an | exercise in managing acceptable failure tolerances. Every | application has a notion of acceptable failure rates or | acceptable downtime. Discovering that is a non-trivial effort. | | Jeff Dean and others at Google famously came to the conclusion | that it was ok for some parts of the Gmail service to be down for | limited periods of time. Accepting that all self-healing/robust | systems will eventually degrade and have to be restored is the | first step in building something manageable. The AXD301 is the | most robust system ever built by humans to my knowledge (I think | it did 20 years of uptime in production). Most other systems will | fail long before that. Managing systems as they fail is an art, | particularly as all systems operate in a degraded state. | | In short, in a lab environment networks function really well. In | the real world, it's a jungle. Plan accordingly. | goodpoint wrote: | Most importantly: leader election should be used only as last | resort. | | Leaderless systems with eventual consistency are often possible | and more robust. | brickbrd wrote: | In practice, for the systems where I built a replication system | from the ground up, once you factor in all the performance, | scale, storage layer and networking implications, this Paxos vs. | Raft thing is largely a theoretical discussion. | | Basic paxos, is well, too basic and people mostly run | modifications of this to get higher throughput and better | latencies. After those modifications, it does not look very | different from Raft with modifications applied for storage | integration and so on. | arielweisberg wrote: | This exactly my take as well. Multi-Paxos and Raft seem very | similar to me. Calling out what the exact differences and | tradeoffs are would be good blog/research fodder. | ignoramous wrote: | > _Basic Paxos, is well, too basic and people mostly run | modifications of this to get higher throughput and better | latencies. After those modifications, it does not look very | different from Raft._ | | Alan Vermeulen, one of the founding AWS engineers, calls | inventing newer solutions to distributed consensus an exercise | in _re-discovering_ Paxos. | | https://youtu.be/QVvFVwyElLY?t=2367 | tschellenbach wrote: | Stream's consensus algorithms are all Raft based, the Go Raft | ecosystem is very solid. We did end up forking some of the | libraries, but nothing major. | lowbloodsugar wrote: | Problem: "Raft protocol described and analyzed in English has | problems." Solution: "Here is a modification to the protocol, | described in English, that does not have such a problem." | | Seems like there is a common failure mode of "describing | distributed protocols in plain english and thinking this is a | proof"? | | The actual problem here is not "There was a problem in the Raft | protocol, and I figured it out and provided a work around". The | actual problem here is "Reasonably experienced software engineers | reviewed the specification and didn't see any problems." This | actual problem has not been addressed by the article. | adsharma wrote: | Author seems to be using https://github.com/ailidani/paxi for | actual implementation and proof. | | I'm more of a python/rust guy. There have been some attempts to | make model checkers in rust: | https://github.com/stateright/stateright | | The issue is that rust is a very large language and it's hard | to get it right. | | I have a python implementation of raft over here: | | https://github.com/adsharma/raft/tree/master/raft/states | | That's small enough to be self contained and perhaps run | through a model checker some day and transpiled to many | statically typed languages. | | The issue with TLA+ proofs such as: | | https://github.com/fpaxos/raft.tla | | is that it's hard to tell if a particular C++ or Rust | implementation conforms to the spec. | | So how do we check and transpile? | | * https://www.philipzucker.com/Modelling_TLA_in_z3py/ * | https://github.com/adsharma/py2many | hwayne wrote: | > is that it's hard to tell if a particular C++ or Rust | implementation conforms to the spec. | | You can use either TLC's graph output or simulation mode to | generate lots of TLA+ traces. Then you write an orchestrator | that reads the trace as JSON and run the same actions in each | step on the program, then confirm they have the same final | states. You can go the other way, too, where you take program | traces and map them onto the TLA+ state space. | | It's a bit tricky to set up, but it's possible. I've had | several clients who've built something similar. | im_down_w_otp wrote: | This is, incidentally, one of the kinds of use cases that | informed part of our product functionality. | | 1. A way to write down fairly complicated system/behavior- | level specs for distributed systems. | | 2. A way to instrument implementations to get the necessary | signal out of the system to verify it nominally. | | 3. A way to automatically explore adverse conditions to | validate that the system keeps behaving as needed and | specifically isolate contributing factors for when it | doesn't. | | The very first prototype was leveraged (in-part, as there | was much more too the system under test) in a consensus | mechanism that was central to the correct, stable, and most | importantly... safe... operation of some critical software | infrastructure. It was necessary to try to explore where | the model assumptions broke down in practice in a real | system. | adsharma wrote: | https://www.microsoft.com/en- | us/research/blog/p-programming-... makes the case better | than I can on why it's important to have the specification | and implementation come from the same source as opposed to | maintaining two and then verify via traces that they're | compatible. | | But definitely worth doing (like Jepsen does) to find | problems in the implementation. | hwayne wrote: | Yeah, if you can do it economically, having a single | source is preferable. | toolslive wrote: | you can abstract IO and async'ness from the implementation, | and model it as a pure function. Then build and inspect the | state space. A detailed description was done here: | | https://incubaid.wordpress.com/2012/10/25/caulking-your- | dist... | littlestymaar wrote: | > The issue is that rust is a very large language and it's | hard to get it right. | | In my experience, Rust type system (especially the _Send_ and | _Sync_ traits, which prevent data races) makes it easier - | not harder - to get big things work. | adsharma wrote: | My intention is to reuse the rust type system where we can | and yet keep the language small enough to be able to | implement and verify non-trivial real world programs. | | https://github.com/adsharma/py2many/blob/main/doc/langspec. | m... https://github.com/adsharma/py2many/blob/main/CHANGELO | G.md#r... | hutrdvnj wrote: | Can someone provide a short description of the differences | between Paxos and Raft? | polynomial wrote: | Paxos is a family of algorithms which are aimed at distributed | consistency / monotonic state guarantees. However, Paxos allows | for leaders with out-of-order logs to be elected leaders | (provided they then reorder their logs) whereas Raft requires a | server's log to be up-to-date before it can be elected leader. | Moreover, Raft has a reputation for having better | understandability than Paxos. | | edit: It looks like the linked paper covers the main | differences, albeit in a more detailed manner. Also, it sems as | if the author rejects the idea that Raft is more understandable | and makes a case why he thinks Paxos is more understandable. | anonymousDan wrote: | Personally I find paxos more understandable. For example, KTH | have a really nice incremental development of Multi-Paxos | called Sequence Paxos: https://arxiv.org/abs/2008.13456 | ignoramous wrote: | Raft is very similar to Multi-Paxos. These Raft and Paxos user | studies comparing the "understandability" of the two protocols | might be helpful: https://youtu.be/JEpsBg0AO6o (Paxos), | https://youtu.be/YbZ3zDzDnrw (Raft). | | See also, comparison of various "Paxos" implementations: | https://muratbuffalo.blogspot.com/2016/04/consensus-in-cloud... | jiryu wrote: | I gave the presentation in the linked article, here's a written | version of the presentation: https://emptysqua.re/blog/paxos-vs- | raft/ | | I hope that the "Paxos vs Raft" debate can die down, now that | engineers are learning TLA+ and distributed systems more | thoroughly. These days we can design new protocols and prove | their correctness, instead of always relying on academics. For | example, at MongoDB we considered adopting the reconfiguration | protocol from Raft, but instead we designed our own and checked | it with TLA+. See "Design and Verification of a Logless Dynamic | Reconfiguration Protocol in MongoDB Replication" for details: | https://arxiv.org/pdf/2102.11960.pdf | butterisgood wrote: | VR for the win... (no not that VR) | http://pmg.csail.mit.edu/papers/vr-revisited.pdf | | Ok, maybe not for the win, but it's worth a look. I'm actually | fairly certain one of the Paxos implementations I've worked with | and used is really more of a VR bend to Paxos anyway. | benlivengood wrote: | Since the article mentions Google as the outlier preferring | Paxos, I may be able to shed some light from a few years ago. | | The Paxos, paxosdb, and related libraries (despite the name, all | are multi-paxos) are solid and integrated directly into a number | of products (Borg, Chubby, CFS, Spanner, etc.). There are years | of engineering effort and unit tests behind the core Paxos | library and so it makes sense to keep using and improving it | instead of going off to Raft. As far as I am aware the Google | Paxos implementation predates Raft by quite a while. | | I think in general if most other people use Raft it's better for | the community to have single, stable, and well-tested shared | implementations for much the same reason it's good for Google to | stick with Paxos. | tyingq wrote: | This makes sense to me. Very few of us have the resources to | maintain, for example, the kind of globally synced (way beyond | typical NTP) clock infrastructure that Google has | (TrueTime[1]). | | [1] https://cloud.google.com/spanner/docs/true-time-external- | con... | gravypod wrote: | If you want to know just how hard this can be with bare metal | that you control, take a look at this: | https://signalsandthreads.com/clock-synchronization/ | jeffbee wrote: | On the other hand, if you run in Google Compute Engine you | get accurate clock discipline for free. | littlestymaar wrote: | Heidi Howard, the first author of this paper did two videos about | her paper: | | - A 10' short intro https://www.youtube.com/watch?v=JQss0uQUc6o | | - A more in depth one : | https://www.youtube.com/watch?v=0K6kt39wyH0 ___________________________________________________________________ (page generated 2021-07-16 23:00 UTC)