[HN Gopher] Paxos ___________________________________________________________________ Paxos Author : joeyespo Score : 83 points Date : 2022-01-06 19:14 UTC (3 hours ago) (HTM) web link (martinfowler.com) (TXT) w3m dump (martinfowler.com) | tombert wrote: | I have a bit of a man-crush on Leslie Lamport (and have | (unsuccessfully) tried to get TLA+ to be used at every job I've | had for the last five years), and I have read through both the | original Paxos paper in addition to the Paxos Made Simple paper, | and I'm just going to say that I still found it about five times | harder to grok than Raft. Honestly the TLA+ specification of | Paxos made it click more than the papers for me, but that could | just be because of the repeated readings before. | | It's a beautiful algorithm, but Raft has the advantage of being a | lot simpler. | uvdn7 wrote: | I am also a huge fan of Lamport. Paxos is so concise and of | such beauty that once you understand it, all other consensus | protocols seem just different ways of phrasing Paxos with | additional features (e.g. membership changes, make it into a | log, etc.). Paxos is the smallest, most elegant consensus | building block that I have seen so far. Raft's protocol covers | a much bigger area than just consensus (it gives you a log, | supports reconfiguration, etc.). Paxos' entire protocol fits a | single slide. Period. | | I wrote a few posts about how to understand paxos, and this is | my personal favorite - https://blog.the-pans.com/understanding- | paxos/. | | Lamport did comment on it (via email) saying that understanding | Paxos is less important than proving it works (via TLA+). | MatteoFrigo wrote: | Your view that Paxos is an algorithm for read-modify-write is | a really deep insight. Once you adopt this point of view, | things like Raft are just RMW updates of a log. (I like to | think of Paxos as the fault-tolerant version of the load- | linked/store-conditional pair of instructions.) | | However, there is a difficulty in expressing exactly what | RMW-Paxos does, because it may end up "modifying" a value | that was never "committed". | | For example, assume that I have three acceptors storing a | replicated counter. The proposer executes phase-1, increments | the value received from the acceptor with the highest ballot, | and sends the value to all acceptors in phase-2. Consider now | an execution where all acceptors initially hold 0. The | proposer manages to store a new value 1 in one acceptor, and | then crashes. Thus, 1 was not committed and a reader may | conclude that the consensus value is still 0. Now a new | proposer arrives, receives (1, 0, 0) from the three | acceptors, chooses 1 as the newest value, and writes (2, 2, | 2) to all acceptors. Thus "2" is now committed, and the | commit of the "2" retroactively commits the "1". Thus, the | condition for being committed is no longer "there exists a | phase-2 quorum ...", but it is more complicated and depends | on future history. | | I have verified with TLA+ that this algorithm does indeed | implement an atomic read-modify-write operation, but only | under a complicated notion of being "committed" that boils | down to either having a phase-2 quorum, or one of the | successor operations having a phase-2 quorum. | | Did you ever figure out a simpler way to say this? | iambvk wrote: | Instead of Read-Modify-Write, I like to think of it as | Read/Determine/Write cycle. | | A note worthy point is that, Paxos actually expects | proposers to _drop_ their value and _carry-forward_ | someone-else 's value to complete the consensus. This is | typically not an intended behavior from client's point of | view, but proposers role is to form the consensus, not push | their value. | | Read phase is basically understanding if there is any value | that could've been chosen when F nodes are unavailable. | | Determine phase is to choose to carry-forward any value | (if-any) from the Read phase or use the new value from the | client. | | Write phase is to actually ask majority of the acceptors to | save the value selected in the Determine phase. | MatteoFrigo wrote: | You are totally correct that in Paxos the proposer is | supposed to "carry-forward" the value proposed by | somebody else. In this view, which is how everybody | understands Paxos, Paxos is an algorithm for consensus. | | However, the proposer can also _modify_ the value | proposed by somebody else, and it turns out that this is | a perfectly valid algorithm for implementing a | distributed read-modify-write. This variant does more | than just consensus---it basically behaves like a fault- | tolerant memory with an atomic update operation (think | compare-and-swap or load-linked /store-conditional). In | his blog post, GP also mentions that paxos is a read- | modify-write transaction, but perhaps I read too much | into what he is saying. Either way, this RMW variant is | correct, I have verified it exhaustively with TLA+, and | it is used in a few production systems that I have | implemented. | | The difficulty is now to say exactly what this RMW | variant does. I was hoping that GP (or anybody else) may | have some insights. | tombert wrote: | > Lamport did comment on it (via email) saying that | understanding Paxos is less important than proving it works | (via TLA+). | | I suppose that's not wrong; if I know for a fact that | something is going to behave how I expect it too, I don't | generally care how it works. I can treat the algorithm as a | black box and just go from there. | | That said, I dont' completely agree. If I am implementing | Paxos in NewLangOfTheMonth, it's likely that a straight one- | to-one port of another version will not fully work due to | tiny differences in the language semantics. If I don't | understand Paxos, then it's unlikely I'll be able to correct | any mistakes in the implementation. | iambvk wrote: | I totally agree. Paxos is such a simple solution that it is | so amazing. | loquor wrote: | > I have a bit of a man-crush on Leslie Lamport | | +1 - I really admire him too. I wonder, what makes people like | him and Edsger Dijkstra so charismatic in this sense? | User23 wrote: | Dijkstra was and Lamport is very good at writing clearly and | distinctly. Many people with that level of raw brainpower | don't make that effort, so the ones that do are easy to | admire and be grateful to. | tombert wrote: | I think a lot of it has to do with the fact that he's a very | prolific writer and speaker, in addition to contributing to a | fairly diverse set of Computer Science fields. | | I usually enjoy Lamport's speeches, and I have said before | that his book Specifying Systems should be mandatory reading | for nearly anyone interested in CS, due to how approachable | it is, and yet how deep it goes. | solidangle wrote: | You might be interested in this paper [1] which presents Paxos | in the same way as Raft is presented in the Raft paper. In my | view it really illustrates that most of the simplicity of Raft | comes from its excellent presentation and not necessarily from | its algorithmic simplicity. | | [1]https://arxiv.org/pdf/2004.05074 | kkwteh wrote: | Paxos has a reputation of being hard to understand. I was under | the impression that Raft was now preferred for distributed | consensus. The creators of the Raft algorithm explicitly had | understandability as a goal when designing it and called out | Paxos as being hard to understand. | | https://pdos.csail.mit.edu/6.824/papers/raft-extended.pdf | solidangle wrote: | (Multi-)Paxos and Raft are incredibly similar. I would even | argue that Raft is just a variant of Paxos in the same way that | e.g. Fast Paxos and EPaxos are variants of Paxos. Most of the | Raft algorithm is identical to the Paxos algorithm. The main | difference between the two algorithms is how leadership | election works. In my view the main reason why Raft is | considered to be simpler is because it was presented in a | really clear paper. Paxos becomes equally simple when using a | similar representation as used in that paper. See this paper: | https://arxiv.org/pdf/2004.05074 | daveevad wrote: | My grey matter is failing me but does anyone remember a rant | about 5-10 years ago about developing software in this era that | had some punch line about consensus algorithms that went a little | like, "so, there's this guy named Diego?" | | edit: found it | | https://circleci.com/blog/its-the-future/ | lowbloodsugar wrote: | >Why don't I just use Google's thing? | | >-You think that's going to be around in 6 months? | | bahaha. | jldugger wrote: | > Unmesh Joshi | | It feels kinda strange to see an author other than Martin Fowler | blogging on martinfowler.com.... | DerpyBaby123 wrote: | Maybe so, although he has been featuring other authors on his | blog regularly now. He works with them to keep the quality | high, check them out. | Jtsummers wrote: | This came up with another martinfowler.com post, and I checked | the past couple years. Roughly half the content is Fowler's, | and the other half are other writers (from memory, not | recounting it now). I don't know the ratio going back prior | years (only checked 2021 and 2020) but I've seen lots of posts | by others on his site for the past few years. | | https://martinfowler.com/articles/patterns-of-distributed-sy... | - other content by the same author under title _Patterns of | Distributed Systems_. | throwaway984393 wrote: | I think a flow chart would have been easier to understand, or | perhaps a video or .gif | uvdn7 wrote: | No, it doesn't. A video might help you _feel_ like you | understand it but it doesn't really help you understand it. | That's just not how distributed systems work. | spicybright wrote: | Now that's a little silly to say, some people grok more from | video, some more from articles with pictures. | rdtsc wrote: | Is there a mistake on the second diagram where prepare(1,e) is | sent to Byzantium? Wonder if it should be prepare(1,a), since | Athens hasn't learned the (1,e) value which only Ephesus and | Delphi know about at that point? | | Otherwise I think it's a pretty good description of Paxos. I like | that it walks through a good number of failure scenarios and | mentions interesting corner cases such as a lack of an explicit | commit in the original papers and that reading also needs to run | the full Paxos algorithm. | Jtsummers wrote: | It does seem to be a typo, based on the table and text below | it. | arsh wrote: | There are huge gaps between the scientific paper and how you | actually operate and scale Paxos. This can be seen in Paxos Made | Live - An Engineering Perspective by Tushar Chandra et.al. | | Similarly, it also has real drawbacks like being slow, or lacking | liveness guarantees. | | See other alternatives in https://aws.amazon.com/builders- | library/leader-election-in-d... ___________________________________________________________________ (page generated 2022-01-06 23:00 UTC)