[HN Gopher] Implementing a distributed key-value store on top of... ___________________________________________________________________ Implementing a distributed key-value store on top of implementing Raft in Go Author : simjue Score : 219 points Date : 2023-05-25 13:26 UTC (9 hours ago) (HTM) web link (notes.eatonphil.com) (TXT) w3m dump (notes.eatonphil.com) | didip wrote: | Big fan of Phil! | | A lot of us tinkered like him, but very few came back and write | the findings nicely in an easy to read form. | eatonphil wrote: | Glad to hear it! :) | yi_xuan wrote: | Recently, I have been watching the course MIT6.824. This article | appeared very timely :) | | Here is another raft implementation in Go | https://github.com/eliben/raft | siftrics wrote: | This is essentially do-it-yourself etcd. | | https://github.com/etcd-io/etcd | esafak wrote: | And tidb | moderation wrote: | TIKV [0] is closer (written in Rust). | | 0. https://github.com/tikv/tikv | bit_flipper wrote: | Your note about encoding/gob being inefficient is somewhat | accurate for how you're using it, but I want to talk a bit about | how you could improve your use. | | encoding/gob is intended for streams, not stateless | marshals/unmarshals. The first thing that is sent over the stream | is the type information the receiver should expect, that's why | your payload was so large. After the first type is received, | subsequent messages are much smaller. You can see this by | extending your example to do multiple writes; each write after | the first is only 10 bytes: https://play.golang.com/p/Po_iaXrTUER | | You have to plan differently, but you could get large | improvements to transmission sizes by changing to append only | files and creating the gob encoder once per file. If you find | you're creating a gob encoder/decoder very often, that's a | telltale sign you're not using it as intended. | eatonphil wrote: | Another thing I glossed over (unintentionally) is that I only | started limiting batch size after I switched to the custom | encoder. So persist() being a function of length wouldn't be | quite true anymore. | | However, I still keep seeing encoding/gob high up in the | profiler taking a lot of time doing reflection during RPC | calls. | | So it does still seem like it's not ideal. Though I may still | just not be understanding how to use net/rpc correctly either. | candiddevmike wrote: | > encoding/gob is intended for streams, not stateless | marshals/unmarshals | | I don't understand this within the context of the rest of your | comment. I use gob for marshaling stuff to storage all the | time, I'm not aware of a better way to do that (serialize data | to binary). | bit_flipper wrote: | Sorry, I should have been more precise. encoding/gob is not | optimized for situations where you create an encoder or | decoder, read/write a single value, then discard that | encoder/decoder. As the author noted, payloads for a single | call to Encode() are quite large. Additionally, re- | instantiating a gob encoder for each call to Encode() is very | expensive allocation-wise and benchmarks where this happens | will show gob to perform poorly in these scenarios. You can | certainly still use gob this way, and if the performance | works for you then have at it! But it performs significantly | better in situations where you make multiple calls to | Encode() with the same encoder. | throwaway894345 wrote: | The parent is saying that gob includes the type information | in the serialized payload. This is extraneous information in | many cases, and the result is an unnecessarily large payload. | dotnwat wrote: | epic | cheeseprocedure wrote: | I was lucky enough to take David Beazley's "Rafting Trip," a | five-day training course that guides each student through | building their own Raft implementation from scratch: | | https://www.dabeaz.com/raft.html | | I'd recommend the course to anyone with development experience | working with or near distributed systems. David is a fantastic | instructor and facilitator, and the blend of student backgrounds | led to some great learning and discussion. | | (I have no financial or personal interest here; I just loved the | course.) | geospeck wrote: | Thanks for the post! | | Another great blog post series about implementig Raft in Go that | I found is this one | https://eli.thegreenplace.net/2020/implementing-raft-part-0-... | eatonphil wrote: | Yes I definitely referred to his posts and project while | working on this. I'm a big fan of his blog! | Xeoncross wrote: | These kinds of posts are my favorite part of HN. The deep dives | into LLM's, state machine theory, fourier transforms, locking | data structures, and memory allocation that is miles above the | basic posts around the internet you get with searching - but not | quite a full book yet. | | The author spent 7 months tinkering and cared enough to come back | and share that with us. | beoberha wrote: | Phil rocks. Highly suggest following him on twitter. He's | pretty outspoken about writing. This is a great post of his | that I need to reflect on more: https://notes.eatonphil.com/is- | it-worth-writing-about.html | eatonphil wrote: | Glad that post resonated! <3 | quaintdev wrote: | I wish there was a separate catalog of these posts which I | could browse through. | restlake wrote: | Cool post. Wonder if it would have helped to take a look at the | MIT distributed systems course on the web and YouTube - one of | the projects is exactly this (Go Raft implementation) | | https://pdos.csail.mit.edu/6.824/labs/lab-shard.html | eatonphil wrote: | I just don't love learning from videos personally. But I'm sure | those are good if you do! | powerset wrote: | I highly recommend MIT open courseware 6.824. Incredibly valuable | for learning distributed systems, and one of the lab assignments | is implementing raft in Go. | http://nil.csail.mit.edu/6.824/2022/schedule.html | | There are a ton of fascinating and potentially frustrating edge | cases and gotchas to implementing raft correctly. There's no | better way to understand it than actually implementing it, and I | probably never would have done it myself without these course | materials. | rochak wrote: | Unfortunately, I hit a roadblock while implementing the Raft | assignment. I knew it was simply beyond my capabilities but | would have made through if I had anyone I could reach out to. I | second this recommendation, but make sure you know what you are | signing up for. This course is as hard as they come. | valzam wrote: | I have found the performance tests very tricky to get to pass | without having any input from others. The assignment is | really very unforgiving, I would wager the test suite is | comparable to how commercial Raft implementations are tested | (e.g. https://github.com/hashicorp/raft) | leetrout wrote: | So kinda like "build your own consul" | eatonphil wrote: | I'm pretty sure consul builds on top of hashicorp's Raft | implementation so I'd say end result, yes, but that this post | goes into the Raft implementation. | | And I'm sure consul does plenty my silly key-value state | machine doesn't. | | I'm not super familiar with it. | roncesvalles wrote: | Reminded me of etcd, which is exactly this - Raft-based key- | value store written in Go. | | https://github.com/etcd-io/etcd | fredski42 wrote: | Here is nice interactive and visual representation of how raft | works: http://thesecretlivesofdata.com/raft/ | thinkharderdev wrote: | That's really cool! | hintymad wrote: | A side topic: how can a company find people like the author, who | can really articulate how to build a system? I've been | interviewing senior engineers for years, and more than 95% of | them, if not more, are really box drawers. They throw terms | around confidently but fail miserably the moment I ask for | specifics. No, I don't ask for details like how Raft works. That | level of details probably would fail all the candidates. Just | simple things like how you organize the data, how you route the | traffic, how you manage metadata, or how your data path keeps | metadata in sync. Really basic ones, yet most interviewers fail. | Indeed, the more senior they are, the more likely they lose touch | of the details -- even those who claim to work on distributed | databases. | pavluha wrote: | Try to interview more senior people, who still remember their C | course in an University, or even know what is assembler. | Currently, most graduates (from my experience) don't know how | an integer is stored in memory. | mydogcanpurr wrote: | I don't even know what a good answer to that is. Some | integers are stored in registers. Some are stored on the | stack. Some are stored on the heap. Or you could be asking | how integers are represented, with two's complement being the | usual way. Did I pass your ambiguous question? Probably not. | qznc wrote: | You find them via their blogs. | bombela wrote: | Where do you find the jobs where people actually care about | engineering? Because real work is mostly careful plumbing | without breaking what works, but in interviews you must | leetcode. Real systems only work when kept reasonably simple, | but system design interviews are all about drawing boxes. | BossingAround wrote: | I recently interviewed with a company, and the interviewer | asked me "how do you organize data." I wasn't sure if they | wanted to talk about classes, modules, databases, k-v stores, | hashing data and routing distributed requests to the same pods. | I asked, and they answered "I mean in general, how do you | organize data." | | After talking for a bit about pretty much all of the above, the | interviewer asked "have you used dictionaries?" | | The reason I'm telling the story is, if a lot of your | candidates fail to answer your questions, the problem might be | in the question. | skippyboxedhero wrote: | I have had that exact experience before. It is fine, you | realise the company is toasted and you move on (the place | that asked me this had just got rinsed by an offshore | consulting firm, ecomm site built with a graph database, | comedic stuff). | maximinus_thrax wrote: | > After talking for a bit about pretty much all of the above, | the interviewer asked "have you used dictionaries?" | | Hahaha classic. | | I was once in an interview (which I failed) and was asked a | problem related to some sort of monotonic queue. I wrote the | solution and was going through it, when the interviewer | asked: 'what CS concept are you using in this solution'? I | didn't really understand what he was asking for and in turn I | asked for more clarification like 'Are you referring to the | data structures I used? Or the technique (it was some sort of | greedy)?' 'no'. After a couple of back-and-forth questions | and answers, he finally tells me he was looking for me to say | 'a state machine'. Seriously? | no_wizard wrote: | I would have halted for a moment and asked something like: | _organize data for what purpose?_ At which point I would | expect there to be some amount of clarification. As it sounds | to me like they 're talking about organizing some set of data | for lookup, as opposed to organizing data around how it flows | through an application, for example | aranw wrote: | I often find when you are given questions like that no | matter how you probe you get a really defensive interviewer | who doesn't want to give away the answer they are | expecting. I've been in very similar situations with vague | questions and I've tried to probe for more details and just | been met with defensiveness with an attitude that says they | expect me to know exactly what they mean | d0mine wrote: | It is worse: your attempt to clarify the question may be | regarded by some interviewers as a sign you are not | senior dev (they think [erroneously in my opinion] that | "senior" means that you must figure out fine details | yourself). | maximinus_thrax wrote: | One a rare occasion I actually got feedback after the | interview (it was for a startup), It was because I asked | 'too many questions' it was decided that I'd need way | more mentoring than they can afford for the senior | position I was applying for. | no_wizard wrote: | This seems odd to me, though its entirely possible my | experience thus far hasn't been this. I've been told at | the last 3 places I worked that one of the reasons I was | hired was how inquisitive I was during the interview | stage, asking _lots_ of questions before coming up with | solutions | | I am most definitely not ruling out what you're saying | may be common case and I am just very lucky, but I do | want to act as a sort of counter point to see if others | can weigh in so we can get more of an industry sense | around this | zimpenfish wrote: | Anecdatum: my experience has been the same as others - | asking questions about the interview questions generally | leads to a bad outcome. In 25 years, I reckon it's less | than a handful of times where asking questions of the | interviewers has actually got a positive response. | | Hell, even when in jobs, asking questions about projects | I'm assigned has sometimes got a negative response... | pavluha wrote: | There are plenty of different types of dictionaries, | organizing their data differently. | nitwit005 wrote: | You seem confident you wouldn't have failed the author of this | blog post, but I wouldn't be so confident. | | Interviews are a high pressure situation where people want an | immediate answer, without much of any time to think, let alone | research options or run an experiment. You get people's | absolute worst possible results. | convolvatron wrote: | I have worked in distributed databases. I can't seem to find | any host organization that's actually building anything and | really cares about those details. I know several other senior | systems engineers in the same boat. where are you guys hiding? | bastardoperator wrote: | Maybe you're the problem? Looking at your questions, you're | asking for opinions, not answers or solutions. Putting people | on the spot is bad form in my opinion. If I have a technical | challenge I want you to solve, I'll give you the time for that | so I can actually see your work and discuss it with you. | | I've seen people that can talk the talk but not walk the walk | and vice versa, they can't articulate well, but their approach | and work speaks for itself. | hintymad wrote: | Maybe I am. For a senior enough role, my questions are open | ended. I'd expect the candidate to drive conversations and | examine alternatives. I ask follow-up questions based on what | the candidate mentions. Yeah, such may be indeed their | opinions, but at least they should be opinions derived from | assumptions, constraints, and fundamentals. | darth_avocado wrote: | That's a terrible way to interview. Interviews are high | stress, time constrained settings where you're talking to | people you don't know. If you want thoughtful answers, you | need to be precise in what you're asking. You can't be like | "tell me about yourself" and then complain I didn't get | from the candidate whether steak is their favorite food. | skippyboxedhero wrote: | Do you understand the difference between writing a blog post | about something and being able to talk about it in an | interview? The majority of work in software is not being an | "expert" on any one thing because the one thing that employers | require is changing quite frequently. Rather it is being able | to build up experience and carrying knowledge forward to new | areas. | | Most technical interviews are conducted in a completely inane | way where the goal is to memorize pieces of information. If | that was how software worked, the best engineer would just be | Google. 95% of people who know enough to build this system | would likely be unable to answer your questions because, more | than likely, they do not currently work in a job that requires | them to use this information every day. This does not mean they | do not understand it or are unable to build it. | | The reason why companies can't find people like this is a | combination of not understanding what software development is | (and not understanding people, it is really the blind leading | the blind but technical people are often as bad as | interviewing). | | You are asking completely wrong questions. You aren't starting | from the right place. | hintymad wrote: | > Rather it is being able to build up experience and carrying | knowledge forward to new areas. | | Then I'd expect the candidate could articulate her experience | and knowledge in some way, no? Otherwise, how would I know | the candidate has the built-up expertise? Of course, I assume | we can only have interviews. Otherwise, we can have other | means, like mini-project, onsite project, or a writeup. Some | candidates do like the alternatives more, and some not. | | > where the goal is to memorize pieces of information | | I disagree. The goal is to see if a candidate does have what | the candidate herself claims to know. I find it hard to | imagine that a candidate claims to be an expert in a field | yet couldn't articulate even one thing in depth for hours if | not days, let alone 30 minutes of interview time. Note this | is not about any specific details, but the general picture | and insights that an expert can convey. This is like a PhD | oral defense. The candidate talks about the topic that _the | candidate_ is familiar with, and the professors dive in on | such topics. I don 't see what's wrong with that. | skippyboxedhero wrote: | Candidates aren't defending a PHd. The interviewer is | nowhere near competent enough for this. | | > The goal is to see if a candidate does have what the | candidate herself claims to know. | | The process you are using does not tell you that. You will | notice that your explanation is full of things that you | expect, not based in an understanding of how reality | actually works. And the result is, unsurprisingly, one | where you do not understand the output. | samsquire wrote: | Thank you for the write up eatonphil. | | I experimentally implemented Raft in Java but I am not very | confident that I did it correctly. | | I wish there was a way to implement stateful programs that | guarantee "forward progress" and are "steady state systems". I | think essentially a state machine that cannot be trapped in a | state. Debugging the absence of something of forward moving | progress or lack of causation is very difficult. | | When there's essentially different actors in the system and they | can interact with eachother by communicating, they each have a | number of states they can get into. There's no guarantee that the | system shall converge on a state that forward progress can be | made. Maybe TLA+ is the right answer here. | | YMMV but I think (my) reasoning over stateful systems is rather | difficult, I think there's lots of hidden states that we cannot | easily detect or reason about because they're in our blind spots. | Especially related to synchronization. I think it's part of what | makes multithreading and distributed systems so hard, because | every component can be in a different state and if something is | not where it is expected to be, the baton doesn't get passed to | the correct state. If you check for something too early, you have | a race condition. | | If we could see in slow motion what was going on, an interaction | between different actors, we could work out why something happens | the way it does. But usually the logs are too numerous to get to | this detail. I think animation can save us, but what does a Raft | animation look like? | | How often have you seen an endless spinner? It's as if a | completion event was raised but didn't get detected and the | system is waiting for something that shall never occur. I want | this kind of error to be impossible. This is one form of hidden | state that prevents progress. | | I wrote an eventually consistent mesh protocol in Python and | tested it with Jepsen, it is not linearizable because the | consistency level is "eventually consistent". | | I don't understand how Raft can scale writes or reads across | multiple machines due to the round trip time talking to other | nodes. | qznc wrote: | I think you should try TLA+. I found it surprisingly easy: | https://beza1e1.tuxen.de/tla-plus.html | | Still haven't found an opportunity to use it professionally | though. | no_wizard wrote: | Sounds like a job for state machines like you can build out | with a library like xstate[0] (though I'm sure there are | similar libraries in whatever language you choose. Python has | one called automat[1]) | | These exist to formalize state logic (current state, computing | state, transitioning state etc), you can even produce diagrams | based on their definitions. Advanced libraries like xstate even | have Actors are part of the core of the library | | [0]: https://stately.ai/docs/xstate | | [1]: https://github.com/glyph/Automat | tbrockman wrote: | Don't have anything to comment on the rest, but my | understanding with respect to: | | > I don't understand how Raft can scale writes or reads across | multiple machines due to the round trip time talking to other | nodes. | | Is that at the point where that becomes a bottleneck you scale | to multiple clusters, where the read/write destination cluster | is determined by a key and whatever your preferred hashing | mechanism is. | DylanSp wrote: | Microsoft has a library/tool called Coyote* that helps with | testing distributed systems; you can write | tests/specifications, Coyote will systematically explore | nondeterminism in your system and check if your tests still | pass. If there's a failure, it'll show the sequence of events | that led to the failing test. | | I started a project to implement Raft with a KV-store on top, | similar to the article, meaning to use Coyote to test it; I | didn't get that far before losing interest, though. It's | reassuring to read that it took Phil several months to write | the code in the post, it's good to know that this is a | decidedly nontrivial problem. | | * https://github.com/microsoft/coyote ___________________________________________________________________ (page generated 2023-05-25 23:00 UTC)