[HN Gopher] Let's build a distributed Postgres proof of concept ___________________________________________________________________ Let's build a distributed Postgres proof of concept Author : eatonphil Score : 168 points Date : 2022-05-17 16:21 UTC (6 hours ago) (HTM) web link (notes.eatonphil.com) (TXT) w3m dump (notes.eatonphil.com) | cube2222 wrote: | Really nice blog post! Especially because it actually builds | something that works in the end, while keeping stuff fairly | simple. | | Another approach could be using PostgreSQL as a stateless | scalable query execution engine, while stubbing out the storage | layer with a remote replicated consistent storage solution. | | This is also what I think Aurora does. | cdumler wrote: | Is this going down the route of YugabyteDB[1]? | | [1] https://www.yugabyte.com | vitruvvius wrote: | now extend into raft atop rdma :) | bogomipz wrote: | I really enjoyed this post. I had a question, is the raft-boltdb | instance the actual raft log where the sql commands are being | written to then? | | It looks like the apply method is responsible for writing the sql | statements to the raft log as well as executing the sql queries. | Is waiting for a quorum of writes to the raft log by the other | members not needed? Or is this all just handled under the hood by | the raft libraries being used. | otoolep wrote: | It's all handled by the Raft libraries (I know this because | rqlite uses the same Raft library at this blog post). When | integrating with Hashicorp's Raft code, your code must | implement Apply()[1]. The Hashicorp code promises not to call | Apply() until a quorum of nodes have been persisted the data to | the Raft log. That way when your code is called to modify your | state machine (FSM), quorum has already been achieved by the | Raft system. | | Architecturally (in rqlite's case) a node looks something like | this: | https://docs.google.com/presentation/d/1Q8lQgCaODlecHa2hS-Oe... | | [1] https://pkg.go.dev/github.com/hashicorp/raft#FSM | bogomipz wrote: | Thanks. So this embedding of a database engine into the Raft | FSM is the general pattern then? | | It looks like Phil's post uses boltdb for the Postgres | storage engine as well as for Raft log(via Hashicorpo's | implementation Raft lib.) | | Thanks for the link to the slides as well. I've seen rqlite | mentioned a few times in the last few week and so it was on | my short list of things to read up on. | otoolep wrote: | Yes, embedding any kind of store can work. For example, | here is an in-memory KV store embedded in Raft: | | https://github.com/otoolep/hraftd | | This is kind of a reference use of Hashicorp's Raft. | remram wrote: | distributed _Postgres-wire-compatible*_ proof of concept | KingOfCoders wrote: | Time might be up (saying as a long term PG user) with Litestream | adding distributed features to SQLite, local inprocess databases | might be the next thing. | nijave wrote: | Litestream allows scale up application architectures whereas | Postgres allows scale out architectures (by only scaling up the | database portion instead of also scaling up application | servers). | | It's basically the same thing as monolith vs microservices but | extending monolith to the data persistence layer. With | horizontally scaling apps being the predominant architecture | right now, I don't really see Litestream changing much. | | If you're going to horizontally scale Litestream with a | multiple writers you're going to end up introducing all the | network and synchronization pieces Postgres architectures | already have. | cromd wrote: | Local inprocess databases might be a bigger thing, but won't | they do nothing to help cases where 10 different applications | on different machines need to read and write in the same | database? I would think the target audiences are very | different, i.e. people who are finding non-distributed PG | insufficient probably can't alleviate their problems by | switching to SQLite | KingOfCoders wrote: | It depends on your access pattern, if every application is | write heavy to the same data set, this won't work. If every | application is light on writes and those are mostly local | (like for many SaaS companies), this will work fine with the | future Litestream that redirect writes to one db and then | distributes the data (if you can live with slightly stale | data). Also depends on the size of you company. Slack does | 300k/s messages, this won't work :-) | cromd wrote: | Oh, maybe I am behind on this "future litestream" that | "redirects writes". I only knew of the ability to | sync/backup/restore. Do you have a link to something | discussing those other features? | ankraft wrote: | It doesn't redirect writes but you can create read-only | replicas: https://github.com/benbjohnson/litestream-read- | replica-examp... | SahAssar wrote: | Litestream still only does single-master, right? To me it feels | like the next step up would be something like foundationdb | (true multi-master) with a good psql layer. | justinsaccount wrote: | Interesting. I figured this would be like rqlite but on top of | postgresql instead, but this builds a simple sql layer on top of | bbolt. | otoolep wrote: | I was half-thinking about taking the Postgres wire protocol | code[1] that CockroachDB uses, and adding it to rqlite[2]. Then | you could talk to rqlite using Postgres tooling (perhaps). The | only thing is that rqlite doesn't support distributed | transactions in the usual sense, so there might be some | difficulty mapping Postgres transaction control to rqlite. But | still, could be interesting. | | [1] https://github.com/jackc/pgproto3 | | [2] https://github.com/rqlite/rqlite | | Disclaimer: I am the creator of rqlite. | awinter-py wrote: | > What is CockroachDB under the hood? ... wire protocol, a | storage layer, a Raft implementation, PostgreSQL's grammar | definition. ... To be absurdly reductionist, CockroachDB is just | the glue around these libraries | | ^ semi off topic, but love the idea of distinguishing certain | projects as 'just glue'. (Not a dis, glue matters). Especially | interesting in the context of OSS tools whiteboxed by cloud | vendors with 'proprietary glue' | eatonphil wrote: | I said absurdly reductionist and definitely meant that. :) | There is a ton going on under the hood to make it work well, to | fully implement postgres, and to work efficiently. | jchw wrote: | I don't think we should be saying "fully implement | PostgreSQL." People are feeling burned by this notion because | it sounds like a drop-in replacement, but it isn't really | that simple. Being wire compatible is mostly just convenient, | since you can reuse existing drivers. But existing software | built on top of PostgreSQL won't necessarily work out of the | box... | | Maybe this is a bit nit picky, but, still. | eatonphil wrote: | I've written about the "world of postgres wire compatible" | [0] and yes there are some databases that implement the | protocol and don't implement postgres (see: immudb and | ravendb) I give databases like CockroachDB more credit for | actually making their intention and generally following | through with implementing Postgres. Also unlike some others | on my list CockroachDB really is meant as an OLTP | replacement. Performance characteristics and operations | will definitely differ but as solely a user/application | developer you shouldn't notice as much. | | [0] | https://datastation.multiprocess.io/blog/2022-02-08-the- | worl... | [deleted] | truth_seeker wrote: | I would rather use LOGICAL replication to achieve 1MASTER-1SLAVE | or 1MASTER-Multi SLAVE or MULTI MASTER - Multi Slave | configuration after working schema design details and read-write | query patterns across tables. | 3iggy wrote: | I was expecting a guerilla db instead of a cockroach db. Like | serialising the db for dropbox. This is like a drifter camping on | private land... you can say it isnt really decentralised but i'll | just move somewhere else. | eloff wrote: | The word used in the title and article is distributed not | decentralized. | fsckboy wrote: | not challenging you, just asking, what is the distinction | you're drawing between distributed and decentralized? | | one idea would be, the db is completely synchronized on 3 | redundant servers, so it's decentralized | | each column of a table is stored on a different server, so | it's distributed...? | ragona wrote: | I'd say distributed is about redundancy and scale within a | single entity's infrastructure, where as decentralized | implies multiple entities working together. | jen20 wrote: | Decentralised implies lack of a global leader. However, | each raft group has a leader, so you can only take this | so far in a multi-group raft design. | wiz21c wrote: | sidenote : implementing RAFT myself was really tough. The | original paper has 2 or 3 obscure points which were really hard | to figure out... | [deleted] | mjb wrote: | It is tough. | | My approach when learning new protocols like Raft or Paxos is | to implement them in Pluscal (TLA+'s higher-level language) or | P (https://github.com/p-org/P). I've found that helps separate | the protocol-level concerns from the implementation-level | concerns (sockets? wire format?) in a way that reduces the | difficulty of learning the protocol. | eatonphil wrote: | Link to your implementation or notes? :) | tynpeddler wrote: | Echoing the other, how did you configure your test suite to | check for correctness? | SahAssar wrote: | Something something jepsen test? | emteycz wrote: | Please share the source! I'd appreciate it very much. | ilovepostgres9 wrote: | azurezyq wrote: | Raft is the secret sauce behind many distributed databases, like | CockroachDB, TiDB, Yugabyte, etc.. | | But to make it scalable and production grade, you have to run | multiple Raft operations in parallel _and_ manage them well. | E.g., sharding them correctly to make the load balanced and | adjusting dynamically. | | Here's a pleasant article to read: | https://en.pingcap.com/blog/how-tikv-reads-and-writes/ | rad_gruchalski wrote: | There's also YugabyteDB which is actual Postgres with distributed | flex. It even supports regular Postgres extensions. Anything in | golang imitating Postgres will be subpar for years to come. | | YugabyteDB splits tables into tablets (partitions), every tablet | has its own raft. It's a modified rocksdb under the hood. | | OTOH, it's cool to see how someone builds something like this | from scratch and is willing to talk about it. ___________________________________________________________________ (page generated 2022-05-17 23:00 UTC)