[HN Gopher] Let's build a distributed Postgres proof of concept
       ___________________________________________________________________
        
       Let's build a distributed Postgres proof of concept
        
       Author : eatonphil
       Score  : 168 points
       Date   : 2022-05-17 16:21 UTC (6 hours ago)
        
 (HTM) web link (notes.eatonphil.com)
 (TXT) w3m dump (notes.eatonphil.com)
        
       | cube2222 wrote:
       | Really nice blog post! Especially because it actually builds
       | something that works in the end, while keeping stuff fairly
       | simple.
       | 
       | Another approach could be using PostgreSQL as a stateless
       | scalable query execution engine, while stubbing out the storage
       | layer with a remote replicated consistent storage solution.
       | 
       | This is also what I think Aurora does.
        
       | cdumler wrote:
       | Is this going down the route of YugabyteDB[1]?
       | 
       | [1] https://www.yugabyte.com
        
       | vitruvvius wrote:
       | now extend into raft atop rdma :)
        
       | bogomipz wrote:
       | I really enjoyed this post. I had a question, is the raft-boltdb
       | instance the actual raft log where the sql commands are being
       | written to then?
       | 
       | It looks like the apply method is responsible for writing the sql
       | statements to the raft log as well as executing the sql queries.
       | Is waiting for a quorum of writes to the raft log by the other
       | members not needed? Or is this all just handled under the hood by
       | the raft libraries being used.
        
         | otoolep wrote:
         | It's all handled by the Raft libraries (I know this because
         | rqlite uses the same Raft library at this blog post). When
         | integrating with Hashicorp's Raft code, your code must
         | implement Apply()[1]. The Hashicorp code promises not to call
         | Apply() until a quorum of nodes have been persisted the data to
         | the Raft log. That way when your code is called to modify your
         | state machine (FSM), quorum has already been achieved by the
         | Raft system.
         | 
         | Architecturally (in rqlite's case) a node looks something like
         | this:
         | https://docs.google.com/presentation/d/1Q8lQgCaODlecHa2hS-Oe...
         | 
         | [1] https://pkg.go.dev/github.com/hashicorp/raft#FSM
        
           | bogomipz wrote:
           | Thanks. So this embedding of a database engine into the Raft
           | FSM is the general pattern then?
           | 
           | It looks like Phil's post uses boltdb for the Postgres
           | storage engine as well as for Raft log(via Hashicorpo's
           | implementation Raft lib.)
           | 
           | Thanks for the link to the slides as well. I've seen rqlite
           | mentioned a few times in the last few week and so it was on
           | my short list of things to read up on.
        
             | otoolep wrote:
             | Yes, embedding any kind of store can work. For example,
             | here is an in-memory KV store embedded in Raft:
             | 
             | https://github.com/otoolep/hraftd
             | 
             | This is kind of a reference use of Hashicorp's Raft.
        
       | remram wrote:
       | distributed _Postgres-wire-compatible*_ proof of concept
        
       | KingOfCoders wrote:
       | Time might be up (saying as a long term PG user) with Litestream
       | adding distributed features to SQLite, local inprocess databases
       | might be the next thing.
        
         | nijave wrote:
         | Litestream allows scale up application architectures whereas
         | Postgres allows scale out architectures (by only scaling up the
         | database portion instead of also scaling up application
         | servers).
         | 
         | It's basically the same thing as monolith vs microservices but
         | extending monolith to the data persistence layer. With
         | horizontally scaling apps being the predominant architecture
         | right now, I don't really see Litestream changing much.
         | 
         | If you're going to horizontally scale Litestream with a
         | multiple writers you're going to end up introducing all the
         | network and synchronization pieces Postgres architectures
         | already have.
        
         | cromd wrote:
         | Local inprocess databases might be a bigger thing, but won't
         | they do nothing to help cases where 10 different applications
         | on different machines need to read and write in the same
         | database? I would think the target audiences are very
         | different, i.e. people who are finding non-distributed PG
         | insufficient probably can't alleviate their problems by
         | switching to SQLite
        
           | KingOfCoders wrote:
           | It depends on your access pattern, if every application is
           | write heavy to the same data set, this won't work. If every
           | application is light on writes and those are mostly local
           | (like for many SaaS companies), this will work fine with the
           | future Litestream that redirect writes to one db and then
           | distributes the data (if you can live with slightly stale
           | data). Also depends on the size of you company. Slack does
           | 300k/s messages, this won't work :-)
        
             | cromd wrote:
             | Oh, maybe I am behind on this "future litestream" that
             | "redirects writes". I only knew of the ability to
             | sync/backup/restore. Do you have a link to something
             | discussing those other features?
        
               | ankraft wrote:
               | It doesn't redirect writes but you can create read-only
               | replicas: https://github.com/benbjohnson/litestream-read-
               | replica-examp...
        
         | SahAssar wrote:
         | Litestream still only does single-master, right? To me it feels
         | like the next step up would be something like foundationdb
         | (true multi-master) with a good psql layer.
        
       | justinsaccount wrote:
       | Interesting. I figured this would be like rqlite but on top of
       | postgresql instead, but this builds a simple sql layer on top of
       | bbolt.
        
         | otoolep wrote:
         | I was half-thinking about taking the Postgres wire protocol
         | code[1] that CockroachDB uses, and adding it to rqlite[2]. Then
         | you could talk to rqlite using Postgres tooling (perhaps). The
         | only thing is that rqlite doesn't support distributed
         | transactions in the usual sense, so there might be some
         | difficulty mapping Postgres transaction control to rqlite. But
         | still, could be interesting.
         | 
         | [1] https://github.com/jackc/pgproto3
         | 
         | [2] https://github.com/rqlite/rqlite
         | 
         | Disclaimer: I am the creator of rqlite.
        
       | awinter-py wrote:
       | > What is CockroachDB under the hood? ... wire protocol, a
       | storage layer, a Raft implementation, PostgreSQL's grammar
       | definition. ... To be absurdly reductionist, CockroachDB is just
       | the glue around these libraries
       | 
       | ^ semi off topic, but love the idea of distinguishing certain
       | projects as 'just glue'. (Not a dis, glue matters). Especially
       | interesting in the context of OSS tools whiteboxed by cloud
       | vendors with 'proprietary glue'
        
         | eatonphil wrote:
         | I said absurdly reductionist and definitely meant that. :)
         | There is a ton going on under the hood to make it work well, to
         | fully implement postgres, and to work efficiently.
        
           | jchw wrote:
           | I don't think we should be saying "fully implement
           | PostgreSQL." People are feeling burned by this notion because
           | it sounds like a drop-in replacement, but it isn't really
           | that simple. Being wire compatible is mostly just convenient,
           | since you can reuse existing drivers. But existing software
           | built on top of PostgreSQL won't necessarily work out of the
           | box...
           | 
           | Maybe this is a bit nit picky, but, still.
        
             | eatonphil wrote:
             | I've written about the "world of postgres wire compatible"
             | [0] and yes there are some databases that implement the
             | protocol and don't implement postgres (see: immudb and
             | ravendb) I give databases like CockroachDB more credit for
             | actually making their intention and generally following
             | through with implementing Postgres. Also unlike some others
             | on my list CockroachDB really is meant as an OLTP
             | replacement. Performance characteristics and operations
             | will definitely differ but as solely a user/application
             | developer you shouldn't notice as much.
             | 
             | [0]
             | https://datastation.multiprocess.io/blog/2022-02-08-the-
             | worl...
        
         | [deleted]
        
       | truth_seeker wrote:
       | I would rather use LOGICAL replication to achieve 1MASTER-1SLAVE
       | or 1MASTER-Multi SLAVE or MULTI MASTER - Multi Slave
       | configuration after working schema design details and read-write
       | query patterns across tables.
        
       | 3iggy wrote:
       | I was expecting a guerilla db instead of a cockroach db. Like
       | serialising the db for dropbox. This is like a drifter camping on
       | private land... you can say it isnt really decentralised but i'll
       | just move somewhere else.
        
         | eloff wrote:
         | The word used in the title and article is distributed not
         | decentralized.
        
           | fsckboy wrote:
           | not challenging you, just asking, what is the distinction
           | you're drawing between distributed and decentralized?
           | 
           | one idea would be, the db is completely synchronized on 3
           | redundant servers, so it's decentralized
           | 
           | each column of a table is stored on a different server, so
           | it's distributed...?
        
             | ragona wrote:
             | I'd say distributed is about redundancy and scale within a
             | single entity's infrastructure, where as decentralized
             | implies multiple entities working together.
        
               | jen20 wrote:
               | Decentralised implies lack of a global leader. However,
               | each raft group has a leader, so you can only take this
               | so far in a multi-group raft design.
        
       | wiz21c wrote:
       | sidenote : implementing RAFT myself was really tough. The
       | original paper has 2 or 3 obscure points which were really hard
       | to figure out...
        
         | [deleted]
        
         | mjb wrote:
         | It is tough.
         | 
         | My approach when learning new protocols like Raft or Paxos is
         | to implement them in Pluscal (TLA+'s higher-level language) or
         | P (https://github.com/p-org/P). I've found that helps separate
         | the protocol-level concerns from the implementation-level
         | concerns (sockets? wire format?) in a way that reduces the
         | difficulty of learning the protocol.
        
         | eatonphil wrote:
         | Link to your implementation or notes? :)
        
         | tynpeddler wrote:
         | Echoing the other, how did you configure your test suite to
         | check for correctness?
        
           | SahAssar wrote:
           | Something something jepsen test?
        
         | emteycz wrote:
         | Please share the source! I'd appreciate it very much.
        
       | ilovepostgres9 wrote:
        
       | azurezyq wrote:
       | Raft is the secret sauce behind many distributed databases, like
       | CockroachDB, TiDB, Yugabyte, etc..
       | 
       | But to make it scalable and production grade, you have to run
       | multiple Raft operations in parallel _and_ manage them well.
       | E.g., sharding them correctly to make the load balanced and
       | adjusting dynamically.
       | 
       | Here's a pleasant article to read:
       | https://en.pingcap.com/blog/how-tikv-reads-and-writes/
        
       | rad_gruchalski wrote:
       | There's also YugabyteDB which is actual Postgres with distributed
       | flex. It even supports regular Postgres extensions. Anything in
       | golang imitating Postgres will be subpar for years to come.
       | 
       | YugabyteDB splits tables into tablets (partitions), every tablet
       | has its own raft. It's a modified rocksdb under the hood.
       | 
       | OTOH, it's cool to see how someone builds something like this
       | from scratch and is willing to talk about it.
        
       ___________________________________________________________________
       (page generated 2022-05-17 23:00 UTC)