[HN Gopher] Rqlite: The lightweight, distributed relational data...
       ___________________________________________________________________
        
       Rqlite: The lightweight, distributed relational database built on
       SQLite
        
       Author : pavanyara
       Score  : 175 points
       Date   : 2021-01-22 13:48 UTC (9 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | mech422 wrote:
       | So it looks like you can now distribute SQLite at the:
       | 
       | Stmt level: https://github.com/rqlite/rqlite
       | 
       | VFS Level: https://github.com/canonical/dqlite
       | 
       | Block Level: https://github.com/benbjohnson/litestream
       | 
       | Really cool enhancements to an awesome project!
        
       | f430 wrote:
       | Could you use this to build a decentralized p2p app? If so, what
       | gotchas and limitations are there?
        
         | otoolep wrote:
         | No, rqlite is not suitable for that kind of application. All
         | writes must go through the leader.
        
       | hermitcrab wrote:
       | So if:
       | 
       | Alice, Bob and Charlie have a synced copy of the same database
       | 
       | Charlie goes on a plane and adds a loads of records without a
       | connection to the other databases
       | 
       | Alice and Bob make no changes
       | 
       | Charlie comes home and syncs
       | 
       | Will Charlie lose all his changes, as his database is different
       | to Alice and Bob's?
       | 
       | What happens if Alice, Bob and Charlie all makes changes offline
       | then resync?
        
         | mrkurt wrote:
         | It doesn't work that way. rqlite is effectively a
         | leader/follower model that uses raft for leader consensus.
         | Writes can only happen online, and only to the leader.
        
           | hermitcrab wrote:
           | Ok, thanks!
        
         | teraflop wrote:
         | As per the description, all updates must be replicated to a
         | quorum of instances. If Charlie is on a plane without the
         | ability to contact a quorum, he can't add records in the first
         | place.
         | 
         | This is a fundamentally necessary tradeoff to provide strong
         | consistency, as described by the CAP theorem.
        
           | unnouinceput wrote:
           | In which case Charlie will have an additional local DB to
           | record those records and when he gets back will use another
           | protocol/method/system/whatever to add those new records? How
           | about if everybody goes and adds records on same table?
           | 
           | Here is a real life scenario that I had to deal with in the
           | past. Technician (carpenters) goes to clients home to repair
           | furniture in the middle of nowhere, so no internet. Adds the
           | necessary paperwork which is pictures, declarations, contract
           | (signed and scanned) to the Clients table. This company was
           | employing hundreds of such technicians throughout the many
           | counties of Germany, each with a laptop on them running this
           | app which was the backbone for getting paid/do the work. And
           | was not uncommon to have more than one carpenter go to client
           | home and do the repairs. Since each carpenter was paid
           | according to its own work, each of them would create entries
           | in their local Clients table and when getting back to HQ
           | their work was manually uploaded to central DB, and only
           | after that they got paid. I automated that (that was the job,
           | to eliminate the thousands of hours that carpenters were
           | wasting manually).
           | 
           | So given the above scenario, how is this system going to
           | achieve that? Same table, and same client details even in
           | table Clients, just different rows for different carpenters
           | (foreign key to table Carpenters).
        
             | wtallis wrote:
             | > So given the above scenario, how is this system going to
             | achieve that?
             | 
             | I don't think it is. You're describing a use case that is
             | distributed but explicitly does not want to enforce
             | consistency--you want offline workers to all be able to
             | keep working, and you're enforcing consistency after the
             | fact and outside of the database itself.
        
             | renewiltord wrote:
             | This tool does not handle that problem. It is not meant to.
             | It's for simultaneously available replicas. And this is the
             | rare moment where thinking about replication vs
             | synchronization as different is worthwhile.
             | 
             | You usually replicate for failure tolerance and performance
             | (this project only aims for the former).
        
             | vorpalhex wrote:
             | As other commentors have mentioned, this tool is not
             | intended for that kind of case. You want a tool like
             | PouchDB which handles this kind of setup, but have a
             | different set of tradeoffs (they're eventually consistent,
             | not strongly consistent).
        
         | adsharma wrote:
         | Charlie doesn't have to lose the data he saved on the plane.
         | Don't know what the rqlite implementation does.
         | 
         | In the second case, Alice-Bob consensus overrides Charlie
        
         | [deleted]
        
         | fipar wrote:
         | I have not read the full description of this project yet, but
         | it does mention the use or raft for consensus, so in your
         | example, I would expect Charlie to not be able to add any
         | records while being disconnected, because, if my understanding
         | is correct: - Charlie would either be the leader, but then
         | without getting confirmation of writes from enough followers,
         | he would not be able to do any writes himself, or - Charlie
         | would be a follower, and while disconnected would obviously get
         | no writes from the leader.
        
         | NDizzle wrote:
         | What's messed up is that I was doing this kind of thing with
         | Lotus Domino in the late 90s. I'm sure others were doing it
         | before me, too.
         | 
         | Sometimes you had conflicts that needed resolution, but those
         | weren't that frequent for our use case.
        
       | mshenfield wrote:
       | Going from a local db to one over a network has at least one
       | risk. The SQLite docs gives developers the okay to write "n+1"
       | style queries (https://www.sqlite.org/np1queryprob.html). When
       | the db is on the same file system as the application this pattern
       | is fine. But as soon a you add a network call it becomes a
       | potential bottleneck.
        
       | alberth wrote:
       | @otoolep
       | 
       | SQLite has a great post on "When to Use" (and not use) SQLite.
       | 
       | Would be great if you included these same use cases in the ReamMe
       | docs and make it clear if Rqlite can address them.
       | 
       | https://www.sqlite.org/whentouse.html
        
       | scottlamb wrote:
       | It looks like this layers Raft on top of SQLite. I don't like
       | when systems replicate high-level changes like "update users set
       | salary = salary + 1000 where ...;" Instead, I prefer they
       | replicate low-level changes like "replace key/block X, which
       | should have contents C_x, with C_x'".
       | 
       | Why? Imagine you're doing a rolling update. Some of your replica
       | are running the newer version of SQLite and some are running the
       | older version. They may not execute the high-level query in
       | exactly the same way. For example, in the absence of an "order
       | by" clause, select results' order is unstated. So imagine someone
       | makes a mutation that depends on this: "insert ... select ...
       | limit". (Maybe a dumb example but it can happen anyway.) Now the
       | databases start to diverge, not only in underlying bytes and
       | implementation-defined ordering but in actual row data.
       | 
       | I worked on a major distributed system that originally replicated
       | high-level changes and switched to replicating low-level changes
       | for this reason. We had a system for detecting when replicas
       | didn't match, and replication of high-level changes was the
       | biggest reason for diffs. (Hardware error was the second biggest
       | reason; we added a lot of checksumming because of that.)
        
         | lrossi wrote:
         | If you replicate low level changes, you might not be able to do
         | a live upgrade/downgrade if the version change affects the on
         | disk format.
         | 
         | Another downside is that you might also propagate data
         | corruption in case of bugs in the DB software (e.g. memory
         | corruption) or hardware defects.
        
           | scottlamb wrote:
           | > If you replicate low level changes, you might not be able
           | to do a live upgrade/downgrade if the version change affects
           | the on disk format.
           | 
           | It certainly requires care to ensure all the replicas have
           | software capable of understanding the new format before it's
           | actually written, but it can be done. Likewise after writing
           | the new format, you want to have a roll-back plan.
           | 
           | In SQLite's case, https://sqlite.org/formatchng.html says:
           | "Since 2004, there have been enhancements to SQLite such that
           | newer database files are unreadable by older versions of the
           | SQLite library. But the most recent versions of the SQLite
           | library should be able to read and write any older SQLite
           | database file without any problems." I don't believe
           | upgrading SQLite automatically starts using any of those
           | enhancements; you'd have to do a schema change like "PRAGMA
           | journal_mode=WAL;" first.
           | 
           | > Another downside is that you might also propagate data
           | corruption in case of bugs in the DB software (e.g. memory
           | corruption) or hardware defects.
           | 
           | This happens regardless.
        
         | otoolep wrote:
         | rqlite creator here.
         | 
         | I understand what you're saying, but I don't think it's a
         | compelling objection. Obviously, differences between versions
         | -- even patched versions -- can results in subtle, unintended,
         | differences in how the code works for a given program. But
         | there is no reason to think a system that operates at a lower
         | level ("replace key/block X, which should have contents C_x,
         | with C_x'") is less vulnerable to this kind of issue, compared
         | to one that operates at a higher level i.e. statement-based
         | replication, which rqlite uses. In fact I would argue that the
         | system that operates on higher-level of abstraction is _less_
         | vulnerable i.e. to care about the subtle changes.
        
         | [deleted]
        
         | xd wrote:
         | In MySQL/MariaDB this is what's known as non-deterministic
         | behaviour so row or mixed replication is used to mitigate.
         | 
         | Statement based (high level) replication is very useful for
         | i.e. "insert into tbl0 select col0 from tbl1 order by col1" as
         | you would only need to send the query not the individual row
         | data.
        
         | tyingq wrote:
         | Dqlite replicates at the VFS layer of sqlite, which sounds like
         | what you're looking for. https://github.com/canonical/dqlite
        
           | hinkley wrote:
           | I haven't gotten a straight answer out of the k3s people
           | about why they dumped dqlite, just some comment about bugs.
           | 
           | I could see myself using dqlite in the future so I'd like
           | some more user reports from the trenches. Anyone shed some
           | light on this?
        
             | tyingq wrote:
             | The initial issue seems to be saying that it's because they
             | need to have etcd anyway, so consolidating on that removes
             | a dependency. Which fits their simplicity goal. Though the
             | issue appears to have been created by a user, not a
             | maintainer.
             | 
             |  _" Since the HA direction needs etcd anyway.. I'm
             | proposing dropping support for sqlite as the default
             | embedded non-HA option and switch to embedded etcd as the
             | default. This will reduce overall effort of maintainability
             | of two entirely different datastores."_
             | 
             | https://github.com/k3s-io/k3s/issues/845
        
               | hinkley wrote:
               | I accepted that reason for them, but as I don't benefit
               | directly from switching to etcd, I'd rather know about
               | what started the conversation than how it was concluded.
        
               | merb wrote:
               | dqlite support was flaky and did go trough their
               | translation layer which probably added complexity.
        
         | ttul wrote:
         | Sounds like you need to submit a patch!
        
           | scottlamb wrote:
           | lol, this is an unhelpful reflex answer to any criticism of
           | open source software. What I'm describing is a redesign.
           | There's no point in submitting a patch for that. It'd be much
           | more effort than a few lines of code, and it wouldn't be
           | accepted. Open source means that anyone can fork. It doesn't
           | mean that maintainers will automatically merge patches
           | replacing their software with completely different software.
           | The only way that will happen is if the maintainers decide
           | for themselves it needs to be redesigned, and that starts
           | with discussion rather than a patch. It's also a long shot
           | compared to just finding different software that already has
           | the design I prefer.
           | 
           | If I want replicated SQLite, I'll look at dqlite or
           | litestream instead, which sound more compatible with my
           | design sensibilities. (Thanks, tyingq and benbjohnson!)
        
             | monadic3 wrote:
             | Frankly your bad-faith commentary isn't helping the
             | conversation either. I sincerely appreciate your cleaning
             | up the tone at the end.
        
         | benbjohnson wrote:
         | I just open-sourced a streaming replication tool for SQLite
         | called Litestream that does physical replication (raw pages)
         | instead of logical replication (SQL commands). Each approach
         | has its pros and cons. Physical replication logs tend to be
         | larger than logical logs but I agree that you avoid a lot of
         | issues if you do physical replication.
         | 
         | https://github.com/benbjohnson/litestream
        
           | hinkley wrote:
           | Do you use compression? And if so, how that affects the
           | relative amount of network traffic vs logical.
        
             | benbjohnson wrote:
             | Yes, Litestream uses LZ4 compression. I originally used
             | gzip but the compression speed was pretty slow. B-tree
             | pages tend compress well because they tend to be 50-75%
             | full because they need space to insert new records and
             | because pages split when they get full.
             | 
             | I'm seeing files shrink down to 14% of their size (1.7MB
             | WAL compressed to 264KB). However, your exact compression
             | will vary depending on your data.
        
               | hinkley wrote:
               | Ah, that makes sense. Most inserts don't split pages, so
               | are around n worst case pages, but once in a while you
               | get 2n updates where most of them are half full, and so
               | compress better.
               | 
               | So how does that compare to logical replication? (Also I
               | imagine packet size plays a role, since you have to flush
               | the stream quite frequently, right? 1000 bytes isn't much
               | more expensive than 431)
        
               | benbjohnson wrote:
               | Litestream defaults to flushing out to S3 every 10
               | seconds but that's mainly because of PUT costs. Each
               | request costs $0.00005 so it costs about $1.30 per month.
               | If you flushed every second then it'd cost you $13/month.
               | 
               | Logical replication would have significantly smaller
               | sizes although the size cost isn't a huge deal on S3.
               | Data transfer in to S3 is free and so are DELETE
               | requests. The data only stays on S3 for as long as your
               | Litestream retention specifies. So if you're retaining
               | for a day then you're just keeping one day's worth of WAL
               | changes on the S3 at any given time.
        
           | mrkurt wrote:
           | I've been following this, and am anxious for the direct-to-
           | sqlite replication.
           | 
           | One of rqlite's big limitations is that it resyncs the entire
           | DB at startup time. Being able to start with a "snapshot" and
           | then incrementally replicate changes would be a big help.
        
             | CuriouslyC wrote:
             | This is tangential, but depending on your sql needs,
             | CouchDB's replication story is amazing, and you can
             | replicate to the browser using PouchDB. There is an
             | optional SQL layer, but obviously the good replication
             | story comes with some trade-offs.
        
             | otoolep wrote:
             | rqlite creator here.
             | 
             | I'm not sure I follow why it's a "big limitation"? Is it
             | causing you long start-up times? I'm definitely interested
             | in improving this, if it's an issue. What are you actually
             | seeing?
             | 
             | Also, rqlite does do log truncation (as per Raft spec), so
             | after a certain amount of log entries (8192 by default)
             | node restarts work _exactly_ like you suggested. The SQLite
             | database is restored from a snapshot, and any remaining
             | Raft Log entries are applied to the database.
        
               | mrkurt wrote:
               | Ah, ok that's some nuance I didn't know about!
               | 
               | We're storing a few GB of data in the sqlite DB.
               | Rebuilding those when rqlite restarts is slow and
               | intensive process compared to just using the file on disk
               | over again.
               | 
               | Our particular use case means we'll end up restarting
               | 100+ replica nodes all at once, so the way we're doing
               | things makes it more painful than necessary.
        
               | otoolep wrote:
               | But how do you know it's intensive? Are you watching disk
               | IO? Is there a noticeable delay when the node starts
               | before it's ready to receive requests?
               | 
               | Try setting "-raft-snap" to a lower number, maybe 1024,
               | and see if it helps. You'll have much fewer log entries
               | to apply on startup. However the node will perform a
               | snapshot more often, and writes are blocked during the
               | snapshotting. It's a trade-off.
               | 
               | It might be possible to always restart using some sort of
               | snapshot, independent of Raft, but that would add
               | significant complexity to rqlite. The fact the SQLite
               | database is built from scratch on startup, from the data
               | in Raft log, means rqlite is much more robust.
        
               | mrkurt wrote:
               | Oh, we're reading the sqlite files directly. rqlite is
               | really just a mechanism for us to propagate read only
               | data to a bunch of clients.
               | 
               | We need that sqlite file to never go away. Even a few
               | seconds is bad. And since our replicas are spread all
               | over the world, it's not feasible to move 1GB+ data from
               | the "servers" fast enough.
               | 
               | Is there a way for us to use that sqlite file without it
               | ever going away? We've thought about hardlinking it
               | elsewhere and replacing the hardlink when rqlite is up,
               | but haven't built any tooling to do that.
        
               | otoolep wrote:
               | Hmmmm, that's a different issue. :-)
               | 
               | Today the rqlite code deletes the SQLite database (if
               | present) and then rebuilds it from the Raft log. It makes
               | things so simple, and ensures the node can always
               | recover, regardless of the prior state of the SQLite
               | database -- basically the Raft log is the only thing that
               | matters and that is _guaranteed_ to be the same under
               | each node.
               | 
               | The fundamental issue here is that Raft can only
               | guarantee that the Raft log is in consensus, so rqlite
               | can rely on that. It's always possible the one of the
               | copies of SQLite under a single node gets a different
               | state that all other nodes. This is because the change to
               | the Raft log, and corresponding change to SQLite, are not
               | atomic. Blowing away the SQLite database means a restart
               | would fix this.
               | 
               | If this is important -- and what you ask sounds
               | reasonable for the read-only case that rqlite can support
               | -- I guess the code could rebuild the SQLite database in
               | a temporary place, wait until that's done, and then
               | quickly swap any existing SQLite file with the rebuilt
               | copy. That would minimize the time the file is not
               | present. But the file has to go away at some point.
               | 
               | Alternatively rqlite could open any existing SQLite file
               | and DROP all data first. At least that way the _file_
               | wouldn 't disappear, but the data in the database would
               | wink out of existence and then come back. WDYT?
        
               | mrkurt wrote:
               | Rebuilding and then moving it in place sounds pretty nice
               | to me.
        
             | jlongster wrote:
             | I built an app (https://actualbudget.com/) that uses a
             | local sqlite db and syncs changes, and that's exactly how
             | it works. It takes quite a different approach though, using
             | CRDTs to represent changes and those are synced around.
             | When a fresh client comes into play, it downloads the
             | latest sqlite snapshot from a server and then syncs up.
        
             | benbjohnson wrote:
             | That can be painful for sure. Litestream will do a snapshot
             | on startup if it detects that it can't pick up from where
             | it left off in the WAL. That can happen if Litestream is
             | shut down and another process perform a checkpoint. But
             | generally a restart will just use the existing snapshot &
             | continue with the WAL replication.
        
           | jgraettinger1 wrote:
           | Here's another approach to the problem [0]:
           | 
           | This package is part of Gazette [1], and uses a gazette
           | journal (known as a "recovery log") to power raw bytestream
           | replication & persistence.
           | 
           | On top of journals, there's a recovery log "hinting"
           | mechanism [2] that is aware of file layouts on disk, and
           | keeps metadata around the portions of the journal which must
           | be read to recover a particular on-disk state (e.x. what are
           | the current live files, and which segments of the log hold
           | them?). You can read and even live-tail a recovery log to
           | "play back" / maintain the on-disk file state of a database
           | that's processing somewhere else.
           | 
           | Then, there's a package providing RocksDB with an Rocks
           | environment that's configured to transparently replicate all
           | database file writes into a recovery log [3]. Because RocksDB
           | is a a continuously compacted LSM-tree and we're tracking
           | live files, it's regularly deleting files which allow for
           | "dropping" chunks of the recovery log journal which must be
           | read or stored in order to recover the full database.
           | 
           | For the SQLite implementation, SQLite journals and WAL's are
           | well-suited to recovery logs & their live file tracking,
           | because they're short-lived ephemeral files. The SQLite page
           | DB is another matter, however, because it's a super-long
           | lived and randomly written file. Naively tracking the page DB
           | means you must re-play the _entire history_ of page mutations
           | which have occurred.
           | 
           | This implementation solves this by using a SQLite VFS which
           | actually uses RocksDB under the hood for the SQLite page DB,
           | and regular files (recorded to the same recovery log) for
           | SQLite journals / WALs. In effect, we're leveraging RocksDB's
           | regular compaction mechanisms to remove old versions of
           | SQLite pages which must be tracked / read & replayed.
           | 
           | [0] https://godoc.org/go.gazette.dev/core/consumer/store-
           | sqlite
           | 
           | [1] https://gazette.readthedocs.io/en/latest/
           | 
           | [2] https://gazette.readthedocs.io/en/latest/consumers-
           | concepts....
           | 
           | [3] https://godoc.org/go.gazette.dev/core/consumer/store-
           | rocksdb
        
           | webmaven wrote:
           | _> I just open-sourced a streaming replication tool for
           | SQLite called Litestream that does physical replication (raw
           | pages) instead of logical replication (SQL commands). Each
           | approach has its pros and cons. Physical replication logs
           | tend to be larger than logical logs but I agree that you
           | avoid a lot of issues if you do physical replication._
           | 
           | Hmm. Not having dug into your solution much, is it safe to
           | say that the physical replication logs have something like
           | logical checkpoints? If so, would it make sense to only keep
           | physical logs on a relatively short rolling window, and
           | logical logs (ie. only the interleaved logical checkpoints)
           | longer?
        
             | benbjohnson wrote:
             | I suppose you could save both the physical and logical logs
             | if you really needed log term retention. SQLite databases
             | (and b-trees in general) tend to compress well so the
             | physical logging isn't as bad as it sounds. You could also
             | store a binary diff of the physical page which would shrink
             | it even smaller.
             | 
             | One benefit to using physical logs is that you end up with
             | a byte-for-byte copy of the original data so it makes it
             | easy to validate that your recovery is correct. You'd need
             | to iterate all the records in your database to validate a
             | logical log.
             | 
             | However, all that being said, Litestream runs as a separate
             | daemon process so it actually doesn't have access to the
             | SQL commands from the application.
        
       | szszrk wrote:
       | rqlite is mentioned here quite often, multiple times last year. I
       | don't think this entry brings anything new.
        
       | foolinaround wrote:
       | We currently use browsers on several devices (both laptops and
       | android) and rely on google sync currently. Maybe this could be
       | used to sync bookmarks, history etc across my devices but still
       | keep my data local to me?
        
         | JoachimSchipper wrote:
         | This uses Raft, so a quorum of devices would need to be online
         | at the same time. That's not what you want for browser sync.
        
       | blackbear_ wrote:
       | I know nothing of consensus algorithms and distributed systems so
       | bear with me please.
       | 
       | > rqlite uses Raft to achieve consensus across all the instances
       | of the SQLite databases, ensuring that every change made to the
       | system is made to a quorum of SQLite databases, or none at all.
       | 
       | What I understood from this sentence is that, if we have three
       | instances, rqlite will make sure that every change is written to
       | at least two. But what if two changes are written to two
       | different pairs of instances? Then the three instances will have
       | three different versions of the data. For example, change X is
       | written to instances A and B, and change Y is written to B and C.
       | Now A has X, B has X and Y, and C has Y only. How do you decide
       | who is right?
        
         | edoceo wrote:
         | Raft consensus: https://raft.github.io/
         | 
         | Surprisingly easy to understand, and a cool viaual.
        
         | teraflop wrote:
         | In brief: at any point in time, one of the replicas is a
         | "leader" which controls the order in which operations are
         | committed. The changes occur in a defined sequence, and other
         | replicas may lag behind the leader, but cannot be inconsistent
         | with it.
         | 
         | Your example can't happen, because if (for instance) A is the
         | leader, then C will not apply change Y without contacting the
         | leader, which will tell it to apply X first.
         | 
         | If you want more details about how this handles all the edge
         | cases -- for instance, what happens if the leader crashes --
         | the Raft paper is quite accessible:
         | https://raft.github.io/raft.pdf
        
           | hinkley wrote:
           | TL;DR: Raft updates are serialized (as in sequential).
        
         | whizzter wrote:
         | The semantics of Raft has a "simple" (compared to the harder to
         | understand Paxos) forward motion of events that is supposed to
         | guarantee that you won't get into weird states regardless of if
         | any particular node(s) goes down (I think it can surive (N/2)-1
         | dead machines in a cluster of N).
         | 
         | Raft is based on having a leader decide what the next COMMIT is
         | going to be, so B could never have X and Y at the same time
         | (they could both be queued but other mechanisms could reject
         | them).
         | 
         | Also data is not considered committed until more than half the
         | cluster has acknowledged it (at which point the leader will
         | know it and handle going forward), leader election also works
         | in a similar way iirc.
         | 
         | As others mentioned, the visualization on
         | https://raft.github.io/ is really good (You can affect it to
         | create commits and control downtime of machines)
        
           | hinkley wrote:
           | It's 1/2 + 1 isn't it? So if the leader goes down at the
           | exact moment of quorum, you you can still get quorum again.
           | 
           | That would mean in 3 servers you need 2.5 aka 3 machines to
           | commit a change. Then 4/5, 5/7, 6/9, 7/11. And I think it's a
           | wash anyway, because as the servers go up the fraction you
           | need for quorum goes down, but the odds of falling behind or
           | failing outright go up too. Not to mention the time during
           | which 1/n machines are down due to an upgrade gets longer and
           | longer the more machines you have, increasing the chances of
           | double fault.
        
             | simtel20 wrote:
             | > It's 1/2 + 1 isn't it?
             | 
             | The parent post is talking about the number that can go
             | down while maintaining quorum, and you're talking about the
             | number that need to remain up to maintain quorum. So you're
             | both correct.
             | 
             | However:
             | 
             | > That would mean in 3 servers you need 2.5 aka 3 machines
             | to commit a change.
             | 
             | That seems wrong. You need N//2 +1 where "//" is floor
             | division, so in a 3 node cluster, you need 3//2 +1, or 1+1
             | or 2 nodes to commit a change.
        
               | hinkley wrote:
               | I think I see the problem.
               | 
               | 'Simple majority' is based on the number of the machines
               | that the leader knows about. You can only change the
               | membership by issuing a write. Write quorum and
               | leadership quorum are two different things, and if I've
               | got it right, they can diverge after a partition.
               | 
               | I'm also thinking of double faults, because the point of
               | Raft is to get past single fault tolerance.
               | 
               | [edit: shortened]
               | 
               | After a permanent fault (broken hardware) in a cluster of
               | 5, the replacement quorum member can't vote for writes
               | until it has caught up. It can vote for leaders, but it
               | can't nominate itself. Catching up leaves a window for
               | additional faults.
               | 
               | It's always 3/5 for writes and elections, the difference
               | is that the _ratio_ of original machines that have to
               | confirm a write can go to 100% of survivors, instead of
               | the 3 /4 of reachable machines. Meaning network jitter
               | and packet loss, slows down writes until it recovers, and
               | an additional partition can block writes altogether, even
               | with 3/5 surviving the partition.
        
             | teraflop wrote:
             | > It's 1/2 + 1 isn't it?
             | 
             | > That would mean in 3 servers you need 2.5 aka 3 machines
             | to commit a change. Then 4/5, 5/7, 6/9, 7/11.
             | 
             | No, the requirement isn't 1/2 + 1. Any _strict_ majority of
             | the cluster is enough to elect a leader. So you need 2 /3,
             | or 3/4, or 3/5, and so on.
             | 
             | > Not to mention the time during which 1/n machines are
             | down due to an upgrade gets longer and longer the more
             | machines you have, increasing the chances of double fault.
             | 
             | Generally, this is not the case. If individual machine
             | failures are random and equally probable, and if each
             | machine is down on average less than 50% of the time, then
             | adding more machines makes things better, not worse. (This
             | is a basic property of the binomial distribution.)
             | 
             | Of course, if you have a single point of failure somewhere
             | -- e.g. a network switch -- this assumption can be
             | violated, but that's true regardless of how many machines
             | you have.
        
               | hinkley wrote:
               | If the leader is down (the scenario you clipped out in
               | your response) you need a strict majority with an even
               | number of machines.
        
               | hinkley wrote:
               | This is right for the wrong reason. See follow-up down-
               | thread.
        
         | jasonjayr wrote:
         | IIRC, a monatomic counter is involved. The odd one out will
         | realize it's behind the highest sequence number and discard
         | it's updates to resync with the majority consensus.
         | 
         | Edit: http://thesecretlivesofdata.com/raft/ if you have some
         | time seems to be a good step by step explanation on how it
         | works in detail.
        
           | adsharma wrote:
           | The visual is indeed cool. I also thought it'd be nice to use
           | a chat like interface to learn raft.
           | 
           | Alice: /set a 1 Alice: /set b 2 Bob: /status
           | 
           | Etc
           | 
           | https://github.com/adsharma/zre_raft
           | https://github.com/adsharma/raft
           | 
           | Bug reports welcome
        
       | ericlewis wrote:
       | Expensify had a version of something like this back in like
       | 2013/14 I think.
        
         | moderation wrote:
         | The project is BedrockDB [0] and has been previously discussed
         | [1].
         | 
         | 0. https://bedrockdb.com/
         | 
         | 1. https://news.ycombinator.com/item?id=12739771
        
           | ericlewis wrote:
           | Nice! Though the blockchain part is new to me. Interesting
           | they kept growing this.
        
       | Conlectus wrote:
       | One thing that jumps out at me after reading a lot of Jepsen
       | analyses - does Rqlite assume that partitions form equality
       | relations? That is, that all nodes belong to one and only
       | partition group? This is not always the case in practice.
        
         | yjftsjthsd-h wrote:
         | So the case of A can talk to B, B can talk to C, but A can't
         | talk to C? (Making sure that I understand how you can be in
         | multiple partitions)
        
       | fnord123 wrote:
       | FoundationDB and Comdb2 also use sqlite as a storage engine.
       | Curious that they decided to implement yet another one.
       | 
       | https://www.foundationdb.org/
       | 
       | http://comdb2.org/
        
         | tyingq wrote:
         | Rqlite appears to predate comdb2.
        
           | rapsey wrote:
           | Literally the first sentence.
           | 
           | > Comdb2 is a relational database built in-house at Bloomberg
           | L.P. over the last 14 years or so.
           | 
           | rqlite is not 14 years old.
        
             | tyingq wrote:
             | I was looking at the github repo history. Was is publicly
             | visible sooner than that would imply?
        
               | tyingq wrote:
               | Answering my own question, Comdb2 was made available to
               | the public on 1 January 2016, well after rqlite launched.
        
       | peter_d_sherman wrote:
       | First of all, great idea, and a brilliant and highly laudable
       | effort!
       | 
       | Favorited!
       | 
       | One minor caveat ("Here be Dragons") I have (with respect to my
       | own future adoption/production use), however:
       | 
       | https://github.com/rqlite/rqlite/blob/master/DOC/FAQ.md
       | 
       | > _" Does rqlite support transactions?
       | 
       | It supports a form of transactions. You can wrap a bulk update in
       | a transaction such that all the statements in the bulk request
       | will succeed, or none of them will. However the behaviour or
       | rqlite is undefined if you send explicit BEGIN, COMMIT, or
       | ROLLBACK statements. This is not because they won't work -- they
       | will -- but if your node (or cluster) fails while a transaction
       | is in progress, the system may be left in a hard-to-use state. So
       | until rqlite can offer strict guarantees about its behaviour if
       | it fails during a transaction, using BEGIN, COMMIT, and ROLLBACK
       | is officially unsupported. Unfortunately this does mean that
       | rqlite may not be suitable for some applications."_
       | 
       | PDS: Distributed transactions are extremely difficult to get
       | exactly right -- so I'm not trying to criticize all of the hard
       | work and effort that everyone has put into this (again, it's a
       | great idea, and I think it has a terrific future).
       | 
       | But Distributed Transactions -- are what differentiate something
       | like rsqlite from say, something like CockroachDB (https://www.co
       | ckroachlabs.com/docs/stable/architecture/life-...).
       | 
       | Of course, CockroachDB is a pay-for product with an actual
       | company with many years of experience backing it, whereas rqlite,
       | as far as I can intuit, at this point in time (someone correct me
       | if I am wrong), appears to be a volunteer effort...
       | 
       | Still, I think that rqlite despite this -- has a glorious and
       | wonderful future!
       | 
       | Again, a brilliant and laudable effort, suitable for many use
       | cases presently, and I can't wait to see what the future holds
       | for this Open Source project!
       | 
       | Maybe in the future some code-ninja will step up to the plate and
       | add fully guaranteed, safe, distributed transactions!
       | 
       | Until then, it looks like a great idea coupled with a great
       | software engineering effort!
       | 
       | As I said, "Favorited!".
        
       | jchrisa wrote:
       | I'm curious how this relates to the Calvin protocol as
       | implemented by FaunaDB. They both use Raft, but FaunaDB and
       | Calvin have additional details about how transactions are retried
       | and aborted. https://fauna.com/blog/consistency-without-clocks-
       | faunadb-tr...
        
       | ClumsyPilot wrote:
       | I think microk8s uses this to form a cluster, and k3s used to use
       | it but moved back to etc.
       | 
       | Would be good to hear from someone who used it what are the pros
       | and cons of such a setup
        
         | fasteo wrote:
         | AFAIK, microk8s uses a similar - but not this - form of
         | distributed sqlite. Specifically, it uses dqlite[1] "a C
         | library that implements an embeddable and replicated SQL
         | database engine with high-availability and automatic failover."
         | 
         | [1] https://github.com/canonical/dqlite
        
           | tyingq wrote:
           | Probably worth mentioning that Canonical initially made
           | dqlite to be the backing store for LXD. It uses the sqlite
           | VFS as the client entry point, so it's a very easy transition
           | for an existing sqlite app, just recompile with the new
           | header.
        
       ___________________________________________________________________
       (page generated 2021-01-22 23:00 UTC)