[HN Gopher] Rqlite: The lightweight, distributed relational data... ___________________________________________________________________ Rqlite: The lightweight, distributed relational database built on SQLite Author : pavanyara Score : 175 points Date : 2021-01-22 13:48 UTC (9 hours ago) (HTM) web link (github.com) (TXT) w3m dump (github.com) | mech422 wrote: | So it looks like you can now distribute SQLite at the: | | Stmt level: https://github.com/rqlite/rqlite | | VFS Level: https://github.com/canonical/dqlite | | Block Level: https://github.com/benbjohnson/litestream | | Really cool enhancements to an awesome project! | f430 wrote: | Could you use this to build a decentralized p2p app? If so, what | gotchas and limitations are there? | otoolep wrote: | No, rqlite is not suitable for that kind of application. All | writes must go through the leader. | hermitcrab wrote: | So if: | | Alice, Bob and Charlie have a synced copy of the same database | | Charlie goes on a plane and adds a loads of records without a | connection to the other databases | | Alice and Bob make no changes | | Charlie comes home and syncs | | Will Charlie lose all his changes, as his database is different | to Alice and Bob's? | | What happens if Alice, Bob and Charlie all makes changes offline | then resync? | mrkurt wrote: | It doesn't work that way. rqlite is effectively a | leader/follower model that uses raft for leader consensus. | Writes can only happen online, and only to the leader. | hermitcrab wrote: | Ok, thanks! | teraflop wrote: | As per the description, all updates must be replicated to a | quorum of instances. If Charlie is on a plane without the | ability to contact a quorum, he can't add records in the first | place. | | This is a fundamentally necessary tradeoff to provide strong | consistency, as described by the CAP theorem. | unnouinceput wrote: | In which case Charlie will have an additional local DB to | record those records and when he gets back will use another | protocol/method/system/whatever to add those new records? How | about if everybody goes and adds records on same table? | | Here is a real life scenario that I had to deal with in the | past. Technician (carpenters) goes to clients home to repair | furniture in the middle of nowhere, so no internet. Adds the | necessary paperwork which is pictures, declarations, contract | (signed and scanned) to the Clients table. This company was | employing hundreds of such technicians throughout the many | counties of Germany, each with a laptop on them running this | app which was the backbone for getting paid/do the work. And | was not uncommon to have more than one carpenter go to client | home and do the repairs. Since each carpenter was paid | according to its own work, each of them would create entries | in their local Clients table and when getting back to HQ | their work was manually uploaded to central DB, and only | after that they got paid. I automated that (that was the job, | to eliminate the thousands of hours that carpenters were | wasting manually). | | So given the above scenario, how is this system going to | achieve that? Same table, and same client details even in | table Clients, just different rows for different carpenters | (foreign key to table Carpenters). | wtallis wrote: | > So given the above scenario, how is this system going to | achieve that? | | I don't think it is. You're describing a use case that is | distributed but explicitly does not want to enforce | consistency--you want offline workers to all be able to | keep working, and you're enforcing consistency after the | fact and outside of the database itself. | renewiltord wrote: | This tool does not handle that problem. It is not meant to. | It's for simultaneously available replicas. And this is the | rare moment where thinking about replication vs | synchronization as different is worthwhile. | | You usually replicate for failure tolerance and performance | (this project only aims for the former). | vorpalhex wrote: | As other commentors have mentioned, this tool is not | intended for that kind of case. You want a tool like | PouchDB which handles this kind of setup, but have a | different set of tradeoffs (they're eventually consistent, | not strongly consistent). | adsharma wrote: | Charlie doesn't have to lose the data he saved on the plane. | Don't know what the rqlite implementation does. | | In the second case, Alice-Bob consensus overrides Charlie | [deleted] | fipar wrote: | I have not read the full description of this project yet, but | it does mention the use or raft for consensus, so in your | example, I would expect Charlie to not be able to add any | records while being disconnected, because, if my understanding | is correct: - Charlie would either be the leader, but then | without getting confirmation of writes from enough followers, | he would not be able to do any writes himself, or - Charlie | would be a follower, and while disconnected would obviously get | no writes from the leader. | NDizzle wrote: | What's messed up is that I was doing this kind of thing with | Lotus Domino in the late 90s. I'm sure others were doing it | before me, too. | | Sometimes you had conflicts that needed resolution, but those | weren't that frequent for our use case. | mshenfield wrote: | Going from a local db to one over a network has at least one | risk. The SQLite docs gives developers the okay to write "n+1" | style queries (https://www.sqlite.org/np1queryprob.html). When | the db is on the same file system as the application this pattern | is fine. But as soon a you add a network call it becomes a | potential bottleneck. | alberth wrote: | @otoolep | | SQLite has a great post on "When to Use" (and not use) SQLite. | | Would be great if you included these same use cases in the ReamMe | docs and make it clear if Rqlite can address them. | | https://www.sqlite.org/whentouse.html | scottlamb wrote: | It looks like this layers Raft on top of SQLite. I don't like | when systems replicate high-level changes like "update users set | salary = salary + 1000 where ...;" Instead, I prefer they | replicate low-level changes like "replace key/block X, which | should have contents C_x, with C_x'". | | Why? Imagine you're doing a rolling update. Some of your replica | are running the newer version of SQLite and some are running the | older version. They may not execute the high-level query in | exactly the same way. For example, in the absence of an "order | by" clause, select results' order is unstated. So imagine someone | makes a mutation that depends on this: "insert ... select ... | limit". (Maybe a dumb example but it can happen anyway.) Now the | databases start to diverge, not only in underlying bytes and | implementation-defined ordering but in actual row data. | | I worked on a major distributed system that originally replicated | high-level changes and switched to replicating low-level changes | for this reason. We had a system for detecting when replicas | didn't match, and replication of high-level changes was the | biggest reason for diffs. (Hardware error was the second biggest | reason; we added a lot of checksumming because of that.) | lrossi wrote: | If you replicate low level changes, you might not be able to do | a live upgrade/downgrade if the version change affects the on | disk format. | | Another downside is that you might also propagate data | corruption in case of bugs in the DB software (e.g. memory | corruption) or hardware defects. | scottlamb wrote: | > If you replicate low level changes, you might not be able | to do a live upgrade/downgrade if the version change affects | the on disk format. | | It certainly requires care to ensure all the replicas have | software capable of understanding the new format before it's | actually written, but it can be done. Likewise after writing | the new format, you want to have a roll-back plan. | | In SQLite's case, https://sqlite.org/formatchng.html says: | "Since 2004, there have been enhancements to SQLite such that | newer database files are unreadable by older versions of the | SQLite library. But the most recent versions of the SQLite | library should be able to read and write any older SQLite | database file without any problems." I don't believe | upgrading SQLite automatically starts using any of those | enhancements; you'd have to do a schema change like "PRAGMA | journal_mode=WAL;" first. | | > Another downside is that you might also propagate data | corruption in case of bugs in the DB software (e.g. memory | corruption) or hardware defects. | | This happens regardless. | otoolep wrote: | rqlite creator here. | | I understand what you're saying, but I don't think it's a | compelling objection. Obviously, differences between versions | -- even patched versions -- can results in subtle, unintended, | differences in how the code works for a given program. But | there is no reason to think a system that operates at a lower | level ("replace key/block X, which should have contents C_x, | with C_x'") is less vulnerable to this kind of issue, compared | to one that operates at a higher level i.e. statement-based | replication, which rqlite uses. In fact I would argue that the | system that operates on higher-level of abstraction is _less_ | vulnerable i.e. to care about the subtle changes. | [deleted] | xd wrote: | In MySQL/MariaDB this is what's known as non-deterministic | behaviour so row or mixed replication is used to mitigate. | | Statement based (high level) replication is very useful for | i.e. "insert into tbl0 select col0 from tbl1 order by col1" as | you would only need to send the query not the individual row | data. | tyingq wrote: | Dqlite replicates at the VFS layer of sqlite, which sounds like | what you're looking for. https://github.com/canonical/dqlite | hinkley wrote: | I haven't gotten a straight answer out of the k3s people | about why they dumped dqlite, just some comment about bugs. | | I could see myself using dqlite in the future so I'd like | some more user reports from the trenches. Anyone shed some | light on this? | tyingq wrote: | The initial issue seems to be saying that it's because they | need to have etcd anyway, so consolidating on that removes | a dependency. Which fits their simplicity goal. Though the | issue appears to have been created by a user, not a | maintainer. | | _" Since the HA direction needs etcd anyway.. I'm | proposing dropping support for sqlite as the default | embedded non-HA option and switch to embedded etcd as the | default. This will reduce overall effort of maintainability | of two entirely different datastores."_ | | https://github.com/k3s-io/k3s/issues/845 | hinkley wrote: | I accepted that reason for them, but as I don't benefit | directly from switching to etcd, I'd rather know about | what started the conversation than how it was concluded. | merb wrote: | dqlite support was flaky and did go trough their | translation layer which probably added complexity. | ttul wrote: | Sounds like you need to submit a patch! | scottlamb wrote: | lol, this is an unhelpful reflex answer to any criticism of | open source software. What I'm describing is a redesign. | There's no point in submitting a patch for that. It'd be much | more effort than a few lines of code, and it wouldn't be | accepted. Open source means that anyone can fork. It doesn't | mean that maintainers will automatically merge patches | replacing their software with completely different software. | The only way that will happen is if the maintainers decide | for themselves it needs to be redesigned, and that starts | with discussion rather than a patch. It's also a long shot | compared to just finding different software that already has | the design I prefer. | | If I want replicated SQLite, I'll look at dqlite or | litestream instead, which sound more compatible with my | design sensibilities. (Thanks, tyingq and benbjohnson!) | monadic3 wrote: | Frankly your bad-faith commentary isn't helping the | conversation either. I sincerely appreciate your cleaning | up the tone at the end. | benbjohnson wrote: | I just open-sourced a streaming replication tool for SQLite | called Litestream that does physical replication (raw pages) | instead of logical replication (SQL commands). Each approach | has its pros and cons. Physical replication logs tend to be | larger than logical logs but I agree that you avoid a lot of | issues if you do physical replication. | | https://github.com/benbjohnson/litestream | hinkley wrote: | Do you use compression? And if so, how that affects the | relative amount of network traffic vs logical. | benbjohnson wrote: | Yes, Litestream uses LZ4 compression. I originally used | gzip but the compression speed was pretty slow. B-tree | pages tend compress well because they tend to be 50-75% | full because they need space to insert new records and | because pages split when they get full. | | I'm seeing files shrink down to 14% of their size (1.7MB | WAL compressed to 264KB). However, your exact compression | will vary depending on your data. | hinkley wrote: | Ah, that makes sense. Most inserts don't split pages, so | are around n worst case pages, but once in a while you | get 2n updates where most of them are half full, and so | compress better. | | So how does that compare to logical replication? (Also I | imagine packet size plays a role, since you have to flush | the stream quite frequently, right? 1000 bytes isn't much | more expensive than 431) | benbjohnson wrote: | Litestream defaults to flushing out to S3 every 10 | seconds but that's mainly because of PUT costs. Each | request costs $0.00005 so it costs about $1.30 per month. | If you flushed every second then it'd cost you $13/month. | | Logical replication would have significantly smaller | sizes although the size cost isn't a huge deal on S3. | Data transfer in to S3 is free and so are DELETE | requests. The data only stays on S3 for as long as your | Litestream retention specifies. So if you're retaining | for a day then you're just keeping one day's worth of WAL | changes on the S3 at any given time. | mrkurt wrote: | I've been following this, and am anxious for the direct-to- | sqlite replication. | | One of rqlite's big limitations is that it resyncs the entire | DB at startup time. Being able to start with a "snapshot" and | then incrementally replicate changes would be a big help. | CuriouslyC wrote: | This is tangential, but depending on your sql needs, | CouchDB's replication story is amazing, and you can | replicate to the browser using PouchDB. There is an | optional SQL layer, but obviously the good replication | story comes with some trade-offs. | otoolep wrote: | rqlite creator here. | | I'm not sure I follow why it's a "big limitation"? Is it | causing you long start-up times? I'm definitely interested | in improving this, if it's an issue. What are you actually | seeing? | | Also, rqlite does do log truncation (as per Raft spec), so | after a certain amount of log entries (8192 by default) | node restarts work _exactly_ like you suggested. The SQLite | database is restored from a snapshot, and any remaining | Raft Log entries are applied to the database. | mrkurt wrote: | Ah, ok that's some nuance I didn't know about! | | We're storing a few GB of data in the sqlite DB. | Rebuilding those when rqlite restarts is slow and | intensive process compared to just using the file on disk | over again. | | Our particular use case means we'll end up restarting | 100+ replica nodes all at once, so the way we're doing | things makes it more painful than necessary. | otoolep wrote: | But how do you know it's intensive? Are you watching disk | IO? Is there a noticeable delay when the node starts | before it's ready to receive requests? | | Try setting "-raft-snap" to a lower number, maybe 1024, | and see if it helps. You'll have much fewer log entries | to apply on startup. However the node will perform a | snapshot more often, and writes are blocked during the | snapshotting. It's a trade-off. | | It might be possible to always restart using some sort of | snapshot, independent of Raft, but that would add | significant complexity to rqlite. The fact the SQLite | database is built from scratch on startup, from the data | in Raft log, means rqlite is much more robust. | mrkurt wrote: | Oh, we're reading the sqlite files directly. rqlite is | really just a mechanism for us to propagate read only | data to a bunch of clients. | | We need that sqlite file to never go away. Even a few | seconds is bad. And since our replicas are spread all | over the world, it's not feasible to move 1GB+ data from | the "servers" fast enough. | | Is there a way for us to use that sqlite file without it | ever going away? We've thought about hardlinking it | elsewhere and replacing the hardlink when rqlite is up, | but haven't built any tooling to do that. | otoolep wrote: | Hmmmm, that's a different issue. :-) | | Today the rqlite code deletes the SQLite database (if | present) and then rebuilds it from the Raft log. It makes | things so simple, and ensures the node can always | recover, regardless of the prior state of the SQLite | database -- basically the Raft log is the only thing that | matters and that is _guaranteed_ to be the same under | each node. | | The fundamental issue here is that Raft can only | guarantee that the Raft log is in consensus, so rqlite | can rely on that. It's always possible the one of the | copies of SQLite under a single node gets a different | state that all other nodes. This is because the change to | the Raft log, and corresponding change to SQLite, are not | atomic. Blowing away the SQLite database means a restart | would fix this. | | If this is important -- and what you ask sounds | reasonable for the read-only case that rqlite can support | -- I guess the code could rebuild the SQLite database in | a temporary place, wait until that's done, and then | quickly swap any existing SQLite file with the rebuilt | copy. That would minimize the time the file is not | present. But the file has to go away at some point. | | Alternatively rqlite could open any existing SQLite file | and DROP all data first. At least that way the _file_ | wouldn 't disappear, but the data in the database would | wink out of existence and then come back. WDYT? | mrkurt wrote: | Rebuilding and then moving it in place sounds pretty nice | to me. | jlongster wrote: | I built an app (https://actualbudget.com/) that uses a | local sqlite db and syncs changes, and that's exactly how | it works. It takes quite a different approach though, using | CRDTs to represent changes and those are synced around. | When a fresh client comes into play, it downloads the | latest sqlite snapshot from a server and then syncs up. | benbjohnson wrote: | That can be painful for sure. Litestream will do a snapshot | on startup if it detects that it can't pick up from where | it left off in the WAL. That can happen if Litestream is | shut down and another process perform a checkpoint. But | generally a restart will just use the existing snapshot & | continue with the WAL replication. | jgraettinger1 wrote: | Here's another approach to the problem [0]: | | This package is part of Gazette [1], and uses a gazette | journal (known as a "recovery log") to power raw bytestream | replication & persistence. | | On top of journals, there's a recovery log "hinting" | mechanism [2] that is aware of file layouts on disk, and | keeps metadata around the portions of the journal which must | be read to recover a particular on-disk state (e.x. what are | the current live files, and which segments of the log hold | them?). You can read and even live-tail a recovery log to | "play back" / maintain the on-disk file state of a database | that's processing somewhere else. | | Then, there's a package providing RocksDB with an Rocks | environment that's configured to transparently replicate all | database file writes into a recovery log [3]. Because RocksDB | is a a continuously compacted LSM-tree and we're tracking | live files, it's regularly deleting files which allow for | "dropping" chunks of the recovery log journal which must be | read or stored in order to recover the full database. | | For the SQLite implementation, SQLite journals and WAL's are | well-suited to recovery logs & their live file tracking, | because they're short-lived ephemeral files. The SQLite page | DB is another matter, however, because it's a super-long | lived and randomly written file. Naively tracking the page DB | means you must re-play the _entire history_ of page mutations | which have occurred. | | This implementation solves this by using a SQLite VFS which | actually uses RocksDB under the hood for the SQLite page DB, | and regular files (recorded to the same recovery log) for | SQLite journals / WALs. In effect, we're leveraging RocksDB's | regular compaction mechanisms to remove old versions of | SQLite pages which must be tracked / read & replayed. | | [0] https://godoc.org/go.gazette.dev/core/consumer/store- | sqlite | | [1] https://gazette.readthedocs.io/en/latest/ | | [2] https://gazette.readthedocs.io/en/latest/consumers- | concepts.... | | [3] https://godoc.org/go.gazette.dev/core/consumer/store- | rocksdb | webmaven wrote: | _> I just open-sourced a streaming replication tool for | SQLite called Litestream that does physical replication (raw | pages) instead of logical replication (SQL commands). Each | approach has its pros and cons. Physical replication logs | tend to be larger than logical logs but I agree that you | avoid a lot of issues if you do physical replication._ | | Hmm. Not having dug into your solution much, is it safe to | say that the physical replication logs have something like | logical checkpoints? If so, would it make sense to only keep | physical logs on a relatively short rolling window, and | logical logs (ie. only the interleaved logical checkpoints) | longer? | benbjohnson wrote: | I suppose you could save both the physical and logical logs | if you really needed log term retention. SQLite databases | (and b-trees in general) tend to compress well so the | physical logging isn't as bad as it sounds. You could also | store a binary diff of the physical page which would shrink | it even smaller. | | One benefit to using physical logs is that you end up with | a byte-for-byte copy of the original data so it makes it | easy to validate that your recovery is correct. You'd need | to iterate all the records in your database to validate a | logical log. | | However, all that being said, Litestream runs as a separate | daemon process so it actually doesn't have access to the | SQL commands from the application. | szszrk wrote: | rqlite is mentioned here quite often, multiple times last year. I | don't think this entry brings anything new. | foolinaround wrote: | We currently use browsers on several devices (both laptops and | android) and rely on google sync currently. Maybe this could be | used to sync bookmarks, history etc across my devices but still | keep my data local to me? | JoachimSchipper wrote: | This uses Raft, so a quorum of devices would need to be online | at the same time. That's not what you want for browser sync. | blackbear_ wrote: | I know nothing of consensus algorithms and distributed systems so | bear with me please. | | > rqlite uses Raft to achieve consensus across all the instances | of the SQLite databases, ensuring that every change made to the | system is made to a quorum of SQLite databases, or none at all. | | What I understood from this sentence is that, if we have three | instances, rqlite will make sure that every change is written to | at least two. But what if two changes are written to two | different pairs of instances? Then the three instances will have | three different versions of the data. For example, change X is | written to instances A and B, and change Y is written to B and C. | Now A has X, B has X and Y, and C has Y only. How do you decide | who is right? | edoceo wrote: | Raft consensus: https://raft.github.io/ | | Surprisingly easy to understand, and a cool viaual. | teraflop wrote: | In brief: at any point in time, one of the replicas is a | "leader" which controls the order in which operations are | committed. The changes occur in a defined sequence, and other | replicas may lag behind the leader, but cannot be inconsistent | with it. | | Your example can't happen, because if (for instance) A is the | leader, then C will not apply change Y without contacting the | leader, which will tell it to apply X first. | | If you want more details about how this handles all the edge | cases -- for instance, what happens if the leader crashes -- | the Raft paper is quite accessible: | https://raft.github.io/raft.pdf | hinkley wrote: | TL;DR: Raft updates are serialized (as in sequential). | whizzter wrote: | The semantics of Raft has a "simple" (compared to the harder to | understand Paxos) forward motion of events that is supposed to | guarantee that you won't get into weird states regardless of if | any particular node(s) goes down (I think it can surive (N/2)-1 | dead machines in a cluster of N). | | Raft is based on having a leader decide what the next COMMIT is | going to be, so B could never have X and Y at the same time | (they could both be queued but other mechanisms could reject | them). | | Also data is not considered committed until more than half the | cluster has acknowledged it (at which point the leader will | know it and handle going forward), leader election also works | in a similar way iirc. | | As others mentioned, the visualization on | https://raft.github.io/ is really good (You can affect it to | create commits and control downtime of machines) | hinkley wrote: | It's 1/2 + 1 isn't it? So if the leader goes down at the | exact moment of quorum, you you can still get quorum again. | | That would mean in 3 servers you need 2.5 aka 3 machines to | commit a change. Then 4/5, 5/7, 6/9, 7/11. And I think it's a | wash anyway, because as the servers go up the fraction you | need for quorum goes down, but the odds of falling behind or | failing outright go up too. Not to mention the time during | which 1/n machines are down due to an upgrade gets longer and | longer the more machines you have, increasing the chances of | double fault. | simtel20 wrote: | > It's 1/2 + 1 isn't it? | | The parent post is talking about the number that can go | down while maintaining quorum, and you're talking about the | number that need to remain up to maintain quorum. So you're | both correct. | | However: | | > That would mean in 3 servers you need 2.5 aka 3 machines | to commit a change. | | That seems wrong. You need N//2 +1 where "//" is floor | division, so in a 3 node cluster, you need 3//2 +1, or 1+1 | or 2 nodes to commit a change. | hinkley wrote: | I think I see the problem. | | 'Simple majority' is based on the number of the machines | that the leader knows about. You can only change the | membership by issuing a write. Write quorum and | leadership quorum are two different things, and if I've | got it right, they can diverge after a partition. | | I'm also thinking of double faults, because the point of | Raft is to get past single fault tolerance. | | [edit: shortened] | | After a permanent fault (broken hardware) in a cluster of | 5, the replacement quorum member can't vote for writes | until it has caught up. It can vote for leaders, but it | can't nominate itself. Catching up leaves a window for | additional faults. | | It's always 3/5 for writes and elections, the difference | is that the _ratio_ of original machines that have to | confirm a write can go to 100% of survivors, instead of | the 3 /4 of reachable machines. Meaning network jitter | and packet loss, slows down writes until it recovers, and | an additional partition can block writes altogether, even | with 3/5 surviving the partition. | teraflop wrote: | > It's 1/2 + 1 isn't it? | | > That would mean in 3 servers you need 2.5 aka 3 machines | to commit a change. Then 4/5, 5/7, 6/9, 7/11. | | No, the requirement isn't 1/2 + 1. Any _strict_ majority of | the cluster is enough to elect a leader. So you need 2 /3, | or 3/4, or 3/5, and so on. | | > Not to mention the time during which 1/n machines are | down due to an upgrade gets longer and longer the more | machines you have, increasing the chances of double fault. | | Generally, this is not the case. If individual machine | failures are random and equally probable, and if each | machine is down on average less than 50% of the time, then | adding more machines makes things better, not worse. (This | is a basic property of the binomial distribution.) | | Of course, if you have a single point of failure somewhere | -- e.g. a network switch -- this assumption can be | violated, but that's true regardless of how many machines | you have. | hinkley wrote: | If the leader is down (the scenario you clipped out in | your response) you need a strict majority with an even | number of machines. | hinkley wrote: | This is right for the wrong reason. See follow-up down- | thread. | jasonjayr wrote: | IIRC, a monatomic counter is involved. The odd one out will | realize it's behind the highest sequence number and discard | it's updates to resync with the majority consensus. | | Edit: http://thesecretlivesofdata.com/raft/ if you have some | time seems to be a good step by step explanation on how it | works in detail. | adsharma wrote: | The visual is indeed cool. I also thought it'd be nice to use | a chat like interface to learn raft. | | Alice: /set a 1 Alice: /set b 2 Bob: /status | | Etc | | https://github.com/adsharma/zre_raft | https://github.com/adsharma/raft | | Bug reports welcome | ericlewis wrote: | Expensify had a version of something like this back in like | 2013/14 I think. | moderation wrote: | The project is BedrockDB [0] and has been previously discussed | [1]. | | 0. https://bedrockdb.com/ | | 1. https://news.ycombinator.com/item?id=12739771 | ericlewis wrote: | Nice! Though the blockchain part is new to me. Interesting | they kept growing this. | Conlectus wrote: | One thing that jumps out at me after reading a lot of Jepsen | analyses - does Rqlite assume that partitions form equality | relations? That is, that all nodes belong to one and only | partition group? This is not always the case in practice. | yjftsjthsd-h wrote: | So the case of A can talk to B, B can talk to C, but A can't | talk to C? (Making sure that I understand how you can be in | multiple partitions) | fnord123 wrote: | FoundationDB and Comdb2 also use sqlite as a storage engine. | Curious that they decided to implement yet another one. | | https://www.foundationdb.org/ | | http://comdb2.org/ | tyingq wrote: | Rqlite appears to predate comdb2. | rapsey wrote: | Literally the first sentence. | | > Comdb2 is a relational database built in-house at Bloomberg | L.P. over the last 14 years or so. | | rqlite is not 14 years old. | tyingq wrote: | I was looking at the github repo history. Was is publicly | visible sooner than that would imply? | tyingq wrote: | Answering my own question, Comdb2 was made available to | the public on 1 January 2016, well after rqlite launched. | peter_d_sherman wrote: | First of all, great idea, and a brilliant and highly laudable | effort! | | Favorited! | | One minor caveat ("Here be Dragons") I have (with respect to my | own future adoption/production use), however: | | https://github.com/rqlite/rqlite/blob/master/DOC/FAQ.md | | > _" Does rqlite support transactions? | | It supports a form of transactions. You can wrap a bulk update in | a transaction such that all the statements in the bulk request | will succeed, or none of them will. However the behaviour or | rqlite is undefined if you send explicit BEGIN, COMMIT, or | ROLLBACK statements. This is not because they won't work -- they | will -- but if your node (or cluster) fails while a transaction | is in progress, the system may be left in a hard-to-use state. So | until rqlite can offer strict guarantees about its behaviour if | it fails during a transaction, using BEGIN, COMMIT, and ROLLBACK | is officially unsupported. Unfortunately this does mean that | rqlite may not be suitable for some applications."_ | | PDS: Distributed transactions are extremely difficult to get | exactly right -- so I'm not trying to criticize all of the hard | work and effort that everyone has put into this (again, it's a | great idea, and I think it has a terrific future). | | But Distributed Transactions -- are what differentiate something | like rsqlite from say, something like CockroachDB (https://www.co | ckroachlabs.com/docs/stable/architecture/life-...). | | Of course, CockroachDB is a pay-for product with an actual | company with many years of experience backing it, whereas rqlite, | as far as I can intuit, at this point in time (someone correct me | if I am wrong), appears to be a volunteer effort... | | Still, I think that rqlite despite this -- has a glorious and | wonderful future! | | Again, a brilliant and laudable effort, suitable for many use | cases presently, and I can't wait to see what the future holds | for this Open Source project! | | Maybe in the future some code-ninja will step up to the plate and | add fully guaranteed, safe, distributed transactions! | | Until then, it looks like a great idea coupled with a great | software engineering effort! | | As I said, "Favorited!". | jchrisa wrote: | I'm curious how this relates to the Calvin protocol as | implemented by FaunaDB. They both use Raft, but FaunaDB and | Calvin have additional details about how transactions are retried | and aborted. https://fauna.com/blog/consistency-without-clocks- | faunadb-tr... | ClumsyPilot wrote: | I think microk8s uses this to form a cluster, and k3s used to use | it but moved back to etc. | | Would be good to hear from someone who used it what are the pros | and cons of such a setup | fasteo wrote: | AFAIK, microk8s uses a similar - but not this - form of | distributed sqlite. Specifically, it uses dqlite[1] "a C | library that implements an embeddable and replicated SQL | database engine with high-availability and automatic failover." | | [1] https://github.com/canonical/dqlite | tyingq wrote: | Probably worth mentioning that Canonical initially made | dqlite to be the backing store for LXD. It uses the sqlite | VFS as the client entry point, so it's a very easy transition | for an existing sqlite app, just recompile with the new | header. ___________________________________________________________________ (page generated 2021-01-22 23:00 UTC)