[HN Gopher] LiteFS a FUSE-based file system for replicating SQLite ___________________________________________________________________ LiteFS a FUSE-based file system for replicating SQLite Author : sysbot Score : 136 points Date : 2022-07-26 16:14 UTC (6 hours ago) (HTM) web link (github.com) (TXT) w3m dump (github.com) | ok_dad wrote: | Very cool, you should add raft so every node could be the primary | if the primary fails. You just need to add election and a few | minor state things on top of what's there already, I think. | gmemstr wrote: | And I /just/ got my infrastructure bits and pieces running | Litestream! Guess I'll have to figure out if it's worth switching | to this -- my gut reaction is no, since I only really run one pod | at a time, so Litestream serves the purpose of not only saving | the database offsite but also restoring it. But I will be keeping | a very close eye on this thanks in part to my love of SQLite. | | Hats off to Ben and Fly.io, you're doing some cool stuff. | benbjohnson wrote: | Thanks! Yeah, if you don't need multiple replicas then | Litestream should work just fine. I'd say stick to that for | now. | moderation wrote: | > LiteFS is intended to provide easy, live, asychronous | replication across ephemeral nodes in a cluster. This approach | makes trade-offs as compared with simpler disaster recovery tools | such as Litestream and more complex but strongly-consistent tools | such as rqlite. | | I think rqlite having a single binary that handles Raft / | consensus _and_ includes SQLITE makes it simpler. Beyond 'hello | world', Consul isn't trivial to run and Fly have blogged about | this [0] | | 0. https://fly.io/blog/a-foolish-consistency/ | hobo_mark wrote: | Tangentially related: I'd like to use litestream but my SQLite | files are several gigabytes, is there a way to lazily download | the db only once it's being accessed? (using something like | userfaultfd maybe? just thinking out loud) | mandeepj wrote: | > is there a way to lazily download the db only once it's being | accessed? | | Not sure what's your full scenario, because you mentioned | "lazily download" so thought you might have a luck here | https://news.ycombinator.com/item?id=27016630 | | tl;dr: using http range and a smaller page size - might be the | way to go | benbjohnson wrote: | I'm not sure how that could be implemented with Litestream | since it runs as a separate process. It could be possible with | LiteFS where it just pages in data on the fly. That's on the | roadmap but it's probably still a ways out. | hinkley wrote: | I have an adjacent problem, and I haven't been able to find | anyone who has a fix for me. | | One perfectly reasonable use case for a read replica of a | database is a bastion server. Database + web server on a machine | that is firewalled both from the internet and from the business | network. With read only access there is a much smaller blast | radius if someone manages to compromise the machine. | | The problem is that every single replication implementation I've | seen expects the replicas to phone home to the master copy, _not_ | for the master copy to know of the replicas and stream updates to | them. This means that your bastion machine needs to be able to | reach into your LAN, which defeats half the point. | | The most important question is, "what options exist to support | this?" but I think the bigger question is why do we treat | replicas as if they are full peers of the system of record when | so often not only are they not - mechanically or philosophically | - and in some cases couldn't be even if we wanted to? (eg, a | database without multi-master support). | ok_dad wrote: | Try litestream I think it's a push system. | benbjohnson wrote: | That's an interesting idea. I had plans to introduce | "candidates" [1] (e.g. nodes that could become the primary) but | I like the idea of reversing the communication and connecting | from primary to replica. I added an issue to the LiteFS project | to track it. Thanks! | | [1] https://github.com/superfly/litefs/issues/16 | | [2] https://github.com/superfly/litefs/issues/24 | mandeepj wrote: | > a database without multi-master support | | I believe Cassandra does not have a Master\follower | architecture; it's following a ring based structure. | 0x457 wrote: | Yeah, Cassandra uses a Distributed Hash Table with Consistent | Hashing. | | When a new node added, it takes responsibility for a segment | of a ring. When node removed, it's segment get redistributed. | | It's however in no way a drop-in replacement for RDBMS and | requires a careful planning around application read and | writes patterns that is very different from your typical | RDBMS. Definitely, can't be used in this scenario - every | node needs to be able to access every other node and client | must be able to access at least one node. | rhizome wrote: | Look into reverse SSH tunnelling. SSH from primary to | secondary, which then connects back to the primary through the | already-established SSH connection. | hinkley wrote: | That's the closest I've been able to come up with, but it | does have the problem that anything local can typically | connect to that tunnel. In the bastion situation we generally | don't assume that the machine is not compromised. Otherwise | why did we put it outside of the firewall? | | To be fair, there are a number of ways a hostile endpoint can | screw with another server even just by screwing around with | TCP protocol behavior, so perhaps I'm putting too fine a | point on it. | koolba wrote: | > The problem is that every single replication implementation | I've seen expects the replicas to phone home to the master | copy, not for the master copy to know of the replicas and | stream updates to them. This means that your bastion machine | needs to be able to reach into your LAN, which defeats half the | point. | | You can set up a PostgreSQL replica to be driven purely off of | archive logs. It does not need direct access to the source | database as it can pull the archive files via read only to a | third location (e.g. file server or S3) that gets pushed by the | source database server. The catch is that it will only be | updated when an WAL file is pushed which can be driven either | by size (automatically on an "active" database) or time (every | N seconds or minutes). If you're fine with potentially being a | minute behind the source, you can easily set this up. | benbjohnson wrote: | LiteFS author here (also Litestream author). I'm happy to answer | any questions folks have about how it works or what's on the | roadmap. | bbertelsen wrote: | Can you describe one or more use cases? | qbasic_forever wrote: | It might be interesting for embedded scenarios where you need | an always-available database but can't guarantee there will | be a connection available consistently to a central database | server. | benbjohnson wrote: | One main advantage of LiteFS is being able to deploy a | SQLite-based application and not have downtime during | deploys. This is especially true in ephemeral systems like | Kubernetes where pods will rollover. | | Another use case is moving data close to users. If you're | only targeting users in a single country then it's not as big | of a deal but RTT from the US to Europe is ~100ms and US to | Asia is ~250ms. That's a big latency hit depending on what | you're trying to do. | laweijfmvo wrote: | A while ago, when I discovered litestream, I played around with | it and just setup cloning to a local file://; is this the same | idea? | benbjohnson wrote: | Sorta. LiteFS sets up a FUSE mount so once you have that in | place then you can create a database within that mount | directory and it is automatically replicated. | anon291 wrote: | Why a fuse filesystem instead of just using sqlite's vfs | features? FUSE adds an extra user-kernel-user context switch | which is expensive compared to a quick user-user call for the | vfs drivers. | benbjohnson wrote: | The VFS system is great and I plan to support that as well. | SQLite is simple to get started with but it can be | complicated to integrate a VFS depending on the language | using. My goal is to make the deployment as simple as | possible and also to support legacy applications. | | There's some interesting work I'm looking at with the VFS and | WASM where you could deploy something like this across pure | serverless functions like Vercel or Deno. | tptacek wrote: | Would doing the VFS thing require participating | applications to be linked to a LiteFS library? | benbjohnson wrote: | That's a good question. I thought that was the case but I | just double checked and it looks like you can load a VFS | as an extension [1]. | | [1] https://www.sqlite.org/loadext.html | GauntletWizard wrote: | Awesome! Glad to see it. | | I was very excited for the experimental support for live | replication in Litestream, but as I understand that didn't pan | out. This looks to be the successor, with fanout replication an | explicit feature. | | Using a FUSE layer that's detecting changes is sure to have | some performance tradeoffs. What benchmarks are under way? Do | you need any help? | benbjohnson wrote: | Thanks! In hindsight, I'm glad I pulled the live replication | in Litestream and left it focused on disaster recovery only. | A lot of what I'm doing in LiteFS would be much more | difficult if I didn't have control over the write path. I'm | able to perform rolling checksums to verify database | integrity across replicas and that would be nearly impossible | with Litestream (which runs as a separate process). | | The FUSE layer does have performance trade-offs. Read queries | aren't affected as much since the OS page cache is in the | kernel and avoids the FUSE layer entirely. Most databases are | under 10GB (or honestly probably less than 1GB) so you can | fit most of that in the page cache for any moderately sized | VM. The write side will incur a performance hit but I haven't | taken benchmarks yet. There's a lot that can still be | optimized around that. | | The biggest help that I could use is just to get feedback on | how it feels to use and any ways I can improve it. The end | goal is to make it ridiculously simple to spin up and "just | work" so any ideas towards that goal would be awesome. | dan353hehe wrote: | I'm curious, was there a specific reason that FUSE was used | instead of the FVS layer inside to SQLite itself? I realize | that it would mean that a custom module would have to be | loaded, and maybe that was undesirable. Full disclosure I | have not read the link yet as I am a little pressed for | time, but I will later tonight when I have time. If it's | mentioned in the readme, feel free to ignore me :) I ask | because I have been building a VFS for SQLite to handle | multiple distributed writers and replicating data blocks | around to different places. | benbjohnson wrote: | It's a good question. It's mainly about reducing friction | for the end user. I wrote up a related response here: | https://news.ycombinator.com/item?id=32243305 | tantaman wrote: | Very cool. I've been working on CRDTs for SQLite to allow for | conflict-free replication. Its still super early but I have a | prototype that works and, if you're up to it, I would love to | pick your brain on the details of sqlite WALs and journaling. | | https://www.loom.com/share/0934f93364d340e0ba658146a974edb4 | benbjohnson wrote: | Right on, I always thought a CRDT SQLite database would be | awesome. I thought about looking at the session extension[1] | to handle the merges but I didn't get very far into it. | | Feel free to ping me at benbjohnson@yahoo.com if you want to | chat WAL & journaling. I think it'd be difficult to implement | CRDT at the journal level since it's all physical pages. | Merging data in pages is going to sometimes cause splits and | that'll be hairy to deal with. | | [1] https://www.sqlite.org/sessionintro.html | tantaman wrote: | that was my assumption but it's been suggested a few times | to implement it there. Nice to hear some validation that | that isn't a great spot. | liuliu wrote: | Does LiteFS needs to do "pattern matching" to know what SQLite | writes (a txn or not e.g.)? With Litestream, it seems simply | use SQLite API. Just curious how do you think through this. | benbjohnson wrote: | Transactions start with the creation of the journal file and | end when it's unlinked so there's some "pattern matching" but | it's not terribly complicated. Once the file is unlinked, | LiteFS can read and verify the journal and pull the change | set of pages from the database file. | | Litestream interacts with the SQLite API for locking but it | uses a regular file descriptor to read the WAL. It has to do | similar parsing of pages but it's a bit more complicated | since it needs to re-verify its position every time it reads | from the WAL. | liuliu wrote: | I am under the impression that if you are in the WAL mode, | there is no rollback journal created? | epilys wrote: | Kudos on the release! I've done almost exactly the same thing | in Rust with WAL only; writes are appended to a distributed wal | which achieves consensus by raft and distributed time leases. | it all worked so nearly I was impressed. It was for a client | though, so it's unfortunately proprietary. | benbjohnson wrote: | Thanks! That sounds like a fun project. Did you run the WAL | writes through Raft or did you just use Raft for leader | election? | Cyberdog wrote: | Given a situation where replication is desirable, is using | SQLite + LiteFS a better choice than just replicated Postgres? | benbjohnson wrote: | Postgres is great and replicating it can work as well. One | benefit to SQLite is that it's in-process so you avoid most | per-query latency so you don't have to worry as much about | N+1 query performance issues. From my benchmarks, I see | latency from an application node to a Postgres node as high | as 1 millisecond -- even if both are in the same region. | SQLite, on the other hand, has per-query latency overhead of | about 10-20us. | | Another goal is to simplify deployment. It's a lot easier and | cost-effective to just deploy out 10 application nodes across | a bunch of regions rather than having to also deploy Postgres | replicas in each of those regions too. | ctur wrote: | If a "FUSE to replicate SQLite" solution came from anywhere else, | I'd be quite skeptical, but there is a lot of very interesting | tech coming out of fly.io these days and Ben certainly knows this | space well. It still feels a little like a hack and piercing of | layers of abstraction (less so than, say, litestream). | | I love it when at first glance it isn't clear if a project is a | crazy idea from someone just goofing around vs a highly leveraged | crazy idea that will be a foundational part of a major technology | shift. | | I suspect it's the latter and the strategy though is to layer | this on top of litestream to create an easy way to use sqlite | transparently in a widely distributed multi-node environment | (litestream providing the backups and/or readonly replication to | remote sites, with LiteFS handling low latency local access in a | cluster, POP, or data center). | | Cool stuff. It will be fun to see where fly takes this :) | tmp_anon_22 wrote: | > a highly leveraged crazy idea that will be a foundational | part of a major technology shift | | Has anything other then the Cloud presented a true foundational | shift in how applications are built? Kubernetes, Serverless, | Blockchain, React, Swift, these things are big but not big | enough. | | I think we just like pretending every little thing is the next | big thing. | benbjohnson wrote: | Thanks for the vote of confidence! I can understand the "hack" | feel -- it's a trade-off. If I wrote it the "proper" way and | integrated directly into the SQLite source or used a VFS then | it'd be a lot harder to deploy for most folks. By making it a | FUSE file system, someone can use it without really knowing | much about it from the application's perspective. | | As for strategy, it unfortunately doesn't work to layer with | Litestream as backups need some strict control over who is the | current primary. Instead, I'm adding S3 replication support [1] | directly into LiteFS. LiteFS also uses a different | transactional file format called LTX so it wouldn't be | compatible with Litestream. The LTX format is optimized for | compactions so point-in-time restores can be nearly instant. | | The end goal isn't much of a secret. We want to let folks spin | up nodes in regions across the world, automatically connect to | one another, and have the whole thing have the ease of a single | node app. We still have a ways to go on that vision but that's | what we're going for. | | [1] https://github.com/superfly/litefs/issues/18 | ignoramous wrote: | > _I 'm adding S3 replication support directly into LiteFS._ | | Nice! There's a lot of value one can get out of a blob store, | despite it appearing seemingly at odds with block-device | dependent systems, like most sql dbms. | | When a database at BigCloud layered replication (real-time | backups) atop S3, they did so by shipping both the WAL and | the on-disk files. For write heavy tables, WAL was streamed | every second, and on-disk files (snapshots) every 30mins (or | at some apt size-based threshold). | | While WAL streaming also doubled-up as a key foundation for | them to build materialized views, support real-time triggers, | and act as an online data-verification layer; S3 itself | served as an insurance against hardware errors (memory, cpu, | network, disk) and data corruption. | | https://web.archive.org/web/20220712155558/https://www.useni. | .. (keyword search _S3_ ) | | Elasticsearch / OpenSearch does something similar but it only | implements snapshot-based replication to S3 (periodic | backups). | | https://web.archive.org/web/20190722153122/https://www.micro. | .. / https://archive.is/Q5jUj (docs) | mandeepj wrote: | Db sharding and replication is a fascinating subject and a matter | of deep interest to me. | | Ben and Matt - appreciate your contributions in this area. I'm | interested in making contributions along with you. Please let me | know if you are looking for help. Much Thanks. | benbjohnson wrote: | Thanks! I think sharding is really interesting -- especially | with a lightweight database like SQLite. I'm not looking for | contributions right now but I would love to hear any feedback | on the approach taken with LiteFS. I want to make it as easy to | run as possible. | mandeepj wrote: | I'd give LiteFS a run soon. SqlLite is cross-platform, but | seems like LiteFS is not. True? | | https://github.com/superfly/litefs | benbjohnson wrote: | Yes, that's correct. It's Linux only right now although | I'll probably support macOS too (via MacFUSE). ___________________________________________________________________ (page generated 2022-07-26 23:00 UTC)