[HN Gopher] LiteFS a FUSE-based file system for replicating SQLite
       ___________________________________________________________________
        
       LiteFS a FUSE-based file system for replicating SQLite
        
       Author : sysbot
       Score  : 136 points
       Date   : 2022-07-26 16:14 UTC (6 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | ok_dad wrote:
       | Very cool, you should add raft so every node could be the primary
       | if the primary fails. You just need to add election and a few
       | minor state things on top of what's there already, I think.
        
       | gmemstr wrote:
       | And I /just/ got my infrastructure bits and pieces running
       | Litestream! Guess I'll have to figure out if it's worth switching
       | to this -- my gut reaction is no, since I only really run one pod
       | at a time, so Litestream serves the purpose of not only saving
       | the database offsite but also restoring it. But I will be keeping
       | a very close eye on this thanks in part to my love of SQLite.
       | 
       | Hats off to Ben and Fly.io, you're doing some cool stuff.
        
         | benbjohnson wrote:
         | Thanks! Yeah, if you don't need multiple replicas then
         | Litestream should work just fine. I'd say stick to that for
         | now.
        
       | moderation wrote:
       | > LiteFS is intended to provide easy, live, asychronous
       | replication across ephemeral nodes in a cluster. This approach
       | makes trade-offs as compared with simpler disaster recovery tools
       | such as Litestream and more complex but strongly-consistent tools
       | such as rqlite.
       | 
       | I think rqlite having a single binary that handles Raft /
       | consensus _and_ includes SQLITE makes it simpler. Beyond 'hello
       | world', Consul isn't trivial to run and Fly have blogged about
       | this [0]
       | 
       | 0. https://fly.io/blog/a-foolish-consistency/
        
       | hobo_mark wrote:
       | Tangentially related: I'd like to use litestream but my SQLite
       | files are several gigabytes, is there a way to lazily download
       | the db only once it's being accessed? (using something like
       | userfaultfd maybe? just thinking out loud)
        
         | mandeepj wrote:
         | > is there a way to lazily download the db only once it's being
         | accessed?
         | 
         | Not sure what's your full scenario, because you mentioned
         | "lazily download" so thought you might have a luck here
         | https://news.ycombinator.com/item?id=27016630
         | 
         | tl;dr: using http range and a smaller page size - might be the
         | way to go
        
         | benbjohnson wrote:
         | I'm not sure how that could be implemented with Litestream
         | since it runs as a separate process. It could be possible with
         | LiteFS where it just pages in data on the fly. That's on the
         | roadmap but it's probably still a ways out.
        
       | hinkley wrote:
       | I have an adjacent problem, and I haven't been able to find
       | anyone who has a fix for me.
       | 
       | One perfectly reasonable use case for a read replica of a
       | database is a bastion server. Database + web server on a machine
       | that is firewalled both from the internet and from the business
       | network. With read only access there is a much smaller blast
       | radius if someone manages to compromise the machine.
       | 
       | The problem is that every single replication implementation I've
       | seen expects the replicas to phone home to the master copy, _not_
       | for the master copy to know of the replicas and stream updates to
       | them. This means that your bastion machine needs to be able to
       | reach into your LAN, which defeats half the point.
       | 
       | The most important question is, "what options exist to support
       | this?" but I think the bigger question is why do we treat
       | replicas as if they are full peers of the system of record when
       | so often not only are they not - mechanically or philosophically
       | - and in some cases couldn't be even if we wanted to? (eg, a
       | database without multi-master support).
        
         | ok_dad wrote:
         | Try litestream I think it's a push system.
        
         | benbjohnson wrote:
         | That's an interesting idea. I had plans to introduce
         | "candidates" [1] (e.g. nodes that could become the primary) but
         | I like the idea of reversing the communication and connecting
         | from primary to replica. I added an issue to the LiteFS project
         | to track it. Thanks!
         | 
         | [1] https://github.com/superfly/litefs/issues/16
         | 
         | [2] https://github.com/superfly/litefs/issues/24
        
         | mandeepj wrote:
         | > a database without multi-master support
         | 
         | I believe Cassandra does not have a Master\follower
         | architecture; it's following a ring based structure.
        
           | 0x457 wrote:
           | Yeah, Cassandra uses a Distributed Hash Table with Consistent
           | Hashing.
           | 
           | When a new node added, it takes responsibility for a segment
           | of a ring. When node removed, it's segment get redistributed.
           | 
           | It's however in no way a drop-in replacement for RDBMS and
           | requires a careful planning around application read and
           | writes patterns that is very different from your typical
           | RDBMS. Definitely, can't be used in this scenario - every
           | node needs to be able to access every other node and client
           | must be able to access at least one node.
        
         | rhizome wrote:
         | Look into reverse SSH tunnelling. SSH from primary to
         | secondary, which then connects back to the primary through the
         | already-established SSH connection.
        
           | hinkley wrote:
           | That's the closest I've been able to come up with, but it
           | does have the problem that anything local can typically
           | connect to that tunnel. In the bastion situation we generally
           | don't assume that the machine is not compromised. Otherwise
           | why did we put it outside of the firewall?
           | 
           | To be fair, there are a number of ways a hostile endpoint can
           | screw with another server even just by screwing around with
           | TCP protocol behavior, so perhaps I'm putting too fine a
           | point on it.
        
         | koolba wrote:
         | > The problem is that every single replication implementation
         | I've seen expects the replicas to phone home to the master
         | copy, not for the master copy to know of the replicas and
         | stream updates to them. This means that your bastion machine
         | needs to be able to reach into your LAN, which defeats half the
         | point.
         | 
         | You can set up a PostgreSQL replica to be driven purely off of
         | archive logs. It does not need direct access to the source
         | database as it can pull the archive files via read only to a
         | third location (e.g. file server or S3) that gets pushed by the
         | source database server. The catch is that it will only be
         | updated when an WAL file is pushed which can be driven either
         | by size (automatically on an "active" database) or time (every
         | N seconds or minutes). If you're fine with potentially being a
         | minute behind the source, you can easily set this up.
        
       | benbjohnson wrote:
       | LiteFS author here (also Litestream author). I'm happy to answer
       | any questions folks have about how it works or what's on the
       | roadmap.
        
         | bbertelsen wrote:
         | Can you describe one or more use cases?
        
           | qbasic_forever wrote:
           | It might be interesting for embedded scenarios where you need
           | an always-available database but can't guarantee there will
           | be a connection available consistently to a central database
           | server.
        
           | benbjohnson wrote:
           | One main advantage of LiteFS is being able to deploy a
           | SQLite-based application and not have downtime during
           | deploys. This is especially true in ephemeral systems like
           | Kubernetes where pods will rollover.
           | 
           | Another use case is moving data close to users. If you're
           | only targeting users in a single country then it's not as big
           | of a deal but RTT from the US to Europe is ~100ms and US to
           | Asia is ~250ms. That's a big latency hit depending on what
           | you're trying to do.
        
         | laweijfmvo wrote:
         | A while ago, when I discovered litestream, I played around with
         | it and just setup cloning to a local file://; is this the same
         | idea?
        
           | benbjohnson wrote:
           | Sorta. LiteFS sets up a FUSE mount so once you have that in
           | place then you can create a database within that mount
           | directory and it is automatically replicated.
        
         | anon291 wrote:
         | Why a fuse filesystem instead of just using sqlite's vfs
         | features? FUSE adds an extra user-kernel-user context switch
         | which is expensive compared to a quick user-user call for the
         | vfs drivers.
        
           | benbjohnson wrote:
           | The VFS system is great and I plan to support that as well.
           | SQLite is simple to get started with but it can be
           | complicated to integrate a VFS depending on the language
           | using. My goal is to make the deployment as simple as
           | possible and also to support legacy applications.
           | 
           | There's some interesting work I'm looking at with the VFS and
           | WASM where you could deploy something like this across pure
           | serverless functions like Vercel or Deno.
        
             | tptacek wrote:
             | Would doing the VFS thing require participating
             | applications to be linked to a LiteFS library?
        
               | benbjohnson wrote:
               | That's a good question. I thought that was the case but I
               | just double checked and it looks like you can load a VFS
               | as an extension [1].
               | 
               | [1] https://www.sqlite.org/loadext.html
        
         | GauntletWizard wrote:
         | Awesome! Glad to see it.
         | 
         | I was very excited for the experimental support for live
         | replication in Litestream, but as I understand that didn't pan
         | out. This looks to be the successor, with fanout replication an
         | explicit feature.
         | 
         | Using a FUSE layer that's detecting changes is sure to have
         | some performance tradeoffs. What benchmarks are under way? Do
         | you need any help?
        
           | benbjohnson wrote:
           | Thanks! In hindsight, I'm glad I pulled the live replication
           | in Litestream and left it focused on disaster recovery only.
           | A lot of what I'm doing in LiteFS would be much more
           | difficult if I didn't have control over the write path. I'm
           | able to perform rolling checksums to verify database
           | integrity across replicas and that would be nearly impossible
           | with Litestream (which runs as a separate process).
           | 
           | The FUSE layer does have performance trade-offs. Read queries
           | aren't affected as much since the OS page cache is in the
           | kernel and avoids the FUSE layer entirely. Most databases are
           | under 10GB (or honestly probably less than 1GB) so you can
           | fit most of that in the page cache for any moderately sized
           | VM. The write side will incur a performance hit but I haven't
           | taken benchmarks yet. There's a lot that can still be
           | optimized around that.
           | 
           | The biggest help that I could use is just to get feedback on
           | how it feels to use and any ways I can improve it. The end
           | goal is to make it ridiculously simple to spin up and "just
           | work" so any ideas towards that goal would be awesome.
        
             | dan353hehe wrote:
             | I'm curious, was there a specific reason that FUSE was used
             | instead of the FVS layer inside to SQLite itself? I realize
             | that it would mean that a custom module would have to be
             | loaded, and maybe that was undesirable. Full disclosure I
             | have not read the link yet as I am a little pressed for
             | time, but I will later tonight when I have time. If it's
             | mentioned in the readme, feel free to ignore me :) I ask
             | because I have been building a VFS for SQLite to handle
             | multiple distributed writers and replicating data blocks
             | around to different places.
        
               | benbjohnson wrote:
               | It's a good question. It's mainly about reducing friction
               | for the end user. I wrote up a related response here:
               | https://news.ycombinator.com/item?id=32243305
        
         | tantaman wrote:
         | Very cool. I've been working on CRDTs for SQLite to allow for
         | conflict-free replication. Its still super early but I have a
         | prototype that works and, if you're up to it, I would love to
         | pick your brain on the details of sqlite WALs and journaling.
         | 
         | https://www.loom.com/share/0934f93364d340e0ba658146a974edb4
        
           | benbjohnson wrote:
           | Right on, I always thought a CRDT SQLite database would be
           | awesome. I thought about looking at the session extension[1]
           | to handle the merges but I didn't get very far into it.
           | 
           | Feel free to ping me at benbjohnson@yahoo.com if you want to
           | chat WAL & journaling. I think it'd be difficult to implement
           | CRDT at the journal level since it's all physical pages.
           | Merging data in pages is going to sometimes cause splits and
           | that'll be hairy to deal with.
           | 
           | [1] https://www.sqlite.org/sessionintro.html
        
             | tantaman wrote:
             | that was my assumption but it's been suggested a few times
             | to implement it there. Nice to hear some validation that
             | that isn't a great spot.
        
         | liuliu wrote:
         | Does LiteFS needs to do "pattern matching" to know what SQLite
         | writes (a txn or not e.g.)? With Litestream, it seems simply
         | use SQLite API. Just curious how do you think through this.
        
           | benbjohnson wrote:
           | Transactions start with the creation of the journal file and
           | end when it's unlinked so there's some "pattern matching" but
           | it's not terribly complicated. Once the file is unlinked,
           | LiteFS can read and verify the journal and pull the change
           | set of pages from the database file.
           | 
           | Litestream interacts with the SQLite API for locking but it
           | uses a regular file descriptor to read the WAL. It has to do
           | similar parsing of pages but it's a bit more complicated
           | since it needs to re-verify its position every time it reads
           | from the WAL.
        
             | liuliu wrote:
             | I am under the impression that if you are in the WAL mode,
             | there is no rollback journal created?
        
         | epilys wrote:
         | Kudos on the release! I've done almost exactly the same thing
         | in Rust with WAL only; writes are appended to a distributed wal
         | which achieves consensus by raft and distributed time leases.
         | it all worked so nearly I was impressed. It was for a client
         | though, so it's unfortunately proprietary.
        
           | benbjohnson wrote:
           | Thanks! That sounds like a fun project. Did you run the WAL
           | writes through Raft or did you just use Raft for leader
           | election?
        
         | Cyberdog wrote:
         | Given a situation where replication is desirable, is using
         | SQLite + LiteFS a better choice than just replicated Postgres?
        
           | benbjohnson wrote:
           | Postgres is great and replicating it can work as well. One
           | benefit to SQLite is that it's in-process so you avoid most
           | per-query latency so you don't have to worry as much about
           | N+1 query performance issues. From my benchmarks, I see
           | latency from an application node to a Postgres node as high
           | as 1 millisecond -- even if both are in the same region.
           | SQLite, on the other hand, has per-query latency overhead of
           | about 10-20us.
           | 
           | Another goal is to simplify deployment. It's a lot easier and
           | cost-effective to just deploy out 10 application nodes across
           | a bunch of regions rather than having to also deploy Postgres
           | replicas in each of those regions too.
        
       | ctur wrote:
       | If a "FUSE to replicate SQLite" solution came from anywhere else,
       | I'd be quite skeptical, but there is a lot of very interesting
       | tech coming out of fly.io these days and Ben certainly knows this
       | space well. It still feels a little like a hack and piercing of
       | layers of abstraction (less so than, say, litestream).
       | 
       | I love it when at first glance it isn't clear if a project is a
       | crazy idea from someone just goofing around vs a highly leveraged
       | crazy idea that will be a foundational part of a major technology
       | shift.
       | 
       | I suspect it's the latter and the strategy though is to layer
       | this on top of litestream to create an easy way to use sqlite
       | transparently in a widely distributed multi-node environment
       | (litestream providing the backups and/or readonly replication to
       | remote sites, with LiteFS handling low latency local access in a
       | cluster, POP, or data center).
       | 
       | Cool stuff. It will be fun to see where fly takes this :)
        
         | tmp_anon_22 wrote:
         | > a highly leveraged crazy idea that will be a foundational
         | part of a major technology shift
         | 
         | Has anything other then the Cloud presented a true foundational
         | shift in how applications are built? Kubernetes, Serverless,
         | Blockchain, React, Swift, these things are big but not big
         | enough.
         | 
         | I think we just like pretending every little thing is the next
         | big thing.
        
         | benbjohnson wrote:
         | Thanks for the vote of confidence! I can understand the "hack"
         | feel -- it's a trade-off. If I wrote it the "proper" way and
         | integrated directly into the SQLite source or used a VFS then
         | it'd be a lot harder to deploy for most folks. By making it a
         | FUSE file system, someone can use it without really knowing
         | much about it from the application's perspective.
         | 
         | As for strategy, it unfortunately doesn't work to layer with
         | Litestream as backups need some strict control over who is the
         | current primary. Instead, I'm adding S3 replication support [1]
         | directly into LiteFS. LiteFS also uses a different
         | transactional file format called LTX so it wouldn't be
         | compatible with Litestream. The LTX format is optimized for
         | compactions so point-in-time restores can be nearly instant.
         | 
         | The end goal isn't much of a secret. We want to let folks spin
         | up nodes in regions across the world, automatically connect to
         | one another, and have the whole thing have the ease of a single
         | node app. We still have a ways to go on that vision but that's
         | what we're going for.
         | 
         | [1] https://github.com/superfly/litefs/issues/18
        
           | ignoramous wrote:
           | > _I 'm adding S3 replication support directly into LiteFS._
           | 
           | Nice! There's a lot of value one can get out of a blob store,
           | despite it appearing seemingly at odds with block-device
           | dependent systems, like most sql dbms.
           | 
           | When a database at BigCloud layered replication (real-time
           | backups) atop S3, they did so by shipping both the WAL and
           | the on-disk files. For write heavy tables, WAL was streamed
           | every second, and on-disk files (snapshots) every 30mins (or
           | at some apt size-based threshold).
           | 
           | While WAL streaming also doubled-up as a key foundation for
           | them to build materialized views, support real-time triggers,
           | and act as an online data-verification layer; S3 itself
           | served as an insurance against hardware errors (memory, cpu,
           | network, disk) and data corruption.
           | 
           | https://web.archive.org/web/20220712155558/https://www.useni.
           | .. (keyword search _S3_ )
           | 
           | Elasticsearch / OpenSearch does something similar but it only
           | implements snapshot-based replication to S3 (periodic
           | backups).
           | 
           | https://web.archive.org/web/20190722153122/https://www.micro.
           | .. / https://archive.is/Q5jUj (docs)
        
       | mandeepj wrote:
       | Db sharding and replication is a fascinating subject and a matter
       | of deep interest to me.
       | 
       | Ben and Matt - appreciate your contributions in this area. I'm
       | interested in making contributions along with you. Please let me
       | know if you are looking for help. Much Thanks.
        
         | benbjohnson wrote:
         | Thanks! I think sharding is really interesting -- especially
         | with a lightweight database like SQLite. I'm not looking for
         | contributions right now but I would love to hear any feedback
         | on the approach taken with LiteFS. I want to make it as easy to
         | run as possible.
        
           | mandeepj wrote:
           | I'd give LiteFS a run soon. SqlLite is cross-platform, but
           | seems like LiteFS is not. True?
           | 
           | https://github.com/superfly/litefs
        
             | benbjohnson wrote:
             | Yes, that's correct. It's Linux only right now although
             | I'll probably support macOS too (via MacFUSE).
        
       ___________________________________________________________________
       (page generated 2022-07-26 23:00 UTC)