[HN Gopher] WAL Mode in LiteFS
       ___________________________________________________________________
        
       WAL Mode in LiteFS
        
       Author : eatonphil
       Score  : 95 points
       Date   : 2023-01-04 16:36 UTC (6 hours ago)
        
 (HTM) web link (fly.io)
 (TXT) w3m dump (fly.io)
        
       | mbStavola wrote:
       | Having an interest in Sqlite and LiteFS, but not having explored
       | either very much, I wonder what the DB administration aspect
       | looks like?
       | 
       | Say I want to connect a SQL client to my production DB do I
       | expose a port on my production app that speaks the protocol? Or
       | maybe I have a separate container that is basically just the
       | Sqlite replica that I can connect to? Or maybe there is some
       | other approach that I'm just not seeing?
       | 
       | Fly, Sqlite, and LiteFS is definitely an interesting combination
       | but I'm still sort of mentally stuck in the managed (and
       | standalone) DB model.
        
         | bob1029 wrote:
         | IMO, the _entire point_ of using something embedded like SQLite
         | is to avoid the network stack.
         | 
         | Using SQLite, you can reliably satisfy queries in tens-to-
         | hundreds of microseconds.
         | 
         | I cannot imagine any scenario wherein a single-node hosted DB
         | (across the LAN) can outperform a properly-configured SQLite
         | instance (in same process).
         | 
         | DB administration for us is included in an admin web app that
         | is part of the product stack.
        
           | gunnarmorling wrote:
           | > I cannot imagine any scenario wherein a single-node hosted
           | DB (across the LAN) can outperform a properly-configured
           | SQLite instance (in same process).
           | 
           | SQLite only supports one writer at a time, so any use case
           | with more than a handful of concurrent writes will likely be
           | better off with a remote DB which typically can handle
           | thousands of active connections issuing transactions.
        
         | benbjohnson wrote:
         | Currently, you need to SSH in and use the sqlite3 CLI on the
         | server. There has been some work in this area but it's all
         | still rough around the edges. I wrote a server called
         | Postlite[1] that exposes remote SQLite databases over the
         | Postgres wire protocol but it's _very_ alpha. :)
         | 
         | I'd love to see more work in this area. Ricardo Ander-Egg wrote
         | a remote management tool called litexplore[2] that connects
         | over SSH to the SQLite CLI behind the scenes. I haven't used it
         | but I think there's a lot of potential with that approach.
         | 
         | [1]: https://github.com/benbjohnson/postlite
         | 
         | [2]: https://github.com/litements/litexplore
        
       | PreInternet01 wrote:
       | Even if you're not using LiteFS _yet_ , the following bit of
       | SQLite advice makes this article absolutely worth the 11 minutes
       | it allegedly takes to read:
       | 
       | > Use WAL mode
       | 
       | (For bonus points, also bump cache_size up to 20000 or so and set
       | synchronous=NORMAL)
        
         | bob1029 wrote:
         | Just to give you an idea of how much difference this makes, I
         | ran the following benchmark as of a few minutes ago:
         | 
         |  _3 new SQLite databases each with 1 text column, inserting
         | 1000 rows of 1 char per. Testing using .NET6 and latest copy of
         | System.Data.SQLite._                 Default / No Pragmas:
         | 8806ms       WAL Mode Only:         2819ms       WAL Mode +
         | Synchronous Normal:         44ms
        
           | benbjohnson wrote:
           | The SYNCHRONOUS pragma is great but I'll mention that there
           | is a durability trade-off. In "NORMAL" mode, there is not an
           | fsync() after every transaction so you could lose recent
           | transactions if you unexpectedly shutdown. The WAL is append-
           | only so you don't risk data corruption (which is great).
        
             | TillE wrote:
             | > unexpectedly shutdown
             | 
             | Right, a power cut or kernel panic. It's certainly
             | something to be aware of, but there are probably much more
             | likely ways to lose some data (programming errors, etc).
        
               | tptacek wrote:
               | Or a point-in-time snapshot of the block device being
               | made.
        
               | bob1029 wrote:
               | This is actually how we recommend our customers perform
               | backups of our product. Crash-consistent snapshots of the
               | entire VM's disk every so often. 100% of our installs are
               | in virtualized/cloud environments, so this is a very
               | convenient and natural way to go about things.
               | 
               | Some loss around the edges (and between snapshots) is
               | understood and accepted by all. We've made it clear to
               | our customers that by making this trade-off, we can
               | vastly simplify operational costs around the system (i.e.
               | we only need 1 self-contained VM per environment).
        
               | tptacek wrote:
               | Oh, for sure. I'm just saying that in cloud environments
               | the "power cut off" scenario is more common than it
               | looks. :)
        
             | bob1029 wrote:
             | We took the tradeoff in our product due to the strong
             | performance upside.
             | 
             | Our reasoning goes something like - whatever happened
             | _approximately_ at the edge of the power-loss abyss is
             | considered part of the explosion. We have never held out
             | hope that we 'd be able to get information to disk up until
             | the last microsecond.
             | 
             | Letting WAL roll us back a few transactions is not a huge
             | deal for our customers. Even if this breaks logical
             | consistency with regard to external systems. We have many
             | degrees of "redo" and the back office is always going to be
             | a thing in our line of business.
             | 
             | We file this under the "edge cases not economically worth
             | targeting" bucket and take our 2 ~free orders of magnitude.
        
               | vlovich123 wrote:
               | Are you sure that those failures modes are the only ones
               | guaranteed? I'd be worried about durability failures that
               | happen due to non-obvious interactions (e.g. committing
               | transaction A, losing transaction B, committing
               | transaction C) or the database getting left in a state
               | where it can't even start.
        
               | bob1029 wrote:
               | WAL enforces serialization semantics at crash recovery
               | time. The only real variable is how many WAL frames made
               | it to disk before the rug pull.
        
             | aidenn0 wrote:
             | The WAL is append-only so you don't risk data corruption
             | (which is great).
             | 
             | I'm not sure this is true with all filesystems. I think
             | there are some filesystems in which a crash during append
             | can end up with the file enlarged, but the data not written
             | (IIRC I saw something like this with XFS when I was working
             | on a kernel module that kept crashing the kernel).
        
               | benbjohnson wrote:
               | > I think there are some filesystems in which a crash
               | during append can end up with the file enlarged
               | 
               | The SQLite WAL file itself has a running checksum that
               | starts from the beginning so if you had an enlarged file
               | or even a block of zero'd out bytes in the middle, SQLite
               | would still recover gracefully. It recovers up to the
               | last valid WAL frame that contains a set "commit" flag.
        
               | aidenn0 wrote:
               | I'm honestly not surprised that SQLite handles this
               | situation (I seriously doubt there's any file-system
               | oddity I've run into that Hipp hasn't). But just being
               | "append only" is insufficient.
        
         | [deleted]
        
           | [deleted]
        
         | cldellow wrote:
         | WAL and synchronous=NORMAL is huge.
         | 
         | I love databases that choose safe defaults.
         | 
         | That said, many real-life applications of databases don't need
         | to block for confirmation that the transaction has been durably
         | committed -- ETL jobs, integration tests, most logging, etc.
         | 
         | When you discover that you can get a 20-50x speedup for
         | basically free, it's like Christmas.
        
       | zamalek wrote:
       | What are the practical limits ( _not_ theoretical limits) for
       | LiteFS? Are we talking 100 's of GB, or something smaller?
        
         | benbjohnson wrote:
         | We're targeting databases of 1 to 10 GB right now and those
         | seem to work pretty well overall. I'm sure we'll expand that
         | target over time as LiteFS matures though.
         | 
         | Most of the requests I've seen have been to support a lot of
         | smaller databases (e.g. hundreds or thousands of 1GB databases)
         | rather than one huge 100GB database. You can do interesting
         | things like sharding. Or if you're a SaaS companies, you could
         | do one SQLite database per customer. That has some nice
         | isolation properties and it improves SQLite's single writer
         | restriction as your writes are spread across multiple
         | databases.
        
           | ignoramous wrote:
           | Hi Ben
           | 
           | > _We 're targeting databases of 1 to 10 GB right now and
           | those seem to work pretty well overall._
           | 
           | What are some reasons you reckon that the current setup won't
           | scale beyond 10GB? Or, is it some arbitrary threshold beyond
           | which you folks don't stress test things?
           | 
           | Also, if I may, you mentioned on Twitter that this work was 3
           | months in the making with 100s of PRs. Leaving aside
           | stability related bugs, what design decisions previously made
           | were the caused painful bugs / roadblocks? Consequently, what
           | things majorly surprised you in a way that perhaps has
           | altered your approach / outlook towards this project or
           | engineering in general?
           | 
           | Thanks.
        
             | benbjohnson wrote:
             | > What are some reasons you reckon that the current setup
             | won't scale beyond 10GB?
             | 
             | It's more of an arbitrary threshold right now. A lot of
             | testing that we do right now is chaos testing where we
             | frequently kill nodes to ensure that the cluster recovers
             | correctly and we try to test a range of database sizes
             | within that threshold. Larger databases should work fine
             | but you also run into SQLite limitations of single writer.
             | Also, the majority of databases we see in the wild are less
             | than 10GB.
             | 
             | > Leaving aside stability related bugs, what design
             | decisions previously made were the caused painful bugs /
             | roadblocks?
             | 
             | So far the design decisions have held up pretty well. Most
             | of the PRs were either stability related or WAL related.
             | That being said, the design is pretty simple. We convert
             | transactions into files and then ship those files to other
             | nodes and replay them.
             | 
             | We recently added LZ4 compression (which will be in the
             | next release). There was a design issue there with how we
             | were streaming data that we had to fix up. We relied on the
             | internal data format of our transaction files to delineate
             | them but that would mean we'd need to uncompress them to
             | read that. We had to alter our streaming protocol a bit to
             | do chunk encoding.
             | 
             | I think our design decisions will be tested more once we
             | expand to doing pure serverless & WASM implementations. I'm
             | curious how things will hold up then.
             | 
             | > Consequently, what things majorly surprised you in a way
             | that perhaps has altered your approach / outlook towards
             | this project or engineering in general?
             | 
             | One thing that's surprised me is that we originally wrote
             | LiteFS to be used with Consul so it could dynamically
             | change its primary node. We kinda threw in our "static"
             | leasing implementation for one of our internal use cases.
             | But it turns out that for a lot of ancillary cache use
             | cases, the static leasing works great! Losing write
             | availability for a couple seconds during a deploy isn't
             | necessarily a big deal for all applications.
        
               | infogulch wrote:
               | Have you compared LZ4 to other compression algorithms,
               | zstd for example? ( https://github.com/phiresky/sqlite-
               | zstd )
               | 
               | Given that LiteFS operates at the filesystem layer via
               | FUSE, have you considered it against designs that use
               | built-in features native to some filesystems? For
               | example, I've considered a similar system design based on
               | a single ZFS primary node that streams ZFS snapshots to
               | reader nodes. With some coordination service (e.g.
               | consul) it could still allow for the whole node promotion
               | process.
        
               | benbjohnson wrote:
               | > Have you compared LZ4 to other compression algorithms,
               | zstd for example?
               | 
               | I looked at zstd as well. IIRC, LZ4 compresses faster and
               | zstd has a better compression ratio. A lot of the files
               | being compressed are short-lived so it's typically better
               | to compress faster with LiteFS. Maybe we'll support more
               | compression formats in the future if it's useful.
               | 
               | > have you considered it against designs that use built-
               | in features native to some filesystems
               | 
               | The ZFS idea is interesting. I targeted FUSE because it
               | seemed like a lower barrier to entry and easier for users
               | to get set up.
        
           | zamalek wrote:
           | Awesome, thanks! These ballpark figures would be a great
           | improvement to your documentation.
        
           | pgm8705 wrote:
           | I'm curious how one goes about handling the need to query
           | data across customers with the one-database-per-customer
           | pattern.
        
             | ithrow wrote:
             | with a script
        
             | benbjohnson wrote:
             | It depends on what you're trying to do. If you're trying to
             | query revenue then that's usually stored in a third-party
             | system like Stripe. If you're tracking metrics then those
             | can be aggregated into a time series database like
             | Prometheus and viewed through Grafana.
             | 
             | It's definitely a trade-off. Isolation is pretty compelling
             | for a lot of mid-market and enterprise customers but it
             | does come at a cost of making it a pain to query across
             | customers. But then again, that's kinda the point. :)
        
             | giraffe_lady wrote:
             | You can attach multiple dbs and join across them and it
             | works very well. The default limit is ten and I think the
             | max is a hundred so there's definitely a limit. But if you
             | really needed to do more I can think of some workarounds.
        
           | Scarbutt wrote:
           | What problems regarding resource usage can come up when
           | trying to use thousands of sqlite databases simultaneously
           | from a single process?
        
             | benbjohnson wrote:
             | You'll probably hit file descriptor limits so you'll need
             | to up those. There's also some memory overhead per
             | connection. I'd recommend limiting the number of concurrent
             | connections on a server if you're running on a resource
             | constrained box. In practice, you probably won't have all
             | thousand databases querying at the same time and reopening
             | SQLite connections is pretty fast.
        
       | spiffytech wrote:
       | If a LiteFS node is way behind on replication, can it let the app
       | know when it's caught up?
       | 
       | What I'd like to do: have one Fly instance running and another in
       | cold standby. On deploy, the cold instance boots, fails the
       | health check until replication catches up, then takes over while
       | the first instance goes cold.
        
         | benbjohnson wrote:
         | > If a LiteFS node is way behind on replication, can it let the
         | app know when it's caught up?
         | 
         | LiteFS exposes the replication position via a file[1] suffixed
         | with "-pos" for each database. That has a monotonically
         | increasing TXID that you could compare to the primary.
         | 
         | We track latency via a Prometheus metric but we don't currently
         | expose that easily to the application. I added an issue for
         | that[2].
         | 
         | > have one Fly instance running and another in cold standby. On
         | deploy, the cold instance boots, fails the health check until
         | replication catches up, then takes over while the first
         | instance goes cold.
         | 
         | When LiteFS starts up, it waits for the node to either connect
         | to the primary and catch up or become the primary itself before
         | it starts the user's application (if running in a supervisor
         | mode). That would probably get you most of the way there to
         | what you're trying to do.
         | 
         | [1]: https://fly.io/docs/litefs/position/
         | 
         | [2]: https://github.com/superfly/litefs/issues/253
        
       ___________________________________________________________________
       (page generated 2023-01-04 23:01 UTC)