[HN Gopher] WAL Mode in LiteFS ___________________________________________________________________ WAL Mode in LiteFS Author : eatonphil Score : 95 points Date : 2023-01-04 16:36 UTC (6 hours ago) (HTM) web link (fly.io) (TXT) w3m dump (fly.io) | mbStavola wrote: | Having an interest in Sqlite and LiteFS, but not having explored | either very much, I wonder what the DB administration aspect | looks like? | | Say I want to connect a SQL client to my production DB do I | expose a port on my production app that speaks the protocol? Or | maybe I have a separate container that is basically just the | Sqlite replica that I can connect to? Or maybe there is some | other approach that I'm just not seeing? | | Fly, Sqlite, and LiteFS is definitely an interesting combination | but I'm still sort of mentally stuck in the managed (and | standalone) DB model. | bob1029 wrote: | IMO, the _entire point_ of using something embedded like SQLite | is to avoid the network stack. | | Using SQLite, you can reliably satisfy queries in tens-to- | hundreds of microseconds. | | I cannot imagine any scenario wherein a single-node hosted DB | (across the LAN) can outperform a properly-configured SQLite | instance (in same process). | | DB administration for us is included in an admin web app that | is part of the product stack. | gunnarmorling wrote: | > I cannot imagine any scenario wherein a single-node hosted | DB (across the LAN) can outperform a properly-configured | SQLite instance (in same process). | | SQLite only supports one writer at a time, so any use case | with more than a handful of concurrent writes will likely be | better off with a remote DB which typically can handle | thousands of active connections issuing transactions. | benbjohnson wrote: | Currently, you need to SSH in and use the sqlite3 CLI on the | server. There has been some work in this area but it's all | still rough around the edges. I wrote a server called | Postlite[1] that exposes remote SQLite databases over the | Postgres wire protocol but it's _very_ alpha. :) | | I'd love to see more work in this area. Ricardo Ander-Egg wrote | a remote management tool called litexplore[2] that connects | over SSH to the SQLite CLI behind the scenes. I haven't used it | but I think there's a lot of potential with that approach. | | [1]: https://github.com/benbjohnson/postlite | | [2]: https://github.com/litements/litexplore | PreInternet01 wrote: | Even if you're not using LiteFS _yet_ , the following bit of | SQLite advice makes this article absolutely worth the 11 minutes | it allegedly takes to read: | | > Use WAL mode | | (For bonus points, also bump cache_size up to 20000 or so and set | synchronous=NORMAL) | bob1029 wrote: | Just to give you an idea of how much difference this makes, I | ran the following benchmark as of a few minutes ago: | | _3 new SQLite databases each with 1 text column, inserting | 1000 rows of 1 char per. Testing using .NET6 and latest copy of | System.Data.SQLite._ Default / No Pragmas: | 8806ms WAL Mode Only: 2819ms WAL Mode + | Synchronous Normal: 44ms | benbjohnson wrote: | The SYNCHRONOUS pragma is great but I'll mention that there | is a durability trade-off. In "NORMAL" mode, there is not an | fsync() after every transaction so you could lose recent | transactions if you unexpectedly shutdown. The WAL is append- | only so you don't risk data corruption (which is great). | TillE wrote: | > unexpectedly shutdown | | Right, a power cut or kernel panic. It's certainly | something to be aware of, but there are probably much more | likely ways to lose some data (programming errors, etc). | tptacek wrote: | Or a point-in-time snapshot of the block device being | made. | bob1029 wrote: | This is actually how we recommend our customers perform | backups of our product. Crash-consistent snapshots of the | entire VM's disk every so often. 100% of our installs are | in virtualized/cloud environments, so this is a very | convenient and natural way to go about things. | | Some loss around the edges (and between snapshots) is | understood and accepted by all. We've made it clear to | our customers that by making this trade-off, we can | vastly simplify operational costs around the system (i.e. | we only need 1 self-contained VM per environment). | tptacek wrote: | Oh, for sure. I'm just saying that in cloud environments | the "power cut off" scenario is more common than it | looks. :) | bob1029 wrote: | We took the tradeoff in our product due to the strong | performance upside. | | Our reasoning goes something like - whatever happened | _approximately_ at the edge of the power-loss abyss is | considered part of the explosion. We have never held out | hope that we 'd be able to get information to disk up until | the last microsecond. | | Letting WAL roll us back a few transactions is not a huge | deal for our customers. Even if this breaks logical | consistency with regard to external systems. We have many | degrees of "redo" and the back office is always going to be | a thing in our line of business. | | We file this under the "edge cases not economically worth | targeting" bucket and take our 2 ~free orders of magnitude. | vlovich123 wrote: | Are you sure that those failures modes are the only ones | guaranteed? I'd be worried about durability failures that | happen due to non-obvious interactions (e.g. committing | transaction A, losing transaction B, committing | transaction C) or the database getting left in a state | where it can't even start. | bob1029 wrote: | WAL enforces serialization semantics at crash recovery | time. The only real variable is how many WAL frames made | it to disk before the rug pull. | aidenn0 wrote: | The WAL is append-only so you don't risk data corruption | (which is great). | | I'm not sure this is true with all filesystems. I think | there are some filesystems in which a crash during append | can end up with the file enlarged, but the data not written | (IIRC I saw something like this with XFS when I was working | on a kernel module that kept crashing the kernel). | benbjohnson wrote: | > I think there are some filesystems in which a crash | during append can end up with the file enlarged | | The SQLite WAL file itself has a running checksum that | starts from the beginning so if you had an enlarged file | or even a block of zero'd out bytes in the middle, SQLite | would still recover gracefully. It recovers up to the | last valid WAL frame that contains a set "commit" flag. | aidenn0 wrote: | I'm honestly not surprised that SQLite handles this | situation (I seriously doubt there's any file-system | oddity I've run into that Hipp hasn't). But just being | "append only" is insufficient. | [deleted] | [deleted] | cldellow wrote: | WAL and synchronous=NORMAL is huge. | | I love databases that choose safe defaults. | | That said, many real-life applications of databases don't need | to block for confirmation that the transaction has been durably | committed -- ETL jobs, integration tests, most logging, etc. | | When you discover that you can get a 20-50x speedup for | basically free, it's like Christmas. | zamalek wrote: | What are the practical limits ( _not_ theoretical limits) for | LiteFS? Are we talking 100 's of GB, or something smaller? | benbjohnson wrote: | We're targeting databases of 1 to 10 GB right now and those | seem to work pretty well overall. I'm sure we'll expand that | target over time as LiteFS matures though. | | Most of the requests I've seen have been to support a lot of | smaller databases (e.g. hundreds or thousands of 1GB databases) | rather than one huge 100GB database. You can do interesting | things like sharding. Or if you're a SaaS companies, you could | do one SQLite database per customer. That has some nice | isolation properties and it improves SQLite's single writer | restriction as your writes are spread across multiple | databases. | ignoramous wrote: | Hi Ben | | > _We 're targeting databases of 1 to 10 GB right now and | those seem to work pretty well overall._ | | What are some reasons you reckon that the current setup won't | scale beyond 10GB? Or, is it some arbitrary threshold beyond | which you folks don't stress test things? | | Also, if I may, you mentioned on Twitter that this work was 3 | months in the making with 100s of PRs. Leaving aside | stability related bugs, what design decisions previously made | were the caused painful bugs / roadblocks? Consequently, what | things majorly surprised you in a way that perhaps has | altered your approach / outlook towards this project or | engineering in general? | | Thanks. | benbjohnson wrote: | > What are some reasons you reckon that the current setup | won't scale beyond 10GB? | | It's more of an arbitrary threshold right now. A lot of | testing that we do right now is chaos testing where we | frequently kill nodes to ensure that the cluster recovers | correctly and we try to test a range of database sizes | within that threshold. Larger databases should work fine | but you also run into SQLite limitations of single writer. | Also, the majority of databases we see in the wild are less | than 10GB. | | > Leaving aside stability related bugs, what design | decisions previously made were the caused painful bugs / | roadblocks? | | So far the design decisions have held up pretty well. Most | of the PRs were either stability related or WAL related. | That being said, the design is pretty simple. We convert | transactions into files and then ship those files to other | nodes and replay them. | | We recently added LZ4 compression (which will be in the | next release). There was a design issue there with how we | were streaming data that we had to fix up. We relied on the | internal data format of our transaction files to delineate | them but that would mean we'd need to uncompress them to | read that. We had to alter our streaming protocol a bit to | do chunk encoding. | | I think our design decisions will be tested more once we | expand to doing pure serverless & WASM implementations. I'm | curious how things will hold up then. | | > Consequently, what things majorly surprised you in a way | that perhaps has altered your approach / outlook towards | this project or engineering in general? | | One thing that's surprised me is that we originally wrote | LiteFS to be used with Consul so it could dynamically | change its primary node. We kinda threw in our "static" | leasing implementation for one of our internal use cases. | But it turns out that for a lot of ancillary cache use | cases, the static leasing works great! Losing write | availability for a couple seconds during a deploy isn't | necessarily a big deal for all applications. | infogulch wrote: | Have you compared LZ4 to other compression algorithms, | zstd for example? ( https://github.com/phiresky/sqlite- | zstd ) | | Given that LiteFS operates at the filesystem layer via | FUSE, have you considered it against designs that use | built-in features native to some filesystems? For | example, I've considered a similar system design based on | a single ZFS primary node that streams ZFS snapshots to | reader nodes. With some coordination service (e.g. | consul) it could still allow for the whole node promotion | process. | benbjohnson wrote: | > Have you compared LZ4 to other compression algorithms, | zstd for example? | | I looked at zstd as well. IIRC, LZ4 compresses faster and | zstd has a better compression ratio. A lot of the files | being compressed are short-lived so it's typically better | to compress faster with LiteFS. Maybe we'll support more | compression formats in the future if it's useful. | | > have you considered it against designs that use built- | in features native to some filesystems | | The ZFS idea is interesting. I targeted FUSE because it | seemed like a lower barrier to entry and easier for users | to get set up. | zamalek wrote: | Awesome, thanks! These ballpark figures would be a great | improvement to your documentation. | pgm8705 wrote: | I'm curious how one goes about handling the need to query | data across customers with the one-database-per-customer | pattern. | ithrow wrote: | with a script | benbjohnson wrote: | It depends on what you're trying to do. If you're trying to | query revenue then that's usually stored in a third-party | system like Stripe. If you're tracking metrics then those | can be aggregated into a time series database like | Prometheus and viewed through Grafana. | | It's definitely a trade-off. Isolation is pretty compelling | for a lot of mid-market and enterprise customers but it | does come at a cost of making it a pain to query across | customers. But then again, that's kinda the point. :) | giraffe_lady wrote: | You can attach multiple dbs and join across them and it | works very well. The default limit is ten and I think the | max is a hundred so there's definitely a limit. But if you | really needed to do more I can think of some workarounds. | Scarbutt wrote: | What problems regarding resource usage can come up when | trying to use thousands of sqlite databases simultaneously | from a single process? | benbjohnson wrote: | You'll probably hit file descriptor limits so you'll need | to up those. There's also some memory overhead per | connection. I'd recommend limiting the number of concurrent | connections on a server if you're running on a resource | constrained box. In practice, you probably won't have all | thousand databases querying at the same time and reopening | SQLite connections is pretty fast. | spiffytech wrote: | If a LiteFS node is way behind on replication, can it let the app | know when it's caught up? | | What I'd like to do: have one Fly instance running and another in | cold standby. On deploy, the cold instance boots, fails the | health check until replication catches up, then takes over while | the first instance goes cold. | benbjohnson wrote: | > If a LiteFS node is way behind on replication, can it let the | app know when it's caught up? | | LiteFS exposes the replication position via a file[1] suffixed | with "-pos" for each database. That has a monotonically | increasing TXID that you could compare to the primary. | | We track latency via a Prometheus metric but we don't currently | expose that easily to the application. I added an issue for | that[2]. | | > have one Fly instance running and another in cold standby. On | deploy, the cold instance boots, fails the health check until | replication catches up, then takes over while the first | instance goes cold. | | When LiteFS starts up, it waits for the node to either connect | to the primary and catch up or become the primary itself before | it starts the user's application (if running in a supervisor | mode). That would probably get you most of the way there to | what you're trying to do. | | [1]: https://fly.io/docs/litefs/position/ | | [2]: https://github.com/superfly/litefs/issues/253 ___________________________________________________________________ (page generated 2023-01-04 23:01 UTC)