[HN Gopher] Show HN: Reduce SQLite database size by up to 80% wi...
       ___________________________________________________________________
        
       Show HN: Reduce SQLite database size by up to 80% with transparent
       compression
        
       Author : phiresky
       Score  : 360 points
       Date   : 2022-08-01 09:36 UTC (13 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | wmanley wrote:
       | This is really neat. I like how the compression can be applied to
       | specific columns rather than whole rows - or whole pages. I could
       | imagine storing JSON from an HTTP request here and using
       | Generated Columns[1] to extract the bits my application is
       | interested in.
       | 
       | I also like how simple and powerful the custom partitioning is.
       | 
       | [1]: https://www.sqlite.org/gencol.html
        
         | wmanley wrote:
         | On that subject I've often thought of writing a combination
         | decompressor/JSON parser that can parse the contents of the
         | dictionary before decompression for faster JSON parsing and
         | lower memory usage.
        
           | aabbcc1241 wrote:
           | Maybe the compression algorithm used in
           | https://github.com/beenotung/compress-json can inspire you.
        
             | wmanley wrote:
             | Thanks, but that wasn't quite what I meant. What I was
             | thinking about was processing existing gzipped or zstd
             | compressed JSON rather than having a new compression
             | format.
             | 
             | I could see how customising the compressor could be helpful
             | - a bit like how there's an `--rsyncable` option to `gzip`,
             | but I'd like to keep compatibility with existing formats. I
             | like to avoid coming up with new formats whenever possible
             | because you might want to mine the data in 20 years time -
             | and if it's a custom format it'll be much harder to deal
             | with.
             | 
             | I don't have a current use-case for this, but I like
             | thinking about it. In particular I like data stores to be
             | as simple as possible. Managing persistent state is the
             | hard bit of most systems as unlike with runtime state or
             | code you have to manage forwards and backwards
             | compatibility, migrations, etc. and you can't just "turn it
             | off and on again" to fix issues.
        
           | gcr wrote:
           | a lot of document key/value databases that store arbitrary
           | JSON tend to work this way (for example Cassandra does this,
           | IIRC). some will even automatically build indices for you
           | (though hand-tuning is usually better)
        
         | klysm wrote:
         | I've been following that pattern in PostgreSQL and love it. It
         | cleanly decouples the ingest and keeps the stack simple.
        
           | masklinn wrote:
           | Do you mean gencols or compression?
        
             | klysm wrote:
             | I meant consuming the raw json into a jsonb column and then
             | using views or triggers to maintain the structured version
             | of it.
        
           | KingOfCoders wrote:
           | Can you elaborate?
        
             | allan_s wrote:
             | Not the author, but if you have a write intensive workflow
             | and I/O starts to be your blottleneck , compressing the
             | data application-side allow your database to write less
             | bytes, and thus to handle more load for the same capacity.
        
               | OnlyMortal wrote:
               | Be aware of the number of iops. That's a killer.
        
             | rovr138 wrote:
             | Look into ETL vs ELT
             | 
             | ETL you grab the data from somewhere, transform it to clean
             | it/format it, then load it (in this case to postgres).
             | 
             | ELT, you grab the data, load it as is. Then you use another
             | step to extract what you need or transform it. You keep the
             | original data in case you need it there too.
        
         | gcr wrote:
         | wait, are you saying that the columns of view tables could be
         | compressed? that doesn't make sense to me cause views don't use
         | extra space to begin with....?
         | 
         | storing raw JSON in sqlite means the entire blob would be
         | compressed and live in a single column, right?
        
           | wmanley wrote:
           | You'd compress the JSON, and leave the data that you extract
           | from the JSON uncompressed using Generated Columns[1]. That
           | way you can save the original JSON data, but still have fast
           | querying, etc.
           | 
           | It would look something like this:                   CREATE
           | TABLE t1(             json TEXT,             data INT
           | GENERATED ALWAYS AS (json_extract(body, '$.d')) STORED
           | );
           | 
           | If all you need is fast querying you could make `data` a
           | `VIRTUAL` column and set up an index on it.
           | 
           | You can read more about this kind of technique here:
           | https://dgl.cx/2020/06/sqlite-json-support
           | 
           | [1]: https://www.sqlite.org/gencol.html
        
       | hda111 wrote:
       | I love the idea. Can I use it with Django projects somehow?
        
       | d1l wrote:
       | I am happy for the author, but want to warn the credulous that
       | this is a very strange way to approach compression with SQLite.
       | The much more common way is to use SQLite vfs:
       | 
       | https://www.sqlite.org/vfs.html
       | 
       | Briefly search for SQLite compressed vfs and you will find many
       | extensions. Some are quite mature.
       | 
       | I get the impression that the author doesn't know about the vfs,
       | as it's not even mentioned in the readme. The clickbait title
       | similarly seems amateurish. Caveat emptor.
        
         | Dowwie wrote:
         | To call it strange or unusual is fine but it isn't a compelling
         | argument why not to use it. Is it better or worse than a vfs
         | approach and why?
        
           | d1l wrote:
           | Why not read the sqlite teams own explanation:
           | 
           | https://www.sqlite.org/zipvfs/doc/trunk/www/howitworks.wiki
        
           | kevingadd wrote:
           | Taking an amateurish or fancy or experimental approach to
           | compressing a sqlite database is a great way to end up with
           | data loss. Doesn't make it a worthless or "bad" project but
           | it's worthy of a warning.
        
         | phiresky wrote:
         | I did actually previously write a VFS, though it did something
         | else entirely:
         | 
         | https://phiresky.github.io/blog/2021/hosting-sqlite-database...
         | 
         | You're right that a comparison to compression VFS is completely
         | missing. I knew about their existence but I have to admit I
         | didn't look too closely.
         | 
         | Note that the main "novelty" here is training a shared /
         | partitioned compression dictionary specifically to take
         | advantage of redundancy that wouldn't appear within only one
         | row or even within a database page / block . The compression
         | happens at the row level and can additionally use application
         | knowledge - you couldn't do that in the VFS level. For example,
         | you can have separate compression dictionaries per different
         | columns and per groups of rows with some commonality.
         | 
         | I'll have to compare to a compression vfs (do you have a
         | favorite?) and see if maybe these two methods can even be
         | combined.
         | 
         | Edit: I see that https://github.com/mlin/sqlite_zstd_vfs does
         | actually also train dictionaries. It's still at the database-
         | page level so can't take application knowledge into account or
         | compress only parts of the data, but that's still pretty smart
         | then.
        
           | kevingadd wrote:
           | How does this compare with modern compression algorithms like
           | Brotli that do context modeling, etc? I've found that they
           | manage to aggressively compress types of data you wouldn't
           | expect to compress well, to the point that investing energy
           | into doing compression yourself doesn't provide big returns
           | anymore. The downside is that codecs like Brotli tend to
           | compress very slowly, but I can imagine being able to do a
           | setup where you only compress old rows so it would be cool to
           | see an experiment with just compressing rows or columns and
           | comparing the sizes with your method.
        
             | vlovich123 wrote:
             | Zstd with training libraries should beat Brotli and Brotli
             | will struggle with non-text data although I haven't
             | benchmarked
             | 
             | Your underlying point though remains valid that the
             | incremental complexity of building that training data
             | probably doesn't warrant it because the place where that
             | becomes valuable is quite rare particularly for typical
             | SQLite databases. Still a neat trick thing though.
        
         | skrebbel wrote:
         | > amateurish
         | 
         | There's no need to be this dismissive.
        
           | 867-5309 wrote:
           | welcome to HN
        
             | skrebbel wrote:
             | Meh, I find HN to be among the least dismissive places on
             | the internet, especially when you only consider nerd-heavy
             | places. I like to think that this is in part due to
             | comments like the one I wrote being commonplace. I've been
             | at the receiving end of them as well (incl a few kicks
             | under the butt from dang himself) and it's helped me assume
             | good intent, not be overly dismissive, etc. I'm just
             | passing the favour along.
        
               | aidenn0 wrote:
               | I think the frustrating thing about HN is the
               | (relatively) high number of comments that are both
               | dismissive and substantive. It's much easier (for my
               | brain at least) to ignore comments where the entirety is
               | "this sucks" or "lame" (or even "If I made something this
               | crappy I would throw myself off a building" which isn't
               | uncommon in some forums).
               | 
               | In comparison, gp comment is both dismissive _and_
               | substantive in that it brings up SQLite VFS layers that
               | can accomplish a similar goal. Author of TFA even updated
               | the README to mention these, addressing gp 's complaint.
               | 
               | The whole "this is crap because of X" where X is
               | something that subjectively could be considered
               | legitimate criticism is relatively common on HN, so
               | comments _that I don 't automatically skip over_ are
               | rather more dismissive than on other forums.
        
         | zimpenfish wrote:
         | > Briefly search for SQLite compressed vfs and you will find
         | many extensions.
         | 
         | I found 3 with a brief search - CEVFS, sqlite_zstd_vfs, and
         | ZIPVFS (not free, therefore discounting.)
         | 
         | Neither CEVFS nor sqlite_zstd_vfs support WAL mode (which
         | means, e.g., no litestream replication).
         | 
         | Can you recommend a mature one that does support WAL mode?
        
         | skerit wrote:
         | Let's not water down the meaning of the word "clickbait". I
         | have no problem with this title.
        
       | aabbcc1241 wrote:
       | I'm using sqlite on zfs and btrfs, with transparent compression
       | supported by the file system, the query speed is very good and
       | compression rate also very good. (database below 3GB
       | uncompressed, 22% after compressed)
        
         | cmurf wrote:
         | Did you enable WAL? Any casual comparison in performance
         | between WAL and no-WAL?
        
           | aabbcc1241 wrote:
           | I enable WAL. It performs better than the default (which is
           | to keep a rollback journal)
        
         | usgroup wrote:
         | What do you think about zfs vs btrfs for your use case? Why do
         | you use both?
        
         | [deleted]
        
       | [deleted]
        
         | [deleted]
        
       | logicallee wrote:
       | This mentions something I've been wondering about:
       | 
       | >Depending on the data, this can reduce the size of the database
       | by 80% while keeping performance mostly the same (or even
       | improving it, since the data to be read from disk is smaller).
       | 
       | "Even improving it" is a bold claim!
       | 
       | Modern CPU's are often hungry for data, aren't they? While being
       | incredibly fast and mostly idle?
       | 
       | Is reading a smaller compressed file and decompressing it in the
       | CPU's hyperfast cache before use faster than reading it
       | uncompressed?
       | 
       | Maybe you can speed up an entire server, even one with lots of
       | free space and no need for compression, by adding full disk
       | compression anyway!
       | 
       | Unless the CPU's are at 100% utilization this could always be
       | used. Since the CPU will then use the data, it might be fast
       | enough to uncompress it at home faster than the bus can keep
       | dropping it off at file read speeds.
       | 
       | This chart I found that goes up to 2000 suggests cache access had
       | an increasing performance gap with even RAM bandwidth, to say
       | nothing of SSD's[1]
       | 
       | It is from here[2]
       | 
       | (However, compressing a full disk can only help if there are
       | extra CPU cycles.
       | 
       | For activity spikes when the CPU doesn't have extra cycles or the
       | cache for it, both compressed and uncompressed versions could be
       | available, at most doubling disk usage, so that the uncompressed
       | file could be read instead.
       | 
       | This can ensure the CPU isn't burdened with a decompression task,
       | since this only helps if the fastest caches are used for it and
       | they might have better things to do than uncompress files.
       | 
       | However, whenever the CPU has idle cores and lots of cache, which
       | is most of the time, it could request the compressed version. If
       | it is busy it could ask for the original version which saves the
       | decompression work.)
       | 
       | Can this help solve the problem of waiting for the disk?
       | 
       | [1] http://www.extremetech.com/wp-content/uploads/2014/08/CPU-
       | DR...
       | 
       | [2]
       | https://www.extremetech.com/extreme/188776-how-l1-and-l2-cpu...
        
         | outworlder wrote:
         | > Maybe you can speed up an entire server, even one with lots
         | of free space and no need for compression, by adding full disk
         | compression anyway!
         | 
         | Yes. Although not all filesystems support compression. One of
         | the filesystems that does support and this is regularly
         | benchmarked, is ZFS. It is also interesting that while you are
         | definitely using more CPU to read a compressed stream, if that
         | causes you to finish your task earlier, then it can be a net
         | gain even from a CPU usage standpoint.
         | 
         | eg.
         | 
         | https://www.servethehome.com/the-case-for-using-zfs-compress...
        
       | andrewmcwatters wrote:
       | Oh, interesting. I scrape GitHub's APIs to generate user and
       | repositories statistics data, then compress it with ZStandard to
       | allow the databases to fit within GitHub's file size
       | restrictions.[1] So, this is very relevant to me.
       | 
       | It's pleasantly surprising how much data you can fit within a
       | ZStandard-compressed SQLite database. The two technologies fit
       | very well together.
       | 
       | [1]: https://github.com/andrewmcwattersandco/github-statistics
        
       | mansoor_ wrote:
       | What does "transparent" mean in this context?
        
         | phiresky wrote:
         | Transparent to the application - basically the SQL queries
         | (select, insert, update) in your code base don't have to change
         | at all.
         | 
         | This would be in contrast to e.g. compressing the data in your
         | code before inserting it into the database, then decompressing
         | after you load it.
        
       | AtNightWeCode wrote:
       | LZ77+ ftw. I am a heavy user of Zstd. I think Brotli might be a
       | better choice here. Or at least for my use-cases.
        
       | quietbritishjim wrote:
       | One option not listed (under "Other options for compressing data
       | in a database") but probably closest to what this solution does
       | under the hood, is to do row-interdependent compression at
       | application level.
       | 
       | That is, while saving rows to the database, pick out every (say)
       | 10th row, and compress it by itself but keep the compression
       | state in memory. Then for the next 9 rows, compress that row
       | based on that dictionary, and a reference to the row it's based
       | on. (For example, in Python's zlib module, you'd use the zdict
       | parameter [1] of compress and decompress.) A bit like keyframes
       | in video compression.
       | 
       | You can potentially get better results this way than a generic
       | solution like the article, because you might have some
       | application knowledge about what rows are likely to be similar to
       | others. But it obviously requires application level code (e.g.
       | fetching a logical row requires fetching two physical rows so you
       | have enough information to decode), and it would be a nightmare
       | if you need to remove or update the rows.
       | 
       | [1] https://docs.python.org/3/library/zlib.html#zlib.compressobj
        
         | phiresky wrote:
         | > You can potentially get better results this way than a
         | generic solution like the article because you might have some
         | application knowledge about what rows are likely to be similar
         | to others
         | 
         | The linked project actually does exactly this - you can
         | basically specify an SQL expression (such as strftime(date,
         | '%Y-%m')) that is then used to group rows together in order to
         | decide what to train dictionaries on. What you're describing as
         | picking every 10th row would be the expression (id/10). So you
         | can use application knowledge, you just don't have to.
         | 
         | The difference is that in my project the dictionary is stored
         | separately so not affected by row deletions, and the trained
         | dictionary is based on all samples not just one of them. Since
         | the dictionary is stored separately it's only worth it if you
         | combine at least maybe 100-1000 rows under one dictionary.
         | 
         | > Python's zlib module, you'd use the zdict parameter
         | 
         | Yes, the training feature of zstd is basically a better version
         | of keeping another sample of data around and using it as a
         | prefix. In fact you can use a data sample instead of a trained
         | dictionary as the `-D` parameter of the zstd command line, zstd
         | just has the additional feature of reducing a set of samples
         | down to the most essential (data-saving) parts.
        
           | quietbritishjim wrote:
           | Thanks for making this project, it looks fantastic! I might
           | end up using it if I can figure out how to from Python.
           | 
           | I did mention that this is what you're doing, I just thought
           | it wasn't clear from the blog post. But I did miss that you
           | can use an SQL expression to choose the grouping. This is
           | ideal for the application I'm thinking of, which a "message
           | type" field that corresponds pretty closely with which rows
           | are similar.
        
             | phiresky wrote:
             | You can use it from python as follows:                   db
             | = sqlite3.connect(...)         db.enable_load_extension()
             | db.load_extension("libsqlite_zstd.so")
             | 
             | You probably still have to figure out how to make that work
             | cross-platform / how to ship that binary though.
        
               | quietbritishjim wrote:
               | Oh interesting, I was assuming that recompiling the whole
               | sqlite module woudl be in order. Thanks again.
        
         | goldenkey wrote:
         | Your explanation is great. I like the key frame codec analogy.
         | Like you said though, this really belongs at the DB layer. An
         | application level solution would be much more than a nightmare.
         | 
         | Plugin/extension/module development is severely underused in
         | our field. Extensibility, if possible, is often the best way to
         | handle edge cases or increments in a dependency, without
         | forking it.
         | 
         | See "Benefits of Being an Extension to PostgreSQL":
         | https://www.citusdata.com/blog/2017/10/25/what-it-means-to-b...
         | 
         | Some great software that is extensible in brilliant ways
         | through plugins, that comes to mind, is:
         | 
         | postgres: https://www.postgresql.org/docs/current/external-
         | extensions....
         | 
         | nginx: https://www.nginx.com/resources/wiki/modules/
         | 
         | sqlite: https://www.sqlite.org/loadext.html
         | 
         | redis: https://redis.io/docs/modules/
         | 
         | babel: https://babeljs.io/docs/en/plugins/
        
       | yababa_y wrote:
       | This makes me wonder how sqlite augmented with blosc or some
       | similar faster-than-memory compressor might behave. Size and
       | speed wins?
        
         | alecco wrote:
         | AFAIK Blosc is just a container. Not a good fit for row-level
         | compression.
        
           | yababa_y wrote:
           | the vfs isn't row level, though!
        
       | avereveard wrote:
       | I don't want to diss on the effort, I have an honest question:
       | wouldn't this something that would be more appropriate to be done
       | at the filesystem level?
        
         | hexmiles wrote:
         | It depends on the data and application,
         | 
         | I use a similar approach for compressing data that go in a
         | couple of database/store. Having control on when and how to
         | compress data can significantly improve performance, also, it
         | sometimes makes sense to keep compress data even in memory
         | since it can allow you to make better use of it.
         | 
         | Also keep in mind that often you can't control the filesystem.
         | 
         | One use case for which I am working right now is that I have is
         | storing highly repetitive data on a very slow flash storage,
         | (that technically support compression, but it is slow and with
         | not great compression ratio), I can significantly improve
         | performance.
         | 
         | edit: grammar
        
         | phiresky wrote:
         | It would be interesting to try something like this on the file
         | system level, yes.
         | 
         | Basically you could heuristically find interesting files, then
         | train a compression dictionary for e.g. each file or each
         | directory. Then you'd compress 32kB blocks of each file with
         | the corresponding dictionary. Note that this would be pretty
         | different from how existing file system compression works (by
         | splitting everything into blocks of e.g. 128kB and just
         | compressing those normally).
         | 
         | I think it would be hard to find heuristics that work well for
         | this method. The result would be pretty different than what I'm
         | doing here with different tradeoffs. sqlite-zstd is able to use
         | application-level information of which data is similar and will
         | compress well together. You couldn't do that within the
         | filesystem. It also works "in cooperation" with the file format
         | of the database instead of against it - e.g. you probably
         | wouldn't want to compress the inner B-tree nodes.
         | 
         | On the other hand, it would then work for any file format not
         | just SQLite databases. E.g. a huge folder of small json files
        
         | gred wrote:
         | > wouldn't this something that would be more appropriate to be
         | done at the filesystem level?
         | 
         | Sometimes? You've gotten a ton of replies digging into the
         | "why" and "when", and I only wanted to add that your question
         | brought to mind this quote from the official SQLite website:
         | 
         | > "Think of SQLite not as a replacement for Oracle but as a
         | replacement for fopen()"
         | 
         | https://www.sqlite.org/about.html
        
         | xdfgh1112 wrote:
         | Yeah but your app doesn't have control over the filesystem.
         | Usually an sqlite db is bundled with an application or created
         | by it.
         | 
         | That said, it looks like you could get the same results by
         | doing client side compression and decompression of data
         | yourself, and storing compressed blobs in sqlite. It only
         | compresses text column data, not entire pages.
        
           | d1l wrote:
           | https://www.sqlite.org/vfs.html
        
           | Gehinnn wrote:
           | But it uses a shared zstd dictionary among rows!
           | 
           | The details are described here:
           | https://phiresky.github.io/blog/2022/sqlite-zstd/
        
         | marginalia_nu wrote:
         | Context-aware compression can be a lot of faster than blind
         | filesystem level compression.
         | 
         | Some data, like plain text, JSON, or XML, compresses extremely
         | well. You're almost certainly looking at compression rates of
         | up to 90%. Dense binary formats, especially formats that are
         | already compressed like most image formats won't compress
         | barely at all (sometimes they can even get bigger), and
         | attempting to compress them is an enormous waste of processing
         | power.
        
           | JulianWasTaken wrote:
           | Continuing the hypothetical without really knowing about
           | today's filesystems, such a system could presumably calculate
           | a moving per page, fd or process- compression factor and turn
           | it down or up if it started seeing lots of incompressible
           | data?
        
             | marginalia_nu wrote:
             | For a columnar database, this would probably work fairly
             | well as it would put data of similar compressability
             | adjacent to more of its kind, but more row-based databases
             | where various types of data is intermingled would probably
             | not benefit nearly as much.
        
           | zimpenfish wrote:
           | Tried this on my twitter database's `json` column (ie the
           | JSON response from the API fetching a singular tweet) with a
           | `dict_chooser` of '%Y-%W' to bundle weeks together and it
           | took 985k rows down from 6.6G to 1.5G. Not as good as
           | compressing the whole thing obviously but really handy for a
           | queryable backup.
        
           | cmurf wrote:
           | At least in btrfs there's compress vs compress-force mount
           | options. compress uses a cheap estimator of compressibility,
           | whereas compress-force always passes data through the
           | compression algorithm.
           | 
           | ZSTD has its own estimator to avoid doing unnecessary work,
           | but I haven't found a comparison of these estimators (which
           | is cheaper).
           | 
           | compress results in max extent size of 128K for compressed
           | extents, and 128M for uncompressed data. Whereas compress-
           | force results in 512K max extent size for uncompressed data.
           | While there's a cost to tracking more extents, it might be
           | less than the cost of frequent modification of large extents,
           | particularly if snapshots are used. Of course there's no one
           | answer, it'd be workload and hardware dependant. Hard drives
           | have significant latency impact from non-contiguous extents.
           | NVMe much less impact.
        
             | phiresky wrote:
             | The zstd estimator is not that great, so you should
             | definitely keep the auto-detection of the file system on.
             | 
             | If you try to compress a completely incompressible file, it
             | will still be extremely slow at higher compression levels: 
             | https://github.com/borgbase/vorta/issues/485#issuecomment-8
             | 8...
        
               | magicalhippo wrote:
               | I can't recall if they actually implemented or just
               | talked about it, but during the ZFS team meetings there
               | was some discussion about using LZ4 as a "compression
               | canary", to determine if it was worth attempting
               | Zstandard. That is, first try compressing with LZ4, and
               | if it did compress, do actual compression with Zstandard.
               | 
               | IIRC the argument went that LZ4 had a very quick bail-out
               | routine, and quite fast compression, so the overall
               | overhead would be quite small, at least for the higher
               | Zstandard compression levels.
        
               | Dagger2 wrote:
               | They implemented it; it landed in May:
               | https://github.com/openzfs/zfs/commit/f375b23c
        
         | fnord123 wrote:
         | Great idea. zfs can compress data on disk. But it's also useful
         | to keep the data compressed in memory so you can fit more data
         | in memory. There's a lot of cpu/memory tradeoffs so it's worth
         | having the options to do it on the data.
         | 
         | It's also worth noting that compression on the application
         | layer works way better on column stores than row stores.
        
           | ciupicri wrote:
           | > Btrfs also supports transparent file compression. There are
           | three algorithms available: ZLIB, LZO and ZSTD (since v4.14).
           | Basically, compression is on a file by file basis. You can
           | have a single btrfs mount point that has some files that are
           | uncompressed, some that are compressed with LZO, some with
           | ZLIB, for instance (though you may not want it that way, it
           | is supported).
           | 
           | https://btrfs.wiki.kernel.org/index.php/Compression
        
         | Someone wrote:
         | No. If you did that, SQLite would have to (logically)
         | decompress about your entire database on every query. For
         | example, how would it read the database schema to determine
         | whether _"select foo from bar"_ is valid sql for the database?
         | If it figures that out, how would it figure out there's an
         | appropriate index to use? How would it read that index?
         | 
         | Databases do a lot of seeking into database files and typically
         | only read a small part of them. That's not a scenario that's a
         | good fit with whole-file compression.
         | 
         | Compressing individual pages is a bit better, but I think this
         | can still be better.
         | 
         | Your file system probably would cache the decompressed database
         | file, but that would use memory, lots of it if your database is
         | large.
        
           | formerly_proven wrote:
           | FS-level compression is usually limited to compressing
           | relatively small blocks (think ~dozens of sectors), they
           | don't normally compress a large file as a single solid block.
        
             | Someone wrote:
             | Yes, but still, the file system would do more decompression
             | than needed, and that will cost performance (CPU and
             | caching)
        
         | ajconway wrote:
         | This kind of solution should work fine event if the database is
         | encrypted.
        
         | antman wrote:
         | With this you can selectively compress per column
        
       | josephernest wrote:
       | Let's say I have a web visits analytics log database, with a
       | column "user-agent" and a column "url" that have most of the time
       | the same values (99.9% of rows have one of 100 typical values).
       | 
       | Would this work well with compression here?
       | 
       | ie. from 50 bytes on average for these columns... to a few bytes
       | only?
        
         | hnarn wrote:
         | Probably yes, but presumably you could achieve this by
         | compressing the data on the file system level as well (for
         | example with ZFS).
        
         | Scaevolus wrote:
         | Yes, if you use the "dictionary" option it will compute build a
         | buffer of common strings and use that during compression, so
         | your user-agent columns will largely become references to the
         | dictionary. Make sure you set compact to true! See this for
         | more information: https://facebook.github.io/zstd/#small-data
        
           | josephernest wrote:
           | Will this mode require a separate file along the data.db
           | Sqlite file, to store the compression dictionary?
           | 
           | From your last link:
           | 
           | > The result of this training is stored in a file called
           | "dictionary", which must be loaded before compression and
           | decompression.
           | 
           | Also, what happens if a column has been compressed based on
           | 100 typical values, and then later we insert thousands of new
           | rows with 500 new frequent values. Does the compression
           | dictionary get automatically updated? Then do old rows need
           | to be rewritten?
           | 
           | PS what is compact mode?
        
             | Scaevolus wrote:
             | It stores these compression dictionaries in the database:
             | 
             | > If dictionary is an int i, it is functionally equivalent
             | to zstd_decompress(data, (select dict from _zstd_dict where
             | id = i)).
             | 
             | Compact saves you 8 bytes per compressed column by removing
             | some metadata zstd uses :
             | 
             | > if compact is true, the output will be without magic
             | header, without checksums, and without dictids. This will
             | save 4 bytes when not using dictionaries and 8 bytes when
             | using dictionaries. this also means the data will not be
             | decodeable as a normal zstd archive with the standard
             | tools. The same compact argument must also be passed to the
             | decompress function.
             | 
             | You will need to update the dicts occasionally, but there's
             | a helper function to do this for you, read the docs about
             | zstd_incremental_maintenance.
        
       | cryptonector wrote:
       | The image near the top of the readme implies that compressing the
       | whole file means not having random access, which is indeed true,
       | but it also implies that there's nothing between whole file
       | compression and row/column level compression, and _that_ is not
       | true. You could compress pages, kinda like ZFS compresses blocks,
       | and still maintain random access, and get pretty close to whole-
       | file compression.
        
       | leobg wrote:
       | This looks great. How would I apply this retroactively? I.e. tell
       | SQLite to compress column x in an existing database, say, of type
       | TEXT?
        
       | Joel_Mckay wrote:
       | sudo apt-get install archivemount
       | 
       | archivemount mydbimage.tgz /mnt/mydbimage -o big_writes
       | 
       | // -o readonly disable write support
       | 
       | // -o nosave do not save changes upon unmount.
       | 
       | //open/cache /mnt/mydbimage/mydbimage.db
       | 
       | //flush
       | 
       | fusermount -u /mnt/mydbimage
       | 
       | //No LGPL/Rust/cargo dependency issues...
       | 
       | //as SQLite's best use-case is embedding ;)
        
         | marginalia_nu wrote:
         | Eh, tar is probably not an amazing format for this, since it's
         | designed for tape drives, and has ridiculously slow random
         | access.
        
           | Joel_Mckay wrote:
           | archivemount supports numerous formats as it is using
           | libarchive.
           | 
           | My point was it doesn't contaminate the parent projects
           | license.
        
         | xyzzy_plugh wrote:
         | I understand why you suggest this, even though archivemount is
         | LGPL, but I digress.
         | 
         | FUSE is cool and all but given it's "userspace" and we're just
         | reading a single file here, I don't find this compelling versus
         | reading from a block device in-process.
         | 
         | TFA is compelling because it's compression over specific data
         | sets without being in the hot path for every operation. Putting
         | a sqlite.db in a tarball is... problematic. There is no random
         | access, so if your database doesn't fit in memory/cache,
         | performance will be abysmal.
        
           | Joel_Mckay wrote:
           | LGPL is not great for porting, and linux block caching is far
           | from the performance bottleneck if you know what you are
           | doing.
           | 
           | I think we can agree it depends on the use-case. =)
        
       ___________________________________________________________________
       (page generated 2022-08-01 23:00 UTC)