[HN Gopher] SQLite as an Application File Format (2014)
       ___________________________________________________________________
        
       SQLite as an Application File Format (2014)
        
       Author : afiori
       Score  : 418 points
       Date   : 2020-06-13 12:28 UTC (10 hours ago)
        
 (HTM) web link (sqlite.org)
 (TXT) w3m dump (sqlite.org)
        
       | twarge wrote:
       | Quickbooks for Mac uses sqlite for its file format and this
       | allowed users to query the books with any SQL browser!
       | 
       | Alas, they encrypted the file starting two years ago.
        
       | hexmiles wrote:
       | are there any other "file format oriented" data storage system??
       | I see software tend to use zip file, sqlite or good old ole
       | structured storage.
       | 
       | In my job we used a fat filesystem as a storage system and
       | recently switched to sqlite, and while i love it i didn't really
       | find any alternative.
        
       | mstade wrote:
       | SQLite is amazing - trivial to embed in almost every context I've
       | tried, rock solid, incredibly flexible, performs well enough for
       | most use cases and for the others it probably still performs
       | better than the solution I had in mind. I wouldn't dream of using
       | anything else for application data, unless you _need_ something
       | text based for human readers as well, and can 't deal with having
       | people use a SQLite client. And even then, really consider
       | whether you _need_ this.
       | 
       | With SQLite embeddable in websites thanks to wasm, and the
       | ability to create object URLs it's also pretty trivial to make
       | full blown (read-only) SPAs delivered as a single HTML file. This
       | last bit might seem crazy - and it kind of is - but if your
       | clients are all on a LAN in a corporate network so bandwidth and
       | latency aren't really an issue it makes a bit more sense.
       | 
       | I love SQLite, hands down one of my favorite tools.
        
       | em500 wrote:
       | A lot of well known apps are exactly doing this, using a sqlite
       | db as the internal file format: Whatsapp, WeChat, Google Keep,
       | Apple iMessage, Notes and Calendar...
       | 
       | Probably tons more if you care to dig around.
        
       | juskrey wrote:
       | Years pass, now decades, yet SQLite is still the most pleasant DB
       | to work with.
        
         | TedDoesntTalk wrote:
         | Agree, but it also side-steps some complex topics like
         | sharding, replication, high-availability, etc. I'm not
         | suggesting those topics should be addressed by SQLite; that
         | simplicity is an asset.
         | 
         | Has anyone used SQLite remotely over a network?
        
           | polyrand wrote:
           | You can use rqlite for that:
           | 
           | https://github.com/rqlite/rqlite
           | 
           | You can have a single over-the-network database or a cluster
           | of replica dbs (through Raft consensus).
        
           | rlpb wrote:
           | Dqlite exists - it's SQLite with some of these feature added.
           | https://dqlite.io/
        
           | allenu wrote:
           | I am working on a flash card app and specifically did not
           | want to have to create a SaaS to support cross-device data
           | sync, so have opted instead to create a journal of the data
           | mutations on each device and having those (i.e. it's CRDT)
           | uploaded to a simple file store (like Dropbox) for each
           | device to share. Each device stores its own local data using
           | SQLite, but does a merge on all the journaled commands to
           | stay up to date.
           | 
           | So, basically I rolled my own sync solution that just happens
           | to use SQLite. It's worth noting that I started this project
           | by storing the data in my own format (JSON, at first) and
           | quickly realized it was growing too large and was taking too
           | long to serialize/deserialize. I'm very glad to have ended up
           | using SQLite instead because it is super easy to use and has
           | been reliable.
        
           | mjw1007 wrote:
           | When Subversion changed their working copy format to rely on
           | an SQLite database, I had a number of problems using working
           | copies on network file systems.
           | 
           | No doubt this was an indication that the network file system
           | was incorrectly configured, but I think the fact that it
           | worked in practice with the file-based format and didn't with
           | the SQLite format is a strike against the idea of using
           | SQLite for what the user sees as saving a file.
        
             | m_sahaf wrote:
             | SQLite recommends against sharing it on NFS for multiple
             | access because locking is broken on NFS for all kinds of
             | files, not just SQLite.
             | 
             | > But use caution: this locking mechanism might not work
             | correctly if the database file is kept on an NFS
             | filesystem. This is because fcntl() file locking is broken
             | on many NFS implementations. You should avoid putting
             | SQLite database files on NFS if multiple processes might
             | try to access the file at the same time.
             | 
             | Source: https://www.sqlite.org/faq.html#q5
        
               | catalogia wrote:
               | Also WAL mode isn't going to work on network filesytems.
        
               | mjw1007 wrote:
               | Fair enough, but that means if someone's following the
               | advice in the article ("It deserves your consideration as
               | the standard file format on your next application
               | design") they need to consider "will our users want to
               | use our files on network filesystems?".
        
             | HelloNurse wrote:
             | Working copies on network file systems are a horrible idea.
             | You _detected_ problems because you were using SQLite, but
             | are you sure you didn 't have silent corruption with older
             | Subversion versions?
        
           | trashburger wrote:
           | I think the main reason SQLite doesn't have them is because
           | it's not a goal for the project. SQLite would be monumentally
           | more complex if it supported networking, sharding, etc. If
           | you need those use cases then perhaps it's a good idea to
           | switch to a more scalable DBMS. My favorite is PostgreSQL. If
           | I'm building something which will have user-generated content
           | then I always pick Postgres. Otherwise, SQLite.
        
           | easytiger wrote:
           | It lives in your process. It has no network features. It's
           | not a DBMS.
           | 
           | People have built layers over it though
        
       | gwbas1c wrote:
       | One minor point that needs to be raised: Using SQLite as your
       | application file format locks you into the SQLite source code.
       | 
       | (And in most cases that's not a bad thing, it's free and open
       | source.)
       | 
       | But, if you truly want an open file format, someone needs to be
       | able to independently write a program that can read your file
       | without relying on third party dependencies. This is why the
       | browser vendors decided not to put SQLite into the HTML and
       | JavaScript standards.
       | 
       | It is an awesome product! I've worked with it for over a decade
       | and I'm a fan!
        
       | trevor-e wrote:
       | While we are on the topic of file formats, I stumbled across a
       | cool project the other day that aims to make parsing well-known
       | binary formats much easier: http://kaitai.io/
       | 
       | In high school I was obsessed with the game Starcraft. It came
       | with its own map editor that would save to its own proprietary
       | format. Some smart people came along and reverse engineered that
       | format and allowed us to do all kinds of neat things we weren't
       | supposed to. I see modding communities for games are more popular
       | than ever, and finding this project brought back lots of great
       | memories.
        
         | mdaniel wrote:
         | I'm super-on-board with declarative binary parsing schemes
         | (i.e. not lua or python) but my own attempt to use kaitai was
         | met with the same frustration that presumably caused PDF to be
         | missing from the format gallery list
         | (http://formats.kaitai.io/): jump or offset-based structures
         | 
         | At the time, I tried using a stunt like defining one huge blob
         | that "eats" the main file, then reaching back into it as we
         | learned more, but it looks like somewhere along the way they
         | acquired a "substream" behavior (https://github.com/kaitai-
         | io/kaitai_struct_doc/blob/c53060f7...) so maybe it's worth
         | another look
        
           | thechao wrote:
           | Can you explain more? I deal with a shocking number of poorly
           | supported binary formats, and this tool looks awesome.
           | Buuuuttt... a lot of these formats are "headers with offsets
           | to structured data". They're intermixed logs, so multiple
           | incompatible bi art formats are freely mixed in the same
           | file.
        
       | zubairq wrote:
       | Yeah you can use SQLite as a file format. We use it to store our
       | project data at https://github.com/zubairq/pilot
        
       | polyrand wrote:
       | I think SQLite is underrated.
       | 
       | With WAL mode + increasing the page cache you can get some
       | excellent concurrency, even if doing reads and writes at the same
       | time.
       | 
       | With rqlite it's easy to make it a server database and have a
       | cluster of SQLite databases (https://github.com/rqlite/rqlite).
       | 
       | I wouldn't try to create the new Instagram with it, but I think
       | it'd be capable enough for many apps that are built on top of
       | more complex DBs.
        
       | michaelbuckbee wrote:
       | IIRC one of the attempts to deliver on "Github but for Design
       | files" used SQLite as an app file format.
        
       | gjsman-1000 wrote:
       | I worked with an SQLite Database on a personal C#/WinForms app.
       | It was a bit complicated to get started, but afterwards, what a
       | handy trick in comparison to using something like the
       | discontinued SQL Server Compact.
       | 
       | I did, however, "enhance" it a bit (or "proprietarize" it) by
       | encrypting it with a short password and an AES algorithm with
       | some uncommon settings. I never shipped it that way, but the
       | output file looked like any other proprietary app's file - a mess
       | of random symbols.
        
       | networkimprov wrote:
       | SQLite is a fine single-file application file format, but if you
       | want average users to be able to inspect and perhaps fix
       | application data when things go wrong, a tree of JSON-format
       | files is more friendly.
       | 
       | True, this requires care to ensure that such files are updated
       | reliably, but that's not quite rocket science.
        
       | tonyedgecombe wrote:
       | Maybe. I've used XML as a file format for all my recent projects.
       | It's well supported in the .Net framework, I don't need any 3rd
       | party assemblies. Using SQL to write out an object graph would
       | involve more work.
        
         | 737maxtw wrote:
         | Depends on the size or structure of the data.
         | 
         | If you're having to do things like updates over a large
         | dataset, SQLITE can be nice because of the performance boosts
         | from indexing.
        
         | simias wrote:
         | A tree I could understand since after all XML uses a tree-like
         | structure, but for a graph you'll need to store references
         | anyway so flattening and denormalizing it completely doesn't
         | sound like a lot more work to me.
         | 
         | Also XML makes me weep and I'll never willingly opt to use it
         | if there's an alternative, but that's my own prejudice.
        
         | gwbas1c wrote:
         | It depends on how large and complicated your XML file is.
         | 
         | Once your XML file hits a certain size, minor updates incur a
         | gigantic performance penalty because you have to write out the
         | entire XML file every time.
         | 
         | Random writes in the middle of an XML file are impossible. For
         | example, if you were to change an attribute so that it's one
         | character longer, you still have to rewrite the remainder of
         | the file in order to shift everything over by one character.
         | 
         | That's the main reason why SQLite is so popular for
         | applications.
         | 
         | (I know that's from personal experience. I had to support an
         | old version of an application that constantly wrote out XML
         | while the new version that we hadn't shipped used SQLite. A
         | customer that made heavy use of the old version basically hit
         | the limits of XML but the version that used SQLite wasn't ready
         | for them yet.)
         | 
         | But, I'm going to be honest here: We had some tables that only
         | had a few rows, so I moved those to XML files. QE really liked
         | it because it was easy to diagnose issues.
        
       | barking wrote:
       | Are we talking about good old fashioned CRUD applications here or
       | is this something different?
        
       | zzo38computer wrote:
       | They say that Microsoft Word, PowerPoint, etc are "fully custom
       | formats", but actually those formats are "wrapped pile-of-files
       | formats" (and 7-Zip can open them), although the contents of the
       | files inside are custom formats (and they seem to have failed to
       | consider that category).
       | 
       | The application ID number in the SQLite header can be used to
       | identify application file formats. The application ID number is a
       | 32-bit number, and there have been a few different ways to handle
       | it; I have seen the use of hexadecimal and of ASCII; I used base
       | thirty-six, and I have then later seen the suggestion to use
       | RADIX-50. Additionally, there is a document about "defense
       | against dark arts" in case you need to load untrusted files.
       | 
       | TeXnicard uses a SQLite database file (with application ID
       | 1778603844) for the card database file. The version control file
       | (optional, and not fully implemented yet) uses a custom format
       | (which is fully documented), and it does support atomic
       | transactions. It consists of a header followed by a sequence of
       | frames, which are key frames and delta frames. The header of the
       | version control file contains two pointers: one to the beginning
       | of the most recently committed key frame, and the other one to
       | the end of the most recently committed frame (whether a key frame
       | or a delta frame; if all frames are fully committed, this will be
       | equal to the length of the file). These pointers are written only
       | after the rest of the file is written; if it gets interrupted,
       | reads will ignore the partially written data, and further writes
       | will overwrite the partially written data.
       | 
       | ZZ Zero uses a Hamster archive of custom (but documented) formats
       | as its world file format. (A Hamster archive is zero or more
       | "lumps" concatenated together. A lump consists of a null-
       | terminated ASCII filename, 32-bit PDP-endian data size (measured
       | in bytes), and then the data of that lump. The preceding text in
       | these parentheses is the full definition of the Hamster archive
       | format; you can use this to implement your own.)
       | 
       | Free Hero Mesh uses a "pile-of-wrapped-pile-of-files" format. A
       | puzzle set consists of four files: .class (which stores class
       | definitions), .xclass (which stores pictures and sounds to be
       | used by the class definitions), .level (which stores levels), and
       | .solution (which stores solutions). The .class file is a plain
       | text file; the other three are Hamster archives. These are four
       | logically distinct parts of a puzzle set; this allows you to
       | split them apart, to create symlinks to share class definitions
       | with puzzle sets, to substitute your own graphics, to work with
       | multiple solution sets (e.g. per user), etc. If you need to do
       | more than that, then you can of course extract the lumps if
       | needed. For class definitions, you can just copy and paste the
       | text.
       | 
       | MegaZeux used a fully custom format before, but now it uses a ZIP
       | archive with the stuff inside being custom formats (one of which
       | is the "MegaZeux Property List" format, which I have documented
       | in Just Solve The File Format wiki; the authors of MegaZeux did
       | not seem to document this format themself anywhere, so I figured
       | it out and did it by myself).
       | 
       | For some cases, SQLite database is a good application file
       | format; other times, I think other formats (such as text formats)
       | may be better. It depends on the application. XML is too often
       | used for stuff that isn't text markup stuff, and XML is
       | especially bad for stuff that isn't text markup stuff, I think.
       | 
       | If you use SQLite though, you will get more than just the
       | database access. It also gives you the string builder functions,
       | the sqlite3_mprintf function, a page cache implementation, memory
       | usage statistics, and a SQL interpreter; the SQL interpreter can
       | be used as one way to allow user customization and user queries
       | (including batch operations), without having to make an entirely
       | new scripting language to embed.
       | 
       | They mention interfaces of SQLite are available for many other
       | programming languages, although at least one that doesn't seem to
       | have a interface to SQLite is PostScript (although you can use
       | %pipe%, it doesn't work so well especially since it is only a one
       | way pipe), and I am not sure if awk has it either.
        
         | chrismorgan wrote:
         | .doc, .ppt, &c. were fully custom. .docx, .pptx, &c. are ZIP
         | files. Remember that this was written in 2014, and many people
         | were still using the old formats then, because many people
         | didn't have Office 2007 or newer or software that could cope
         | with this new format. (Now I think very few people still use
         | the old Office formats.)
        
       | nathan_f77 wrote:
       | I've been reading through many of the other pages on the SQLite
       | website. I didn't realize that SQLite is developed by a company
       | (Hwaci) and funded by selling support, licenses, and testing
       | services. I had always assumed it was an open source project
       | backed by a non-profit, similar to the Apache Software
       | Foundation.
       | 
       | This is another very interesting example of an open-source
       | business. I would be interested to learn more about how the Hwaci
       | company operates (revenue, number of employees, etc.). I find
       | this very interesting:
       | 
       | > We are a 100% engineering company. There is no sales staff. Our
       | goal is to provide outstanding service and honest advice without
       | spin or sales-talk.
       | 
       | They list some "$8K-50K/year" and "$85k" price tags directly on
       | their "Pro Support" webpage. These would usually be behind a
       | "Schedule a Call" or "Get Quote" button. I've been thinking about
       | doing something similar with my own on-premise licenses and
       | support contracts. I'm not very good at sales and I don't really
       | want to hire a sales team, so I'd be interested to know how this
       | worked out for them.
       | 
       | I also liked this sentence, which is very similar to Basecamp's
       | philosophy (and both companies were started around the same time
       | - 1999 vs 2000):
       | 
       | > Hwaci intends to continue operating in its current form, and at
       | roughly its current size until at least the year 2050.
       | 
       | It's interesting to think that SQLite could have raised money and
       | grown into a billion-dollar public company with thousands of
       | employees.
       | 
       | I'm going to listen to this Changelog interview with Richard Hipp
       | now [2], and also this talk on YouTube [3].
       | 
       | [1] https://sqlite.org/prosupport.html
       | 
       | [2] https://changelog.com/201/
       | 
       | [3] https://www.youtube.com/watch?v=Jib2AmRb_rk
        
         | Varriount wrote:
         | As someone who has had to evaluate and recommended enterprise
         | software for purchasing, a straightforward price tag is going
         | to make me come your product in a better light.
         | 
         | Plus, I have to wonder how much extra profit the "call us"
         | route actually takes in, after you've subtracted costs for the
         | marketing staff it requires (especially if you try to
         | renegotiate the cost after the subscription/license expires).
        
         | TheRealPomax wrote:
         | Also important to note is that enterprise pricing is way higher
         | than you'd think because when an enterprise needs a solution,
         | something like $1000 is literally too cheap to justify the cost
         | of procurement (which involves not just getting a license, but
         | also internal documentation, which in turn needs to be entered
         | into whatever process certification programme is in place,
         | etc.)
        
       | adamnemecek wrote:
       | I have been thinking about this as well lately. For my app
       | (http://ngrid.io) I'm using a custom built Entity-Component-
       | System (https://en.wikipedia.org/wiki/Entity_component_system)
       | based UI framework in Rust.
       | 
       | In the ECS paradigm, everything is stored as struct-of-arrays as
       | opposed to array-of-structs
       | (https://en.wikipedia.org/wiki/AoS_and_SoA) and as a result, your
       | serialization becomes trivial. You are not chasing pointers to
       | other objects when serializing. Apple's Core Data takes the
       | object graph approach and it's a real clusterfuck.
       | 
       | ECS is very similar to databases in that your data is normalized
       | and things are referenced by offsets rather than pointers
       | https://floooh.github.io/2018/06/17/handles-vs-pointers.html).
       | 
       | Why roll your own thing as opposed to use say SQLite? If you use
       | SQLite, you will probably have some OOP on top of it which
       | introduces a serious impedance mismatch. With ECS, you cut that
       | whole layer out.
       | 
       | I'm guessing that I'm losing on some atomicity but for my use
       | case that doesn't matter that much.
        
         | lxdesk wrote:
         | ECS is useful for real-time, but it means that you are rolling
         | your own query logic. This is fine for the runtime of game
         | engines because they generally aren't dealing with that many
         | data types(for the very most complex AAA games, perhaps a few
         | hundred components) and the queries used at runtime are simple
         | and tend to demand optimization at the data structure level.
         | 
         | If your purpose is an editing tool you may want to reconsider.
         | Editing changes the goal in a very substantial way and the
         | complexity of your queries goes way up, which is where SQL
         | syntax absolutely shines.
        
           | TeMPOraL wrote:
           | FWIW, a game I'm writing uses in-memory SQLite as its ECS,
           | mostly because I wanted to see if I could. SQLite is
           | surprisingly performant. My game is not realtime, but I
           | tested it, and I could actually redraw the screen at 60+ FPS
           | while querying for render data every frame, and I still had
           | lots of frame time to spare.
        
       | garganzol wrote:
       | Everything old is new again. I can recall a similar movement at
       | the end of 90s regarding databases as application file formats.
       | That time it was Microsoft Jet Database engine.
        
       | rb808 wrote:
       | I keep switching between csv, parquet and sqlite. I dont need
       | another article to encourage me to switch again! :)
        
       | waynesonfire wrote:
       | What is an application file format?
        
         | detaro wrote:
         | The first section of the article explains it.
        
       | dragly wrote:
       | From experience, I can also recommend using SQLite as an
       | application file format. I landed on SQLite after looking for
       | solutions for a file format for an educational app we made for
       | simulating biological neural networks. The app is cross-platform,
       | written in Qt and the simulations needed to be stored as a JSON
       | describing the network, a thumbnail and some metadata. It was
       | also intended to be extensible with more features and backwards
       | compatible if new versions were released. I considered creating
       | our own simple format, using ZIP files, HDF5, Qt resource files
       | or SQLite.
       | 
       | I landed on SQLite for many of the reasons outlined in this
       | article and in particular because of how easy it was to implement
       | and maintain. SQLite is supported natively in QtSql, which made
       | it extremely easy to write the save and load functions, and later
       | extend these with more data fields. In addition, we did not have
       | to worry about cross-platform support since this was covered by
       | SQLite and Qt already.
        
         | bob1029 wrote:
         | It sounds like future schema changes may be a potential concern
         | for your application. One thing you can look into using is the
         | SQLite user_version pragma. We use this right now to roll our
         | own migrators and it's light years better than how migrators
         | work for Entity Framework, et. al.
         | 
         | https://www.sqlite.org/pragma.html#pragma_user_version
        
         | radarsat1 wrote:
         | Interesting, I used HDF5 in a similar situation because we
         | needed to save a lot of same-sized rows of data (simulation
         | time steps), so a matrix-oriented format seemed to make sense
         | but it wasn't entirely without some need for cross-referencing
         | between tables, so it does make me wonder now if sqlite would
         | have been a comparable or better choice. Any reason for
         | rejecting HDF5 in your case?
        
       | Nihilartikel wrote:
       | Recently I've been using SQLite to store state for a data-syncing
       | CLI utility in Python where I was considering using CSV or JSON
       | at first. The Python 'Dataset' library,
       | https://dataset.readthedocs.io/en/latest/ has taken almost all of
       | the friction out of using SQLite for trivial state persistence.
       | 
       | I don't have to manually manage schema or create tables - they're
       | all lazily created from Dict keys on insertion, but indexes and
       | SQL can still be used as desired.
       | 
       | This lends itself really well to Jupyter notebook assisted
       | development, where functions can be quickly and interactively
       | iterated without having to muck around with existing tables
       | whenever data changes shape.
       | 
       | It's been a real productivity boost, and I've been looking around
       | for something similar to use in Clojure.
        
       | fatboy wrote:
       | We transitioned the file format for the images in our climbing-
       | topo app from a pile of files to SQLite (the images are tiled).
       | 
       | Going from having high hundreds of tiny files on disk per topo
       | photo to just one was an incredible boon to productivity for
       | things like data backup and transferring files onto the device
       | for testing.
        
         | wetmore wrote:
         | Tell me more about your app
        
           | fatboy wrote:
           | It acts as a digital window into our print guidebook
           | catalogue. The books are mostly for the uk, with the rest
           | spread across Europe.
           | 
           | I take the finished desktop-publishing documents and extract
           | and package the data to put into the app, including tiling
           | the images so we can have very high resolution photos on low-
           | end devices.
           | 
           | There's a semi-technical article about the topo-view
           | implementation here:
           | 
           | https://www.ukclimbing.com/articles/features/rockfax_app_dee.
           | ..
        
       | me551ah wrote:
       | Where can you use SQLite?
       | 
       | Embedded Systems: Yes
       | 
       | Raspberry Pi : Yes
       | 
       | Mobile Apps. : Yes
       | 
       | Desktop Apps : Yes
       | 
       | Browsers : No
       | 
       | Servers : Yes
       | 
       | Supercomputers : Yes
        
         | gwbas1c wrote:
         | You used to be able to use it in a browser with Google Gears.
         | 
         | The reason why it was never adopted was because the browser
         | makers wanted to be able to independently implement their own
         | database instead of everyone having to use the same source
         | code.
        
       | ris wrote:
       | Amongst all the praise we're seeing for this approach here, may
       | we please just have a minute's silence for the various exploits
       | sqlite has seen when loading arbitrary database files over the
       | years?
       | 
       | My recent favourite being
       | https://media.ccc.de/v/36c3-10701-select_code_execution_from...
        
         | SQLite wrote:
         | This concern was just raised on the SQLite Forum (probably
         | after showing up here). See my reply at
         | https://sqlite.org/forum/forumpost/8beceed68e for additional
         | insights into the problem and recent SQLite enhancements to
         | address it.
        
           | ris wrote:
           | Absolutely, there are many things that can be done to
           | mitigate this kind of thing, I'm merely advising caution over
           | people immediately reaching for a solution where they're
           | executing thousands upon thousands of lines of code over
           | untrusted files when they can get away with something much
           | simpler.
        
         | zucker42 wrote:
         | Is there any way to mitigate code execution issues (besides
         | just fixing bugs)? Because allow code execution might be a
         | pretty big deal breaker in certain contexts.
        
         | empthought wrote:
         | This is a security flaw of the system or application using
         | SQLite, though. It is trusting data it should not trust and
         | executing code provided with that data.
        
           | roelschroeven wrote:
           | If I use libpng or libjpeg to decode an image and that causes
           | a buffer overrun or another security vulnerability, that is
           | very much a bug in libpng or libjpeg.
           | 
           | The same is true when I open a database in SQLite: if that
           | causes a security problem, it's a bug in that library. I
           | don't even see how you could validate a database file before
           | you hand it over to SQLite.
        
             | empthought wrote:
             | RTFM
             | 
             | https://www.sqlite.org/security.html
        
           | skrebbel wrote:
           | Huh? If you're using SQLite as an application file format,
           | isn't opening untrusted files a key use case? What else,
           | never share files with people? That must be a pretty boring
           | application.
        
             | empthought wrote:
             | Sure but just like a web browser, it's the application's
             | responsibility to sandbox or verify the code, not the http
             | library's.
        
               | skrebbel wrote:
               | That translates to "don't use SQLite", right?
               | 
               | How would an application that uses SQLite as a file
               | format be able to scan for malicious database files that
               | trigger buffer overflows in the SQLite engine? I'm really
               | not sure what you're suggesting.
        
               | empthought wrote:
               | RTFM
               | 
               | https://www.sqlite.org/security.html
        
               | speedgoose wrote:
               | From the description, I didn't watch the video, loading
               | the database and querying it is enough to run the
               | exploit. To compare with your example, it would be like
               | having a remote exploit if your application queries a
               | specific http endpoint.
               | 
               | For sure application developers could sandbox the http
               | library, sqlite, or stop using libraries developed in so
               | unsafe programming languages but it's a bit too early for
               | that.
        
           | [deleted]
        
       | boudewijnrempt wrote:
       | I was surprised, but also intrigued when I discovered that Manga
       | Studio (Clip Studio these days) used sqlite for the native file
       | format, including the pixel planes.
       | 
       | I wish my predecessor at Krita had made that choice, instead of
       | choosing to re-use the KOffice xml-based file format that's
       | basically a zip file with stuff in it. It would have made adding
       | the animation feature so much easier.
        
         | mkesper wrote:
         | So isn't it possible for Krita to create an updating process
         | and switch? Btw. it's really fantastic Krita is on Android now!
        
         | im3w1l wrote:
         | What's wrong with xml + zip?
        
           | sradman wrote:
           | One of the famous SQLite refrains is:
           | 
           | > SQLite does not compete with client/server databases.
           | SQLite competes with fopen().
           | 
           | This undersells SQLite somewhat. Like Berkeley DB, SQLite was
           | created as an alternative to dbm [1] and one of the main use
           | cases is safe multi-process access to a single data file,
           | typically CLI apps (written in C, TCL, etc.).
           | 
           | Client-Server databases tackle multi-user concurrency while
           | embedded databases often tackle multi-process concurrency.
           | 
           | This article has long been part of the SQLite documentation
           | found under the "Advocacy" section. There is also a short
           | version. [2]
           | 
           | [1] https://en.wikipedia.org/wiki/DBM_(computing)
           | 
           | [2] https://sqlite.org/aff_short.html
        
           | boudewijnrempt wrote:
           | It's really hard to update in place, we basically have to
           | rewrite the whole file after a change, and, of course, we've
           | got to do a lot of things ourselves, like inventing an xml
           | format for describing the image.
        
           | derefr wrote:
           | XML-based formats are export formats, not state-keeping
           | formats. To use an XML-based format for storage, need to have
           | a separate, canonical _in-memory_ representation of the data,
           | which you then snapshot and serialize into XML upon request.
           | You may or may not be able to get away with serializing less
           | than your full in-memory object graph upon save, using
           | techniques similar to DOM reconciliation. Either way, you 'll
           | still likely need your entire document/project _represented_
           | in memory.
           | 
           | If you're working with something analogous to a text
           | document, this snapshot-and-serialize approach to saving
           | works fine. If you're working with other types of data,
           | though, this approach only works for trivial projects; once
           | your document exceeds ~100MB, the overhead of
           | snapshotting+serializing your object graph becomes bad enough
           | that people stop saving very often (dangerous!), and it also
           | makes the saving process itself more fragile (since the
           | longer a save takes, the more likely it becomes that the
           | process might be killed by some natural event like a power
           | cut during it+.)
           | 
           | And, once your project size exceeds the average computer's
           | memory capacity, an in-memory canonical representation
           | quickly becomes untenable. You start to have to resort to
           | hacks like forcing the user to "partition" their project,
           | only allowing the user to work with one pieces at a time.
           | 
           | With an applicaton store-keeping format, you have none of
           | these concerns; the store is itself the canonical data
           | location. You don't have a canonical in-memory representation
           | of the data; the in-memory representation is simply a write-
           | through or write-back caching layer for the object graph on
           | disk, and the cache can be flushed at any time. Or you may
           | not have a cache at all; many systems that use SQLite as a
           | file-format just do SQL queries directly whenever they want
           | to know something, never instantiating any intermediate in-
           | memory representation of the data itself, only retrieving
           | "reports" built from it.
           | 
           | + You can fix fragile saving with a WAL log, but now the WAL
           | log is your true application state-keeping format, with the
           | XML format just being a convenient rollup representation of
           | it.
        
             | geokon wrote:
             | I've never had my problem scale to the size that required a
             | database/SQL, but I don't quite get the advantage of your
             | solution. Having all your interactions with data have to go
             | to disk though a cache muddles things b/c it makes it much
             | harder to reason about performance (b/c when do you have a
             | cache miss? and how do you configure a cache properly?) You
             | introduce a lot more blackmagic variables to reason about.
             | 
             | If you're editing images I'd think it'd just makes more
             | sense to have all of your stuff in RAM and then a saving-
             | to-disk is done on a separate thread. I don't quite get why
             | the users would stop saving in this example.
             | 
             | I'm not saying you're wrong - but more asking for some more
             | details b/c I've never imagined using a DB on data that can
             | fit in RAM
        
               | HelloNurse wrote:
               | It's primarily a problem of inflexibility handicapping
               | performance, not of "cache misses" and clever algorithms.
               | 
               | For example, imagine a word processing program opening a
               | document and showing you the first page: you could load
               | 50MB of kitchen sink XML and 250 embedded images from a
               | zip file and then start doing something with the
               | resulting canonical representation, or you could load the
               | bare minimum of metadata (e.g. page size) from the
               | appropriate tables and the content that goes in the first
               | page from carefully indexed tables of objects. Which
               | variant is likely to load faster? Which one is guaranteed
               | to load useless data? Which one can save the document
               | more quickly and efficiently (one paragraph instead of a
               | whole document or a messy update log) when you edit text?
        
             | catalogia wrote:
             | > _it also makes the saving process itself more fragile
             | (since the longer a save takes, the more likely it becomes
             | that the process might be killed by some natural event like
             | a power cut during it+.)_
             | 
             | This is one I take very seriously, after I got bit by it. I
             | was saving state by writing s-expressions to a text file;
             | it seemed a reasonable enough thing to do even with tens of
             | megabytes of it, until my laptop turned off in the middle
             | of a write. After recovering from a backup and losing
             | several hours of work in the process, I switched to SQLite
             | that evening.
        
           | 7thaccount wrote:
           | I don't personally know anything wrong with it, but SQL
           | queries are powerful powerful powerful, so although XML might
           | not have any huge issues, in my personal opinion it is still
           | sub-optimal compared to what is possible with SQL (at least
           | for my uses, which could be very different to yours so take
           | with a grain of salt). In particular, SQLite is a tiny
           | executable with a HUGE amount of tests and is used very
           | widely, so reliable for a lot of tasks as long as the data
           | size stays relatively small and you caste your types
           | properly.
           | 
           | SQLite also has a lot of tutorials and books and language
           | support. I've used it with Python, C#, Perl, TCL, and
           | Powershell with no issues. You can access it via the command-
           | line or you can hook into it with a fully graphical SQL IDE
           | like DB-Visualizer (I really recommend using an IDE for
           | interactive SQL use). If your language doesn't have built-in
           | or library support, even a novice programmer like myself can
           | roll a few functions together to build the tables, update,
           | delete, and run queries to analyze the data if you can run
           | some system commands. It's a wonderful little technology that
           | I feel comfortable reaching to when I need it.
           | 
           | One thing I've shied away from over the years are
           | technologies which require running complex installers as it
           | makes things more confusing and makes it harder for me to
           | share with colleagues that aren't as interested in
           | programming. Both SQLite and DB-Visualizer require zero
           | installation. I just put each in a folder and then run the
           | executable. This is really easy to use to me and easy to get
           | others started too. Note that this is not commercial
           | software, but internal business apps to help people do
           | complex tasks easier. So you have a script that does some
           | data processing pushes that data to SQLite and then the user
           | can bring up DB-VISUALIZER, point it to the little SQLite .db
           | file I created and then get to work. We have a lot of little
           | apps like this and since most of our engineers are really
           | good with SQL, they can do whatever they need efficiently.
        
           | chrisseaton wrote:
           | Seems very hard to seek in XML? And how would you update it
           | in place? Also seems like it would be massive?
        
             | boudewijnrempt wrote:
             | Well, we're not storing pixels in xml, but now we've got
             | two xml files, one with author data, one descibing the
             | image and a folder tree containing various kinds of binary
             | data, all in a zip file. And yes, updating in place is a
             | bother.
        
           | airstrike wrote:
           | Do people really pick XML as a go-to format these days?
           | Probably easier to list what's not wrong with it
        
             | boudewijnrempt wrote:
             | Well, these days 22 years have passed since the original
             | file format was designed -- and not by me, though I
             | probably would have done the same thing, 22 years ago.
             | 
             | Also, it's zip + xml files + binary files, not all xml.
        
               | airstrike wrote:
               | Well, I was responding to the "what's wrong with xml +
               | zip" question, not your remark on your predecessor
        
       | wodenokoto wrote:
       | To my very limited knowledge it is very common for iOS and
       | android apps to store data in SQLite.
       | 
       | Is it possible to extract or change these databases?
       | 
       | I have a few (offline) apps on my phone that I'd love to append
       | data to
        
         | SonOfLilit wrote:
         | It's possible on Android if you have root access, I think.
         | Android is very serious about apps not accessing each other's
         | data.
        
         | kccqzy wrote:
         | For unjailbroken iOS the only way to extract these databases is
         | to make an unencrypted device backup in iTunes (Finder in
         | Catalina) and then look at the backup file. The backed up files
         | won't be conveniently named but you can still find it.
        
           | wodenokoto wrote:
           | Can I push my edits back onto my iPhone or is that a pipe
           | dream?
        
         | muxator wrote:
         | As a sibling comment says, if you are on Android and have root
         | access, it is very easy.
         | 
         | Just have a look at /data/data/appname/.
         | 
         | For example, this is what I copy to make backups of my contacts
         | on my Android phone:
         | 
         | > # pwd
         | 
         | > /data/data/com.android.providers.contacts/databases
         | 
         | >
         | 
         | > # ls -1
         | 
         | > calllog.db
         | 
         | > contacts2.db
         | 
         | > profile.db
         | 
         | This is a simple way for moving data around, restoring
         | applications, performing backups, editing your data (if you
         | know your way inside the app).
         | 
         | Beware of the selinux labels if you are moving files across
         | different devices, as recent android version now run with
         | selinux in enforcing mode.
         | 
         | Just be sure to adjust them (useful commands if you need a
         | reference: ls -lZ, chcon, semanage).
        
       | booleandilemma wrote:
       | SQLite is a database that tries so hard to not be a database :)
        
       | ori_b wrote:
       | Please no. This is an incredibly complicated file format that is
       | not particularly well suited for file saving. It's also very slow
       | ((edit: correction) ~10 transactions per second) if you aim for
       | integrity, or unsafe if you tune for performance (~50k
       | transactions per second, but if your program or computer dies
       | half way, your file is hosed). So, keeping files up to date in
       | place makes your ui janky.
       | 
       | You can work around it by working in memory and writing out a
       | whole new database from in memory structures on save and then do
       | the atomic rename. But if you do that, you are probably better
       | off with json, protobuf, or similar. The libraries around these
       | formats are similarly battle tested, but they fit the needs
       | better, supporting working in ram fully and then saving cleanly
       | and easily.
        
         | kbumsik wrote:
         | > It's also very slow (~100 transactions per second) > But if
         | you do that, you are probably better off with json, protobuf,
         | or similar.
         | 
         | Are you talking about a server application?
         | 
         | > or unsafe if you find for performance (you can get it up to,
         | IIRC, ~50k transactions per second, but if your program or
         | computer dies half way, your file is hosed)
         | 
         | Never heard that SQLite has unsafe operations. Any source?
        
           | ori_b wrote:
           | > _Are you talking about a server application?_
           | 
           | I'm talking about doing a handful of transactions -- even a
           | single one, now that I looked at the actual numbers that
           | sqlite discusses -- being enough to introduce user-visible
           | jank.
           | 
           | > _Never heard that SQLite has unsafe operations. Any
           | source?_
           | 
           | The sqlite docs. You can improve the performance of sqlite by
           | multiple orders of magnitude by messing with things like
           | https://www.sqlite.org/pragma.html#pragma_synchronous, at the
           | cost of safe atomic updates.
        
             | kbumsik wrote:
             | > The sqlite docs. You can improve the performance of
             | sqlite by multiple orders of magnitude by messing with
             | things like
             | https://www.sqlite.org/pragma.html#pragma_synchronous, at
             | the cost of safe atomic updates.
             | 
             | Thanks for the link, but only when the OS or your computer
             | crashes, not even the application itself. Could you
             | introduce any application file format that is safe in that
             | case? Obviously your JSON file format is much more unsafe
             | than that, not to mention the huge overhead of JSON with
             | binary data.
        
               | ori_b wrote:
               | > _Thanks for the link, but only when the OS or your
               | computer crashes, not even the application itself._
               | 
               | Ah, right there are also the journalling knobs that you'd
               | tweak to get to max performance -- either in-memory or no
               | journalling at all make a big impact. There are a bunch
               | of them scattered throughout the docs.
               | 
               | > _Could you introduce any application file format that
               | is safe in that case? Obviously your JSON file format is
               | much more unsafe than that, not to mention the huge
               | overhead of JSON with binary data._
               | 
               | Yes, any format is safe if you write to a temporary file,
               | and then rename(2) it to replace it. This is guaranteed
               | to be atomic, so it works with anything (including
               | sqlite, though at that point all sqlite does is cost you
               | performance and complexity).
        
         | tzs wrote:
         | > Please no. This is an incredibly complicated file format that
         | is not particularly well suited for file saving. It's also very
         | slow (~100 transactions per second) if you aim for integrity,
         | or unsafe if you don't.
         | 
         | The kind of application files they are talking about (things
         | like word processor documents, spreadsheets, drawings, source
         | code control system data) would only be writing sporadically.
         | During one of those sporadic writes they might need to update
         | thousands of rows but those could all be done in one
         | transaction.
        
           | ori_b wrote:
           | I made a mistake -- it's an order of magnitude slower. That
           | means a single transaction is within the threshold for human
           | perception. If you block the UI thread on a transaction,
           | you're now dropping 6 frames at 60fps. If you block other
           | operations on the transaction completing, you end up with
           | lag.
           | 
           | Reference: https://www.sqlite.org/faq.html, question 19.
           | (I've seen similar when testing on SSDs locally).
        
             | lalaland1125 wrote:
             | Why would you block the UI thread on a transaction?
        
               | ori_b wrote:
               | Because you've got another operation triggered by a UI
               | action, and you need the previous one to finish, for
               | example.
        
             | lostmyoldone wrote:
             | If you have an FPS or interactivity target that needs to be
             | guaranteed to be better than about 0.1-1 second - depending
             | on platform - you can't have any disk IO at all on the UI
             | thread, or any thread that's supposed to react at
             | interactive rates at all times.
             | 
             | It has generally become much better in the last decade or
             | two, but one should still expect most OS's to _sometimes_
             | pause for excessive amounts of time on disk IO unless the
             | API is specifically guaranteed to never pause. Even then
             | one would be wise to measure /log deviations if it's
             | critical for the application. OS guarantees might also be
             | contingent on driver/subsystem guarantees, and bad drivers
             | might sometimes affect what seems completely unrelated
             | upstream systems.
        
               | ori_b wrote:
               | > _If you have an FPS or interactivity target that needs
               | to be guaranteed to be better than about 0.1-1 second -
               | depending on platform - you can 't have any disk IO at
               | all on the UI thread, or any thread that's supposed to
               | react at interactive rates at all times._
               | 
               | Yes, and other file formats encourage doing things in
               | memory, so you don't have any disk i/o in the common
               | path.
               | 
               | Using sqlite as a file format strongly discourages the
               | simple, jank-free until you press save workflow of
               | slurping your content, operating on it in memory, and
               | then outputting it all in one operation as a response to
               | an explicit user action. Instead, your whole application
               | gets small but perceptible delays across all operations
               | and interactions.
        
               | lxdesk wrote:
               | That conflates the use of the schema with the use of the
               | format features. You can, in fact, use your custom in-
               | memory structures and then "slurp it out" to BLOB when
               | you're ready to save.
        
               | ori_b wrote:
               | Yes. But if you go that way, though, sqlite is just a
               | slow and clunky format compared to, eg, json or protobuf.
               | I doubt that any program out there uses sqlite this way,
               | other than those which realize they painted themselves
               | into a performance corner and hacked around it for
               | compatibility.
        
               | em500 wrote:
               | Your concerns seem pretty theoretical. Several _very_
               | widely used applications (Whatsapp, iMessage, Apple Notes
               | /Calendar) use sqlite as a file format without much
               | issue.
        
         | SQLite wrote:
         | Storing content in SQLite is actually _faster_ than writing the
         | equivalent content directly to disk, in many situations. See
         | https://www.sqlite.org/fasterthanfs.html for discussion,
         | caveats, and links to source code where you can verify these
         | claims for yourself.
        
       | renewiltord wrote:
       | GeoPKG is an open format that is just a SQLite schema. It's cool!
        
       | tracker1 wrote:
       | Concurrent access to SQLite is the main reason I haven't really
       | looked into Deno. For that matter, I really think that
       | WebAssembly file system implementations should have enough
       | flexibility to be able to work with a SQLite database file in the
       | host OS.
       | 
       | The needs for concurrent SQLite pretty much covers the need for a
       | robust file system.
        
       | pvg wrote:
       | Previously: https://news.ycombinator.com/item?id=18754634
        
       | rlpb wrote:
       | Would such an app hold the sqlite database open while the user
       | has the document "open", and live-write user changes back to the
       | database immediately? Or would it follow the traditional model of
       | the user choosing to "Save"?
       | 
       | I worked on an app that did the former many years ago (to an
       | Access database, not sqlite), and it did not go well because this
       | broke user expectations on the usual "open/save/save as" model.
        
         | catalogia wrote:
         | I think the former is preferable and most modern users will
         | prefer it. However if your users are upset about it, you could
         | do the later, or you could do the former and give them a
         | dummy/placebo "save" option that does nothing, or maybe VACUUMs
         | the DB.
        
         | pmilot wrote:
         | The save model at the UX level is completely orthogonal to the
         | application file format. You could implement either model with
         | an SQLite DB file.
        
         | EvilTerran wrote:
         | I guess you could send changes to the DB on-the-fly inside a
         | transaction, so when the user clicks "save" it's just a matter
         | of running COMMIT? Not sure what effect that would have on
         | performance as the transaction grows, though.
        
         | mkl wrote:
         | I think it would depend on the app (e.g. how document-like it
         | is), but nowadays people will be much more used to the instant-
         | save model due to things like Google Docs, OneNote, and phone
         | apps.
        
       | bstar77 wrote:
       | Is anyone doing this in Javascript (via electron) and encrypting
       | the file in such a way that the data is protected and not an
       | obvious sqlite db?
       | 
       | I have some ideas on how to do this, but I'm curious if there's a
       | "preferred" way to do it.
        
         | gregmac wrote:
         | Why would you want to "protect" the data in this way? It's
         | user-hostile, and anyone determined enough will be able to get
         | at it no matter what you do.
         | 
         | If it's popular and valuable enough, instructions and/or code
         | to break it automatically will then be published, regardless of
         | how much money or time you invested into the protection. (For
         | proof, look no further than the game industry's DRM over the
         | past 30 or so years)
        
           | bstar77 wrote:
           | It's not meant to be user hostile, it's for a game. I allow
           | data exports, but I don't want to expose the inner workings
           | of the game via direct access to the DB.
        
             | lxdesk wrote:
             | Data mining is an old gamedev worry. Nobody who's been
             | doing it for a while particularly cares. If the game is
             | remotely popular, data mining will happen. Disassembly to
             | find the decrypt function is a talent possessed by many,
             | many people who have the reverse-engineering bug. If you
             | then obfuscate the code, you have made things interesting
             | and then increasingly talented people will try to have a go
             | at it. If you combine a changing obfuscating technique with
             | updates, you can slow down community possession of the game
             | with respect to modding etc., and Minecraft worked that way
             | in its beta phases. But it's really all a question of your
             | purpose at that point.
             | 
             | If you haven't already, I suggest cruising through TCRF:
             | https://tcrf.net/The_Cutting_Room_Floor
        
             | Vendan wrote:
             | You do realize, exposing the "inner workings of the game"
             | seems to be at least somewhat responsible for massively
             | increasing the appeal of the game in other cases? see
             | Minecraft, Rimworld, KSP as a few examples...
        
               | serf wrote:
               | ... see Minecraft, Rimworld, KSP as a few examples...
               | 
               | I mean, I agree that games should be moddable, but not
               | from the premise of game profitability and popularity.
               | 
               | The highest-grossing video game franchise list is over-
               | ran by products that are , pretty much, famously
               | unmoddable.[0]
               | 
               | [0]: https://en.wikipedia.org/wiki/List_of_highest-
               | grossing_video...
        
               | TeMPOraL wrote:
               | That list has some old and long-running franchises (e.g.
               | Mario), some semi-interactive movie experiences (e.g.
               | CoD), and some games that have in fact been modded to
               | ridiculous extent (Warcraft, StarCraft, The Sims). I
               | wouldn't read too much from it.
        
         | pilif wrote:
         | SQLite has encryption plugin support, or you could encrypt the
         | file yourself at rest. However, no matter what you are doing,
         | as the application needs to be able to decrypt the file, you
         | have to ship the key and thus the protection will be able to be
         | circumvented.
         | 
         | It's really not worth the effort IMHO
        
           | bstar77 wrote:
           | That's a good point... I don't want to have the game rely on
           | "phoning home" to decrypt the db.
        
       | geff82 wrote:
       | Haha!! I thought it was only me using sqlite as my go-to file
       | format in apps.
        
       | TheRealPomax wrote:
       | The great thing about applications that use sqlite as their
       | application file format is that you can write secondary apps and
       | utilities to supplement the applications themselves.
       | 
       | For example, Adobe's Lightroom uses sqlite as their application
       | file format, which means that it's almost trivial to write an
       | application to help it do things it's either really
       | bad/exceedingly slow at (like removing images from a catalogue
       | that no longer exist on disk) or literally can't do (like
       | generating a playlist of files-on-disk for a specific collection
       | or tag), without being locked into whatever janky scripting
       | solution exists in-app. If there even is one to begin with.
       | 
       | All you need is a programming language with a sqlite connector,
       | and you're in the driving seat. And sure, you'll need to figure
       | out the table schemas you need to care about, but sqlite _comes
       | with sqldiff_ so it 's really easy to figure out which operations
       | hit which tables with only a few minutes of work.
       | 
       | Good luck reverse engineering proprietary file formats in the
       | same way!
        
         | TheRealPomax wrote:
         | Heck, I wrote https://github.com/Pomax/lightroom-catalog-helper
         | exactly because lrcat is an open format that takes barely any
         | time to figure out, which means the only limit to what you can
         | do with "lightroom" is what your programming skills allow for.
        
         | mongol wrote:
         | I tried to do that with a Firefox SQLite database files but it
         | had an exclusive lock.
        
           | TheRealPomax wrote:
           | This is the case for any application that constantly writes
           | to their dbs, so you usually run your own application/utility
           | on them when the "parent" application isn't running.
        
           | simonw wrote:
           | You can usually work around that by creating a copy - I run
           | "cp /path/to/locked.db /tmp" all the time for this.
        
             | pjscott wrote:
             | Make sure to also copy locked.db-wal if it's present. (Is
             | there a better way to do all this? I feel like there really
             | ought to be a better way.)
        
         | petters wrote:
         | It's very annoying when the applications use sqlite with
         | encryption. That is not open source, so regular language
         | bindings can not be used.
        
           | TheRealPomax wrote:
           | I'm not sure how slqite as your application file format has
           | anything to do with open source, honestly. It's certainly
           | super easy for open source projects to use, but a closed
           | source proprietary application using an encrypted sqlite file
           | sounds perfectly sensible to me?
           | 
           | We all win when folks decide to leave it accessible, but I'm
           | not going to hold "encrypting a file format so that people
           | can't easily reverse engineer it" against folks who are
           | trying to sell software.
        
             | labawi wrote:
             | > I'm not going to hold "encrypting a file format so that
             | people can't easily reverse engineer it" against folks who
             | are trying to sell software.
             | 
             | I think I'd mostly disagree. Selling an application is one
             | thing, but the data itself is usually customers' and
             | holding their data hostage is not a proper thing to do.
        
               | TheRealPomax wrote:
               | That's a tenuous argument at best, heavily relying on the
               | application in question not offering any way to export
               | your data, which many applications do come with. Just
               | because the file format is locked, doesn't mean the data
               | in it is locked. You just need the application to unlock
               | it.
        
               | samatman wrote:
               | This is self contradictory.
               | 
               | Listen to what you're saying: the data isn't locked, you
               | just need the... application, to... unlock it.
               | 
               | That's beyond tenuous, it's invalid.
        
             | cfn wrote:
             | I guess he/she meant that the extension that encrypts the
             | database is proprietary. I believe that was the only option
             | for a while but now there is an open source alternative.
        
               | yellowapple wrote:
               | Is the open-source alternative bidirectionally
               | interoperable with the closed-source version (i.e. it can
               | decrypt the proprietary encryption and the proprietary
               | version can decrypt the FOSS version)? If so, then that's
               | handy.
        
             | wrigby wrote:
             | It looks like the actual library to read/write an encrypted
             | SQLite DB is licensed separately from the open source
             | SQLite[1], so even if you have the encryption key, you
             | can't access the encrypted format:
             | 
             | 1: https://www.sqlite.org/see/doc/release/www/readme.wiki
        
         | simonw wrote:
         | Completely agree - I wrote about some fun I had with Apple
         | Photos last month:
         | https://simonwillison.net/2020/May/21/dogsheep-photos/
        
           | tomduncalf wrote:
           | Enjoyed this post a lot and had no idea apple use SQLite u
           | see the hood for a lot of stuff - thanks for sharing!
        
           | TheRealPomax wrote:
           | It opens up so many possibilities. Nice write-up!
        
         | me551ah wrote:
         | The only thing stopping SQLite from becoming the primary
         | application file format is browsers. Every single other client
         | supports SQLite. Embedded, Mobile, Lightroom, Desktop Apps and
         | Cross platform mobile frameworks have really good support for
         | SQLite.
        
           | HelloNurse wrote:
           | On the one hand, browsers dislike access to files (regardless
           | of format) for security reasons. On the other hand, browsers
           | use SQLite a lot. For example Firefox uses SQLite at least
           | for IndexedDB (various files) and for history and bookmarks
           | (places.sqlite): https://developer.mozilla.org/en-
           | US/docs/Mozilla/Firefox_Ope...
        
             | mbreese wrote:
             | Wasn't there an effort a few years back to try to create a
             | W3C standard for accessing data that stalled because all of
             | the browsers used SQLite and no one wanted to create a
             | different implementation? IIRC, the standard required two
             | different implementations and SQLite was used by everyone?
        
           | TheRealPomax wrote:
           | Primary application file format for... what kind of files?
           | Because it's certainly not going to (nor should it ever)
           | replace hypertext documents.
           | 
           | Being able to load a read-only .sqlite database might seem
           | cool, but I can't think of a single instance in which that's
           | smaller and/or more efficient than calling data endpoints
           | that use gzip/brotli for transport.
           | 
           | Or are you thinking "as replacement for IndexedDB"? In which
           | case, hard agree but then it's an actual database, not used
           | as file data container.
        
         | lbutler wrote:
         | I have an application that uses sqlite as their file format but
         | they've cleared the header so you can't open it directly.
         | 
         | I'd love to be able to make secondary applications like you've
         | described but being enterprise software they don't want to make
         | it too easy.
         | 
         | They obviously want to keep people locked in with their $40k
         | per seat application!
         | 
         | I guess the first step is figuring out the page size and other
         | bits the other meta data you set in the header [1].
         | 
         | I know I just have to sit down and understand the format better
         | and I will eventually figure it out...
         | 
         | [1] https://www.sqlite.org/fileformat.html
        
           | TheRealPomax wrote:
           | That's mean, but you'll just have to live with that. If
           | someone takes active steps towards preventing people from
           | making secondary apps/utils, then they're (a) jerks, and (b)
           | allowed to do that.
           | 
           | Might be worth actually contacting them to ask why, if you
           | can make the case that secondary applications will increase
           | the value of their app, not decrease it.
        
             | traverseda wrote:
             | >That's mean, but you'll just have to live with that
             | 
             | I mean I know we're all on board with the idea of
             | intellectual property actually being a thing now, but
             | surely there are limits? I've seen people take the hard-
             | line stance that if something is your property you should
             | be able to dictate _exactly_ under what situation it can be
             | used, but there have to be limits to IP holders rights on
             | some level, and I feel like reverse engineering a file
             | format is a pretty reasonable place to draw that line.
        
               | jtbayly wrote:
               | I didn't take "you'll just have to live with that" as a
               | claim that the user couldn't reverse engineer it. It
               | simply means that they've made it harder for the user to
               | do so, and so the user has to live with it being harder.
        
               | TheRealPomax wrote:
               | this.
        
               | jstanley wrote:
               | > I know we're all on board with the idea of intellectual
               | property actually being a thing now
               | 
               | I don't know that that's true at all! I'd say the
               | pendulum is swinging in the opposite direction.
        
               | matheusmoreira wrote:
               | > we're all on board with the idea of intellectual
               | property actually being a thing now
               | 
               | We most certainly are not. I personally believe that
               | intellectual property as a whole doesn't make sense in
               | the 21st century and should be abolished.
               | 
               | > there have to be limits to IP holders rights on some
               | level
               | 
               | There are. The laws generally recognize fair use and
               | reverse engineering for interoperability.
               | 
               | > I feel like reverse engineering a file format is a
               | pretty reasonable place to draw that line
               | 
               | Absolutely. Unfortunately, in the US it seems
               | corporations can force people to give up their rights by
               | making them agree to it. Therefore, "you must not reverse
               | engineer our software" is a standard clause in every
               | contract and it's not negotiable.
        
               | kstrauser wrote:
               | I'd be more on board with it if it were taxed like other
               | property. That patent's worth $1B, you say? Better let
               | Accounting know. Conversely: you're suing for $1B for a
               | patent violation but you only paid taxes on it being
               | worth $300? $300 max it is, then.
        
               | kgwgk wrote:
               | > taxed like other property.
               | 
               | Most property owned by companies is not taxed.
        
               | [deleted]
        
               | cortesoft wrote:
               | They aren't saying you have to live with the legal
               | restriction, but rather the technical restriction of
               | having to reverse engineer it without the headers
        
               | deathgrips wrote:
               | >we're all on board with the idea of intellectual
               | property actually being a thing now
               | 
               | No. Intellectual property is not genuine property. It is
               | a state granted monopoly and is antithetical to free
               | market principles.
        
             | hamburglar wrote:
             | I'm surprised that nobody's pointed out that there are
             | actually valid reasons other than greed to obscure your
             | file format. It's an implementation detail, not a contract.
             | If customers begin relying on the implementation details,
             | you end up with angry customers when you change the
             | implementation details. A SQLite db without the header is
             | basically a statement saying, "we are using the obvious
             | file format here for our convenience, not for general
             | purpose access. Screw around in here at your own risk."
             | 
             | If you modified their app's internal state db and screwed
             | it up because they have designed their software with
             | certain assumptions that aren't clear from just reading
             | their db schema, that would be a nightmare for them to
             | support. The easiest thing for them to do is just to try to
             | discourage tampering with their internal state.
             | 
             | This is especially true if there's a chance that a market
             | for secondary apps/utils will spring up. If that's to
             | happen and be viable, they absolutely would want to put
             | thought into what their supported interfaces are for those
             | apps/utils, otherwise they will end up painted into a
             | corner and unable to change their architecture without
             | destroying a marketplace.
        
             | lbutler wrote:
             | I can understand why they do it, it's enterprise software
             | so the more open you make it, then the less licences they
             | can sell.
             | 
             | They do have a module you can purchase to run API calls and
             | access their files/software but as you probably guessed
             | that's another $40k license!
             | 
             | Most of my apps I build use this API, but for me to provide
             | to other companies they need them to also buy the API
             | extension.
             | 
             | I'd love to cut out the middle man and I'll do it
             | eventually when I reverse engineer the header!
        
               | mrmonkeyman wrote:
               | I don't get this mentality. You haven't created this
               | system. You have no idea what it takes or else you would
               | have created it yourself. You don't like the price, but
               | you want it anyway. How is this, I mean come on,
               | different from stealing? You even say it: cutting out the
               | middle man. You want to remove their profits.
        
           | kccqzy wrote:
           | I might be wrong but I thought Dropbox used to do that as
           | well. All the sync state is stored in a sqlite file with the
           | header changed or removed.
        
           | freeopinion wrote:
           | At $40K/seat, that must be some special software. Simple
           | economics invites competition. Depending on the software, it
           | may not take too many seats to incent the funding of an
           | alternative.
           | 
           | Maybe this is the kind of software that requires huge
           | development costs. But maybe it would be worth 20 seats'
           | worth of customers joining forces to fund a team of 5 people
           | to build you a competing app tailored to your specific
           | needs/wants and completely under your control.
           | 
           | Granted, that could bump your costs from $800K/year to
           | $1.6M+/year. But only short-term. Once your software is
           | production-ready, you drop the costs of your current
           | software. So think of it more like going from $8M/10 years to
           | $6-10M/10 years but having complete control to add the
           | features you want. And perhaps having the opportunity to
           | recoup $millions/year by licensing to others. Or, open source
           | it and give others the same kind of control while benefiting
           | from the features they add. Spread your development costs
           | across more seats to further lower your $/seat.
           | 
           | Or, look at the 100 employees your vendor currently has and
           | lose heart, then hope somebody with deep pockets funds a
           | competitor.
        
             | lbutler wrote:
             | This particular software is for water utilities to model
             | and simulate their water and waste water networks.
             | 
             | It's mostly used so utilites can forecast growth in their
             | areas for the next 25+ years and see the impact on their
             | networks and feed into their capital work projects.
             | 
             | A decently sized utility may spend up to $200M/yr on
             | capital works so $40k isn't even a line item!
             | 
             | There is completion in the market but consultants are
             | forced to use what their clients pick and most utilites
             | aren't that price sensitive.
             | 
             | There are also open source alternatives by the EPA[1][2],
             | and most commercial operators are just wrappers around this
             | public domain software.
             | 
             | I'm trying to create FOSS to help view and run these
             | models.
             | 
             | [1] https://en.m.wikipedia.org/wiki/EPANET
             | 
             | [2] https://en.m.wikipedia.org/wiki/Storm_Water_Management_
             | Model
        
       | praveen9920 wrote:
       | The article has great points.
       | 
       | Main reason for existence of lot of file formats is that
       | enterprises don't want just about everyone access their files and
       | modifying them. It greatly reduces the usage of their proprietary
       | software hence their revenues.
       | 
       | I remember the days when open office trying to render doc file
       | but formatting used to suck big time.
       | 
       | Open source softwares should leverage this kind of file formats
       | for inter operating.
        
         | agumonkey wrote:
         | it's the old business paradox, if everything was totally lean
         | and solved, how do people earn a living ? you need a strong
         | force to overcome this hurdle and show how to reorganize the
         | system in the large
        
         | catalogia wrote:
         | > _Main reason for existence of lot of file formats is that
         | enterprises don 't want just about everyone access their files
         | and modifying them. It greatly reduces the usage of their
         | proprietary software hence their revenues._
         | 
         | I'm sure this does happen, but it seems more like MBA-paranoia
         | than a legitimate concern. For instance I sincerely doubt Adobe
         | has lost any revenue by using SQLite with lightroom, despite
         | various open source tools being able to interact with their
         | lrcat (sqlite) files.
        
       | catalogia wrote:
       | Is there any good way to use SQLite as an application file format
       | while also using it in WAL mode? They seem sort of mutually
       | exclusive; near as I can figure the way to do this would be to
       | execute `PRAGMA journal_mode=DELETE` when the user requests a
       | save file, then copy the DB to the location the user specifies. I
       | think this negates a few of the advantages of using sqlite as an
       | application file format though.
        
         | Redoubts wrote:
         | I think the backup api or VACUUM INTO are the preferred ways,
         | though you're gonna probably want to turn off wal on the
         | exported copy still.
        
       | GnarfGnarf wrote:
       | I just converted a 90's-vintage xBase app to SQLite. I just love
       | it. Should have done this years ago.
       | 
       | The only thing I miss is the ability to jump X% into an index.
       | Not a deal-breaker, though.
        
       | overcast wrote:
       | What happened to storing blobs in a database being giant rookie
       | mistake? I'm currently dealing with this at the office, where now
       | a SQLServer database has grown to unmanageable size from helpdesk
       | software attachments.
        
         | JulianWasTaken wrote:
         | It's less of a rookie mistake than it seems these days:
         | https://danluu.com/file-consistency/.
        
         | adrianmonk wrote:
         | I don't think there is a consensus that storing blobs in the
         | database is always the wrong decision. Yes, there are
         | drawbacks, but it really depends on the circumstances, and it
         | might be the right choice.
         | 
         | Suppose, for example, that you were going to make your own
         | implementation of Git. The normal implementation has a bunch of
         | blobs under one directory. There's not much use in manipulating
         | this directory separately from everything else under .git. The
         | blobs and most of the non-blob stuff are only useful together
         | anyway, so you already manage them as a single collection.
         | 
         | It could create problems for certain backup tools, so that's a
         | disadvantage. But it also simplifies application development.
         | So it's not a slam dunk one way or the other.
        
         | agumonkey wrote:
         | attaching oversized unreadable scanned pictures of digital text
         | files I guess
        
         | foldr wrote:
         | The suggestion here is to use it as a file format, not as a
         | file system.
        
         | GordonS wrote:
         | I think that refrain usually applies to when the database is
         | used to store relational data, rather than when it's being
         | requisitioned for use as a file format.
        
         | MaxBarraclough wrote:
         | I've never faced this problem so perhaps I'm missing something
         | obvious, but what's the big difference storing binary blobs on
         | the DB vs, say, storing URLs on the DB and hosting the blobs
         | elsewhere? Is the size of the database the main concern?
        
           | raphaelj wrote:
           | Both approaches have their advantages.
           | 
           | Storing blobs on the DB makes the data more consistent (the
           | database is always in sync with the file content), and allows
           | you to use advanced features of the DB on these files
           | (transactions, SQL queries, data rollback ...). You also only
           | have to backup the DB.
           | 
           | Storing links to the objects is usually more scalable, as
           | your DB will not grow as fast. DBs are usually harder to
           | scale, and also more expensive (at least 10x per GB).
           | 
           | It really depends on the project, but I'm favoring more and
           | more storing the data in BLOBs, as it makes backups easier
           | and as I can use SQL queries directly on the data. Databases
           | as a service also make it easy to scale DBs up to +/- 2TB.
           | But the cost might still be an issue.
        
             | ginko wrote:
             | >and also more expensive (at least 10x per GB).
             | 
             | Expensive in what way? Memory/Compute? It can't be
             | licensing money since SQLite is public domain.
        
               | raphaelj wrote:
               | I'm just comparing the per stored GB cost of on demand
               | database services vs on demand storage services.
        
               | Tajnymag wrote:
               | raphaelj is probably talking about classic database
               | machines like PostgreSQL and MariaDB.
               | 
               | In my view, storing blobs in sqlite doesn't have a huge
               | disadvantage tied to it. Sqlite grows pretty linearly and
               | as it stores the whole db in a single file, hosting
               | provider doesn't even have to know about you using it.
        
         | ojnabieoot wrote:
         | Did you read the article? This isn't about storing blobs as
         | data in SQLite, but rather using SQLite-formatted databases as
         | the file format for an application.
         | 
         | As a simple example, Word documents are just zipped XML text
         | files (try unzipping a .docx and looking inside). Instead of
         | using this, you could a SQLite .db file (probably with a
         | different extension), translating the XML files into tables,
         | and folders into databases. The OpenOffice case study has more
         | details: https://sqlite.org/affcase1.html
        
           | overcast wrote:
           | Apparently I'm misreading the following paragraph.
           | 
           | Any application state that can be recorded in a pile-of-files
           | can also be recorded in an SQLite database with a simple
           | key/value schema like this:                   CREATE TABLE
           | files(filename TEXT PRIMARY KEY, content BLOB);
        
             | roenxi wrote:
             | It is reasonable to adopt that style if it is data you
             | don't control. Storing blobs is probably a mistake if it is
             | data you have control over.
             | 
             | Sometimes all that is known about data is it exists - in
             | that case, into the database as a blob it shall go. If it
             | can be decomposed it probably should be.
        
             | ojnabieoot wrote:
             | I agree that is unclear for an introduction, and that the
             | article could have been written better. The "pile-of-files"
             | they are referencing isn't a filesystem of arbitrary and
             | unstructured binary data stored for humans to look at later
             | (like your Helpdesk attachments), but rather structured and
             | specific data for an application to read in a well-defined
             | manner.
             | 
             | Your Helpdesk example used SQLServer problematically
             | because a SQL database shouldn't be used as an arbitrary
             | file store. But if you know what the file structure is and
             | have a reasonable grasp for how it might scale (that each
             | binary blob is small, that each user only adds one row to
             | the database, etc), there are huge advantages to "a SQLite
             | table with lots of binary and text columns" versus "a
             | folder with lots of binary and text files." And if those
             | text files are just small key-value pairs then maybe they
             | should also go in SQLite.
        
           | wenc wrote:
           | The OpenOffice case study is a fascinating thought
           | experiment.
           | 
           | We don't really create too many new file formats these days,
           | and if we did they're highly performance specialized
           | (parquet, arrow).
           | 
           | Just wondering aloud, what recent file format would have
           | benefited from being based on sqlite?
        
         | TheRealPomax wrote:
         | That's the thing: this isn't "a database" anymore when it's
         | used as an application file format, it's now "an application
         | file format" that is conveniently also a sqlite db file so
         | everything in it is trivially accessible.
         | 
         | Storing huge blobs then becomes a matter of "is this my data,
         | or is this general data that I'm merely also making use of".
         | Examples abound of both:
         | 
         | - Adobe Lightroom (a non-destructive photo editor) goes with
         | "these are not my images" and leaves them out of its sqlite
         | "lrcat" files, instead maintaining a record of where files can
         | be found on-disk.
         | 
         | - Clip Studio (a graphics program tailored for digital artists)
         | on the other hand stores everything, because that's what you
         | want: even if you imported 20 images as layers, that data is
         | now strongly tied to a single project, and should absolutely
         | all be in a single file.
         | 
         | So the key point here is that, yes: sqlite files are database,
         | but because they're also single files on disk, their _use_
         | extends far beyond what a normal database allows. As
         | application file format, the rules for application files apply,
         | not the rules for standard databases.
        
         | HelloNurse wrote:
         | Storing blobs is good if the blobs are what you actually want
         | to store; the rookie mistake is storing blobs instead of the
         | well-structured, meaningful, indexable and much more compact
         | columns and additional tables that you should be storing but
         | you are too rookie to to design.
        
         | bob1029 wrote:
         | Storing your blobs outside of the database when they are an
         | inherent part of some larger transactional model is the giant
         | rookie mistake IMO. We did this for a little while for the
         | reasons of "oh my gosh look how big the .db is getting" and it
         | ended up a complete shitshow. Simple tasks like cleaning up old
         | entries can quickly turn into a nightmare if you have to go
         | around disk and make sure you are not leaving garbage behind.
         | There are also other concerns outside the box like VM
         | snapshots. This could leave you in a very inconsistent state
         | regarding what blob you have on disk vs what the database is
         | expecting.
         | 
         | Is having everything in one gigantic .db file an actual
         | downside? What makes a database's size "unmanageable"?
         | Presumably, you'd have to store that business information
         | somewhere anyways, right? I don't understand how 1 unified file
         | is unmanageable, but splitting that concern up into 2 different
         | domains magically makes it better.
        
           | jzoch wrote:
           | It can trash db performance. A relatively simple alternative
           | is to store references to immutable blobs in the db so that
           | everything remains transactionally consistent (and easily
           | restorable to any point-in-time). This is what many large
           | companies do. I happen to work at a company that runs the
           | largest + most "shall not be named" databases and has
           | struggled for years to evict all the blobs people shoved in
           | em.
        
             | aembleton wrote:
             | If you index on the file_name in a table where you have
             | file_name and a blob then why would it trash performance?
             | SQLite can quickly locate where to start reading the blob.
        
       | chrismorgan wrote:
       | How would SQLite go as for something like Audacity projects?
       | Audacity projects are a pile of files: an XML file that declares
       | the project metadata, tracks, channels, envelopes (an example of
       | non-destructive editing), and references to the zillions of audio
       | files that it drops in a directory beside the XML file (typically
       | less than 1MB each).
       | 
       | I presume it splits all the audio up into small files so that
       | most types of edits can only need to touch a small area.
       | 
       | If you directly ported that to SQLite, would it work fairly well,
       | or would you want to restructure it somehow? Things like
       | additions or deletions, would it need to write lots of extra data
       | to the disk (would it be doing something like defragmenting, or
       | would it grow larger than it should, or are there other tricks
       | that I don't know about to delete a chunk from the middle of a
       | file without needing to rewrite all of the file beyond that
       | point)?
        
         | catalogia wrote:
         | Adobe Lightroom seems comparable (non-destructive image
         | editing.) It uses a sqlite database that points to image files
         | on the filesystem. I think precisely the same approach could be
         | taken with an audio editing application.
         | 
         | You could also consider putting the audio files into the sqlite
         | db, which _might_ work alright. I 've heard of image thumbnails
         | being stored in sqlite dbs (maybe by Lightroom iirc?) though
         | those are probably smaller than your audio clips I'm guessing.
        
       | chrisbennet wrote:
       | I used it with a different extension to save application for a
       | product I wrote.
       | 
       | Any classes that need to be saved had serialize() and
       | deserialize() functions. Serialize before saving to SQLite and if
       | read in, deserialize after reading it from the DB.
        
       | 6gvONxR4sf7o wrote:
       | I used to advocate for this, then tried opening a sqlite file I'd
       | been using on another computer and it said it was corrupted. I
       | took it back to the original computer and it opened just fine. So
       | there was something going on that made it more complex and less
       | portable than something simple like json, which is super
       | frustrating because otherwise it's a really useful format for
       | datasets, and now I can't really trust it.
        
         | curiousmindz wrote:
         | If you use certain new features of SQLite, the database file
         | gets "upgraded". At that time, you must use a version of SQLite
         | that understands these new features.
         | 
         | So, the other computer you used was probably running an older
         | version of SQLite. Just update it to make it work.
        
           | bollockitis wrote:
           | That's a good catch. I'd there any way to tell which version
           | of SQLite created the file? Aside from trial and error, I
           | mean?
        
             | lbutler wrote:
             | The version is stored in byte 96 of the header of the
             | sqlite file[1].
             | 
             | I haven't looked but there will be some sqlite command to
             | query it and I'm sure some viewer tools will display it as
             | well.
             | 
             | [1] https://www.sqlite.org/fileformat.html
        
               | pjscott wrote:
               | You can print the version with this magical incantation:
               | $ python -c 'import sys, struct;
               | print(struct.unpack(">I", open(sys.argv[1],
               | "rb").read()[96:100])[0])' foo.db
        
             | banana_giraffe wrote:
             | Recentish versions of file can show you the version:
             | $ file example.db         example.db: SQLite 3.x database,
             | last written using SQLite version 3007017
        
       | teleforce wrote:
       | Shouldn't this has 2014 inside the title? The story has also
       | appeared several times in HN before.
       | 
       | Very interesting concept but now I think perhaps application file
       | format using TileDB will be much better since it can support
       | sparse data as well [1].
       | 
       | [1] https://github.com/TileDB-Inc/TileDB
        
         | HelloNurse wrote:
         | TileDB uses a "pile of files", as the article calls it, it
         | doesn't have SQL, and it has a rather embarrassing point of
         | view on race conditions (https://docs.tiledb.com/main/basic-
         | concepts/consistency). It could be a good choice for
         | applications that rely on large and simple multidimensional
         | arrays, hardly the same "market" as application file formats.
         | 
         | Moreover, database tables are a _very_ good fit for sparse
         | multidimensional arrays.
        
         | airstrike wrote:
         | That's not at all similar it doesn't support SQL
        
       | corysama wrote:
       | "Atomic transactions" is a feature needs _formal_ support in
       | random file formats way more often than people realize. Simply
       | writing to a file at all in an guaranteed-atomic way is much
       | harder than it looks. That guarantee becomes important when your
       | app gets widely distributed. If you have a million users of your
       | free mobile app, 1 in a million events happen every day. For
       | example: random hardware shutdown midway through a file write
       | operation. How is your app going to react when it reads back that
       | garbled file?
       | 
       | I've used SQLite on mobile apps to mitigate this problem. I've
       | used LMDB on a cloud app where the server was recording a lot of
       | data, but also rebooting unexpectedly. Would recommend. I've also
       | gone through the process of crafting an atomic file write routine
       | in C. https://danluu.com/file-consistency/ It was "fun" if your
       | idea of fun is responding to the error code of fclose(), but I
       | would not recommend...
        
         | jolmg wrote:
         | > For example: random hardware shutdown midway through a file
         | write operation. How is your app going to react when it reads
         | back that garbled file?
         | 
         | Don't filesystem journals ensure that you can't get a garbled
         | file from sudden shutdowns?
        
           | espadrine wrote:
           | > _Don 't filesystem journals ensure that you can't get a
           | garbled file from sudden shutdowns?_
           | 
           | For once, I have a blog post that goes in detail into why
           | that is not true!
           | 
           | (Or rather, not true unless you implement it in a very
           | sophisticated manner rarely used in practice.)
           | 
           | https://espadrine.github.io/blog/posts/file-system-object-
           | st...
        
           | SonOfLilit wrote:
           | They ensure you don't get a garbled filesystem.
           | 
           | They also expose an API that allows you, if you're very
           | careful and really know what you're doing (like danluu or the
           | SQLite author), to write performant code that won't garble
           | files on random shutdowns. But most programmers at most times
           | would rather just let the OS make smart decisions about
           | performance at the risk of garbling the file, or if they
           | really need Durability, just use a library that provides a
           | higher level API that takes care of it, like LMDB or an RDBMS
           | like SQLite.
           | 
           | To not get your file garbled, you need to use an API that
           | knows about legal vs. illegal states of the file. So either
           | the API gets a complete memory image of the file content at a
           | legal point in time and rewrites it, or it has to know more
           | about the file format than "it's a stream of bytes you can
           | read or write with random access".
           | 
           | Popular APIs to write files are either cursor based (usually
           | with buffering at the programming language standard library
           | level, I think, which takes control of Durability away from
           | the programmer) or memory mapped (which realllly takes
           | control of Durability from the programmer).
           | 
           | SQLite uses the cursor API and is very careful about buffer
           | flushing, enabling it to promise Durability. Also, to not
           | need to rewrite the whole file for each change, it does it's
           | own Journaling inside the file* - like most RDBMSs do.
           | 
           | * Well, it has a mode where it uses a more advanced technique
           | instead to achieve the same guarantees with better
           | performance
        
             | jolmg wrote:
             | > They ensure you don't get a garbled filesystem.
             | 
             | Well, they do that, but they also protect data to
             | reasonable degrees. For example, ext3/4's default
             | journaling mode "ordered" protects against corruption when
             | appending data or creating new files. It admittedly doesn't
             | protect when doing direct overwrites (journaling mode
             | "journal" does, however), but I'm pretty sure people
             | generally avoid doing direct overwrites anyway, and instead
             | write to a new file and rename over the old one.
             | 
             | I'm not sure if it would protect files that are clobbered
             | with O_TRUNC on opening (like when using > in the shell). I
             | would imagine that using O_TRUNC causes new blocks to be
             | used and so the old data isn't overwritten and it isn't
             | discarded because the old file metadata which would
             | identify the old blocks corresponding to the file would be
             | backed up in the journal.
             | 
             | > They also expose an API that allows you, if you're very
             | careful and really know what you're doing (like danluu or
             | the SQLite author), to write performant code that won't
             | garble files on random shutdowns.
             | 
             | As far as I see for the general case, being "very careful
             | and really knowing what you're doing" consists of just
             | avoiding direct overwrites. Of course, a single file that
             | persists data by the needs of software similar to a web
             | server (small updates to a big file in a long-running
             | process) is going to want the performance benefits of
             | direct overwrites. I can totally see SQLite needing special
             | care. However, I don't think those needs apply to all
             | applications.
        
               | labawi wrote:
               | > As far as I see for the general case, being "very
               | careful and really knowing what you're doing" consists of
               | just avoiding direct overwrites.
               | 
               | That's a dangerous thing to say. There are many ways to
               | mess up your data, without directly overwriting old data.
               | 
               | If you write a new file, close, then rename, on a typical
               | linux filesystem, mounted with reasonable options, on
               | compliant hardware, I think you should have either the
               | old or new version of the file on power loss, even if you
               | don't sync the proper things in proper order, but that's
               | only because of special handling of the common pattern.
               | See e.g. xfs 0 size file debacle.
               | 
               | Not an expert.
        
           | hansvm wrote:
           | Not really. If you have a file format that requires, e.g.,
           | changes to be done in two places then it's reasonable to
           | write to one place, have the system shut down never having
           | written to the second place, and now have a corrupt file.
           | 
           | The journal ensures (helps ensure?) that individual file
           | operations either happen or don't and can improve write
           | performance, but it can't possibly know that you need to
           | write, e.g., two separate 20TB streams to have a non-corrupt
           | file.
        
             | [deleted]
        
             | jolmg wrote:
             | For a single file, I thought that write operations were
             | committed when e.g. closing the file or doing fsync, but
             | now I'm not sure. I wonder if the system is free to commit
             | immediately after a write() ends.
             | 
             | Based on your scenario, if an application-level "change"
             | involves updating 2 files, interpreting the update of only
             | one file and not the other as a corruption, you're right
             | that filesystem journaling wouldn't suffice. However, in
             | that case it wouldn't be that a single file was corrupted.
             | 
             | Still, I wonder about the other case, about when the
             | filesystem decides to commit.
        
         | hedora wrote:
         | Posix (and I think windows) guarantee this atomically and
         | durably overwrites a file:
         | 
         | tempfile = mkstemp(filename-XXXX)
         | 
         | write(tempfile)
         | 
         | fsync(tempfile)
         | 
         | close(tempfile)
         | 
         | rename(tempfile, filename)
         | 
         | sync()
         | 
         | Assume the entire write failed if any of the above return an
         | error.
         | 
         | In some systems (nfs and ext3 come to mind), you can skip the
         | fsync and/or sync, but don't do that. It doesn't make things
         | significantly faster on the systems where it's safe, but it
         | definitely will lose data on other systems.
         | 
         | The only loophole I know of is that the final sync can fail,
         | then return anyway. If that happens, the file system is
         | probably hosed anyway.
        
           | corysama wrote:
           | Here's the article I couldn't find before on the process
           | https://lwn.net/Articles/457667/
           | 
           | Which links full source for the process you described
           | https://lwn.net/Articles/457672/
        
           | networkimprov wrote:
           | You need a recovery step on startup to retry the rename if
           | tempfile is complete, or delete it if it isn't.
           | 
           | That means you need a way to verify that tempfile is
           | complete. I do that by removing filename after completing
           | tempfile. And that requires a placeholder for filename if it
           | didn't already exist (e.g. a symlink to nowhwere).
           | 
           | On crash, rename may leave both files in place.
           | 
           | This technique doesn't work if you have hardlinks to filename
           | which should refer to the new file.
        
             | jonas21 wrote:
             | Regardless of whether the tempfile is complete, you can
             | just ignore (or delete) it on startup. From the caller's
             | perspective, the save operation doesn't succeed until the
             | rename is done and written to disk.
        
           | [deleted]
        
         | dfox wrote:
         | I second the recomendation of LMDB. With one important caveat:
         | under heavy write load it is perfect demonstration of
         | brokenness of semaphore implementation on freebsd and macos.
        
           | skrebbel wrote:
           | In what way is LMDB better than eg SQLite or Redis? For what
           | kinds of use cases would you recommend it?
        
             | dfox wrote:
             | For my applications: latency. The fact that writers do not
             | block readers is just a nice bonus.
        
               | kortex wrote:
               | Do you reckon LMDB would be reasonably performant
               | compared to raw mmaped files for zero-copy passing large
               | images between processes? I want something like a ring
               | buffer but mitigate the risk of use-after-free if the
               | consumer lags. Seems like lmdb automatically re-uses
               | memory that's long been freed but is sensible to detect
               | that an incoming write needs more space.
               | 
               | Python/c++.
               | 
               | I've looked into mmap and flock but it's messy and not
               | highly portable.
        
               | dfox wrote:
               | LMDB is in fact not much more than large mmaped file and
               | clever synchronization mechanism on top of that that also
               | serves as dictionary implementation and transaction
               | mechanism. My uses of LMDB are essentially replacements
               | for mmaped file and LMDB gives me the ability to sanely
               | do partial updates.
        
             | corysama wrote:
             | It's very simple. Single C file implementation. Binary blob
             | keys : binary blob values. The end. If all you need is a
             | bunch of blobs written and read back reliably while in a
             | statistically unreliable situation, LMDB is great.
        
               | namibj wrote:
               | Well, sqlite is also just a single .c with a single .h.
        
               | tlb wrote:
               | Sqlite is dozens of files. See
               | https://www3.sqlite.org/cgi/src/dir?ci=tip&name=src
        
               | samatman wrote:
               | You're both right... kinda.
               | 
               | SQLite is dozens of files, if you're browsing it, or
               | modifying it. Which you can do, as long as you're ok with
               | contributing your changes being impossible, since it's
               | open-source but closed-contribution.
               | 
               | If you're compiling it, it is, in fact, one .c and one
               | .h:
               | 
               | https://www.sqlite.org/amalgamation.html
               | 
               | If you're choosing between LMDB and SQLite, the latter
               | sense is the relevant one, so they're identical along
               | this particular axis.
        
       ___________________________________________________________________
       (page generated 2020-06-13 23:00 UTC)