[HN Gopher] SQLite as an Application File Format (2014) ___________________________________________________________________ SQLite as an Application File Format (2014) Author : afiori Score : 418 points Date : 2020-06-13 12:28 UTC (10 hours ago) (HTM) web link (sqlite.org) (TXT) w3m dump (sqlite.org) | twarge wrote: | Quickbooks for Mac uses sqlite for its file format and this | allowed users to query the books with any SQL browser! | | Alas, they encrypted the file starting two years ago. | hexmiles wrote: | are there any other "file format oriented" data storage system?? | I see software tend to use zip file, sqlite or good old ole | structured storage. | | In my job we used a fat filesystem as a storage system and | recently switched to sqlite, and while i love it i didn't really | find any alternative. | mstade wrote: | SQLite is amazing - trivial to embed in almost every context I've | tried, rock solid, incredibly flexible, performs well enough for | most use cases and for the others it probably still performs | better than the solution I had in mind. I wouldn't dream of using | anything else for application data, unless you _need_ something | text based for human readers as well, and can 't deal with having | people use a SQLite client. And even then, really consider | whether you _need_ this. | | With SQLite embeddable in websites thanks to wasm, and the | ability to create object URLs it's also pretty trivial to make | full blown (read-only) SPAs delivered as a single HTML file. This | last bit might seem crazy - and it kind of is - but if your | clients are all on a LAN in a corporate network so bandwidth and | latency aren't really an issue it makes a bit more sense. | | I love SQLite, hands down one of my favorite tools. | em500 wrote: | A lot of well known apps are exactly doing this, using a sqlite | db as the internal file format: Whatsapp, WeChat, Google Keep, | Apple iMessage, Notes and Calendar... | | Probably tons more if you care to dig around. | juskrey wrote: | Years pass, now decades, yet SQLite is still the most pleasant DB | to work with. | TedDoesntTalk wrote: | Agree, but it also side-steps some complex topics like | sharding, replication, high-availability, etc. I'm not | suggesting those topics should be addressed by SQLite; that | simplicity is an asset. | | Has anyone used SQLite remotely over a network? | polyrand wrote: | You can use rqlite for that: | | https://github.com/rqlite/rqlite | | You can have a single over-the-network database or a cluster | of replica dbs (through Raft consensus). | rlpb wrote: | Dqlite exists - it's SQLite with some of these feature added. | https://dqlite.io/ | allenu wrote: | I am working on a flash card app and specifically did not | want to have to create a SaaS to support cross-device data | sync, so have opted instead to create a journal of the data | mutations on each device and having those (i.e. it's CRDT) | uploaded to a simple file store (like Dropbox) for each | device to share. Each device stores its own local data using | SQLite, but does a merge on all the journaled commands to | stay up to date. | | So, basically I rolled my own sync solution that just happens | to use SQLite. It's worth noting that I started this project | by storing the data in my own format (JSON, at first) and | quickly realized it was growing too large and was taking too | long to serialize/deserialize. I'm very glad to have ended up | using SQLite instead because it is super easy to use and has | been reliable. | mjw1007 wrote: | When Subversion changed their working copy format to rely on | an SQLite database, I had a number of problems using working | copies on network file systems. | | No doubt this was an indication that the network file system | was incorrectly configured, but I think the fact that it | worked in practice with the file-based format and didn't with | the SQLite format is a strike against the idea of using | SQLite for what the user sees as saving a file. | m_sahaf wrote: | SQLite recommends against sharing it on NFS for multiple | access because locking is broken on NFS for all kinds of | files, not just SQLite. | | > But use caution: this locking mechanism might not work | correctly if the database file is kept on an NFS | filesystem. This is because fcntl() file locking is broken | on many NFS implementations. You should avoid putting | SQLite database files on NFS if multiple processes might | try to access the file at the same time. | | Source: https://www.sqlite.org/faq.html#q5 | catalogia wrote: | Also WAL mode isn't going to work on network filesytems. | mjw1007 wrote: | Fair enough, but that means if someone's following the | advice in the article ("It deserves your consideration as | the standard file format on your next application | design") they need to consider "will our users want to | use our files on network filesystems?". | HelloNurse wrote: | Working copies on network file systems are a horrible idea. | You _detected_ problems because you were using SQLite, but | are you sure you didn 't have silent corruption with older | Subversion versions? | trashburger wrote: | I think the main reason SQLite doesn't have them is because | it's not a goal for the project. SQLite would be monumentally | more complex if it supported networking, sharding, etc. If | you need those use cases then perhaps it's a good idea to | switch to a more scalable DBMS. My favorite is PostgreSQL. If | I'm building something which will have user-generated content | then I always pick Postgres. Otherwise, SQLite. | easytiger wrote: | It lives in your process. It has no network features. It's | not a DBMS. | | People have built layers over it though | gwbas1c wrote: | One minor point that needs to be raised: Using SQLite as your | application file format locks you into the SQLite source code. | | (And in most cases that's not a bad thing, it's free and open | source.) | | But, if you truly want an open file format, someone needs to be | able to independently write a program that can read your file | without relying on third party dependencies. This is why the | browser vendors decided not to put SQLite into the HTML and | JavaScript standards. | | It is an awesome product! I've worked with it for over a decade | and I'm a fan! | trevor-e wrote: | While we are on the topic of file formats, I stumbled across a | cool project the other day that aims to make parsing well-known | binary formats much easier: http://kaitai.io/ | | In high school I was obsessed with the game Starcraft. It came | with its own map editor that would save to its own proprietary | format. Some smart people came along and reverse engineered that | format and allowed us to do all kinds of neat things we weren't | supposed to. I see modding communities for games are more popular | than ever, and finding this project brought back lots of great | memories. | mdaniel wrote: | I'm super-on-board with declarative binary parsing schemes | (i.e. not lua or python) but my own attempt to use kaitai was | met with the same frustration that presumably caused PDF to be | missing from the format gallery list | (http://formats.kaitai.io/): jump or offset-based structures | | At the time, I tried using a stunt like defining one huge blob | that "eats" the main file, then reaching back into it as we | learned more, but it looks like somewhere along the way they | acquired a "substream" behavior (https://github.com/kaitai- | io/kaitai_struct_doc/blob/c53060f7...) so maybe it's worth | another look | thechao wrote: | Can you explain more? I deal with a shocking number of poorly | supported binary formats, and this tool looks awesome. | Buuuuttt... a lot of these formats are "headers with offsets | to structured data". They're intermixed logs, so multiple | incompatible bi art formats are freely mixed in the same | file. | zubairq wrote: | Yeah you can use SQLite as a file format. We use it to store our | project data at https://github.com/zubairq/pilot | polyrand wrote: | I think SQLite is underrated. | | With WAL mode + increasing the page cache you can get some | excellent concurrency, even if doing reads and writes at the same | time. | | With rqlite it's easy to make it a server database and have a | cluster of SQLite databases (https://github.com/rqlite/rqlite). | | I wouldn't try to create the new Instagram with it, but I think | it'd be capable enough for many apps that are built on top of | more complex DBs. | michaelbuckbee wrote: | IIRC one of the attempts to deliver on "Github but for Design | files" used SQLite as an app file format. | gjsman-1000 wrote: | I worked with an SQLite Database on a personal C#/WinForms app. | It was a bit complicated to get started, but afterwards, what a | handy trick in comparison to using something like the | discontinued SQL Server Compact. | | I did, however, "enhance" it a bit (or "proprietarize" it) by | encrypting it with a short password and an AES algorithm with | some uncommon settings. I never shipped it that way, but the | output file looked like any other proprietary app's file - a mess | of random symbols. | networkimprov wrote: | SQLite is a fine single-file application file format, but if you | want average users to be able to inspect and perhaps fix | application data when things go wrong, a tree of JSON-format | files is more friendly. | | True, this requires care to ensure that such files are updated | reliably, but that's not quite rocket science. | tonyedgecombe wrote: | Maybe. I've used XML as a file format for all my recent projects. | It's well supported in the .Net framework, I don't need any 3rd | party assemblies. Using SQL to write out an object graph would | involve more work. | 737maxtw wrote: | Depends on the size or structure of the data. | | If you're having to do things like updates over a large | dataset, SQLITE can be nice because of the performance boosts | from indexing. | simias wrote: | A tree I could understand since after all XML uses a tree-like | structure, but for a graph you'll need to store references | anyway so flattening and denormalizing it completely doesn't | sound like a lot more work to me. | | Also XML makes me weep and I'll never willingly opt to use it | if there's an alternative, but that's my own prejudice. | gwbas1c wrote: | It depends on how large and complicated your XML file is. | | Once your XML file hits a certain size, minor updates incur a | gigantic performance penalty because you have to write out the | entire XML file every time. | | Random writes in the middle of an XML file are impossible. For | example, if you were to change an attribute so that it's one | character longer, you still have to rewrite the remainder of | the file in order to shift everything over by one character. | | That's the main reason why SQLite is so popular for | applications. | | (I know that's from personal experience. I had to support an | old version of an application that constantly wrote out XML | while the new version that we hadn't shipped used SQLite. A | customer that made heavy use of the old version basically hit | the limits of XML but the version that used SQLite wasn't ready | for them yet.) | | But, I'm going to be honest here: We had some tables that only | had a few rows, so I moved those to XML files. QE really liked | it because it was easy to diagnose issues. | barking wrote: | Are we talking about good old fashioned CRUD applications here or | is this something different? | zzo38computer wrote: | They say that Microsoft Word, PowerPoint, etc are "fully custom | formats", but actually those formats are "wrapped pile-of-files | formats" (and 7-Zip can open them), although the contents of the | files inside are custom formats (and they seem to have failed to | consider that category). | | The application ID number in the SQLite header can be used to | identify application file formats. The application ID number is a | 32-bit number, and there have been a few different ways to handle | it; I have seen the use of hexadecimal and of ASCII; I used base | thirty-six, and I have then later seen the suggestion to use | RADIX-50. Additionally, there is a document about "defense | against dark arts" in case you need to load untrusted files. | | TeXnicard uses a SQLite database file (with application ID | 1778603844) for the card database file. The version control file | (optional, and not fully implemented yet) uses a custom format | (which is fully documented), and it does support atomic | transactions. It consists of a header followed by a sequence of | frames, which are key frames and delta frames. The header of the | version control file contains two pointers: one to the beginning | of the most recently committed key frame, and the other one to | the end of the most recently committed frame (whether a key frame | or a delta frame; if all frames are fully committed, this will be | equal to the length of the file). These pointers are written only | after the rest of the file is written; if it gets interrupted, | reads will ignore the partially written data, and further writes | will overwrite the partially written data. | | ZZ Zero uses a Hamster archive of custom (but documented) formats | as its world file format. (A Hamster archive is zero or more | "lumps" concatenated together. A lump consists of a null- | terminated ASCII filename, 32-bit PDP-endian data size (measured | in bytes), and then the data of that lump. The preceding text in | these parentheses is the full definition of the Hamster archive | format; you can use this to implement your own.) | | Free Hero Mesh uses a "pile-of-wrapped-pile-of-files" format. A | puzzle set consists of four files: .class (which stores class | definitions), .xclass (which stores pictures and sounds to be | used by the class definitions), .level (which stores levels), and | .solution (which stores solutions). The .class file is a plain | text file; the other three are Hamster archives. These are four | logically distinct parts of a puzzle set; this allows you to | split them apart, to create symlinks to share class definitions | with puzzle sets, to substitute your own graphics, to work with | multiple solution sets (e.g. per user), etc. If you need to do | more than that, then you can of course extract the lumps if | needed. For class definitions, you can just copy and paste the | text. | | MegaZeux used a fully custom format before, but now it uses a ZIP | archive with the stuff inside being custom formats (one of which | is the "MegaZeux Property List" format, which I have documented | in Just Solve The File Format wiki; the authors of MegaZeux did | not seem to document this format themself anywhere, so I figured | it out and did it by myself). | | For some cases, SQLite database is a good application file | format; other times, I think other formats (such as text formats) | may be better. It depends on the application. XML is too often | used for stuff that isn't text markup stuff, and XML is | especially bad for stuff that isn't text markup stuff, I think. | | If you use SQLite though, you will get more than just the | database access. It also gives you the string builder functions, | the sqlite3_mprintf function, a page cache implementation, memory | usage statistics, and a SQL interpreter; the SQL interpreter can | be used as one way to allow user customization and user queries | (including batch operations), without having to make an entirely | new scripting language to embed. | | They mention interfaces of SQLite are available for many other | programming languages, although at least one that doesn't seem to | have a interface to SQLite is PostScript (although you can use | %pipe%, it doesn't work so well especially since it is only a one | way pipe), and I am not sure if awk has it either. | chrismorgan wrote: | .doc, .ppt, &c. were fully custom. .docx, .pptx, &c. are ZIP | files. Remember that this was written in 2014, and many people | were still using the old formats then, because many people | didn't have Office 2007 or newer or software that could cope | with this new format. (Now I think very few people still use | the old Office formats.) | nathan_f77 wrote: | I've been reading through many of the other pages on the SQLite | website. I didn't realize that SQLite is developed by a company | (Hwaci) and funded by selling support, licenses, and testing | services. I had always assumed it was an open source project | backed by a non-profit, similar to the Apache Software | Foundation. | | This is another very interesting example of an open-source | business. I would be interested to learn more about how the Hwaci | company operates (revenue, number of employees, etc.). I find | this very interesting: | | > We are a 100% engineering company. There is no sales staff. Our | goal is to provide outstanding service and honest advice without | spin or sales-talk. | | They list some "$8K-50K/year" and "$85k" price tags directly on | their "Pro Support" webpage. These would usually be behind a | "Schedule a Call" or "Get Quote" button. I've been thinking about | doing something similar with my own on-premise licenses and | support contracts. I'm not very good at sales and I don't really | want to hire a sales team, so I'd be interested to know how this | worked out for them. | | I also liked this sentence, which is very similar to Basecamp's | philosophy (and both companies were started around the same time | - 1999 vs 2000): | | > Hwaci intends to continue operating in its current form, and at | roughly its current size until at least the year 2050. | | It's interesting to think that SQLite could have raised money and | grown into a billion-dollar public company with thousands of | employees. | | I'm going to listen to this Changelog interview with Richard Hipp | now [2], and also this talk on YouTube [3]. | | [1] https://sqlite.org/prosupport.html | | [2] https://changelog.com/201/ | | [3] https://www.youtube.com/watch?v=Jib2AmRb_rk | Varriount wrote: | As someone who has had to evaluate and recommended enterprise | software for purchasing, a straightforward price tag is going | to make me come your product in a better light. | | Plus, I have to wonder how much extra profit the "call us" | route actually takes in, after you've subtracted costs for the | marketing staff it requires (especially if you try to | renegotiate the cost after the subscription/license expires). | TheRealPomax wrote: | Also important to note is that enterprise pricing is way higher | than you'd think because when an enterprise needs a solution, | something like $1000 is literally too cheap to justify the cost | of procurement (which involves not just getting a license, but | also internal documentation, which in turn needs to be entered | into whatever process certification programme is in place, | etc.) | adamnemecek wrote: | I have been thinking about this as well lately. For my app | (http://ngrid.io) I'm using a custom built Entity-Component- | System (https://en.wikipedia.org/wiki/Entity_component_system) | based UI framework in Rust. | | In the ECS paradigm, everything is stored as struct-of-arrays as | opposed to array-of-structs | (https://en.wikipedia.org/wiki/AoS_and_SoA) and as a result, your | serialization becomes trivial. You are not chasing pointers to | other objects when serializing. Apple's Core Data takes the | object graph approach and it's a real clusterfuck. | | ECS is very similar to databases in that your data is normalized | and things are referenced by offsets rather than pointers | https://floooh.github.io/2018/06/17/handles-vs-pointers.html). | | Why roll your own thing as opposed to use say SQLite? If you use | SQLite, you will probably have some OOP on top of it which | introduces a serious impedance mismatch. With ECS, you cut that | whole layer out. | | I'm guessing that I'm losing on some atomicity but for my use | case that doesn't matter that much. | lxdesk wrote: | ECS is useful for real-time, but it means that you are rolling | your own query logic. This is fine for the runtime of game | engines because they generally aren't dealing with that many | data types(for the very most complex AAA games, perhaps a few | hundred components) and the queries used at runtime are simple | and tend to demand optimization at the data structure level. | | If your purpose is an editing tool you may want to reconsider. | Editing changes the goal in a very substantial way and the | complexity of your queries goes way up, which is where SQL | syntax absolutely shines. | TeMPOraL wrote: | FWIW, a game I'm writing uses in-memory SQLite as its ECS, | mostly because I wanted to see if I could. SQLite is | surprisingly performant. My game is not realtime, but I | tested it, and I could actually redraw the screen at 60+ FPS | while querying for render data every frame, and I still had | lots of frame time to spare. | garganzol wrote: | Everything old is new again. I can recall a similar movement at | the end of 90s regarding databases as application file formats. | That time it was Microsoft Jet Database engine. | rb808 wrote: | I keep switching between csv, parquet and sqlite. I dont need | another article to encourage me to switch again! :) | waynesonfire wrote: | What is an application file format? | detaro wrote: | The first section of the article explains it. | dragly wrote: | From experience, I can also recommend using SQLite as an | application file format. I landed on SQLite after looking for | solutions for a file format for an educational app we made for | simulating biological neural networks. The app is cross-platform, | written in Qt and the simulations needed to be stored as a JSON | describing the network, a thumbnail and some metadata. It was | also intended to be extensible with more features and backwards | compatible if new versions were released. I considered creating | our own simple format, using ZIP files, HDF5, Qt resource files | or SQLite. | | I landed on SQLite for many of the reasons outlined in this | article and in particular because of how easy it was to implement | and maintain. SQLite is supported natively in QtSql, which made | it extremely easy to write the save and load functions, and later | extend these with more data fields. In addition, we did not have | to worry about cross-platform support since this was covered by | SQLite and Qt already. | bob1029 wrote: | It sounds like future schema changes may be a potential concern | for your application. One thing you can look into using is the | SQLite user_version pragma. We use this right now to roll our | own migrators and it's light years better than how migrators | work for Entity Framework, et. al. | | https://www.sqlite.org/pragma.html#pragma_user_version | radarsat1 wrote: | Interesting, I used HDF5 in a similar situation because we | needed to save a lot of same-sized rows of data (simulation | time steps), so a matrix-oriented format seemed to make sense | but it wasn't entirely without some need for cross-referencing | between tables, so it does make me wonder now if sqlite would | have been a comparable or better choice. Any reason for | rejecting HDF5 in your case? | Nihilartikel wrote: | Recently I've been using SQLite to store state for a data-syncing | CLI utility in Python where I was considering using CSV or JSON | at first. The Python 'Dataset' library, | https://dataset.readthedocs.io/en/latest/ has taken almost all of | the friction out of using SQLite for trivial state persistence. | | I don't have to manually manage schema or create tables - they're | all lazily created from Dict keys on insertion, but indexes and | SQL can still be used as desired. | | This lends itself really well to Jupyter notebook assisted | development, where functions can be quickly and interactively | iterated without having to muck around with existing tables | whenever data changes shape. | | It's been a real productivity boost, and I've been looking around | for something similar to use in Clojure. | fatboy wrote: | We transitioned the file format for the images in our climbing- | topo app from a pile of files to SQLite (the images are tiled). | | Going from having high hundreds of tiny files on disk per topo | photo to just one was an incredible boon to productivity for | things like data backup and transferring files onto the device | for testing. | wetmore wrote: | Tell me more about your app | fatboy wrote: | It acts as a digital window into our print guidebook | catalogue. The books are mostly for the uk, with the rest | spread across Europe. | | I take the finished desktop-publishing documents and extract | and package the data to put into the app, including tiling | the images so we can have very high resolution photos on low- | end devices. | | There's a semi-technical article about the topo-view | implementation here: | | https://www.ukclimbing.com/articles/features/rockfax_app_dee. | .. | me551ah wrote: | Where can you use SQLite? | | Embedded Systems: Yes | | Raspberry Pi : Yes | | Mobile Apps. : Yes | | Desktop Apps : Yes | | Browsers : No | | Servers : Yes | | Supercomputers : Yes | gwbas1c wrote: | You used to be able to use it in a browser with Google Gears. | | The reason why it was never adopted was because the browser | makers wanted to be able to independently implement their own | database instead of everyone having to use the same source | code. | ris wrote: | Amongst all the praise we're seeing for this approach here, may | we please just have a minute's silence for the various exploits | sqlite has seen when loading arbitrary database files over the | years? | | My recent favourite being | https://media.ccc.de/v/36c3-10701-select_code_execution_from... | SQLite wrote: | This concern was just raised on the SQLite Forum (probably | after showing up here). See my reply at | https://sqlite.org/forum/forumpost/8beceed68e for additional | insights into the problem and recent SQLite enhancements to | address it. | ris wrote: | Absolutely, there are many things that can be done to | mitigate this kind of thing, I'm merely advising caution over | people immediately reaching for a solution where they're | executing thousands upon thousands of lines of code over | untrusted files when they can get away with something much | simpler. | zucker42 wrote: | Is there any way to mitigate code execution issues (besides | just fixing bugs)? Because allow code execution might be a | pretty big deal breaker in certain contexts. | empthought wrote: | This is a security flaw of the system or application using | SQLite, though. It is trusting data it should not trust and | executing code provided with that data. | roelschroeven wrote: | If I use libpng or libjpeg to decode an image and that causes | a buffer overrun or another security vulnerability, that is | very much a bug in libpng or libjpeg. | | The same is true when I open a database in SQLite: if that | causes a security problem, it's a bug in that library. I | don't even see how you could validate a database file before | you hand it over to SQLite. | empthought wrote: | RTFM | | https://www.sqlite.org/security.html | skrebbel wrote: | Huh? If you're using SQLite as an application file format, | isn't opening untrusted files a key use case? What else, | never share files with people? That must be a pretty boring | application. | empthought wrote: | Sure but just like a web browser, it's the application's | responsibility to sandbox or verify the code, not the http | library's. | skrebbel wrote: | That translates to "don't use SQLite", right? | | How would an application that uses SQLite as a file | format be able to scan for malicious database files that | trigger buffer overflows in the SQLite engine? I'm really | not sure what you're suggesting. | empthought wrote: | RTFM | | https://www.sqlite.org/security.html | speedgoose wrote: | From the description, I didn't watch the video, loading | the database and querying it is enough to run the | exploit. To compare with your example, it would be like | having a remote exploit if your application queries a | specific http endpoint. | | For sure application developers could sandbox the http | library, sqlite, or stop using libraries developed in so | unsafe programming languages but it's a bit too early for | that. | [deleted] | boudewijnrempt wrote: | I was surprised, but also intrigued when I discovered that Manga | Studio (Clip Studio these days) used sqlite for the native file | format, including the pixel planes. | | I wish my predecessor at Krita had made that choice, instead of | choosing to re-use the KOffice xml-based file format that's | basically a zip file with stuff in it. It would have made adding | the animation feature so much easier. | mkesper wrote: | So isn't it possible for Krita to create an updating process | and switch? Btw. it's really fantastic Krita is on Android now! | im3w1l wrote: | What's wrong with xml + zip? | sradman wrote: | One of the famous SQLite refrains is: | | > SQLite does not compete with client/server databases. | SQLite competes with fopen(). | | This undersells SQLite somewhat. Like Berkeley DB, SQLite was | created as an alternative to dbm [1] and one of the main use | cases is safe multi-process access to a single data file, | typically CLI apps (written in C, TCL, etc.). | | Client-Server databases tackle multi-user concurrency while | embedded databases often tackle multi-process concurrency. | | This article has long been part of the SQLite documentation | found under the "Advocacy" section. There is also a short | version. [2] | | [1] https://en.wikipedia.org/wiki/DBM_(computing) | | [2] https://sqlite.org/aff_short.html | boudewijnrempt wrote: | It's really hard to update in place, we basically have to | rewrite the whole file after a change, and, of course, we've | got to do a lot of things ourselves, like inventing an xml | format for describing the image. | derefr wrote: | XML-based formats are export formats, not state-keeping | formats. To use an XML-based format for storage, need to have | a separate, canonical _in-memory_ representation of the data, | which you then snapshot and serialize into XML upon request. | You may or may not be able to get away with serializing less | than your full in-memory object graph upon save, using | techniques similar to DOM reconciliation. Either way, you 'll | still likely need your entire document/project _represented_ | in memory. | | If you're working with something analogous to a text | document, this snapshot-and-serialize approach to saving | works fine. If you're working with other types of data, | though, this approach only works for trivial projects; once | your document exceeds ~100MB, the overhead of | snapshotting+serializing your object graph becomes bad enough | that people stop saving very often (dangerous!), and it also | makes the saving process itself more fragile (since the | longer a save takes, the more likely it becomes that the | process might be killed by some natural event like a power | cut during it+.) | | And, once your project size exceeds the average computer's | memory capacity, an in-memory canonical representation | quickly becomes untenable. You start to have to resort to | hacks like forcing the user to "partition" their project, | only allowing the user to work with one pieces at a time. | | With an applicaton store-keeping format, you have none of | these concerns; the store is itself the canonical data | location. You don't have a canonical in-memory representation | of the data; the in-memory representation is simply a write- | through or write-back caching layer for the object graph on | disk, and the cache can be flushed at any time. Or you may | not have a cache at all; many systems that use SQLite as a | file-format just do SQL queries directly whenever they want | to know something, never instantiating any intermediate in- | memory representation of the data itself, only retrieving | "reports" built from it. | | + You can fix fragile saving with a WAL log, but now the WAL | log is your true application state-keeping format, with the | XML format just being a convenient rollup representation of | it. | geokon wrote: | I've never had my problem scale to the size that required a | database/SQL, but I don't quite get the advantage of your | solution. Having all your interactions with data have to go | to disk though a cache muddles things b/c it makes it much | harder to reason about performance (b/c when do you have a | cache miss? and how do you configure a cache properly?) You | introduce a lot more blackmagic variables to reason about. | | If you're editing images I'd think it'd just makes more | sense to have all of your stuff in RAM and then a saving- | to-disk is done on a separate thread. I don't quite get why | the users would stop saving in this example. | | I'm not saying you're wrong - but more asking for some more | details b/c I've never imagined using a DB on data that can | fit in RAM | HelloNurse wrote: | It's primarily a problem of inflexibility handicapping | performance, not of "cache misses" and clever algorithms. | | For example, imagine a word processing program opening a | document and showing you the first page: you could load | 50MB of kitchen sink XML and 250 embedded images from a | zip file and then start doing something with the | resulting canonical representation, or you could load the | bare minimum of metadata (e.g. page size) from the | appropriate tables and the content that goes in the first | page from carefully indexed tables of objects. Which | variant is likely to load faster? Which one is guaranteed | to load useless data? Which one can save the document | more quickly and efficiently (one paragraph instead of a | whole document or a messy update log) when you edit text? | catalogia wrote: | > _it also makes the saving process itself more fragile | (since the longer a save takes, the more likely it becomes | that the process might be killed by some natural event like | a power cut during it+.)_ | | This is one I take very seriously, after I got bit by it. I | was saving state by writing s-expressions to a text file; | it seemed a reasonable enough thing to do even with tens of | megabytes of it, until my laptop turned off in the middle | of a write. After recovering from a backup and losing | several hours of work in the process, I switched to SQLite | that evening. | 7thaccount wrote: | I don't personally know anything wrong with it, but SQL | queries are powerful powerful powerful, so although XML might | not have any huge issues, in my personal opinion it is still | sub-optimal compared to what is possible with SQL (at least | for my uses, which could be very different to yours so take | with a grain of salt). In particular, SQLite is a tiny | executable with a HUGE amount of tests and is used very | widely, so reliable for a lot of tasks as long as the data | size stays relatively small and you caste your types | properly. | | SQLite also has a lot of tutorials and books and language | support. I've used it with Python, C#, Perl, TCL, and | Powershell with no issues. You can access it via the command- | line or you can hook into it with a fully graphical SQL IDE | like DB-Visualizer (I really recommend using an IDE for | interactive SQL use). If your language doesn't have built-in | or library support, even a novice programmer like myself can | roll a few functions together to build the tables, update, | delete, and run queries to analyze the data if you can run | some system commands. It's a wonderful little technology that | I feel comfortable reaching to when I need it. | | One thing I've shied away from over the years are | technologies which require running complex installers as it | makes things more confusing and makes it harder for me to | share with colleagues that aren't as interested in | programming. Both SQLite and DB-Visualizer require zero | installation. I just put each in a folder and then run the | executable. This is really easy to use to me and easy to get | others started too. Note that this is not commercial | software, but internal business apps to help people do | complex tasks easier. So you have a script that does some | data processing pushes that data to SQLite and then the user | can bring up DB-VISUALIZER, point it to the little SQLite .db | file I created and then get to work. We have a lot of little | apps like this and since most of our engineers are really | good with SQL, they can do whatever they need efficiently. | chrisseaton wrote: | Seems very hard to seek in XML? And how would you update it | in place? Also seems like it would be massive? | boudewijnrempt wrote: | Well, we're not storing pixels in xml, but now we've got | two xml files, one with author data, one descibing the | image and a folder tree containing various kinds of binary | data, all in a zip file. And yes, updating in place is a | bother. | airstrike wrote: | Do people really pick XML as a go-to format these days? | Probably easier to list what's not wrong with it | boudewijnrempt wrote: | Well, these days 22 years have passed since the original | file format was designed -- and not by me, though I | probably would have done the same thing, 22 years ago. | | Also, it's zip + xml files + binary files, not all xml. | airstrike wrote: | Well, I was responding to the "what's wrong with xml + | zip" question, not your remark on your predecessor | wodenokoto wrote: | To my very limited knowledge it is very common for iOS and | android apps to store data in SQLite. | | Is it possible to extract or change these databases? | | I have a few (offline) apps on my phone that I'd love to append | data to | SonOfLilit wrote: | It's possible on Android if you have root access, I think. | Android is very serious about apps not accessing each other's | data. | kccqzy wrote: | For unjailbroken iOS the only way to extract these databases is | to make an unencrypted device backup in iTunes (Finder in | Catalina) and then look at the backup file. The backed up files | won't be conveniently named but you can still find it. | wodenokoto wrote: | Can I push my edits back onto my iPhone or is that a pipe | dream? | muxator wrote: | As a sibling comment says, if you are on Android and have root | access, it is very easy. | | Just have a look at /data/data/appname/. | | For example, this is what I copy to make backups of my contacts | on my Android phone: | | > # pwd | | > /data/data/com.android.providers.contacts/databases | | > | | > # ls -1 | | > calllog.db | | > contacts2.db | | > profile.db | | This is a simple way for moving data around, restoring | applications, performing backups, editing your data (if you | know your way inside the app). | | Beware of the selinux labels if you are moving files across | different devices, as recent android version now run with | selinux in enforcing mode. | | Just be sure to adjust them (useful commands if you need a | reference: ls -lZ, chcon, semanage). | booleandilemma wrote: | SQLite is a database that tries so hard to not be a database :) | ori_b wrote: | Please no. This is an incredibly complicated file format that is | not particularly well suited for file saving. It's also very slow | ((edit: correction) ~10 transactions per second) if you aim for | integrity, or unsafe if you tune for performance (~50k | transactions per second, but if your program or computer dies | half way, your file is hosed). So, keeping files up to date in | place makes your ui janky. | | You can work around it by working in memory and writing out a | whole new database from in memory structures on save and then do | the atomic rename. But if you do that, you are probably better | off with json, protobuf, or similar. The libraries around these | formats are similarly battle tested, but they fit the needs | better, supporting working in ram fully and then saving cleanly | and easily. | kbumsik wrote: | > It's also very slow (~100 transactions per second) > But if | you do that, you are probably better off with json, protobuf, | or similar. | | Are you talking about a server application? | | > or unsafe if you find for performance (you can get it up to, | IIRC, ~50k transactions per second, but if your program or | computer dies half way, your file is hosed) | | Never heard that SQLite has unsafe operations. Any source? | ori_b wrote: | > _Are you talking about a server application?_ | | I'm talking about doing a handful of transactions -- even a | single one, now that I looked at the actual numbers that | sqlite discusses -- being enough to introduce user-visible | jank. | | > _Never heard that SQLite has unsafe operations. Any | source?_ | | The sqlite docs. You can improve the performance of sqlite by | multiple orders of magnitude by messing with things like | https://www.sqlite.org/pragma.html#pragma_synchronous, at the | cost of safe atomic updates. | kbumsik wrote: | > The sqlite docs. You can improve the performance of | sqlite by multiple orders of magnitude by messing with | things like | https://www.sqlite.org/pragma.html#pragma_synchronous, at | the cost of safe atomic updates. | | Thanks for the link, but only when the OS or your computer | crashes, not even the application itself. Could you | introduce any application file format that is safe in that | case? Obviously your JSON file format is much more unsafe | than that, not to mention the huge overhead of JSON with | binary data. | ori_b wrote: | > _Thanks for the link, but only when the OS or your | computer crashes, not even the application itself._ | | Ah, right there are also the journalling knobs that you'd | tweak to get to max performance -- either in-memory or no | journalling at all make a big impact. There are a bunch | of them scattered throughout the docs. | | > _Could you introduce any application file format that | is safe in that case? Obviously your JSON file format is | much more unsafe than that, not to mention the huge | overhead of JSON with binary data._ | | Yes, any format is safe if you write to a temporary file, | and then rename(2) it to replace it. This is guaranteed | to be atomic, so it works with anything (including | sqlite, though at that point all sqlite does is cost you | performance and complexity). | tzs wrote: | > Please no. This is an incredibly complicated file format that | is not particularly well suited for file saving. It's also very | slow (~100 transactions per second) if you aim for integrity, | or unsafe if you don't. | | The kind of application files they are talking about (things | like word processor documents, spreadsheets, drawings, source | code control system data) would only be writing sporadically. | During one of those sporadic writes they might need to update | thousands of rows but those could all be done in one | transaction. | ori_b wrote: | I made a mistake -- it's an order of magnitude slower. That | means a single transaction is within the threshold for human | perception. If you block the UI thread on a transaction, | you're now dropping 6 frames at 60fps. If you block other | operations on the transaction completing, you end up with | lag. | | Reference: https://www.sqlite.org/faq.html, question 19. | (I've seen similar when testing on SSDs locally). | lalaland1125 wrote: | Why would you block the UI thread on a transaction? | ori_b wrote: | Because you've got another operation triggered by a UI | action, and you need the previous one to finish, for | example. | lostmyoldone wrote: | If you have an FPS or interactivity target that needs to be | guaranteed to be better than about 0.1-1 second - depending | on platform - you can't have any disk IO at all on the UI | thread, or any thread that's supposed to react at | interactive rates at all times. | | It has generally become much better in the last decade or | two, but one should still expect most OS's to _sometimes_ | pause for excessive amounts of time on disk IO unless the | API is specifically guaranteed to never pause. Even then | one would be wise to measure /log deviations if it's | critical for the application. OS guarantees might also be | contingent on driver/subsystem guarantees, and bad drivers | might sometimes affect what seems completely unrelated | upstream systems. | ori_b wrote: | > _If you have an FPS or interactivity target that needs | to be guaranteed to be better than about 0.1-1 second - | depending on platform - you can 't have any disk IO at | all on the UI thread, or any thread that's supposed to | react at interactive rates at all times._ | | Yes, and other file formats encourage doing things in | memory, so you don't have any disk i/o in the common | path. | | Using sqlite as a file format strongly discourages the | simple, jank-free until you press save workflow of | slurping your content, operating on it in memory, and | then outputting it all in one operation as a response to | an explicit user action. Instead, your whole application | gets small but perceptible delays across all operations | and interactions. | lxdesk wrote: | That conflates the use of the schema with the use of the | format features. You can, in fact, use your custom in- | memory structures and then "slurp it out" to BLOB when | you're ready to save. | ori_b wrote: | Yes. But if you go that way, though, sqlite is just a | slow and clunky format compared to, eg, json or protobuf. | I doubt that any program out there uses sqlite this way, | other than those which realize they painted themselves | into a performance corner and hacked around it for | compatibility. | em500 wrote: | Your concerns seem pretty theoretical. Several _very_ | widely used applications (Whatsapp, iMessage, Apple Notes | /Calendar) use sqlite as a file format without much | issue. | SQLite wrote: | Storing content in SQLite is actually _faster_ than writing the | equivalent content directly to disk, in many situations. See | https://www.sqlite.org/fasterthanfs.html for discussion, | caveats, and links to source code where you can verify these | claims for yourself. | renewiltord wrote: | GeoPKG is an open format that is just a SQLite schema. It's cool! | tracker1 wrote: | Concurrent access to SQLite is the main reason I haven't really | looked into Deno. For that matter, I really think that | WebAssembly file system implementations should have enough | flexibility to be able to work with a SQLite database file in the | host OS. | | The needs for concurrent SQLite pretty much covers the need for a | robust file system. | pvg wrote: | Previously: https://news.ycombinator.com/item?id=18754634 | rlpb wrote: | Would such an app hold the sqlite database open while the user | has the document "open", and live-write user changes back to the | database immediately? Or would it follow the traditional model of | the user choosing to "Save"? | | I worked on an app that did the former many years ago (to an | Access database, not sqlite), and it did not go well because this | broke user expectations on the usual "open/save/save as" model. | catalogia wrote: | I think the former is preferable and most modern users will | prefer it. However if your users are upset about it, you could | do the later, or you could do the former and give them a | dummy/placebo "save" option that does nothing, or maybe VACUUMs | the DB. | pmilot wrote: | The save model at the UX level is completely orthogonal to the | application file format. You could implement either model with | an SQLite DB file. | EvilTerran wrote: | I guess you could send changes to the DB on-the-fly inside a | transaction, so when the user clicks "save" it's just a matter | of running COMMIT? Not sure what effect that would have on | performance as the transaction grows, though. | mkl wrote: | I think it would depend on the app (e.g. how document-like it | is), but nowadays people will be much more used to the instant- | save model due to things like Google Docs, OneNote, and phone | apps. | bstar77 wrote: | Is anyone doing this in Javascript (via electron) and encrypting | the file in such a way that the data is protected and not an | obvious sqlite db? | | I have some ideas on how to do this, but I'm curious if there's a | "preferred" way to do it. | gregmac wrote: | Why would you want to "protect" the data in this way? It's | user-hostile, and anyone determined enough will be able to get | at it no matter what you do. | | If it's popular and valuable enough, instructions and/or code | to break it automatically will then be published, regardless of | how much money or time you invested into the protection. (For | proof, look no further than the game industry's DRM over the | past 30 or so years) | bstar77 wrote: | It's not meant to be user hostile, it's for a game. I allow | data exports, but I don't want to expose the inner workings | of the game via direct access to the DB. | lxdesk wrote: | Data mining is an old gamedev worry. Nobody who's been | doing it for a while particularly cares. If the game is | remotely popular, data mining will happen. Disassembly to | find the decrypt function is a talent possessed by many, | many people who have the reverse-engineering bug. If you | then obfuscate the code, you have made things interesting | and then increasingly talented people will try to have a go | at it. If you combine a changing obfuscating technique with | updates, you can slow down community possession of the game | with respect to modding etc., and Minecraft worked that way | in its beta phases. But it's really all a question of your | purpose at that point. | | If you haven't already, I suggest cruising through TCRF: | https://tcrf.net/The_Cutting_Room_Floor | Vendan wrote: | You do realize, exposing the "inner workings of the game" | seems to be at least somewhat responsible for massively | increasing the appeal of the game in other cases? see | Minecraft, Rimworld, KSP as a few examples... | serf wrote: | ... see Minecraft, Rimworld, KSP as a few examples... | | I mean, I agree that games should be moddable, but not | from the premise of game profitability and popularity. | | The highest-grossing video game franchise list is over- | ran by products that are , pretty much, famously | unmoddable.[0] | | [0]: https://en.wikipedia.org/wiki/List_of_highest- | grossing_video... | TeMPOraL wrote: | That list has some old and long-running franchises (e.g. | Mario), some semi-interactive movie experiences (e.g. | CoD), and some games that have in fact been modded to | ridiculous extent (Warcraft, StarCraft, The Sims). I | wouldn't read too much from it. | pilif wrote: | SQLite has encryption plugin support, or you could encrypt the | file yourself at rest. However, no matter what you are doing, | as the application needs to be able to decrypt the file, you | have to ship the key and thus the protection will be able to be | circumvented. | | It's really not worth the effort IMHO | bstar77 wrote: | That's a good point... I don't want to have the game rely on | "phoning home" to decrypt the db. | geff82 wrote: | Haha!! I thought it was only me using sqlite as my go-to file | format in apps. | TheRealPomax wrote: | The great thing about applications that use sqlite as their | application file format is that you can write secondary apps and | utilities to supplement the applications themselves. | | For example, Adobe's Lightroom uses sqlite as their application | file format, which means that it's almost trivial to write an | application to help it do things it's either really | bad/exceedingly slow at (like removing images from a catalogue | that no longer exist on disk) or literally can't do (like | generating a playlist of files-on-disk for a specific collection | or tag), without being locked into whatever janky scripting | solution exists in-app. If there even is one to begin with. | | All you need is a programming language with a sqlite connector, | and you're in the driving seat. And sure, you'll need to figure | out the table schemas you need to care about, but sqlite _comes | with sqldiff_ so it 's really easy to figure out which operations | hit which tables with only a few minutes of work. | | Good luck reverse engineering proprietary file formats in the | same way! | TheRealPomax wrote: | Heck, I wrote https://github.com/Pomax/lightroom-catalog-helper | exactly because lrcat is an open format that takes barely any | time to figure out, which means the only limit to what you can | do with "lightroom" is what your programming skills allow for. | mongol wrote: | I tried to do that with a Firefox SQLite database files but it | had an exclusive lock. | TheRealPomax wrote: | This is the case for any application that constantly writes | to their dbs, so you usually run your own application/utility | on them when the "parent" application isn't running. | simonw wrote: | You can usually work around that by creating a copy - I run | "cp /path/to/locked.db /tmp" all the time for this. | pjscott wrote: | Make sure to also copy locked.db-wal if it's present. (Is | there a better way to do all this? I feel like there really | ought to be a better way.) | petters wrote: | It's very annoying when the applications use sqlite with | encryption. That is not open source, so regular language | bindings can not be used. | TheRealPomax wrote: | I'm not sure how slqite as your application file format has | anything to do with open source, honestly. It's certainly | super easy for open source projects to use, but a closed | source proprietary application using an encrypted sqlite file | sounds perfectly sensible to me? | | We all win when folks decide to leave it accessible, but I'm | not going to hold "encrypting a file format so that people | can't easily reverse engineer it" against folks who are | trying to sell software. | labawi wrote: | > I'm not going to hold "encrypting a file format so that | people can't easily reverse engineer it" against folks who | are trying to sell software. | | I think I'd mostly disagree. Selling an application is one | thing, but the data itself is usually customers' and | holding their data hostage is not a proper thing to do. | TheRealPomax wrote: | That's a tenuous argument at best, heavily relying on the | application in question not offering any way to export | your data, which many applications do come with. Just | because the file format is locked, doesn't mean the data | in it is locked. You just need the application to unlock | it. | samatman wrote: | This is self contradictory. | | Listen to what you're saying: the data isn't locked, you | just need the... application, to... unlock it. | | That's beyond tenuous, it's invalid. | cfn wrote: | I guess he/she meant that the extension that encrypts the | database is proprietary. I believe that was the only option | for a while but now there is an open source alternative. | yellowapple wrote: | Is the open-source alternative bidirectionally | interoperable with the closed-source version (i.e. it can | decrypt the proprietary encryption and the proprietary | version can decrypt the FOSS version)? If so, then that's | handy. | wrigby wrote: | It looks like the actual library to read/write an encrypted | SQLite DB is licensed separately from the open source | SQLite[1], so even if you have the encryption key, you | can't access the encrypted format: | | 1: https://www.sqlite.org/see/doc/release/www/readme.wiki | simonw wrote: | Completely agree - I wrote about some fun I had with Apple | Photos last month: | https://simonwillison.net/2020/May/21/dogsheep-photos/ | tomduncalf wrote: | Enjoyed this post a lot and had no idea apple use SQLite u | see the hood for a lot of stuff - thanks for sharing! | TheRealPomax wrote: | It opens up so many possibilities. Nice write-up! | me551ah wrote: | The only thing stopping SQLite from becoming the primary | application file format is browsers. Every single other client | supports SQLite. Embedded, Mobile, Lightroom, Desktop Apps and | Cross platform mobile frameworks have really good support for | SQLite. | HelloNurse wrote: | On the one hand, browsers dislike access to files (regardless | of format) for security reasons. On the other hand, browsers | use SQLite a lot. For example Firefox uses SQLite at least | for IndexedDB (various files) and for history and bookmarks | (places.sqlite): https://developer.mozilla.org/en- | US/docs/Mozilla/Firefox_Ope... | mbreese wrote: | Wasn't there an effort a few years back to try to create a | W3C standard for accessing data that stalled because all of | the browsers used SQLite and no one wanted to create a | different implementation? IIRC, the standard required two | different implementations and SQLite was used by everyone? | TheRealPomax wrote: | Primary application file format for... what kind of files? | Because it's certainly not going to (nor should it ever) | replace hypertext documents. | | Being able to load a read-only .sqlite database might seem | cool, but I can't think of a single instance in which that's | smaller and/or more efficient than calling data endpoints | that use gzip/brotli for transport. | | Or are you thinking "as replacement for IndexedDB"? In which | case, hard agree but then it's an actual database, not used | as file data container. | lbutler wrote: | I have an application that uses sqlite as their file format but | they've cleared the header so you can't open it directly. | | I'd love to be able to make secondary applications like you've | described but being enterprise software they don't want to make | it too easy. | | They obviously want to keep people locked in with their $40k | per seat application! | | I guess the first step is figuring out the page size and other | bits the other meta data you set in the header [1]. | | I know I just have to sit down and understand the format better | and I will eventually figure it out... | | [1] https://www.sqlite.org/fileformat.html | TheRealPomax wrote: | That's mean, but you'll just have to live with that. If | someone takes active steps towards preventing people from | making secondary apps/utils, then they're (a) jerks, and (b) | allowed to do that. | | Might be worth actually contacting them to ask why, if you | can make the case that secondary applications will increase | the value of their app, not decrease it. | traverseda wrote: | >That's mean, but you'll just have to live with that | | I mean I know we're all on board with the idea of | intellectual property actually being a thing now, but | surely there are limits? I've seen people take the hard- | line stance that if something is your property you should | be able to dictate _exactly_ under what situation it can be | used, but there have to be limits to IP holders rights on | some level, and I feel like reverse engineering a file | format is a pretty reasonable place to draw that line. | jtbayly wrote: | I didn't take "you'll just have to live with that" as a | claim that the user couldn't reverse engineer it. It | simply means that they've made it harder for the user to | do so, and so the user has to live with it being harder. | TheRealPomax wrote: | this. | jstanley wrote: | > I know we're all on board with the idea of intellectual | property actually being a thing now | | I don't know that that's true at all! I'd say the | pendulum is swinging in the opposite direction. | matheusmoreira wrote: | > we're all on board with the idea of intellectual | property actually being a thing now | | We most certainly are not. I personally believe that | intellectual property as a whole doesn't make sense in | the 21st century and should be abolished. | | > there have to be limits to IP holders rights on some | level | | There are. The laws generally recognize fair use and | reverse engineering for interoperability. | | > I feel like reverse engineering a file format is a | pretty reasonable place to draw that line | | Absolutely. Unfortunately, in the US it seems | corporations can force people to give up their rights by | making them agree to it. Therefore, "you must not reverse | engineer our software" is a standard clause in every | contract and it's not negotiable. | kstrauser wrote: | I'd be more on board with it if it were taxed like other | property. That patent's worth $1B, you say? Better let | Accounting know. Conversely: you're suing for $1B for a | patent violation but you only paid taxes on it being | worth $300? $300 max it is, then. | kgwgk wrote: | > taxed like other property. | | Most property owned by companies is not taxed. | [deleted] | cortesoft wrote: | They aren't saying you have to live with the legal | restriction, but rather the technical restriction of | having to reverse engineer it without the headers | deathgrips wrote: | >we're all on board with the idea of intellectual | property actually being a thing now | | No. Intellectual property is not genuine property. It is | a state granted monopoly and is antithetical to free | market principles. | hamburglar wrote: | I'm surprised that nobody's pointed out that there are | actually valid reasons other than greed to obscure your | file format. It's an implementation detail, not a contract. | If customers begin relying on the implementation details, | you end up with angry customers when you change the | implementation details. A SQLite db without the header is | basically a statement saying, "we are using the obvious | file format here for our convenience, not for general | purpose access. Screw around in here at your own risk." | | If you modified their app's internal state db and screwed | it up because they have designed their software with | certain assumptions that aren't clear from just reading | their db schema, that would be a nightmare for them to | support. The easiest thing for them to do is just to try to | discourage tampering with their internal state. | | This is especially true if there's a chance that a market | for secondary apps/utils will spring up. If that's to | happen and be viable, they absolutely would want to put | thought into what their supported interfaces are for those | apps/utils, otherwise they will end up painted into a | corner and unable to change their architecture without | destroying a marketplace. | lbutler wrote: | I can understand why they do it, it's enterprise software | so the more open you make it, then the less licences they | can sell. | | They do have a module you can purchase to run API calls and | access their files/software but as you probably guessed | that's another $40k license! | | Most of my apps I build use this API, but for me to provide | to other companies they need them to also buy the API | extension. | | I'd love to cut out the middle man and I'll do it | eventually when I reverse engineer the header! | mrmonkeyman wrote: | I don't get this mentality. You haven't created this | system. You have no idea what it takes or else you would | have created it yourself. You don't like the price, but | you want it anyway. How is this, I mean come on, | different from stealing? You even say it: cutting out the | middle man. You want to remove their profits. | kccqzy wrote: | I might be wrong but I thought Dropbox used to do that as | well. All the sync state is stored in a sqlite file with the | header changed or removed. | freeopinion wrote: | At $40K/seat, that must be some special software. Simple | economics invites competition. Depending on the software, it | may not take too many seats to incent the funding of an | alternative. | | Maybe this is the kind of software that requires huge | development costs. But maybe it would be worth 20 seats' | worth of customers joining forces to fund a team of 5 people | to build you a competing app tailored to your specific | needs/wants and completely under your control. | | Granted, that could bump your costs from $800K/year to | $1.6M+/year. But only short-term. Once your software is | production-ready, you drop the costs of your current | software. So think of it more like going from $8M/10 years to | $6-10M/10 years but having complete control to add the | features you want. And perhaps having the opportunity to | recoup $millions/year by licensing to others. Or, open source | it and give others the same kind of control while benefiting | from the features they add. Spread your development costs | across more seats to further lower your $/seat. | | Or, look at the 100 employees your vendor currently has and | lose heart, then hope somebody with deep pockets funds a | competitor. | lbutler wrote: | This particular software is for water utilities to model | and simulate their water and waste water networks. | | It's mostly used so utilites can forecast growth in their | areas for the next 25+ years and see the impact on their | networks and feed into their capital work projects. | | A decently sized utility may spend up to $200M/yr on | capital works so $40k isn't even a line item! | | There is completion in the market but consultants are | forced to use what their clients pick and most utilites | aren't that price sensitive. | | There are also open source alternatives by the EPA[1][2], | and most commercial operators are just wrappers around this | public domain software. | | I'm trying to create FOSS to help view and run these | models. | | [1] https://en.m.wikipedia.org/wiki/EPANET | | [2] https://en.m.wikipedia.org/wiki/Storm_Water_Management_ | Model | praveen9920 wrote: | The article has great points. | | Main reason for existence of lot of file formats is that | enterprises don't want just about everyone access their files and | modifying them. It greatly reduces the usage of their proprietary | software hence their revenues. | | I remember the days when open office trying to render doc file | but formatting used to suck big time. | | Open source softwares should leverage this kind of file formats | for inter operating. | agumonkey wrote: | it's the old business paradox, if everything was totally lean | and solved, how do people earn a living ? you need a strong | force to overcome this hurdle and show how to reorganize the | system in the large | catalogia wrote: | > _Main reason for existence of lot of file formats is that | enterprises don 't want just about everyone access their files | and modifying them. It greatly reduces the usage of their | proprietary software hence their revenues._ | | I'm sure this does happen, but it seems more like MBA-paranoia | than a legitimate concern. For instance I sincerely doubt Adobe | has lost any revenue by using SQLite with lightroom, despite | various open source tools being able to interact with their | lrcat (sqlite) files. | catalogia wrote: | Is there any good way to use SQLite as an application file format | while also using it in WAL mode? They seem sort of mutually | exclusive; near as I can figure the way to do this would be to | execute `PRAGMA journal_mode=DELETE` when the user requests a | save file, then copy the DB to the location the user specifies. I | think this negates a few of the advantages of using sqlite as an | application file format though. | Redoubts wrote: | I think the backup api or VACUUM INTO are the preferred ways, | though you're gonna probably want to turn off wal on the | exported copy still. | GnarfGnarf wrote: | I just converted a 90's-vintage xBase app to SQLite. I just love | it. Should have done this years ago. | | The only thing I miss is the ability to jump X% into an index. | Not a deal-breaker, though. | overcast wrote: | What happened to storing blobs in a database being giant rookie | mistake? I'm currently dealing with this at the office, where now | a SQLServer database has grown to unmanageable size from helpdesk | software attachments. | JulianWasTaken wrote: | It's less of a rookie mistake than it seems these days: | https://danluu.com/file-consistency/. | adrianmonk wrote: | I don't think there is a consensus that storing blobs in the | database is always the wrong decision. Yes, there are | drawbacks, but it really depends on the circumstances, and it | might be the right choice. | | Suppose, for example, that you were going to make your own | implementation of Git. The normal implementation has a bunch of | blobs under one directory. There's not much use in manipulating | this directory separately from everything else under .git. The | blobs and most of the non-blob stuff are only useful together | anyway, so you already manage them as a single collection. | | It could create problems for certain backup tools, so that's a | disadvantage. But it also simplifies application development. | So it's not a slam dunk one way or the other. | agumonkey wrote: | attaching oversized unreadable scanned pictures of digital text | files I guess | foldr wrote: | The suggestion here is to use it as a file format, not as a | file system. | GordonS wrote: | I think that refrain usually applies to when the database is | used to store relational data, rather than when it's being | requisitioned for use as a file format. | MaxBarraclough wrote: | I've never faced this problem so perhaps I'm missing something | obvious, but what's the big difference storing binary blobs on | the DB vs, say, storing URLs on the DB and hosting the blobs | elsewhere? Is the size of the database the main concern? | raphaelj wrote: | Both approaches have their advantages. | | Storing blobs on the DB makes the data more consistent (the | database is always in sync with the file content), and allows | you to use advanced features of the DB on these files | (transactions, SQL queries, data rollback ...). You also only | have to backup the DB. | | Storing links to the objects is usually more scalable, as | your DB will not grow as fast. DBs are usually harder to | scale, and also more expensive (at least 10x per GB). | | It really depends on the project, but I'm favoring more and | more storing the data in BLOBs, as it makes backups easier | and as I can use SQL queries directly on the data. Databases | as a service also make it easy to scale DBs up to +/- 2TB. | But the cost might still be an issue. | ginko wrote: | >and also more expensive (at least 10x per GB). | | Expensive in what way? Memory/Compute? It can't be | licensing money since SQLite is public domain. | raphaelj wrote: | I'm just comparing the per stored GB cost of on demand | database services vs on demand storage services. | Tajnymag wrote: | raphaelj is probably talking about classic database | machines like PostgreSQL and MariaDB. | | In my view, storing blobs in sqlite doesn't have a huge | disadvantage tied to it. Sqlite grows pretty linearly and | as it stores the whole db in a single file, hosting | provider doesn't even have to know about you using it. | ojnabieoot wrote: | Did you read the article? This isn't about storing blobs as | data in SQLite, but rather using SQLite-formatted databases as | the file format for an application. | | As a simple example, Word documents are just zipped XML text | files (try unzipping a .docx and looking inside). Instead of | using this, you could a SQLite .db file (probably with a | different extension), translating the XML files into tables, | and folders into databases. The OpenOffice case study has more | details: https://sqlite.org/affcase1.html | overcast wrote: | Apparently I'm misreading the following paragraph. | | Any application state that can be recorded in a pile-of-files | can also be recorded in an SQLite database with a simple | key/value schema like this: CREATE TABLE | files(filename TEXT PRIMARY KEY, content BLOB); | roenxi wrote: | It is reasonable to adopt that style if it is data you | don't control. Storing blobs is probably a mistake if it is | data you have control over. | | Sometimes all that is known about data is it exists - in | that case, into the database as a blob it shall go. If it | can be decomposed it probably should be. | ojnabieoot wrote: | I agree that is unclear for an introduction, and that the | article could have been written better. The "pile-of-files" | they are referencing isn't a filesystem of arbitrary and | unstructured binary data stored for humans to look at later | (like your Helpdesk attachments), but rather structured and | specific data for an application to read in a well-defined | manner. | | Your Helpdesk example used SQLServer problematically | because a SQL database shouldn't be used as an arbitrary | file store. But if you know what the file structure is and | have a reasonable grasp for how it might scale (that each | binary blob is small, that each user only adds one row to | the database, etc), there are huge advantages to "a SQLite | table with lots of binary and text columns" versus "a | folder with lots of binary and text files." And if those | text files are just small key-value pairs then maybe they | should also go in SQLite. | wenc wrote: | The OpenOffice case study is a fascinating thought | experiment. | | We don't really create too many new file formats these days, | and if we did they're highly performance specialized | (parquet, arrow). | | Just wondering aloud, what recent file format would have | benefited from being based on sqlite? | TheRealPomax wrote: | That's the thing: this isn't "a database" anymore when it's | used as an application file format, it's now "an application | file format" that is conveniently also a sqlite db file so | everything in it is trivially accessible. | | Storing huge blobs then becomes a matter of "is this my data, | or is this general data that I'm merely also making use of". | Examples abound of both: | | - Adobe Lightroom (a non-destructive photo editor) goes with | "these are not my images" and leaves them out of its sqlite | "lrcat" files, instead maintaining a record of where files can | be found on-disk. | | - Clip Studio (a graphics program tailored for digital artists) | on the other hand stores everything, because that's what you | want: even if you imported 20 images as layers, that data is | now strongly tied to a single project, and should absolutely | all be in a single file. | | So the key point here is that, yes: sqlite files are database, | but because they're also single files on disk, their _use_ | extends far beyond what a normal database allows. As | application file format, the rules for application files apply, | not the rules for standard databases. | HelloNurse wrote: | Storing blobs is good if the blobs are what you actually want | to store; the rookie mistake is storing blobs instead of the | well-structured, meaningful, indexable and much more compact | columns and additional tables that you should be storing but | you are too rookie to to design. | bob1029 wrote: | Storing your blobs outside of the database when they are an | inherent part of some larger transactional model is the giant | rookie mistake IMO. We did this for a little while for the | reasons of "oh my gosh look how big the .db is getting" and it | ended up a complete shitshow. Simple tasks like cleaning up old | entries can quickly turn into a nightmare if you have to go | around disk and make sure you are not leaving garbage behind. | There are also other concerns outside the box like VM | snapshots. This could leave you in a very inconsistent state | regarding what blob you have on disk vs what the database is | expecting. | | Is having everything in one gigantic .db file an actual | downside? What makes a database's size "unmanageable"? | Presumably, you'd have to store that business information | somewhere anyways, right? I don't understand how 1 unified file | is unmanageable, but splitting that concern up into 2 different | domains magically makes it better. | jzoch wrote: | It can trash db performance. A relatively simple alternative | is to store references to immutable blobs in the db so that | everything remains transactionally consistent (and easily | restorable to any point-in-time). This is what many large | companies do. I happen to work at a company that runs the | largest + most "shall not be named" databases and has | struggled for years to evict all the blobs people shoved in | em. | aembleton wrote: | If you index on the file_name in a table where you have | file_name and a blob then why would it trash performance? | SQLite can quickly locate where to start reading the blob. | chrismorgan wrote: | How would SQLite go as for something like Audacity projects? | Audacity projects are a pile of files: an XML file that declares | the project metadata, tracks, channels, envelopes (an example of | non-destructive editing), and references to the zillions of audio | files that it drops in a directory beside the XML file (typically | less than 1MB each). | | I presume it splits all the audio up into small files so that | most types of edits can only need to touch a small area. | | If you directly ported that to SQLite, would it work fairly well, | or would you want to restructure it somehow? Things like | additions or deletions, would it need to write lots of extra data | to the disk (would it be doing something like defragmenting, or | would it grow larger than it should, or are there other tricks | that I don't know about to delete a chunk from the middle of a | file without needing to rewrite all of the file beyond that | point)? | catalogia wrote: | Adobe Lightroom seems comparable (non-destructive image | editing.) It uses a sqlite database that points to image files | on the filesystem. I think precisely the same approach could be | taken with an audio editing application. | | You could also consider putting the audio files into the sqlite | db, which _might_ work alright. I 've heard of image thumbnails | being stored in sqlite dbs (maybe by Lightroom iirc?) though | those are probably smaller than your audio clips I'm guessing. | chrisbennet wrote: | I used it with a different extension to save application for a | product I wrote. | | Any classes that need to be saved had serialize() and | deserialize() functions. Serialize before saving to SQLite and if | read in, deserialize after reading it from the DB. | 6gvONxR4sf7o wrote: | I used to advocate for this, then tried opening a sqlite file I'd | been using on another computer and it said it was corrupted. I | took it back to the original computer and it opened just fine. So | there was something going on that made it more complex and less | portable than something simple like json, which is super | frustrating because otherwise it's a really useful format for | datasets, and now I can't really trust it. | curiousmindz wrote: | If you use certain new features of SQLite, the database file | gets "upgraded". At that time, you must use a version of SQLite | that understands these new features. | | So, the other computer you used was probably running an older | version of SQLite. Just update it to make it work. | bollockitis wrote: | That's a good catch. I'd there any way to tell which version | of SQLite created the file? Aside from trial and error, I | mean? | lbutler wrote: | The version is stored in byte 96 of the header of the | sqlite file[1]. | | I haven't looked but there will be some sqlite command to | query it and I'm sure some viewer tools will display it as | well. | | [1] https://www.sqlite.org/fileformat.html | pjscott wrote: | You can print the version with this magical incantation: | $ python -c 'import sys, struct; | print(struct.unpack(">I", open(sys.argv[1], | "rb").read()[96:100])[0])' foo.db | banana_giraffe wrote: | Recentish versions of file can show you the version: | $ file example.db example.db: SQLite 3.x database, | last written using SQLite version 3007017 | teleforce wrote: | Shouldn't this has 2014 inside the title? The story has also | appeared several times in HN before. | | Very interesting concept but now I think perhaps application file | format using TileDB will be much better since it can support | sparse data as well [1]. | | [1] https://github.com/TileDB-Inc/TileDB | HelloNurse wrote: | TileDB uses a "pile of files", as the article calls it, it | doesn't have SQL, and it has a rather embarrassing point of | view on race conditions (https://docs.tiledb.com/main/basic- | concepts/consistency). It could be a good choice for | applications that rely on large and simple multidimensional | arrays, hardly the same "market" as application file formats. | | Moreover, database tables are a _very_ good fit for sparse | multidimensional arrays. | airstrike wrote: | That's not at all similar it doesn't support SQL | corysama wrote: | "Atomic transactions" is a feature needs _formal_ support in | random file formats way more often than people realize. Simply | writing to a file at all in an guaranteed-atomic way is much | harder than it looks. That guarantee becomes important when your | app gets widely distributed. If you have a million users of your | free mobile app, 1 in a million events happen every day. For | example: random hardware shutdown midway through a file write | operation. How is your app going to react when it reads back that | garbled file? | | I've used SQLite on mobile apps to mitigate this problem. I've | used LMDB on a cloud app where the server was recording a lot of | data, but also rebooting unexpectedly. Would recommend. I've also | gone through the process of crafting an atomic file write routine | in C. https://danluu.com/file-consistency/ It was "fun" if your | idea of fun is responding to the error code of fclose(), but I | would not recommend... | jolmg wrote: | > For example: random hardware shutdown midway through a file | write operation. How is your app going to react when it reads | back that garbled file? | | Don't filesystem journals ensure that you can't get a garbled | file from sudden shutdowns? | espadrine wrote: | > _Don 't filesystem journals ensure that you can't get a | garbled file from sudden shutdowns?_ | | For once, I have a blog post that goes in detail into why | that is not true! | | (Or rather, not true unless you implement it in a very | sophisticated manner rarely used in practice.) | | https://espadrine.github.io/blog/posts/file-system-object- | st... | SonOfLilit wrote: | They ensure you don't get a garbled filesystem. | | They also expose an API that allows you, if you're very | careful and really know what you're doing (like danluu or the | SQLite author), to write performant code that won't garble | files on random shutdowns. But most programmers at most times | would rather just let the OS make smart decisions about | performance at the risk of garbling the file, or if they | really need Durability, just use a library that provides a | higher level API that takes care of it, like LMDB or an RDBMS | like SQLite. | | To not get your file garbled, you need to use an API that | knows about legal vs. illegal states of the file. So either | the API gets a complete memory image of the file content at a | legal point in time and rewrites it, or it has to know more | about the file format than "it's a stream of bytes you can | read or write with random access". | | Popular APIs to write files are either cursor based (usually | with buffering at the programming language standard library | level, I think, which takes control of Durability away from | the programmer) or memory mapped (which realllly takes | control of Durability from the programmer). | | SQLite uses the cursor API and is very careful about buffer | flushing, enabling it to promise Durability. Also, to not | need to rewrite the whole file for each change, it does it's | own Journaling inside the file* - like most RDBMSs do. | | * Well, it has a mode where it uses a more advanced technique | instead to achieve the same guarantees with better | performance | jolmg wrote: | > They ensure you don't get a garbled filesystem. | | Well, they do that, but they also protect data to | reasonable degrees. For example, ext3/4's default | journaling mode "ordered" protects against corruption when | appending data or creating new files. It admittedly doesn't | protect when doing direct overwrites (journaling mode | "journal" does, however), but I'm pretty sure people | generally avoid doing direct overwrites anyway, and instead | write to a new file and rename over the old one. | | I'm not sure if it would protect files that are clobbered | with O_TRUNC on opening (like when using > in the shell). I | would imagine that using O_TRUNC causes new blocks to be | used and so the old data isn't overwritten and it isn't | discarded because the old file metadata which would | identify the old blocks corresponding to the file would be | backed up in the journal. | | > They also expose an API that allows you, if you're very | careful and really know what you're doing (like danluu or | the SQLite author), to write performant code that won't | garble files on random shutdowns. | | As far as I see for the general case, being "very careful | and really knowing what you're doing" consists of just | avoiding direct overwrites. Of course, a single file that | persists data by the needs of software similar to a web | server (small updates to a big file in a long-running | process) is going to want the performance benefits of | direct overwrites. I can totally see SQLite needing special | care. However, I don't think those needs apply to all | applications. | labawi wrote: | > As far as I see for the general case, being "very | careful and really knowing what you're doing" consists of | just avoiding direct overwrites. | | That's a dangerous thing to say. There are many ways to | mess up your data, without directly overwriting old data. | | If you write a new file, close, then rename, on a typical | linux filesystem, mounted with reasonable options, on | compliant hardware, I think you should have either the | old or new version of the file on power loss, even if you | don't sync the proper things in proper order, but that's | only because of special handling of the common pattern. | See e.g. xfs 0 size file debacle. | | Not an expert. | hansvm wrote: | Not really. If you have a file format that requires, e.g., | changes to be done in two places then it's reasonable to | write to one place, have the system shut down never having | written to the second place, and now have a corrupt file. | | The journal ensures (helps ensure?) that individual file | operations either happen or don't and can improve write | performance, but it can't possibly know that you need to | write, e.g., two separate 20TB streams to have a non-corrupt | file. | [deleted] | jolmg wrote: | For a single file, I thought that write operations were | committed when e.g. closing the file or doing fsync, but | now I'm not sure. I wonder if the system is free to commit | immediately after a write() ends. | | Based on your scenario, if an application-level "change" | involves updating 2 files, interpreting the update of only | one file and not the other as a corruption, you're right | that filesystem journaling wouldn't suffice. However, in | that case it wouldn't be that a single file was corrupted. | | Still, I wonder about the other case, about when the | filesystem decides to commit. | hedora wrote: | Posix (and I think windows) guarantee this atomically and | durably overwrites a file: | | tempfile = mkstemp(filename-XXXX) | | write(tempfile) | | fsync(tempfile) | | close(tempfile) | | rename(tempfile, filename) | | sync() | | Assume the entire write failed if any of the above return an | error. | | In some systems (nfs and ext3 come to mind), you can skip the | fsync and/or sync, but don't do that. It doesn't make things | significantly faster on the systems where it's safe, but it | definitely will lose data on other systems. | | The only loophole I know of is that the final sync can fail, | then return anyway. If that happens, the file system is | probably hosed anyway. | corysama wrote: | Here's the article I couldn't find before on the process | https://lwn.net/Articles/457667/ | | Which links full source for the process you described | https://lwn.net/Articles/457672/ | networkimprov wrote: | You need a recovery step on startup to retry the rename if | tempfile is complete, or delete it if it isn't. | | That means you need a way to verify that tempfile is | complete. I do that by removing filename after completing | tempfile. And that requires a placeholder for filename if it | didn't already exist (e.g. a symlink to nowhwere). | | On crash, rename may leave both files in place. | | This technique doesn't work if you have hardlinks to filename | which should refer to the new file. | jonas21 wrote: | Regardless of whether the tempfile is complete, you can | just ignore (or delete) it on startup. From the caller's | perspective, the save operation doesn't succeed until the | rename is done and written to disk. | [deleted] | dfox wrote: | I second the recomendation of LMDB. With one important caveat: | under heavy write load it is perfect demonstration of | brokenness of semaphore implementation on freebsd and macos. | skrebbel wrote: | In what way is LMDB better than eg SQLite or Redis? For what | kinds of use cases would you recommend it? | dfox wrote: | For my applications: latency. The fact that writers do not | block readers is just a nice bonus. | kortex wrote: | Do you reckon LMDB would be reasonably performant | compared to raw mmaped files for zero-copy passing large | images between processes? I want something like a ring | buffer but mitigate the risk of use-after-free if the | consumer lags. Seems like lmdb automatically re-uses | memory that's long been freed but is sensible to detect | that an incoming write needs more space. | | Python/c++. | | I've looked into mmap and flock but it's messy and not | highly portable. | dfox wrote: | LMDB is in fact not much more than large mmaped file and | clever synchronization mechanism on top of that that also | serves as dictionary implementation and transaction | mechanism. My uses of LMDB are essentially replacements | for mmaped file and LMDB gives me the ability to sanely | do partial updates. | corysama wrote: | It's very simple. Single C file implementation. Binary blob | keys : binary blob values. The end. If all you need is a | bunch of blobs written and read back reliably while in a | statistically unreliable situation, LMDB is great. | namibj wrote: | Well, sqlite is also just a single .c with a single .h. | tlb wrote: | Sqlite is dozens of files. See | https://www3.sqlite.org/cgi/src/dir?ci=tip&name=src | samatman wrote: | You're both right... kinda. | | SQLite is dozens of files, if you're browsing it, or | modifying it. Which you can do, as long as you're ok with | contributing your changes being impossible, since it's | open-source but closed-contribution. | | If you're compiling it, it is, in fact, one .c and one | .h: | | https://www.sqlite.org/amalgamation.html | | If you're choosing between LMDB and SQLite, the latter | sense is the relevant one, so they're identical along | this particular axis. ___________________________________________________________________ (page generated 2020-06-13 23:00 UTC)