[HN Gopher] TIL-Python has a built-in persistent key-value store... ___________________________________________________________________ TIL-Python has a built-in persistent key-value store (2018) Author : zora_goron Score : 227 points Date : 2022-09-15 10:10 UTC (12 hours ago) (HTM) web link (remusao.github.io) (TXT) w3m dump (remusao.github.io) | nonethewiser wrote: | Huh. I am working on a little utility to deduplicate csvs using | sqlite. Seems like this would likely be better. | | I used sqlite in the projects name. This makes me think of an | article I read recently that suggested not using descriptive | names for projects. For reasons such as this. | ALittleLight wrote: | You may want to look at pandas too (unless the issue is your | CSVs are too large to fit in memory). | ivvve wrote: | This is probably dumb intervention but forgive me because it | made me giddy to find: the pandas drop_duplicates() method is | pretty great, if you're in a pinch. Again -- forgive me if you | know about it! | amelius wrote: | This is going to fall flat when you try to use multithreading. | The GIL will rear its head. | giraffe_lady wrote: | This is a wrapper around dbm, the unix library? Though having a | portable interface to that tool is nothing small, and most | languages don't have one in their standard library. Though I | think all languages with a stdlib as big as python's probably do. | It's not inherently a python thing though, unless I'm badly | misunderstanding. | collyw wrote: | I look at it the other way, this is the least effort disk | persistence that I have seen in python. A fair amount of dev | stuff is reading from disk and putting it into some kind of | object / dictionary format. I am not really bothered about what | it uses under the hood (ok it is interesting enough that I have | looked up DMB). To me the he main appeal of this that you can | use what appears to be a normal python dictionary and have it | save to disk with minimal effort. This is a lot less friction | that using a database. | giraffe_lady wrote: | Oh yeah for sure. I don't mean to convey any disdain for dbm | itself or for having it rolled into the standard library. | Reading and writing custom files for "convenience | persistence" or configuration is one of those really common | things that you don't usually need to actually do if you know | about tools like this. | | I was just pointing out that the title and even the article | seem to associate this system with python per se, rather than | understanding it as python's interface to a common and pre- | existing system. | Someone wrote: | FTA: Why is sqlite3 so slow? | | My money is that's mostly because it actually stores the data. I | don't think _dbm_ guarantees any of the data written makes it to | disk until you _sync_ or _close_ the database. | | This writes, guesstimating, on the order of 100 kilobytes, so | chances are the _dbm_ data never hits the disk. | akubera wrote: | There's also the shelve[0] module which allows storing any | pickleable object in a persistent key-value store, not just | string/bytes. I've found it's very handy for caching while | developing scripts which query remote resources, and not have to | worry about serialization. | | [0] https://docs.python.org/3.10/library/shelve.html | | Obligatory pickle note: one should be aware of pickle security | implications and should not open a "Shelf" provided by untrusted | sources, or rather should treat opening a shelf (or any pickle | deserialization operation for that matter) as running an | arbitrary Python script (which cannot be read). | cristoperb wrote: | I once wrote a locking wrapper around the shelve module so I | could use it as a thread- and multiprocess-safe key-value cache | (including a wrapper around the requests modules `get()` to | transparently cache/validate http resources according to | headers): | | https://github.com/cristoper/shelfcache | | It works despite some cross-platform issues (flock and macos's | version of gdbm interacting to create a deadlock), but if I | were to do again I would just use sqlite (which Python's | standard library has an interface for). | TillE wrote: | > but if I were to do again I would just use sqlite | | Yeah, I tried to use shelve for some very simple stuff | because it seemed like a great fit, but ultimately found that | I had a much better time with tortoise-orm on top of sqlite. | | If you need _any_ kind of real feature, just use sqlite. | PartiallyTyped wrote: | I like to use `joblib.Memory` for my caching as it catches | changes in functions which prompt rerunning over loading from | the cache and then overwrite the old result. | mrweasel wrote: | Can you safely "pickle" Python objects across different | architectures and Python version (assuming we forget about | Python 2)? | NegativeLatency wrote: | There's a version on the pickle format, so you might be able | to do it across versions but I suspect the version has | changed for security reasons over time? | Spivak wrote: | Pickle could _in theory_ could be architecture dependent | since __getstate__, and __setstate__ are user provided | options. But you would have to try to do that on purpose. | | And you don't even have to forget about Python 2! If you use | format version 2 you can pickle objects from every version | from Python 2.3+ and all pickle format are promised to be | backwards compatible. If you only care about Python 3 then | you can use version 3 and it will work for all Python 3.0+. | | https://docs.python.org/3/library/pickle.html#data-stream- | fo... | | The reason against using pickle hasn't changed though, if you | wouldn't exec() it, don't unpickle it. If you're going to | send it over the network use MAC use MAC use MAC. Seriously, | it's built in -- the hmac module. | mmcgaha wrote: | We had a program that was sending a pickled session state | as a cookie. We solved that by packing the cookie as a | random string, a timestamp, the object, and a MAC. We | validated the MAC, then checked the timestamp, and finally | unpickle the object. It still bothers me that we are | unpickling data passed by the client but I ran arguments | against doing it. | clysm wrote: | Another problem with pickles with any sort of living code base | is when one makes modification to the type itself - renaming, | refactoring, whatever. Picking objects (and nested objects) | that aren't explicitly meant for data | storage/retention/transmissions leads to headaches. | | It's best to use another dedicated type or library specifically | for this task. | formerly_proven wrote: | There's even an object database built around pickle, ZODB. It | even has a networked implementation with replication / | failover. Used to be part of Zope, originally written in the | 1998 time frame or so - GvR actually committed a bunch of stuff | to it. | nas wrote: | ZODB is awesome and overlooked, IMHO. I'm biased I guess | because I was involved in making Durus which is inspired by | ZODB. The ZODB model is not appropriate for all applications | (optimistic concurrency control) but for the applications it | works for, it's great. Very easy to develop with (no | relational-to-OO mismatch) and performance can be great if | you design your model carefully. The client caching of data | is great. It is a bit like memcache but the data is already | there as objects in RAM. The database server will invalidate | the cache for you, no manual invalidation needed. | mwcampbell wrote: | I developed a web application in 2004 using Quixote and | Durus. I wonder how many developers outside of MEMS | Exchange ever used both of those packages. Somehow I had | not yet encountered a proper ORM (I didn't discover | SQLObject until later, and Django wasn't out yet), so I | liked Durus at the time. That web application is still | running, and in the years since then, I've had to handle | several escalated customer support cases, and I often | wished I had chosen a relational database so I could easily | do ad-hoc queries. So Durus probably wasn't the best choice | for my application, but that's not your fault. And one | thing I liked about Durus was that the implementation was | simple enough that I felt I could really understand how it | worked. So thanks for that. | nijave wrote: | Wonder how this compares to using the filesystem and keys where | keys are either filenames or a file path (and the value is stored | in the file)? | j0hnyl wrote: | I wonder if it's still faster than sqlite in 2022. | cbdumas wrote: | The script linked at the bottom of this page uses sqlite | without an index on the table. In my testing if you add an | index sqlite quickly outperforms dbm in terms of read | performance | calpaterson wrote: | According to his benchmark run today on Python 3.9 on linux, | dbm is still considerably faster for random access. | | That said, what database dbm is using varies a lot by platform. | I think on linux it's usually BerkleyDB. | | I use dbm a lot to cache pickles of object that were slow to | create. eg pandas dataframes | curiousgal wrote: | Sorry if it's a dumb question but sinc dbm writes stuff to | disk, how is that different than loading the pickles? | dfox wrote: | You get one file for the whole store, not one per key. | Also, it tends to better handle having lots of keys | significantly better than filesystems handle lots of files | in one directory. | curiousgal wrote: | Thank you! | qalmakka wrote: | I think on Linux is often GNU DBM, Berkeley DB is a | different, better implementation by Oracle that requires a | third party module. | gerty wrote: | It's not built-in, of course, but if you can live with the quirky | API, LMDB as an embedded key-value store will be hard to beat in | terms of speed. It can also do "one-writer, many-readers" and | transactions. | cbdumas wrote: | So I read the script used to compare 'dbm' and 'sqlite3', and in | sqlite it creates a table with no index. Hard to take that | comparison seriously. I wrote a little benchmark script of my own | just now and sqlite3 beats DBM handily if you add an index on the | key. | simonw wrote: | Yeah I tried that myself - I took that benchmark script and | changed the CREATE TABLE lines to look like this (adding the | primary key): CREATE TABLE store(key TEXT | PRIMARY KEY, value TEXT) | | Here are the results before I made that change: | sqlite Took 0.032 seconds, 3.19541 microseconds / | record dict_open Took 0.002 seconds, 0.20261 | microseconds / record dbm_open Took 0.043 | seconds, 4.26550 microseconds / record sqlite3_mem_open | Took 2.240 seconds, 224.02620 microseconds / record | sqlite3_file_open Took 7.119 seconds, 711.87410 | microseconds / record | | And here's what I got after adding the primary keys: | sqlite Took 0.040 seconds, 3.97618 microseconds / | record dict_open Took 0.002 seconds, 0.19641 | microseconds / record dbm_open Took 0.042 | seconds, 4.18961 microseconds / record sqlite3_mem_open | Took 0.116 seconds, 11.58359 microseconds / record | sqlite3_file_open Took 5.571 seconds, 557.13968 | microseconds / record | | My code is here: | https://gist.github.com/simonw/019ddf08150178d49f4967cc38356... | cbdumas wrote: | Yeah now that I dig in a little further it looks like it's | not as clear cut as I thought. DBM read performance is better | for me with larger test cases than I was initially using, | though I am getting some weird performance hiccups when I use | very large test cases. It looks like I'm too late to edit my | top level response to reflect that. | divbzero wrote: | Just to spell out the results: Adding the primary key appears | to improve SQLite performance but still falls short of DBM. | banana_giraffe wrote: | The index is the least of the issue with the SQLite | implementation. It's calling one INSERT per each record in that | version, so the benchmark is spending something like 99.8% of | its time opening and closing transactions as it sets up the | database. | | Fixing that on my machine took the sqlite3_file_open benchmark | from 16.910 seconds to 1.033 seconds. Adding the index brought | it down to 0.040 seconds. | | Also, I've never really dug into what's going on, but the dbm | implementation is pretty slow on Windows, at least when I've | tried to use it. | avinassh wrote: | I was working on a project to insert a billion rows in SQLite | under a minute, batching the inserts made it crazy fast | compared to individual transactions. | | link: https://avi.im/blag/2021/fast-sqlite-inserts/ | Steltek wrote: | For use cases where you want a very simple key-value store, | working with single records is probably a good test? | banana_giraffe wrote: | Maybe? | | Sure, mutating data sets might be a useful use case. But, | inserting thousands of items at once one at a time in a | tight loop, then asking for all of them is testing an | unusual use case in my opinion. | | My point was that we're comparing apples and oranges. By | default, I think, Python's dbm implementation doesn't do | any sort of transaction or even sync after every insert, | where as SQLite does have a pretty hefty atomic guarantee | after each INSERT, so they're quite different actions. | [deleted] | OGWhales wrote: | > Also, I've never really dug into what's going on, but the | dbm implementation is pretty slow on Windows, at least when | I've tried to use it. | | Seems like this would be why: | https://news.ycombinator.com/item?id=32852333 | banana_giraffe wrote: | "the slow-but-simple implementation in module dbm.dumb will | be used" | | No kidding. | | https://github.com/python/cpython/blob/main/Lib/dbm/dumb.py | | It uses a text file to store keys and pointers into a | binary file for values. It works .. most of the time .. but | yeah, that's not going to win any speed awards. | jacobr1 wrote: | Also you can use WAL | MrBuddyCasino wrote: | PHP also has support several Berkeley-style databases: | https://www.php.net/manual/en/dba.requirements.php | pmarreck wrote: | For the record, so does Erlang/Elixir, right in their stdlib | (OTP), as either ETS, DETS, or Mnesia | https://en.wikipedia.org/wiki/Mnesia | https://www.erlang.org/doc/man/mnesia.html | Snawoot wrote: | There is no point to use sqlite3 in default journaling mode. I | bet results may be even better than dbm if you use | PRAGMA journal_mode=WAL; | bob1029 wrote: | > There is no point to use sqlite3 in default journaling mode. | | There are situations where you wouldn't want to, but they are | probably very uncommon outside of the embedded computing world. | Copying 3 files vs 1 is not a gigantic deal. Most of the time | you aren't even moving a SQLite database around. | quietbritishjim wrote: | There's never a need to copy 3 files anyway. If the database | is closed cleanly then the WAL file and lock file are | deleted. If not (either it's still open or not closed | cleanly) then I think any half-finished transactions will be | discarded if you copy all files to a new location. Certainly | safest not to copy the 2 extra files in any case. | WorldMaker wrote: | You can still copy _just_ the primary file and not the WAL or | any other files if you want to live dangerously and possibly | lose transactions if there are write operations in parallel. | | Of course, you'd be better off with WAL-based replication | tech like litestream instead of plain file copies if you are | truly worried about parallel operations during your file | copies. | epilys wrote: | You can even distribute individual WAL frames and arrange | distributed writes with raft consensus and a time lease. I | never formally modelled this but it seemed to work | perfectly: the approach only lacked checkpointing the Wal | file and synchronising the database file across raft nodes. | nonethewiser wrote: | more info https://sqlite.org/wal.html | kuschku wrote: | That only works if all processes accessing it share a memory | namespace (aka, are not in containers/jails/VMs and on the same | physical machine). | singron wrote: | You are probably thinking of issues with networked | filesystems like nfs, 9p, or vboxsf where you can't mmap a | file and actually share memory. Basically any other real | filesystem that 2 processes actually open the same file on | will allow shared memory. | joppy wrote: | WAL works for different processes accessing the same SQLite | database. It's meant to improve concurrency, in fact. | kuschku wrote: | Yes, but the processes still need to share a memory | namespace. In WAL mode, SQLite creates a shared memory file | which gets mapped into any process trying to access it | (that's why WAL mode creates additional .shm files). | | But that only works if all given processes run on the same | physical machine, and not in containers/jails/VM. | formerly_proven wrote: | I dunno what SQLite is doing specifically but you can | certainly mmap a file across containers and it behaves | like you'd think. SELinux and other hardening options | might interfere though, because shared mmaps should imho | be viewed as a very likely vector for cross-process | exploitation. | fuckstick wrote: | There isn't really a thing called a memory namespace, in | Linux or the BSDs at least. In Linux there are cgroups that | deal with accounting and you can share VM space among cloned | processes (which we then typically refer to as threads). | Neither of these affect the ability to share memory among | processes. | | The only namespace that matters here is the filesystem/mount | namespace. | | There is no reason you can't access the same shared SQLite | database on a common volume between containers on Linux for | instance. | wodenokoto wrote: | According to documentation it is not so much a db included as an | interface to one: | | > dbm is a generic interface to variants of the DBM database -- | dbm.gnu or dbm.ndbm. If none of these modules is installed, the | slow-but-simple implementation in module dbm.dumb will be used. | There is a third party interface to the Oracle Berkeley DB | | https://docs.python.org/3/library/dbm.html | tootie wrote: | Yeah, dbm is a posix database utility that's been around | forever. There's a few implementations available and there are | bindings for most languages. | | https://en.wikipedia.org/wiki/DBM_(computing) | chrismorgan wrote: | When it includes a fallback implementation so that you can rely | on it working, I think it's fair to call it "included". | cozzyd wrote: | Right but the on-disk formats between versions won't be | compatible... | qalmakka wrote: | The main issue here is that, given it relies on whatever | implementation of DBM you have installed, its performance | will vary considerably from OS to OS. | | I don't know how fast or slow the "dumb" implementation is, | but I can bet it is way slower than gdbm. I can see someone | using this module, considering it "fast enough" on GNU/Linux | and then finding out that it's painfully slow on Windows and | shipping binaries of another implementation is a massive | PITA. | tyingq wrote: | Each of which has it's own set of limitations and watch-outs. I | would not use the python dbm interface for anything serious. | Ndbm, for example, has a per item value limit that varies | between ~1-4k. | isoprophlex wrote: | There exists the 'diskcache' library too, which gives you a | persistent dictionary that stores any kind of pickle-able python | object. There's also a nice persistent memoization decorator, | with optional TTL... I've always found 'diskcache' to be very | ergonomic. | BiteCode_dev wrote: | If you can pip install something, I would also recommend | diskcache: | | - it backed up by sqlite, so the data is much more secure | | - you can access it from several processes | | - it's more portable | syncsynchalt wrote: | This is libdbm, a relatively ancient db library that you could | find on most unixes. | | In the 90s you might target dbm for your portable Unix | application because you could reasonably expect it to be | implemented on your platform. It lost a lot of share to gdbm and | sleepycat's BerkeleyDB, both of which I consider its successor. | | Of course all of this is now sqlite3. | [deleted] | blibble wrote: | the gnu version of dbm has some nice surprises | | if you go and look at the binary database you'll see random other | bits of memory in there, because it doesn't initialise most of | its buffers | ridiculous_fish wrote: | In AppleScript you can mark a variable as a property, and then | its value is automatically remembered across runs. For example a | script can easily remember how many times it has run: | property runCount : 0 set runCount to runCount + 1 | | So natural to use! | | https://developer.apple.com/library/archive/documentation/Ap... | nathancahill wrote: | I feel like AppleScript could have really shown if things had | played out differently in Mac OS automation. I learned to code | with AppleScript on OS 9 and it really got me in to | programming. MacScripter.net was my old hangout. Good times. | im3w1l wrote: | As far as I can tell this doesn't support transactions, which | imho is table stakes for a persistence solution in 2022. | eesmith wrote: | FWIW, it was added in 1992. | https://github.com/python/cpython/commit/dd9ed839d6958f73704... | with gdbm added in 1994. | https://github.com/python/cpython/commit/4b4c664d2e93279c8d7... | . | im3w1l wrote: | The article makes it sound like it's something to consider | today (well 2018), that's why I think it's fair to see how | well it satisfies modern requirements. | ok123456 wrote: | If you just need a simple key value store that stays on | disk, it's fine. | jonnycomputer wrote: | requirements _for what_ is the question. "modern | requirements" doens't mean that much. | pjc50 wrote: | DBM is the persistence solution of 1979. | https://en.wikipedia.org/wiki/DBM_(computing) | Zababa wrote: | > Why is sqlite3 so slow? Well, the benchmark is probably not | representative of the typical workload for sqlite (lots of | individual insertions and selections). If we perform the same | operations using executemany and one select of all the keys at | once, we get: | | I think this is because each time you call execute (which is | probably sqlite3_exec under the hood), your statement is prepared | again, and then deleted, while with executemany it's prepared | once and then used with the data. According to the SQLite3 | documentation of sqlite3_exec: | | > The sqlite3_exec() interface is a convenience wrapper around | sqlite3_prepare_v2(), sqlite3_step(), and sqlite3_finalize(), | that allows an application to run multiple statements of SQL | without having to use a lot of C code. | | I knew that when you execute a request a lot, prepared statements | are faster, but it seems that it's not exactly the case and that | all statements are prepared, the performance improvements come | from preparing (and deleting) only once. The documentation page | about the Prepared Statement Object has a good explanation of the | lifecycle of a prepared statement, which also seems to be the | lifecycle of all statements. | | [1]: https://www.sqlite.org/c3ref/exec.html | | [2]: https://www.sqlite.org/c3ref/stmt.html | quietbritishjim wrote: | The Python sqlite3 middle caches prepared statements, I think | it's the last 100 by default but can be overridden. It does | require you to use proper parameters rather rather than string | interpolation though for the cache to be effective (I didn't | check their benchmark code to see if that's the case). | | As others have pointed out, it's the lack of a suitable index | that's the real problem here. | jonnycomputer wrote: | I did not know this either. Wow. | collyw wrote: | I thought Perl's Tie::DB_file module was unique for that sort of | thing. That's probably quite useful. ___________________________________________________________________ (page generated 2022-09-15 23:00 UTC)