[HN Gopher] TIL-Python has a built-in persistent key-value store...
       ___________________________________________________________________
        
       TIL-Python has a built-in persistent key-value store (2018)
        
       Author : zora_goron
       Score  : 227 points
       Date   : 2022-09-15 10:10 UTC (12 hours ago)
        
 (HTM) web link (remusao.github.io)
 (TXT) w3m dump (remusao.github.io)
        
       | nonethewiser wrote:
       | Huh. I am working on a little utility to deduplicate csvs using
       | sqlite. Seems like this would likely be better.
       | 
       | I used sqlite in the projects name. This makes me think of an
       | article I read recently that suggested not using descriptive
       | names for projects. For reasons such as this.
        
         | ALittleLight wrote:
         | You may want to look at pandas too (unless the issue is your
         | CSVs are too large to fit in memory).
        
         | ivvve wrote:
         | This is probably dumb intervention but forgive me because it
         | made me giddy to find: the pandas drop_duplicates() method is
         | pretty great, if you're in a pinch. Again -- forgive me if you
         | know about it!
        
       | amelius wrote:
       | This is going to fall flat when you try to use multithreading.
       | The GIL will rear its head.
        
       | giraffe_lady wrote:
       | This is a wrapper around dbm, the unix library? Though having a
       | portable interface to that tool is nothing small, and most
       | languages don't have one in their standard library. Though I
       | think all languages with a stdlib as big as python's probably do.
       | It's not inherently a python thing though, unless I'm badly
       | misunderstanding.
        
         | collyw wrote:
         | I look at it the other way, this is the least effort disk
         | persistence that I have seen in python. A fair amount of dev
         | stuff is reading from disk and putting it into some kind of
         | object / dictionary format. I am not really bothered about what
         | it uses under the hood (ok it is interesting enough that I have
         | looked up DMB). To me the he main appeal of this that you can
         | use what appears to be a normal python dictionary and have it
         | save to disk with minimal effort. This is a lot less friction
         | that using a database.
        
           | giraffe_lady wrote:
           | Oh yeah for sure. I don't mean to convey any disdain for dbm
           | itself or for having it rolled into the standard library.
           | Reading and writing custom files for "convenience
           | persistence" or configuration is one of those really common
           | things that you don't usually need to actually do if you know
           | about tools like this.
           | 
           | I was just pointing out that the title and even the article
           | seem to associate this system with python per se, rather than
           | understanding it as python's interface to a common and pre-
           | existing system.
        
       | Someone wrote:
       | FTA: Why is sqlite3 so slow?
       | 
       | My money is that's mostly because it actually stores the data. I
       | don't think _dbm_ guarantees any of the data written makes it to
       | disk until you _sync_ or _close_ the database.
       | 
       | This writes, guesstimating, on the order of 100 kilobytes, so
       | chances are the _dbm_ data never hits the disk.
        
       | akubera wrote:
       | There's also the shelve[0] module which allows storing any
       | pickleable object in a persistent key-value store, not just
       | string/bytes. I've found it's very handy for caching while
       | developing scripts which query remote resources, and not have to
       | worry about serialization.
       | 
       | [0] https://docs.python.org/3.10/library/shelve.html
       | 
       | Obligatory pickle note: one should be aware of pickle security
       | implications and should not open a "Shelf" provided by untrusted
       | sources, or rather should treat opening a shelf (or any pickle
       | deserialization operation for that matter) as running an
       | arbitrary Python script (which cannot be read).
        
         | cristoperb wrote:
         | I once wrote a locking wrapper around the shelve module so I
         | could use it as a thread- and multiprocess-safe key-value cache
         | (including a wrapper around the requests modules `get()` to
         | transparently cache/validate http resources according to
         | headers):
         | 
         | https://github.com/cristoper/shelfcache
         | 
         | It works despite some cross-platform issues (flock and macos's
         | version of gdbm interacting to create a deadlock), but if I
         | were to do again I would just use sqlite (which Python's
         | standard library has an interface for).
        
           | TillE wrote:
           | > but if I were to do again I would just use sqlite
           | 
           | Yeah, I tried to use shelve for some very simple stuff
           | because it seemed like a great fit, but ultimately found that
           | I had a much better time with tortoise-orm on top of sqlite.
           | 
           | If you need _any_ kind of real feature, just use sqlite.
        
         | PartiallyTyped wrote:
         | I like to use `joblib.Memory` for my caching as it catches
         | changes in functions which prompt rerunning over loading from
         | the cache and then overwrite the old result.
        
         | mrweasel wrote:
         | Can you safely "pickle" Python objects across different
         | architectures and Python version (assuming we forget about
         | Python 2)?
        
           | NegativeLatency wrote:
           | There's a version on the pickle format, so you might be able
           | to do it across versions but I suspect the version has
           | changed for security reasons over time?
        
           | Spivak wrote:
           | Pickle could _in theory_ could be architecture dependent
           | since __getstate__, and __setstate__ are user provided
           | options. But you would have to try to do that on purpose.
           | 
           | And you don't even have to forget about Python 2! If you use
           | format version 2 you can pickle objects from every version
           | from Python 2.3+ and all pickle format are promised to be
           | backwards compatible. If you only care about Python 3 then
           | you can use version 3 and it will work for all Python 3.0+.
           | 
           | https://docs.python.org/3/library/pickle.html#data-stream-
           | fo...
           | 
           | The reason against using pickle hasn't changed though, if you
           | wouldn't exec() it, don't unpickle it. If you're going to
           | send it over the network use MAC use MAC use MAC. Seriously,
           | it's built in -- the hmac module.
        
             | mmcgaha wrote:
             | We had a program that was sending a pickled session state
             | as a cookie. We solved that by packing the cookie as a
             | random string, a timestamp, the object, and a MAC. We
             | validated the MAC, then checked the timestamp, and finally
             | unpickle the object. It still bothers me that we are
             | unpickling data passed by the client but I ran arguments
             | against doing it.
        
         | clysm wrote:
         | Another problem with pickles with any sort of living code base
         | is when one makes modification to the type itself - renaming,
         | refactoring, whatever. Picking objects (and nested objects)
         | that aren't explicitly meant for data
         | storage/retention/transmissions leads to headaches.
         | 
         | It's best to use another dedicated type or library specifically
         | for this task.
        
         | formerly_proven wrote:
         | There's even an object database built around pickle, ZODB. It
         | even has a networked implementation with replication /
         | failover. Used to be part of Zope, originally written in the
         | 1998 time frame or so - GvR actually committed a bunch of stuff
         | to it.
        
           | nas wrote:
           | ZODB is awesome and overlooked, IMHO. I'm biased I guess
           | because I was involved in making Durus which is inspired by
           | ZODB. The ZODB model is not appropriate for all applications
           | (optimistic concurrency control) but for the applications it
           | works for, it's great. Very easy to develop with (no
           | relational-to-OO mismatch) and performance can be great if
           | you design your model carefully. The client caching of data
           | is great. It is a bit like memcache but the data is already
           | there as objects in RAM. The database server will invalidate
           | the cache for you, no manual invalidation needed.
        
             | mwcampbell wrote:
             | I developed a web application in 2004 using Quixote and
             | Durus. I wonder how many developers outside of MEMS
             | Exchange ever used both of those packages. Somehow I had
             | not yet encountered a proper ORM (I didn't discover
             | SQLObject until later, and Django wasn't out yet), so I
             | liked Durus at the time. That web application is still
             | running, and in the years since then, I've had to handle
             | several escalated customer support cases, and I often
             | wished I had chosen a relational database so I could easily
             | do ad-hoc queries. So Durus probably wasn't the best choice
             | for my application, but that's not your fault. And one
             | thing I liked about Durus was that the implementation was
             | simple enough that I felt I could really understand how it
             | worked. So thanks for that.
        
       | nijave wrote:
       | Wonder how this compares to using the filesystem and keys where
       | keys are either filenames or a file path (and the value is stored
       | in the file)?
        
       | j0hnyl wrote:
       | I wonder if it's still faster than sqlite in 2022.
        
         | cbdumas wrote:
         | The script linked at the bottom of this page uses sqlite
         | without an index on the table. In my testing if you add an
         | index sqlite quickly outperforms dbm in terms of read
         | performance
        
         | calpaterson wrote:
         | According to his benchmark run today on Python 3.9 on linux,
         | dbm is still considerably faster for random access.
         | 
         | That said, what database dbm is using varies a lot by platform.
         | I think on linux it's usually BerkleyDB.
         | 
         | I use dbm a lot to cache pickles of object that were slow to
         | create. eg pandas dataframes
        
           | curiousgal wrote:
           | Sorry if it's a dumb question but sinc dbm writes stuff to
           | disk, how is that different than loading the pickles?
        
             | dfox wrote:
             | You get one file for the whole store, not one per key.
             | Also, it tends to better handle having lots of keys
             | significantly better than filesystems handle lots of files
             | in one directory.
        
               | curiousgal wrote:
               | Thank you!
        
           | qalmakka wrote:
           | I think on Linux is often GNU DBM, Berkeley DB is a
           | different, better implementation by Oracle that requires a
           | third party module.
        
       | gerty wrote:
       | It's not built-in, of course, but if you can live with the quirky
       | API, LMDB as an embedded key-value store will be hard to beat in
       | terms of speed. It can also do "one-writer, many-readers" and
       | transactions.
        
       | cbdumas wrote:
       | So I read the script used to compare 'dbm' and 'sqlite3', and in
       | sqlite it creates a table with no index. Hard to take that
       | comparison seriously. I wrote a little benchmark script of my own
       | just now and sqlite3 beats DBM handily if you add an index on the
       | key.
        
         | simonw wrote:
         | Yeah I tried that myself - I took that benchmark script and
         | changed the CREATE TABLE lines to look like this (adding the
         | primary key):                   CREATE TABLE store(key TEXT
         | PRIMARY KEY, value TEXT)
         | 
         | Here are the results before I made that change:
         | sqlite         Took 0.032 seconds, 3.19541 microseconds /
         | record         dict_open         Took 0.002 seconds, 0.20261
         | microseconds / record         dbm_open         Took 0.043
         | seconds, 4.26550 microseconds / record         sqlite3_mem_open
         | Took 2.240 seconds, 224.02620 microseconds / record
         | sqlite3_file_open         Took 7.119 seconds, 711.87410
         | microseconds / record
         | 
         | And here's what I got after adding the primary keys:
         | sqlite         Took 0.040 seconds, 3.97618 microseconds /
         | record         dict_open         Took 0.002 seconds, 0.19641
         | microseconds / record         dbm_open         Took 0.042
         | seconds, 4.18961 microseconds / record         sqlite3_mem_open
         | Took 0.116 seconds, 11.58359 microseconds / record
         | sqlite3_file_open         Took 5.571 seconds, 557.13968
         | microseconds / record
         | 
         | My code is here:
         | https://gist.github.com/simonw/019ddf08150178d49f4967cc38356...
        
           | cbdumas wrote:
           | Yeah now that I dig in a little further it looks like it's
           | not as clear cut as I thought. DBM read performance is better
           | for me with larger test cases than I was initially using,
           | though I am getting some weird performance hiccups when I use
           | very large test cases. It looks like I'm too late to edit my
           | top level response to reflect that.
        
           | divbzero wrote:
           | Just to spell out the results: Adding the primary key appears
           | to improve SQLite performance but still falls short of DBM.
        
         | banana_giraffe wrote:
         | The index is the least of the issue with the SQLite
         | implementation. It's calling one INSERT per each record in that
         | version, so the benchmark is spending something like 99.8% of
         | its time opening and closing transactions as it sets up the
         | database.
         | 
         | Fixing that on my machine took the sqlite3_file_open benchmark
         | from 16.910 seconds to 1.033 seconds. Adding the index brought
         | it down to 0.040 seconds.
         | 
         | Also, I've never really dug into what's going on, but the dbm
         | implementation is pretty slow on Windows, at least when I've
         | tried to use it.
        
           | avinassh wrote:
           | I was working on a project to insert a billion rows in SQLite
           | under a minute, batching the inserts made it crazy fast
           | compared to individual transactions.
           | 
           | link: https://avi.im/blag/2021/fast-sqlite-inserts/
        
           | Steltek wrote:
           | For use cases where you want a very simple key-value store,
           | working with single records is probably a good test?
        
             | banana_giraffe wrote:
             | Maybe?
             | 
             | Sure, mutating data sets might be a useful use case. But,
             | inserting thousands of items at once one at a time in a
             | tight loop, then asking for all of them is testing an
             | unusual use case in my opinion.
             | 
             | My point was that we're comparing apples and oranges. By
             | default, I think, Python's dbm implementation doesn't do
             | any sort of transaction or even sync after every insert,
             | where as SQLite does have a pretty hefty atomic guarantee
             | after each INSERT, so they're quite different actions.
        
           | [deleted]
        
           | OGWhales wrote:
           | > Also, I've never really dug into what's going on, but the
           | dbm implementation is pretty slow on Windows, at least when
           | I've tried to use it.
           | 
           | Seems like this would be why:
           | https://news.ycombinator.com/item?id=32852333
        
             | banana_giraffe wrote:
             | "the slow-but-simple implementation in module dbm.dumb will
             | be used"
             | 
             | No kidding.
             | 
             | https://github.com/python/cpython/blob/main/Lib/dbm/dumb.py
             | 
             | It uses a text file to store keys and pointers into a
             | binary file for values. It works .. most of the time .. but
             | yeah, that's not going to win any speed awards.
        
           | jacobr1 wrote:
           | Also you can use WAL
        
       | MrBuddyCasino wrote:
       | PHP also has support several Berkeley-style databases:
       | https://www.php.net/manual/en/dba.requirements.php
        
       | pmarreck wrote:
       | For the record, so does Erlang/Elixir, right in their stdlib
       | (OTP), as either ETS, DETS, or Mnesia
       | https://en.wikipedia.org/wiki/Mnesia
       | https://www.erlang.org/doc/man/mnesia.html
        
       | Snawoot wrote:
       | There is no point to use sqlite3 in default journaling mode. I
       | bet results may be even better than dbm if you use
       | PRAGMA journal_mode=WAL;
        
         | bob1029 wrote:
         | > There is no point to use sqlite3 in default journaling mode.
         | 
         | There are situations where you wouldn't want to, but they are
         | probably very uncommon outside of the embedded computing world.
         | Copying 3 files vs 1 is not a gigantic deal. Most of the time
         | you aren't even moving a SQLite database around.
        
           | quietbritishjim wrote:
           | There's never a need to copy 3 files anyway. If the database
           | is closed cleanly then the WAL file and lock file are
           | deleted. If not (either it's still open or not closed
           | cleanly) then I think any half-finished transactions will be
           | discarded if you copy all files to a new location. Certainly
           | safest not to copy the 2 extra files in any case.
        
           | WorldMaker wrote:
           | You can still copy _just_ the primary file and not the WAL or
           | any other files if you want to live dangerously and possibly
           | lose transactions if there are write operations in parallel.
           | 
           | Of course, you'd be better off with WAL-based replication
           | tech like litestream instead of plain file copies if you are
           | truly worried about parallel operations during your file
           | copies.
        
             | epilys wrote:
             | You can even distribute individual WAL frames and arrange
             | distributed writes with raft consensus and a time lease. I
             | never formally modelled this but it seemed to work
             | perfectly: the approach only lacked checkpointing the Wal
             | file and synchronising the database file across raft nodes.
        
         | nonethewiser wrote:
         | more info https://sqlite.org/wal.html
        
         | kuschku wrote:
         | That only works if all processes accessing it share a memory
         | namespace (aka, are not in containers/jails/VMs and on the same
         | physical machine).
        
           | singron wrote:
           | You are probably thinking of issues with networked
           | filesystems like nfs, 9p, or vboxsf where you can't mmap a
           | file and actually share memory. Basically any other real
           | filesystem that 2 processes actually open the same file on
           | will allow shared memory.
        
           | joppy wrote:
           | WAL works for different processes accessing the same SQLite
           | database. It's meant to improve concurrency, in fact.
        
             | kuschku wrote:
             | Yes, but the processes still need to share a memory
             | namespace. In WAL mode, SQLite creates a shared memory file
             | which gets mapped into any process trying to access it
             | (that's why WAL mode creates additional .shm files).
             | 
             | But that only works if all given processes run on the same
             | physical machine, and not in containers/jails/VM.
        
               | formerly_proven wrote:
               | I dunno what SQLite is doing specifically but you can
               | certainly mmap a file across containers and it behaves
               | like you'd think. SELinux and other hardening options
               | might interfere though, because shared mmaps should imho
               | be viewed as a very likely vector for cross-process
               | exploitation.
        
           | fuckstick wrote:
           | There isn't really a thing called a memory namespace, in
           | Linux or the BSDs at least. In Linux there are cgroups that
           | deal with accounting and you can share VM space among cloned
           | processes (which we then typically refer to as threads).
           | Neither of these affect the ability to share memory among
           | processes.
           | 
           | The only namespace that matters here is the filesystem/mount
           | namespace.
           | 
           | There is no reason you can't access the same shared SQLite
           | database on a common volume between containers on Linux for
           | instance.
        
       | wodenokoto wrote:
       | According to documentation it is not so much a db included as an
       | interface to one:
       | 
       | > dbm is a generic interface to variants of the DBM database --
       | dbm.gnu or dbm.ndbm. If none of these modules is installed, the
       | slow-but-simple implementation in module dbm.dumb will be used.
       | There is a third party interface to the Oracle Berkeley DB
       | 
       | https://docs.python.org/3/library/dbm.html
        
         | tootie wrote:
         | Yeah, dbm is a posix database utility that's been around
         | forever. There's a few implementations available and there are
         | bindings for most languages.
         | 
         | https://en.wikipedia.org/wiki/DBM_(computing)
        
         | chrismorgan wrote:
         | When it includes a fallback implementation so that you can rely
         | on it working, I think it's fair to call it "included".
        
           | cozzyd wrote:
           | Right but the on-disk formats between versions won't be
           | compatible...
        
           | qalmakka wrote:
           | The main issue here is that, given it relies on whatever
           | implementation of DBM you have installed, its performance
           | will vary considerably from OS to OS.
           | 
           | I don't know how fast or slow the "dumb" implementation is,
           | but I can bet it is way slower than gdbm. I can see someone
           | using this module, considering it "fast enough" on GNU/Linux
           | and then finding out that it's painfully slow on Windows and
           | shipping binaries of another implementation is a massive
           | PITA.
        
         | tyingq wrote:
         | Each of which has it's own set of limitations and watch-outs. I
         | would not use the python dbm interface for anything serious.
         | Ndbm, for example, has a per item value limit that varies
         | between ~1-4k.
        
       | isoprophlex wrote:
       | There exists the 'diskcache' library too, which gives you a
       | persistent dictionary that stores any kind of pickle-able python
       | object. There's also a nice persistent memoization decorator,
       | with optional TTL... I've always found 'diskcache' to be very
       | ergonomic.
        
         | BiteCode_dev wrote:
         | If you can pip install something, I would also recommend
         | diskcache:
         | 
         | - it backed up by sqlite, so the data is much more secure
         | 
         | - you can access it from several processes
         | 
         | - it's more portable
        
       | syncsynchalt wrote:
       | This is libdbm, a relatively ancient db library that you could
       | find on most unixes.
       | 
       | In the 90s you might target dbm for your portable Unix
       | application because you could reasonably expect it to be
       | implemented on your platform. It lost a lot of share to gdbm and
       | sleepycat's BerkeleyDB, both of which I consider its successor.
       | 
       | Of course all of this is now sqlite3.
        
       | [deleted]
        
       | blibble wrote:
       | the gnu version of dbm has some nice surprises
       | 
       | if you go and look at the binary database you'll see random other
       | bits of memory in there, because it doesn't initialise most of
       | its buffers
        
       | ridiculous_fish wrote:
       | In AppleScript you can mark a variable as a property, and then
       | its value is automatically remembered across runs. For example a
       | script can easily remember how many times it has run:
       | property runCount : 0         set runCount to runCount + 1
       | 
       | So natural to use!
       | 
       | https://developer.apple.com/library/archive/documentation/Ap...
        
         | nathancahill wrote:
         | I feel like AppleScript could have really shown if things had
         | played out differently in Mac OS automation. I learned to code
         | with AppleScript on OS 9 and it really got me in to
         | programming. MacScripter.net was my old hangout. Good times.
        
       | im3w1l wrote:
       | As far as I can tell this doesn't support transactions, which
       | imho is table stakes for a persistence solution in 2022.
        
         | eesmith wrote:
         | FWIW, it was added in 1992.
         | https://github.com/python/cpython/commit/dd9ed839d6958f73704...
         | with gdbm added in 1994.
         | https://github.com/python/cpython/commit/4b4c664d2e93279c8d7...
         | .
        
           | im3w1l wrote:
           | The article makes it sound like it's something to consider
           | today (well 2018), that's why I think it's fair to see how
           | well it satisfies modern requirements.
        
             | ok123456 wrote:
             | If you just need a simple key value store that stays on
             | disk, it's fine.
        
             | jonnycomputer wrote:
             | requirements _for what_ is the question.  "modern
             | requirements" doens't mean that much.
        
         | pjc50 wrote:
         | DBM is the persistence solution of 1979.
         | https://en.wikipedia.org/wiki/DBM_(computing)
        
       | Zababa wrote:
       | > Why is sqlite3 so slow? Well, the benchmark is probably not
       | representative of the typical workload for sqlite (lots of
       | individual insertions and selections). If we perform the same
       | operations using executemany and one select of all the keys at
       | once, we get:
       | 
       | I think this is because each time you call execute (which is
       | probably sqlite3_exec under the hood), your statement is prepared
       | again, and then deleted, while with executemany it's prepared
       | once and then used with the data. According to the SQLite3
       | documentation of sqlite3_exec:
       | 
       | > The sqlite3_exec() interface is a convenience wrapper around
       | sqlite3_prepare_v2(), sqlite3_step(), and sqlite3_finalize(),
       | that allows an application to run multiple statements of SQL
       | without having to use a lot of C code.
       | 
       | I knew that when you execute a request a lot, prepared statements
       | are faster, but it seems that it's not exactly the case and that
       | all statements are prepared, the performance improvements come
       | from preparing (and deleting) only once. The documentation page
       | about the Prepared Statement Object has a good explanation of the
       | lifecycle of a prepared statement, which also seems to be the
       | lifecycle of all statements.
       | 
       | [1]: https://www.sqlite.org/c3ref/exec.html
       | 
       | [2]: https://www.sqlite.org/c3ref/stmt.html
        
         | quietbritishjim wrote:
         | The Python sqlite3 middle caches prepared statements, I think
         | it's the last 100 by default but can be overridden. It does
         | require you to use proper parameters rather rather than string
         | interpolation though for the cache to be effective (I didn't
         | check their benchmark code to see if that's the case).
         | 
         | As others have pointed out, it's the lack of a suitable index
         | that's the real problem here.
        
       | jonnycomputer wrote:
       | I did not know this either. Wow.
        
       | collyw wrote:
       | I thought Perl's Tie::DB_file module was unique for that sort of
       | thing. That's probably quite useful.
        
       ___________________________________________________________________
       (page generated 2022-09-15 23:00 UTC)