[HN Gopher] What's the big deal about embedded key-value databases?
       What's the big deal about embedded key-value databases?
       Author : eatonphil
       Score  : 96 points
       Date   : 2022-08-23 15:58 UTC (7 hours ago)
 (HTM) web link (notes.eatonphil.com)
 (TXT) w3m dump (notes.eatonphil.com)
       | eis wrote:
       | A few more entries that might be of interest:                 *
       | DynamoDB and the Dynamo KV store       * LMDB (embedded kv)
       | * Dgraph (distributed graph db) and its embedded kv store
       | BadgerDB
       | didgetmaster wrote:
       | I am building a general-purpose data management system called
       | Didgets (https://didgets.com/) that extensively uses KV stores
       | that I invented. Since it was primarily designed to be a file
       | system replacement, I used them for attaching contextual meta-
       | data tags to file objects.
       | My whole container started to look like a sparsely populated
       | relational table where every row/column intersection could have
       | multiple values (e.g. a photo could have a tag for every person
       | in the picture attached). I started experimenting with using the
       | KV stores as columns to form regular relational tables.
       | It turns out that it was relatively easy and was extremely fast.
       | I started building tables with 50+ million rows and many columns
       | and performing queries against them. Benchmarking the system
       | against other databases revealed that it was very fast (and
       | didn't need separate indexes to accomplish this).
       | Here is a video showing how it does a bunch of queries 10x faster
       | than the same data stored in a highly indexed table in Postgres:
       | https://www.youtube.com/watch?v=OVICKCkWMZE
       | atmin wrote:
       | No mention of SQLite as an embedded SQL database?
         | eatonphil wrote:
         | This post is about key-value stores.
         | While foundationdb uses SQLite I didn't otherwise think of
         | SQLite as being relevant here. :)
       | morelisp wrote:
       | "Time is a flat circle." - someone at Sleepycat, probably.
       | rad_gruchalski wrote:
       | This is a good read. By the way, Kafka Streams is also built on
       | top of RocksDB. Not strictly a database but relevant to a certain
       | extent.
       | Xeoncross wrote:
       | I highly recommend people comfortable with Go checkout the
       | building blocks at https://github.com/thomasjungblut/go-sstables
       | This codebase shows how SSTables, WAL, memtables, recordio,
       | skiplists, segment files, and other storage engine components
       | work in a digestible way. Includes a demo database showing how it
       | all comes together to make a RocksDB / LevelDB competitor (not
       | really).
       | tristan957 wrote:
       | I work on a storage engine at $dayJob. We have created a
       | connector for MongoDB, although for a very ancient version. We
       | are currently working with $cloudProvider to use our storage
       | engine in their cloud DBaaS offerings.
       | This field is pretty interesting when you're talking about
       | performance vs space amp vs write amp vs read amp.
       | aviramha wrote:
       | Great article! One cool thing about RocksDB it's actually even
       | used in other KV databases such as Redis on Flash
       | https://redis.com/blog/hood-redis-enterprise-flash-database-...
         | eatonphil wrote:
         | Yup, FB's ZippyDB [0] is another example mentioned in the
         | article.
         | [0] https://engineering.fb.com/2021/08/06/core-data/zippydb/
         | Edit: I've added Redis Enterprise Flash to the list now.
         | Thanks!
         | dboreham wrote:
         | The article misses the point. All data storage and query
         | systems end up architected in layers. Upper layers deal with
         | higher abstractions (objects, rows, whatever). Lower layers
         | deal with simpler functions, closer to the hardware. The upper
         | layers are consumers of the lower layers. This is where
         | "embedded KV stores" like LevelDB, RocksDB, etc come from. They
         | began as the embedded storage layer for some bigger thing.
         | Every product you think of as a database or document store is
         | built like this, including MySQL and PostgreSQL and Oracle.
         | Such a storage layer, shipped as an independent library, is how
         | you (or anyone) builds your own database-ish thing. That's what
         | the article should say.
         | The list of examples are odd. For instance MongoRocks is cited
         | for using RocksDB, but actual stock MongoDB uses Wired Tiger,
         | which isn't mentioned.
         | Disclosure: I played a part in the late-beginning of this space
         | when Netscape funded Sleepycat to develop BerkeleyDB. dbm and
         | ndbm existed beforehand, but BerkeleyDB used in LDAP servers is
         | I think the genesis point for this pattern as it exists today.
           | eatonphil wrote:
           | If there's a difference between what you wrote and what I
           | wrote I'm missing it.
           | But you're also welcome to write your own post. :)
             | morelisp wrote:
             | I do feel like there's a historical perspective missing
             | from the article which the GP touches on. Embedded KV
             | stores aren't new (although some of the algorithms behind
             | the current crop certainly are). They used to dominate
             | "backend" software development until their popularity waned
             | as the world got obsessed with "model the domain, damn the
             | computation cost" (because all resources were doubling or
             | more yearly) followed by "we'll just distribute it".
             | The need for parallelism killed the first approach and the
             | cost of increasingly complex reduce steps killed the
             | second. Now we're back to "how much can we fit in RAM on a
             | local machine" and it turns out, if you can still bang bits
             | for smart key formats, a hell of a lot.
           | galaxyLogic wrote:
           | > Upper layers deal with higher abstractions (objects, rows,
           | whatever)
           | Right, I'm waiting for standard for a level above relational
           | databases which is Object-databases. I know there are several
           | ones already and there are Object-Relational mapping layers.
           | I think the key point there is that Object databases are a
           | level ABOVE relational databases. They are not "better" but
           | they deal with the higher level of objects rather than
           | "tables", just like relational databases can be seen to be
           | are a level above key-value -stores.
           | I would like Object databases to become better and easier to
           | use and more standardized.
           | I think there is value in being able to see both level, the
           | objects, and the relational data that makes up the objects.
             | morelisp wrote:
             | Neither objects nor relations are "above" the other. You
             | can map them in a vacuous mathematical sense, but it's a
             | massively leaky abstraction in either direction.
               | eatonphil wrote:
               | Some concrete examples:
               | 1. Yugabyte's relational query layer sits on top of a
               | document store (DocDB):
               | https://www.yugabyte.com/blog/how-we-built-a-high-
               | performanc....
               | 2. You can put documents in a PostgreSQL JSON(B) column.
           | nicholasjarnold wrote:
           | > They began as the embedded storage layer for some bigger
           | thing.
           | I immediately thought of Kafka's streaming query stuff when I
           | read the headline (ksqlDB). I'm not sure if that's the origin
           | story of RocksDB, but it's the storage engine underlying that
           | streaming query tooling in Kafka's ecosystem.
       | eis wrote:
       | TiKV is not an embedded key-value store, it is distributed.
         | eatonphil wrote:
         | Thanks! Fixed and attributed you at the end.
       | x3n0ph3n3 wrote:
       | My team has a use-case that involves a precomputed RocksDB
       | database saved on an AWS EFS volume that is mounted on a lambda
       | with 100's-1000's of invocations per second. It allows for some
       | extremely fast querying of relatively static data. Another
       | process is responsible for periodically updating the database and
       | writing it back to the EFS volume.
       | samsquire wrote:
       | With RockSet's converged indexes and an SQL query optimiser you
       | can build an SQL database.
       | https://rockset.com/blog/converged-indexing-the-secret-sauce...
       | Rockset's converged indexes + denormalisation means you can have
       | fast querying.
       | mprovost wrote:
       | I feel like this is missing any mention of the history of KV
       | stores. Unix came with an embedded database (dbm) from the early
       | days (1979) [0] which was rewritten at Berkeley into the more
       | popular bdb in the 80s. [1] Sendmail was one of the more common
       | programs that used it. And then when djb built his replacement
       | for sendmail, qmail, he invented cdb. [2]
       | [0] https://en.wikipedia.org/wiki/DBM_(computing)
       | [1] https://en.wikipedia.org/wiki/Berkeley_DB
       | [2] https://cr.yp.to/cdb.html
       | LAC-Tech wrote:
       | When I read about event sourcing, my mind immediately went to how
       | that would map to a K/V database. Has anyone done this in
       | production?
       | Also - no mention of LMDB? RocksDB and LMDB feel like the ones
       | that stand out in that field - levelDB definitely had a
       | reputation for corrupting data.
       | effnorwood wrote:
       | adammarples wrote:
       | Plug for my python dict wrapper
       | https://github.com/adammarples/rocksdbdict
       | Adiqq wrote:
       | Honestly, I'm still not sure, why would I use something like
       | RocksDB instead or in addition to plain PostgreSQL/MongoDB/Redis
       | instances.
       | I don't work with a lot of data, but typically my decisions base
       | on basic factors and purpose:
       | PostgreSQL - SQL, structured data, cannot scale horizontally
       | MongoDB - NoSQL, unstructured data
       | Redis - key-value, distributed cache
       | I get it that you can replace storage engine and you can
       | theoretically get more performance, but in practice compatibility
       | and standardization is more important, because a lot of products
       | (including third-party) will already use
       | PostgreSQL/MongoDB/Redis, so it's no-brainer to use it as well
       | for your solution.
       | However for me to pick RocksDB or some other, new, shining
       | database/storage engine, there would have to be more compelling
       | reasons.
         | jzelinskie wrote:
         | Unless you are building a database, these embedded KV store
         | libraries are less likely to be the best solution the job. If
         | you are considering them for an app that isn't a database, you
         | should also take a long, hard look at SQLite first.
         | What's also interesting is the trend of newer distributed
         | "database systems" like Vitess[0] or SpiceDB[1] that forego
         | embedded KV stores and instead reuse existing SQL databases as
         | their "embedded database". Vitess leverages MySQL and SpiceDB
         | leverages MySQL, PostgreSQL, CockroachDB, or Spanner. Systems
         | built this way get to leverage many high-level features from
         | existing databases systems such that they can focus on
         | innovating in even higher-level functionality. In the case of
         | Vitess, it's scaling, distributing, and schema management of
         | MySQL. In the case of SpiceDB, it's building a database
         | specifically optimized for querying access control data in a
         | way that can coordinate with causality across multiple
         | services.
         | [0]: https://github.com/vitessio/vitess
         | [1]: https://github.com/authzed/spicedb
         | zarzavat wrote:
         | In your list RocksDB is most like Redis, but even faster
         | because the data doesn't have to leave the process.
         | Think of it as a high performance sports car like a Ferrari.
         | It's not good at taking the kids to school or buying groceries.
         | But if you need to prioritise performance at the expense of all
         | other considerations then it's exactly what you need.
         | Xeoncross wrote:
         | Like S3 or Redis, RocksDB is much more performant when you
         | don't need the query engine and want to have highly compact
         | storage with fast lookups and high write throughput.
         | Storage engines are different levels of complexity based on the
         | query requirements. Simple K/V stores can run circles around
         | Postgres/MySQL as long as you don't need the extra features.
       | rajko_rad wrote:
       | Two more examples to check out: Yugabyte also persists with
       | rocksDB https://www.yugabyte.com/blog/how-we-built-a-high-
       | performanc...
       | And this is very cool, distributed SQLite with FDB:
       | https://univalence.me/posts/mvsqlite
         | eatonphil wrote:
         | Thank you, edited to include Yugabyte!
       | ramoz wrote:
       | Should see a rise in embedded KV popularity in correlation with
       | ML applications. Storing embeddings in something like leveldb in
       | formats such as flatbuffer offer high-performance solutions for
       | online prediction (i.e. for mapping business values to their
       | embedding format on the fly to send off to some model for
       | inference).
         | jupp0r wrote:
         | Would that be on mobile devices for offline usage? I'm thinking
         | that for typical backend use cases one would use a dedicated
         | key value store service, right?
           | ramoz wrote:
           | This would depend on your requirements and type of inference.
           | Say you need to compute inference across 1000's of
           | content/documents/images every second or so, out of some
           | corpus of millions-billions, then having a kv store on
           | disk/SSD (NVME) might be for more efficient & cheaper (in
           | terms of grabbing those embeddings to conduct a downstream ML
           | task). How you update the corpus matters too -- a lot of
           | embedding spaces need to be updated in aggregate.
       | lacker wrote:
       | IMO it's just confusing to call both, say, RocksDB and MySQL
       | "databases". They sit at different levels of the stack and it is
       | easier to just think of them as entirely different things, your
       | "SQL database" and your "storage engine". So your stack looks
       | like
       | Application
       | |
       | MySQL
       | |
       | RocksDB
       | |
       | Filesystem
       | In general the MySQL layer is doing all the convenient stuff for
       | application developers like supporting different queries and
       | datatypes. The RocksDB layer is optimizing for performance
       | metrics like throughput and reliability and just treats data as
       | sequences of bytes.
         | tomhallett wrote:
         | 100% agreed. TIL that mysql uses RocksDB under the hood.
         | Here's another example of a realtime database which uses
         | RocksDB under the hood: https://rockset.com/blog/how-we-use-
         | rocksdb-at-rockset/
           | eatonphil wrote:
           | As far as I'm aware, MySQL does not use RocksDB under the
           | hood by default. MyRocks is a distribution of MySQL that uses
           | RocksDB.
             | moralestapia wrote:
             | Yeah, weird comment from GP. By the time RocksDB was born,
             | MySQL was already going to prom.
               | ruw1090 wrote:
               | Close, but in database years it was actually already in
               | its mid life crisis.
           | icelancer wrote:
           | Only if you configure it that way. Same as MyISAM/InnoDB/etc.
         | [deleted]
         | lcnPylGDnU4H9OF wrote:
         | Actually, this helps a lot. I'd never heard of RocksDB and I'm
         | barely familiar with InnoDB and hopefully I am not wrong to
         | compare the two.
         | jeffbee wrote:
         | I think the use of bare RocksDB is more common than the use of
         | MyRocks.
       | NetOpWibby wrote:
       | You should add RethinkDB! I moved to it from MongoDB years ago.
         | jeffbee wrote:
         | RethinkDB is utterly defunct as a project, has not had a
         | substantive release in years, and in my experience just flat
         | out doesn't work. And let's don't even discuss Mongo. Asking
         | yourself to choose between these is like selecting your
         | favorite brand of thumbtack to step on.
           | gqewogpdqa wrote:
           | Lol. When did you last use MongoDB and why is it a thumbtack?
           | NetOpWibby wrote:
           | RethinkDB still works well for me /shrug
         | orthecreedence wrote:
         | Are you still using it? How is the pace going on the community-
         | supported version? I stopped using it after the company folded,
         | but I do kind of miss it. Definitely one of the more
         | interesting designs, and light years beyond what MongoDB was at
         | the time.
           | NetOpWibby wrote:
           | I'm definitely still using it, via rethinkdb-ts (npm
           | package). I even forked it to make it work with Deno.
           | The built-in Data Explorer is a must-have for me and idk of
           | any other database that has something similar.
             | eis wrote:
             | There are plenty of data explorers for other databases,
             | especially SQL DBs. I don't think it being built into the
             | DB should be a make-it-or-break-it feature.
             | I used RethinkDB back in the days because it was the first
             | DB that had pretty good replication and sharding - it was
             | zero effort. I felt the functional programming model to be
             | strange, some stuff got executed locally, other parts
             | remotely and it was not very straight forward when things
             | didn't go as planned.
             | By the time the RethinkDB company folded, CockroachDB
             | emerged and has been my go-to distributed DB since.
         | eatonphil wrote:
         | No I don't think that's relevant. They implement their own
         | btree it seems [0].
         | They don't use a key-value store library.
         | I know it's a bit of a fine line. But I'm talking about
         | standalone libraries people embed across different
         | applications/databases. That's what RocksDB/LevelDB/Pebble are.
         | [0]
         | https://github.com/rethinkdb/rethinkdb/tree/v2.4.x/src/btree
           | tristan957 wrote:
           | HSE[0] is another storage engine to throw on the pile.
           | [0]: https://github.com/hse-project/hse
       | NonNefarious wrote:
       | The term is "key/value."
       | mdzn wrote:
       | The article says that Consul or etcd are designed to always be
       | up, but it's actually quite the opposite. They both leverage Raft
       | for maintaining consensus and thus optimize for consistency at
       | the cost of availability in case of network partitions. See CAP
       | theorem.
         | cloudhead wrote:
         | All distributed databases are designed to "always be up",
         | that's the point of making them distributed, otherwise a single
         | instance is fine.
           | morelisp wrote:
           | There are reasons to distribute DBs that do not need to be up
           | constantly, e.g. distributing work (transactions or queries)
           | across more resources than are available on one machine; or
           | to bring a replica closer to some other service to reduce
           | latency.
           | Kafka Streams is the first kind; the source-of-truth storage
           | is HA (as HA as the Kafka topics it's backed with at least)
           | but can only be queried with high consistency when the
           | consumer is active, and it goes down for rebalances when you
           | scale out or fail over (and in many operational setups also
           | when you upgrade).
           | For an example of the second kind, see Fly.io's Litestream
           | explanation - https://fly.io/blog/all-in-on-sqlite-
           | litestream/.
           | That being said, I think the etcd etc. examples are just
           | meant to be in contrast to stock Redis or Memcache, which
           | offer very little HA support, generally just failover with
           | minimal consistency guarantee.
       | kefir wrote:
       | Apache Ignite 3 also uses RocksDB as a pluggable storage
       | https://www.gridgain.com/resources/blog/apache-ignite-3-alph...
         | eatonphil wrote:
         | Thanks! Adding this.
       (page generated 2022-08-23 23:00 UTC)