[HN Gopher] SplinterDB: High performance embedded key-value store ___________________________________________________________________ SplinterDB: High performance embedded key-value store Author : ridruejo Score : 52 points Date : 2022-05-26 07:44 UTC (2 days ago) (HTM) web link (github.com) (TXT) w3m dump (github.com) | bufferoverflow wrote: | You call it "high performance" and provide no benchmarks? | dilyevsky wrote: | Paper has it | https://www.usenix.org/system/files/atc20-conway.pdf but yeah | if you check out list of limitations looks more like a research | proj at this stage. Pretty interesting architecture overall | though | bufferoverflow wrote: | The numbers look very good actually. | | I don't care if it's a research project. If it doesn't crash, | doesn't corrupt data, and delivers performance, it's useful. | | I'd want to see performance against Redis and KeyDB. | dilyevsky wrote: | Well you should read the limitations... I _think_ they are | actually cheating by not calling fsync at all which makes | writes not durable. This is different in rocks /pebble and | friends. | | > I'd want to see performance against Redis and KeyDB. | | I think this is apples to oranges comparison as neither of | these provide durability by default and if you enable it | redis had terrible performance last I checked + redis needs | to fit a whole dataset in memory | ajhconway wrote: | Hi, research lead for SplinterDB here. | | SplinterDB does make all writes durable and in fact has | its own user-level cache which generally performs writes | directly to disk (using O_DIRECT for example). | | Like RocksDB's default behavior (no fsyncs on the log), | it does not immediately sync writes to its log when they | happen. It waits to sync in batches, so that writes may | not be immediately durable, but logging is more | efficient. This is a slightly stronger default durability | guarantee, and we intend to make this configurable. | otterley wrote: | I'm a little confused. If you don't ensure data is | committed to storage (log or otherwise) before acking the | write request, how can you call it durable? | | If it's not truly 100% durable by default, it's best not | to suggest that it is. Experience says people will use | the default settings and then become very cross if they | lose data. It undermines trust and is harmful to | reputation. | ajhconway wrote: | With many workloads, there's a tradeoff between the | granularity of durability and the overall performance. | | If a workload has many small writes (some of our product | workloads do), then syncing each write can cause write | amplification and massively affect overall throughput and | latency. Suppose I do a 100B write, this causes a 4KiB | page write to sync, which is 40x write amp. Suddenly a | 2GiB/sec SSD can effectively only write 50MiB/sec. | Similarly, the per-write latency goes from <5us to 10us | (with the fastest Optane SSDs) or 150us (with flash | SSDs). | | So storage systems tend to offer a range of durability | guarantees. Some systems have a special sync operation | for applications to ensure that all writes are durable. | | RocksDB offers a fairly weak guarantee by default too, | writing to the write-ahead-log (WAL), but not performing | fsyncs (https://github.com/facebook/rocksdb/wiki/WAL- | Performance). They make a similar write amplification | argument too | (https://github.com/facebook/rocksdb/wiki/WAL- | Performance#wri...). | dilyevsky wrote: | I missed the use of direct io and the comment about fsync | threw me off, thanks. Very impressive then! | tyingq wrote: | Ah, that's helpful, and explains why it exists: | | _" Three novel ideas contribute to the high performance of | SplinterDB: the STB-tree, a new compaction policy that | exposes more concurrency, and a concurrent memtable and user- | level cache that removes scalability bottlenecks. All three | components are designed to enable the CPU to drive high IOPS | without wasting cycles."_ | | _" At the heart of SplinterDB is the STB-tree, a novel data | structure that combines ideas from log-structured merge tree | and B-trees. The STB-tree adapts the idea of size-tiering | (also known as fragmentation) from key-value stores such as | Cassandra and PebblesDB and applies them to B-trees to reduce | write amplification by reducing the number of times a data | item is re-written during compaction."_ | stingraycharles wrote: | Yeah I would appreciate a benchmark against its main | alternative, rocksdb. I know benchmark are typically | manufactured and not too representative for real world load, | but at least a ballpark figure would be nice to know what we're | talking about here. | | Their main website is at https://splinterdb.org/ by the way, | for those interested. Also no benchmarks there. :) | ridruejo wrote: | The paper referenced in the other comment includes a | benchmark against RocksDB | https://news.ycombinator.com/item?id=31515765 | [deleted] | killingtime74 wrote: | Why would I pick this over SQLite? | necubi wrote: | Totally different use-cases. This is an embedded key value | store, not an RDBMS. You would use this in place of e.g., | LevelDB or RocksDB, potentially as the storage layer of a | database. | axblount wrote: | There's always the venerable: CREATE TABLE kv | ( k TEXT PRIMARY KEY, v TEXT NOT NULL | ); | | Even if sqlite is technically an RDBMS, I think it's a | legitimate comparison. Is SplinterDB worth giving up sqlite's | reliability and feature set? | necubi wrote: | This is much lower-level than sqlite. In fact, you could | use this as the storage layer for a SQL DB. See, e.g., | MyRocks[0] which is a MySQL backend that uses RocksDB as | the storage layer. | | In other words, you'd use this when you just need a | persistent KV store and want to build the higher level | semantics according to your application's needs. | | [0] http://myrocks.io/ | 4khilles wrote: | > In other words, you'd use this when you just need a | persistent KV store and want to build the higher level | semantics according to your application's needs. | | Why can't you use SQLite for this usecase? I believe FDB | uses SQLite as an embedded KV store. | tjungblut wrote: | Checkout the limitations first, no fsync and no data recovery | makes this of very little use. I wonder what makes you write a kv | store without this from the start. | adra wrote: | Like 99% if the use cases redis/memcached should be used for? | bufferoverflow wrote: | Redis has optional durability | | https://redis.io/docs/manual/persistence/ | tjungblut wrote: | Why write anything to disk if you can just store it in an | array? | capableweb wrote: | Just one example: sharing data between processes ___________________________________________________________________ (page generated 2022-05-28 23:00 UTC)