[HN Gopher] SplinterDB: High performance embedded key-value store
       ___________________________________________________________________
        
       SplinterDB: High performance embedded key-value store
        
       Author : ridruejo
       Score  : 52 points
       Date   : 2022-05-26 07:44 UTC (2 days ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | bufferoverflow wrote:
       | You call it "high performance" and provide no benchmarks?
        
         | dilyevsky wrote:
         | Paper has it
         | https://www.usenix.org/system/files/atc20-conway.pdf but yeah
         | if you check out list of limitations looks more like a research
         | proj at this stage. Pretty interesting architecture overall
         | though
        
           | bufferoverflow wrote:
           | The numbers look very good actually.
           | 
           | I don't care if it's a research project. If it doesn't crash,
           | doesn't corrupt data, and delivers performance, it's useful.
           | 
           | I'd want to see performance against Redis and KeyDB.
        
             | dilyevsky wrote:
             | Well you should read the limitations... I _think_ they are
             | actually cheating by not calling fsync at all which makes
             | writes not durable. This is different in rocks /pebble and
             | friends.
             | 
             | > I'd want to see performance against Redis and KeyDB.
             | 
             | I think this is apples to oranges comparison as neither of
             | these provide durability by default and if you enable it
             | redis had terrible performance last I checked + redis needs
             | to fit a whole dataset in memory
        
               | ajhconway wrote:
               | Hi, research lead for SplinterDB here.
               | 
               | SplinterDB does make all writes durable and in fact has
               | its own user-level cache which generally performs writes
               | directly to disk (using O_DIRECT for example).
               | 
               | Like RocksDB's default behavior (no fsyncs on the log),
               | it does not immediately sync writes to its log when they
               | happen. It waits to sync in batches, so that writes may
               | not be immediately durable, but logging is more
               | efficient. This is a slightly stronger default durability
               | guarantee, and we intend to make this configurable.
        
               | otterley wrote:
               | I'm a little confused. If you don't ensure data is
               | committed to storage (log or otherwise) before acking the
               | write request, how can you call it durable?
               | 
               | If it's not truly 100% durable by default, it's best not
               | to suggest that it is. Experience says people will use
               | the default settings and then become very cross if they
               | lose data. It undermines trust and is harmful to
               | reputation.
        
               | ajhconway wrote:
               | With many workloads, there's a tradeoff between the
               | granularity of durability and the overall performance.
               | 
               | If a workload has many small writes (some of our product
               | workloads do), then syncing each write can cause write
               | amplification and massively affect overall throughput and
               | latency. Suppose I do a 100B write, this causes a 4KiB
               | page write to sync, which is 40x write amp. Suddenly a
               | 2GiB/sec SSD can effectively only write 50MiB/sec.
               | Similarly, the per-write latency goes from <5us to 10us
               | (with the fastest Optane SSDs) or 150us (with flash
               | SSDs).
               | 
               | So storage systems tend to offer a range of durability
               | guarantees. Some systems have a special sync operation
               | for applications to ensure that all writes are durable.
               | 
               | RocksDB offers a fairly weak guarantee by default too,
               | writing to the write-ahead-log (WAL), but not performing
               | fsyncs (https://github.com/facebook/rocksdb/wiki/WAL-
               | Performance). They make a similar write amplification
               | argument too
               | (https://github.com/facebook/rocksdb/wiki/WAL-
               | Performance#wri...).
        
               | dilyevsky wrote:
               | I missed the use of direct io and the comment about fsync
               | threw me off, thanks. Very impressive then!
        
           | tyingq wrote:
           | Ah, that's helpful, and explains why it exists:
           | 
           |  _" Three novel ideas contribute to the high performance of
           | SplinterDB: the STB-tree, a new compaction policy that
           | exposes more concurrency, and a concurrent memtable and user-
           | level cache that removes scalability bottlenecks. All three
           | components are designed to enable the CPU to drive high IOPS
           | without wasting cycles."_
           | 
           |  _" At the heart of SplinterDB is the STB-tree, a novel data
           | structure that combines ideas from log-structured merge tree
           | and B-trees. The STB-tree adapts the idea of size-tiering
           | (also known as fragmentation) from key-value stores such as
           | Cassandra and PebblesDB and applies them to B-trees to reduce
           | write amplification by reducing the number of times a data
           | item is re-written during compaction."_
        
         | stingraycharles wrote:
         | Yeah I would appreciate a benchmark against its main
         | alternative, rocksdb. I know benchmark are typically
         | manufactured and not too representative for real world load,
         | but at least a ballpark figure would be nice to know what we're
         | talking about here.
         | 
         | Their main website is at https://splinterdb.org/ by the way,
         | for those interested. Also no benchmarks there. :)
        
           | ridruejo wrote:
           | The paper referenced in the other comment includes a
           | benchmark against RocksDB
           | https://news.ycombinator.com/item?id=31515765
        
         | [deleted]
        
       | killingtime74 wrote:
       | Why would I pick this over SQLite?
        
         | necubi wrote:
         | Totally different use-cases. This is an embedded key value
         | store, not an RDBMS. You would use this in place of e.g.,
         | LevelDB or RocksDB, potentially as the storage layer of a
         | database.
        
           | axblount wrote:
           | There's always the venerable:                 CREATE TABLE kv
           | (         k TEXT PRIMARY KEY,         v TEXT NOT NULL
           | );
           | 
           | Even if sqlite is technically an RDBMS, I think it's a
           | legitimate comparison. Is SplinterDB worth giving up sqlite's
           | reliability and feature set?
        
             | necubi wrote:
             | This is much lower-level than sqlite. In fact, you could
             | use this as the storage layer for a SQL DB. See, e.g.,
             | MyRocks[0] which is a MySQL backend that uses RocksDB as
             | the storage layer.
             | 
             | In other words, you'd use this when you just need a
             | persistent KV store and want to build the higher level
             | semantics according to your application's needs.
             | 
             | [0] http://myrocks.io/
        
               | 4khilles wrote:
               | > In other words, you'd use this when you just need a
               | persistent KV store and want to build the higher level
               | semantics according to your application's needs.
               | 
               | Why can't you use SQLite for this usecase? I believe FDB
               | uses SQLite as an embedded KV store.
        
       | tjungblut wrote:
       | Checkout the limitations first, no fsync and no data recovery
       | makes this of very little use. I wonder what makes you write a kv
       | store without this from the start.
        
         | adra wrote:
         | Like 99% if the use cases redis/memcached should be used for?
        
           | bufferoverflow wrote:
           | Redis has optional durability
           | 
           | https://redis.io/docs/manual/persistence/
        
           | tjungblut wrote:
           | Why write anything to disk if you can just store it in an
           | array?
        
             | capableweb wrote:
             | Just one example: sharing data between processes
        
       ___________________________________________________________________
       (page generated 2022-05-28 23:00 UTC)