[HN Gopher] Understanding LSM Trees: What Powers Write-Heavy Dat...
       ___________________________________________________________________
        
       Understanding LSM Trees: What Powers Write-Heavy Databases
        
       Author : branko_d
       Score  : 33 points
       Date   : 2022-02-09 05:37 UTC (17 hours ago)
        
 (HTM) web link (yetanotherdevblog.com)
 (TXT) w3m dump (yetanotherdevblog.com)
        
       | saurik wrote:
       | I don't understand why every distributed project has gravitated
       | to write-optimized data structures; like, if you look at all of
       | the decentralized SQL offerings, they are now all built around
       | LSN trees... does no one have extremely large transactional (so
       | not like, bulk data analysis OLAP but standard OLTP work)
       | datasets that are more read than write (ones where raw
       | replication to balance the I/O or CPU load of queries due to the
       | space exploitation, so you want the kind of autosharding these
       | solutions offer)? I guess I am just confused as I would have
       | expected write-oriented workloads to be more rare and
       | specialized, mostly operating in the regime of transient shopping
       | carts (Amazon's original use case for Dynamo) or click logs or
       | whatever, as I would have thought most projects (maybe a social
       | network or a chat client) would have numerous reads (maybe even
       | tens of thousands) for every write.
        
         | ot wrote:
         | LSM trees have all kinds of tricks to make reads fast too, like
         | Bloom filters to rule out levels that don't have the key.
         | RocksDB, an LSM, powers what is probably the largest MySQL
         | deployment in the world, for OLTP workloads [1]
         | 
         | The real killer feature is not write efficiency, but the
         | sequential write pattern, as opposed to the small random writes
         | used by B-trees. On flash, this makes a significant difference
         | for endurance.
         | 
         | EDIT: And I forgot to mention that it is a lot easier to do
         | compression on immutable data structures (SSTables) than on
         | mutable ones, so LSMs are generally more space-efficient too.
         | 
         | [1] https://vldb.org/pvldb/vol13/p3217-matsunobu.pdf
        
           | saurik wrote:
           | Every time I look into it though these tricks aren't making
           | reads truly fast, they are just mitigating the impacts of LSM
           | trees to the extent to which is possible, but a B-Tree are
           | still in the end both better for reads and not even that much
           | worse than writes (due to the write amplification from LSM
           | trees; with B-trees this is only an issue if you are making
           | truly random writes, which I believe is the worst case for
           | LSM trees anyway).
        
             | ot wrote:
             | If you're not doing random writes (that is, random keys) an
             | LSM has no write amplification, it essentially becomes a
             | sorted list of SSTables.
             | 
             | I can't think of any OLTP application where the writes are
             | not random (in the sense of relative position in the key
             | space). Either your primary keys are, or at least all your
             | indexes are.
             | 
             | Write amplification is not an issue if sequential writes
             | are orders of magnitude faster than random writes, which is
             | usually the case, so even with that LSMs come out ahead.
             | 
             | Also if your application is read-heavy, you can tune the
             | LSM parameters to have fewer levels (at the cost of more
             | compaction traffic). With B-trees, you have to chase a
             | fixed depth of log_{page size}.
        
           | bob1029 wrote:
           | > The real killer feature is not write efficiency, but the
           | sequential write pattern, as opposed to the small random
           | writes used by B-trees. On flash, this makes a significant
           | difference for endurance.
           | 
           | Sequential writes _with_ some degree of batching requests can
           | add orders of magnitude to system throughput. I use a
           | ringbuffer abstraction to serialize my writers and handle
           | batches of up to 4096 transactions at a time. Everything in
           | the batch gets compressed and written as 1 chunk to an
           | append-only log. The benefits of processing multiple requests
           | at a time are very hard to overstate. Saturating the write
           | performance of virtually any storage device is trivial with
           | this approach. It also carries back to magnetic storage
           | really well, and even tape if you were so inclined to go down
           | that path.
        
             | ot wrote:
             | The batching comes implicitly in LSMs: for the WAL, the
             | writes are committed only when the WAL is synced, so the
             | amount of batching is user-controlled. When writing
             | SSTables (either from memtables or from compaction) the
             | write sizes are controlled by the write buffer size.
        
         | kragen wrote:
         | Social networks and chat clients can deal with eventually
         | consistent reads instead of consistent reads, so you can run
         | your database with a master, which handles only writes, a
         | standby master for failover, and 58 asynchronously updated
         | readslaves. There are 307 different ways to scale out read
         | performance horizontally as long as eventual consistency is
         | good enough for your application: memcached, Redis, Varnish,
         | Fastly (built on Varnish), etc. Some of these don't even have
         | cache invalidation, relying on timeouts.
         | 
         | This approach can get you pretty far before you need sharding,
         | auto- or otherwise, but the master's write bandwidth is the
         | bottleneck. Taken to the extreme this realization leads you to
         | ridiculous places like Kafka/Samza.
         | 
         | I'm trying to come up with an example of an OLTP application
         | where the ratio of reads to writes within the same transaction
         | needs to be large. I'm sure they must exist but I can't think
         | of one.
        
       | tommiegannert wrote:
       | > An SSTable will consist of multiple sorted files called
       | segments. [---] Recall that LSM trees only perform sequential
       | writes.
       | 
       | Assuming either that you only have one table, or file systems are
       | magic things that don't make that argument moot.
       | 
       | > Once the compaction process has written a new segment for the
       | input segments, the old segment files are deleted.
       | 
       | ... and the assumption of only a single table isn't enough! Thank
       | you, file system developers, for making SSTables sound so easy.
       | 
       | > Bloom filter
       | 
       | I did some research a couple of weeks ago about Approximate
       | member query data structures. Bloom filters are from 1970,
       | Quotient filters arrived in 1984, then Pagh improved Bloom
       | filters in 2004, and we got Cuckoo filters in 2014 [1]. Has there
       | been progress since? Did I miss any key achievement in-between?
       | 
       | [1] https://en.wikipedia.org/wiki/Cuckoo_filter
        
       | 0xCMP wrote:
       | First thing I thought reading this is "there is something missing
       | here" cause you obviously can't wait until the memtable is full
       | to reply OK to the requester. RocksDB Docs explain that they use
       | a WAL+Manifest Log and I assume others do something similar:
       | https://github.com/facebook/rocksdb/wiki/RocksDB-Overview#3-...
       | 
       | Wish it was just mentioned at least off hand that clearly this
       | isn't the whole solution and a WAL or similar is used to make
       | sure losing the memtable doesn't lose the data.
        
       | spullara wrote:
       | There is one nit I would like to point out where they are talking
       | about a bloom filter. They should replace "value is present" with
       | "value may be present". The worst case scenario is a false
       | positive that is generally tuned to be about 1% of the time.
       | Whatever that false positive rate (R) is though your tail latency
       | will definitely be affected at the p(100-R).
        
       | bradengroom wrote:
       | Author here! Thanks for sharing, and thanks for all of your
       | suggestions in the comments here. I'll try and find some time to
       | apply the feedback here amid my current parental leave.
        
       ___________________________________________________________________
       (page generated 2022-02-09 23:00 UTC)