[HN Gopher] Understanding LSM Trees: What Powers Write-Heavy Dat... ___________________________________________________________________ Understanding LSM Trees: What Powers Write-Heavy Databases Author : branko_d Score : 33 points Date : 2022-02-09 05:37 UTC (17 hours ago) (HTM) web link (yetanotherdevblog.com) (TXT) w3m dump (yetanotherdevblog.com) | saurik wrote: | I don't understand why every distributed project has gravitated | to write-optimized data structures; like, if you look at all of | the decentralized SQL offerings, they are now all built around | LSN trees... does no one have extremely large transactional (so | not like, bulk data analysis OLAP but standard OLTP work) | datasets that are more read than write (ones where raw | replication to balance the I/O or CPU load of queries due to the | space exploitation, so you want the kind of autosharding these | solutions offer)? I guess I am just confused as I would have | expected write-oriented workloads to be more rare and | specialized, mostly operating in the regime of transient shopping | carts (Amazon's original use case for Dynamo) or click logs or | whatever, as I would have thought most projects (maybe a social | network or a chat client) would have numerous reads (maybe even | tens of thousands) for every write. | ot wrote: | LSM trees have all kinds of tricks to make reads fast too, like | Bloom filters to rule out levels that don't have the key. | RocksDB, an LSM, powers what is probably the largest MySQL | deployment in the world, for OLTP workloads [1] | | The real killer feature is not write efficiency, but the | sequential write pattern, as opposed to the small random writes | used by B-trees. On flash, this makes a significant difference | for endurance. | | EDIT: And I forgot to mention that it is a lot easier to do | compression on immutable data structures (SSTables) than on | mutable ones, so LSMs are generally more space-efficient too. | | [1] https://vldb.org/pvldb/vol13/p3217-matsunobu.pdf | saurik wrote: | Every time I look into it though these tricks aren't making | reads truly fast, they are just mitigating the impacts of LSM | trees to the extent to which is possible, but a B-Tree are | still in the end both better for reads and not even that much | worse than writes (due to the write amplification from LSM | trees; with B-trees this is only an issue if you are making | truly random writes, which I believe is the worst case for | LSM trees anyway). | ot wrote: | If you're not doing random writes (that is, random keys) an | LSM has no write amplification, it essentially becomes a | sorted list of SSTables. | | I can't think of any OLTP application where the writes are | not random (in the sense of relative position in the key | space). Either your primary keys are, or at least all your | indexes are. | | Write amplification is not an issue if sequential writes | are orders of magnitude faster than random writes, which is | usually the case, so even with that LSMs come out ahead. | | Also if your application is read-heavy, you can tune the | LSM parameters to have fewer levels (at the cost of more | compaction traffic). With B-trees, you have to chase a | fixed depth of log_{page size}. | bob1029 wrote: | > The real killer feature is not write efficiency, but the | sequential write pattern, as opposed to the small random | writes used by B-trees. On flash, this makes a significant | difference for endurance. | | Sequential writes _with_ some degree of batching requests can | add orders of magnitude to system throughput. I use a | ringbuffer abstraction to serialize my writers and handle | batches of up to 4096 transactions at a time. Everything in | the batch gets compressed and written as 1 chunk to an | append-only log. The benefits of processing multiple requests | at a time are very hard to overstate. Saturating the write | performance of virtually any storage device is trivial with | this approach. It also carries back to magnetic storage | really well, and even tape if you were so inclined to go down | that path. | ot wrote: | The batching comes implicitly in LSMs: for the WAL, the | writes are committed only when the WAL is synced, so the | amount of batching is user-controlled. When writing | SSTables (either from memtables or from compaction) the | write sizes are controlled by the write buffer size. | kragen wrote: | Social networks and chat clients can deal with eventually | consistent reads instead of consistent reads, so you can run | your database with a master, which handles only writes, a | standby master for failover, and 58 asynchronously updated | readslaves. There are 307 different ways to scale out read | performance horizontally as long as eventual consistency is | good enough for your application: memcached, Redis, Varnish, | Fastly (built on Varnish), etc. Some of these don't even have | cache invalidation, relying on timeouts. | | This approach can get you pretty far before you need sharding, | auto- or otherwise, but the master's write bandwidth is the | bottleneck. Taken to the extreme this realization leads you to | ridiculous places like Kafka/Samza. | | I'm trying to come up with an example of an OLTP application | where the ratio of reads to writes within the same transaction | needs to be large. I'm sure they must exist but I can't think | of one. | tommiegannert wrote: | > An SSTable will consist of multiple sorted files called | segments. [---] Recall that LSM trees only perform sequential | writes. | | Assuming either that you only have one table, or file systems are | magic things that don't make that argument moot. | | > Once the compaction process has written a new segment for the | input segments, the old segment files are deleted. | | ... and the assumption of only a single table isn't enough! Thank | you, file system developers, for making SSTables sound so easy. | | > Bloom filter | | I did some research a couple of weeks ago about Approximate | member query data structures. Bloom filters are from 1970, | Quotient filters arrived in 1984, then Pagh improved Bloom | filters in 2004, and we got Cuckoo filters in 2014 [1]. Has there | been progress since? Did I miss any key achievement in-between? | | [1] https://en.wikipedia.org/wiki/Cuckoo_filter | 0xCMP wrote: | First thing I thought reading this is "there is something missing | here" cause you obviously can't wait until the memtable is full | to reply OK to the requester. RocksDB Docs explain that they use | a WAL+Manifest Log and I assume others do something similar: | https://github.com/facebook/rocksdb/wiki/RocksDB-Overview#3-... | | Wish it was just mentioned at least off hand that clearly this | isn't the whole solution and a WAL or similar is used to make | sure losing the memtable doesn't lose the data. | spullara wrote: | There is one nit I would like to point out where they are talking | about a bloom filter. They should replace "value is present" with | "value may be present". The worst case scenario is a false | positive that is generally tuned to be about 1% of the time. | Whatever that false positive rate (R) is though your tail latency | will definitely be affected at the p(100-R). | bradengroom wrote: | Author here! Thanks for sharing, and thanks for all of your | suggestions in the comments here. I'll try and find some time to | apply the feedback here amid my current parental leave. ___________________________________________________________________ (page generated 2022-02-09 23:00 UTC)