[HN Gopher] B-tree Path Hints ___________________________________________________________________ B-tree Path Hints Author : eatonphil Score : 230 points Date : 2021-07-30 14:46 UTC (8 hours ago) (HTM) web link (github.com) (TXT) w3m dump (github.com) | ccleve wrote: | It's a cool idea, but you have to keep in mind that binary | searches are not optimal for b-tree nodes unless the node is | really big. If you have a smaller node, < 32 or 64 elements, a | simple linear search is faster. The reason has something to do | with processor cache, branch mis-prediction, and pipelining. | | A linear search can also be optimized to use SIMD, which is a big | win. | | If you really want to use a path hint, you could just say "start | my linear search at the point where I left off last time", and if | the first element is too big, wrap around to the beginning. But | that might not buy you much. | ardit33 wrote: | I agree.... the whole point of a b-tree is to already subdivide | the dataset in smaller manageable sets. A binary search wont be | much help on those slices if the tree is constructed properly. | jldugger wrote: | As I understand it, the idea is that a path hint is a | suggestion on which node to scan first, not a suggestion on | how to binary search an individual node. | jsnell wrote: | No, it'll still do a scan all the way from the root node | down to the leaf nodes. The path is simply saying which | element to start the binary search from, in each of the | nodes. | jleahy wrote: | The something is data dependencies. | nextaccountic wrote: | Isn't the binary search here taken as the whole tree, spanning | multiple nodes? Each time we follow a pointer we are discarding | a huge swath of the tree. | | In a single node, a linear search is probably better, yes. | scottlamb wrote: | > It's a cool idea, but you have to keep in mind that binary | searches are not optimal for b-tree nodes unless the node is | really big. If you have a smaller node, < 32 or 64 elements, a | simple linear search is faster. The reason has something to do | with processor cache, branch mis-prediction, and pipelining. | | I've heard that--and iirc Rust's BTreeMap type uses linear | search--but does this depend on the type of key? Easy for me to | see how that might be true for a BTreeMap<u64, V> but not a | BTreeMap<String, V>. In the latter case, there's a pointer to | chase for each comparison (and more/variable data to process; | SIMD over multiple keys doesn't seem likely). | ccleve wrote: | Yes, true. If your comparisons involve pointer-chasing or | string comparison or anything complex, then it's not clear | which strategy is faster. I always write the simplest | possible code first, which is usually a linear search, | benchmark it, and then try to beat the benchmark if it's | warranted. | ntonozzi wrote: | A similar idea is finger search | (https://en.wikipedia.org/wiki/Finger_search), where you can turn | a traversal of a tree from O(log(number of items in the tree)) | into O(log(number of items between the finger and the key)). | uyt wrote: | I was going to link the same thing except I would say he is | talking about exactly that without knowing the academic name | for it. | namibj wrote: | A related thing: exponential search. It's the sorted-array | equivalent. | mabbo wrote: | I love little optimizations like this. | | What I'm reading is that the node in the b-tree remembers the | most recently found index during search, and if the next search | happens to be for the same value or one very close, the binary | search goes much faster. This only helps in situations where | searches correlated in time tend to have nearby key values, but | that's probably very common. | anonymouse008 wrote: | I'm not qualified to truly understand this, but when Lex first | interviewed Jim Keller, Jim basically said regarding processor | design -- 'yeah, if you just guess the same result as last | time, you get a 50% speed increase.' | | First Interview (where that poorly paraphrased quote resides): | https://www.youtube.com/watch?v=Nb2tebYAaOA&t=13s | | Second Interview: https://www.youtube.com/watch?v=G4hL5Om4IJ4 | sitkack wrote: | The Weather Prediction Algorithm, weather will be the same as | yesterday. Only wrong on transitions, very useful when you | have runs of the same state. After that you implement a | Markov Chain [1] | | https://en.wikipedia.org/wiki/Markov_chain | Symmetry wrote: | I'm sort of surprised that the gains are so high. I'd tend to | expect the cost of a pointer deference in a large data structure | to be bigger than the comparisons even with the attendant branch | mispredictions. | | EDIT: Oh, duh, if the path hints are useful then cache locality | means that the data you're looking for is already nearby. If the | hint works the cost of the lookup will actually be dominated by | the comparisons. | iratewizard wrote: | I wonder how this would compare to a B-Tree where you've | explicitly loaded large branches of the tree into cpu cache and | perform multiple comparisons in a single cpu cycle. | warmfuzzykitten wrote: | The "path hint" sounds a lot like a cursor. Since a lot of b-tree | traversal is ordered, knowing the last position in the tree is | often valuable information. | bob1029 wrote: | I think you would find the splay tree provides similar | properties, potentially in an even more elegant way. | | For applications I have used it for, this technique results in | the hottest data eventually being condensed into a smaller number | of contiguous storage blocks. | | The power of this technique is that it amortizes the optimization | of the tree across _all_ access (reads), not just inserts & | deletions (or other special-purpose re-balancing operations). | foobiekr wrote: | I love splay trees, skew heaps, etc. data structures that get | very little love. | | But I've actually tried to use splay trees in production code. | The performance benefits versus a simple Patricia tree seemed | totally absent in actual practice even for things like very | frequent adjacent queries. The structure is neat but paying | attention to memory layout and cache line size is far, far more | important. | zinclozenge wrote: | One of the things that surprised me was that linear search on | optimally laid out data was frequently faster than binary | search with frequent pointer chasing. | bob1029 wrote: | I use splay trees for their indirect utility. That is, the | splay tree is a trivial way to calculate minimal, incremental | updates to an append-only log. With most (all?) other b-tree | structures, you would need to rewrite an unsustainable (or | otherwise unpredictable) amount of tree data to the append- | only log each time it is modified. The magic is that the | splay operation redefines the root node every time you access | the tree, so you just have to keep track of all the nodes you | visited in each transaction batch and then write that sub- | tree to the end of the log. | | >paying attention to memory layout and cache line size is | far, far more important. | | Totally agree. For scenarios where I know I am going to have | fewer than ~100k of something (assume an array of structs), I | typically just do a linear scan over memory. If its a bunch | of objects in memory or a cold access to storage, the trees | become far more practical. | jsnell wrote: | I don't think this is about hot-data optimization as such | though, but about very short-term local clustering of keys. | | I.e. it's not that a:b:c and a:b:d are globally accessed more | often than x:y:z, but that after a:b:c is accessed it's more | likely that the next access is a:b:d than x:y:z. | eternalban wrote: | The tree structure is invariant across the balancing/search | algorithms. Maybe you can do splay on loads and then | rebalance. Q is whether this ends up being more costly over | time. Now I wonder if there can be a hybric splay with | distinct sub-roots. | crdrost wrote: | It's also that they are different data structures used | historically for different things, no? | | It'd be interesting to see a high-fanout splay tree, not even | 100% sure how it'd work. Do you splay just the one node up to | the root and have to split/combine each splay? Or do you only | splay the first node in a "chunk" up? Or the whole "chunk"? | Is it better to do a B+ type of thing rather than a B-tree | type of thing? | bob1029 wrote: | In theory, this could provide dramatic performance uplift | if the splay is done along lines of an aggregate value | function over the node contents, and the contents of the | node are selected such that their aggregate value function | is well-defined & ordered relative to other nodes. I've got | an implementation I might try this with. There are a ton of | directions you could take this in depending on specifically | what you are trying to optimize for. | | Looking at this from a mechanical sympathy perspective - If | you are unable to batch tree operations for whatever | reason, increasing fanout is a good intermediate answer | since you can make your storage system do more effective | work per I/O. You can't ever read or write in units smaller | than 1 storage block, so maybe shoving more logical items | into each block is where you need to look. | | Practically, this is probably not feasible in a wide number | of use cases. GUID/UUID keys (non-linear) would break this | scheme in my mind. | kolbe wrote: | I love this. In general, I've found you can do amazing thing to | speed up common algorithms by tailoring them to your use case. It | usually feels hacky, but I've found it to be very effective. | kache_ wrote: | I love b trees :) | infogulch wrote: | Another B-tree-related post I appreciated recently: Evolution of | tree data structures for indexing - | https://news.ycombinator.com/item?id=27963753 | | I'd be interested in more posts on LSM trees. | kazinator wrote: | An interesting idea would be to imitate filesystems and provide | an actual "current working node" concept to the application. The | current working node is a hint which contains a pointer to a | BTree node. This could be the leaf node where some item was | previously found, or one node higher above the leaf level. | | When the search for an item fails from the current working node, | it falls back on earnestly searching from the root. | | If a node is split or merged by insertions or deletions, then if | any current working node pointers to it exist, they get | invalidated or updated somehow. (The B-Tree object could carry a | small dynamic set of current working node items which it can | update when the tree mutates.) | [deleted] | jeremyis wrote: | I really enjoyed this post and hope I get more like it in the | future: | | - It gave a high level, concise background overview of the base | concept (B-trees) complete with diagrams | | - It talked from first principles about how things worked without | really any Jargon | | - It explained simply a clever optimization | | Thanks for posting! | srcreigh wrote: | I'm not sure this is useful for a traditional database. There are | lots of costs that are orders of magnitude more important. | | This is a CPU optimization, however in a DB the major costs are | 1) pulling blocks from disk 2) transferring results over a | network. | | Even if you assume nodes are cached in memory, a B-Tree node for | a database probably fits in L1 cache. SQLite page size is 4098 | bytes by default vs 8KB+ L1 cache size. Shouldn't you pay 100x | the cost just to load the next node from memory, if you have a | large tree? In theory that's <1% faster overall. | | I'm curious whether 3x improvement is overall, or just from the | binary search. (according to the benchmark it is overall 2x | faster - very curious). | | Also curious the size of the tree in benchmarks and the spread | factor (ie how many children each B-Tree node has). I couldn't | find this info from the repo. This could affect the % of time you | can find the next node already in a cache. | srcreigh wrote: | I found the answers - the tree uses 256 children by default, | implying ~16kb node size on 64 bit architecture. The size of | the tree in the benchmark is 1M nodes, so ~16GB tree. Pretty | hefty tree! :) | | It's the seqential insert/get benchmarks which benefits the | most from path hints. You'd only have 1 hot path down the tree | in this setup (the rightmost path). Is it possible the CPU is | caching multiple non-contiguous blocks (ie the hot path)? That | would explain why memory access doesn't have as much of a | factor as I originally thought. | | In other words - this technique may be useful in DBs! Disk and | memory are both heavily cached in the CPU, so CPU optimizations | can help for sequential/repeated local accesses | tidwall wrote: | It's a 1M items, not nodes. Each item is a Go interface{}, | about 16 bytes each. the benchmarks actually show that the | tree is around 36MB. | sroussey wrote: | Back in the 1990s, I did something similar with SQL and it worked | wonders. | vvern wrote: | It feels to me like this just a kludge added to deal with a lack | of a stateful iterator on top of the tree. In this use case, the | author indicates that it is about exploiting locality in a | request path. Imagine that the tree offered an iterator that | maintained a stack of its path and that thing had a seek method. | You can optimize that thing to only go back up to the root if it | needs to. Stateful iterators make for a nice pairing with btrees. | These other hints are papering over the lack of that abstraction | as far as I can tell. | anderskaseorg wrote: | With a stateful iterator, you might need to worry about your | cached pointers being invalidated by mutations from other | callers. | Ar-Curunir wrote: | Not to sound like a member of the Rust Evangelism | StrikeForce, but Rust is able to offer stateful iterators | over its `BTree` exactly by preventing mutation during | iteration. | oscardssmith wrote: | Doesn't that mean that iteration requires a write lock? | That sounds bad for lots of applications. | Ar-Curunir wrote: | The write-lock is checked entirely at compile time, so no | performance overhead | jvanderbot wrote: | In the mid 80s to 90s, there was a lot of interest around "query | responsive" data structures. There still is, I'm sure, but that | was what I studied and I've moved on since then. | | Of these, the fibbonaci and pairing heaps were the most well | known products, I reckon. | | They essentially move "Active" parts of the data structure | upwards near the top of the tree. The use case is different | (being heaps), but I wouldn't be surprised if that line of R&D | has continued, or if there's a huge gap in performance that can | be reclaimed with "intelligent" guesses like this one. | | I'm rambling and conjecturing, but maybe someday branch | prediction will get a huge boost to their already tangible smarts | and we'll see another explosion in practical CPU throughput, | combined with a little help from algs and data structures, of | course. | svnpenn wrote: | > fibbonaci | gopalv wrote: | > we'll see another explosion in practical CPU throughput, | combined with a little help from algs and data structures | | The learned index[1] is basically a variation of this sort of | "hint" which relies on getting the right answer most of the | time, but from learning the distribution function as a model | instead of relying on the programmer to heuristic it. | | (from that paper) | | > The key idea is that a model can learn the sort order or | structure of lookup keys and use this signal to effectively | predict the position or existence of records | | I haven't seen anything mix this sort of work with an Eytzinger | traversal, but there's always diminishing returns for these. | | [1] - https://arxiv.org/abs/1712.01208 | nmwnmw wrote: | My personal favorite of these is the Splay Tree[1]. I've never | used it in production, but it's really simple to implement and | reason about. Though my main experience was trying (and | failing) to prove amortized bounds on the operations. | | [1] https://en.wikipedia.org/wiki/Splay_tree | scottlamb wrote: | Does that mean any access to these data structures needs an | exclusive lock? One advantage to the "path hints" approach | described in the article is that the path hints can be stored | separately from the btree itself. The author writes: | | > For single-threaded programs, it's possible to use one shared | path hint per B-tree for the life of the program. | | > For multi-threaded programs, I find it best to use one path | hint per B-tree , per thread. | | > For server-client programs, one path hint per B-tree, per | client should suffice. | | Seems likely multiple threads/clients would have different | access patterns anyway, so the hints might actually be better | this way in addition to more concurrent. | | edit: also, this approach feels a lot lighter-weight to me than | mutating the tree: I can imagine the mutations involving | frequent allocation/deallocation where this hint is just a | vector that grows to the maximum depth and stays there. | bradleyjg wrote: | Had the same thought. Making every access RW seems like a | cure worse than the disease in a multithreaded situation. | jvanderbot wrote: | > mid 80s | nomy99 wrote: | I think OP is using the wrong data structure, and his fix is | bringing him closer to the ideal solution. You need a hashmap | function that returns a search range to direct the binary search | on the B tree. This lets you add logic on how you want data to be | searched in the future. In base case it will work as a the B tree | hint | vecter wrote: | In case others are unclear since it doesn't seem to be explicitly | stated in the docs, you update the path hint to be the path of | the most recently found item. I think this is implied when he | says "My path may be wrong, and if so please provide me with the | correct path so I get it right the next time.", but you can see | where `hint` is set in the implementation also. | ot wrote: | The C++ STL has something similar, insert operations accept an | iterator as hint: | | https://en.cppreference.com/w/cpp/container/map/insert | | In common implementations iterators are just node pointers, which | is enough because nodes have parent pointers, so you can recover | the whole path. | | In extreme cases like ordered insertion this effectively shaves | off a log(n) factor. | uyt wrote: | I really like that the hint is optional. | | You don't always want the hint because if you're jumping _d_ | steps ahead, you actually need to touch 2 log(d) nodes, once to | walk up, and another to walk down. This means anytime you 're | moving more than d=sqrt(n) steps away you're better off just | walking down from the root instead. | | For example if n=1000000, if the hot section is wider than | d=1000, you're better off going from root all the time. Of | course if you access stuff that are directly adjacent you | should still use inorder iteration for those (but I see that as | a ++ operator, not a separate lookup). | willvarfar wrote: | Wow! Obvious in hindsight! | | Do any databases use this or similar tricks? | jlokier wrote: | Yes. Even better, a good database will avoid traversing the | blocks and path, when the key is close enough to the previous | lookup's hint, making all but the first lookup in a nearly- | consecutive sequence take O(1) instead of O(log N) where N is | the size of the tree, and runs of n nearly-consecutive | operations take O(n + log N) instead of O(n log N). | bartlomieju wrote: | BuntDB [0] from @tidwall uses this package as a backing data | structure. And BuntDB is in turn used by Tile38 [1] | | [0] https://github.com/tidwall/buntdb [1] | https://github.com/tidwall/tile38 ___________________________________________________________________ (page generated 2021-07-30 23:00 UTC)