hngopher.com

       [HN Gopher] B-tree Path Hints
       ___________________________________________________________________
        
       B-tree Path Hints
        
       Author : eatonphil
       Score  : 230 points
       Date   : 2021-07-30 14:46 UTC (8 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | ccleve wrote:
       | It's a cool idea, but you have to keep in mind that binary
       | searches are not optimal for b-tree nodes unless the node is
       | really big. If you have a smaller node, < 32 or 64 elements, a
       | simple linear search is faster. The reason has something to do
       | with processor cache, branch mis-prediction, and pipelining.
       | 
       | A linear search can also be optimized to use SIMD, which is a big
       | win.
       | 
       | If you really want to use a path hint, you could just say "start
       | my linear search at the point where I left off last time", and if
       | the first element is too big, wrap around to the beginning. But
       | that might not buy you much.
        
         | ardit33 wrote:
         | I agree.... the whole point of a b-tree is to already subdivide
         | the dataset in smaller manageable sets. A binary search wont be
         | much help on those slices if the tree is constructed properly.
        
           | jldugger wrote:
           | As I understand it, the idea is that a path hint is a
           | suggestion on which node to scan first, not a suggestion on
           | how to binary search an individual node.
        
             | jsnell wrote:
             | No, it'll still do a scan all the way from the root node
             | down to the leaf nodes. The path is simply saying which
             | element to start the binary search from, in each of the
             | nodes.
        
         | jleahy wrote:
         | The something is data dependencies.
        
         | nextaccountic wrote:
         | Isn't the binary search here taken as the whole tree, spanning
         | multiple nodes? Each time we follow a pointer we are discarding
         | a huge swath of the tree.
         | 
         | In a single node, a linear search is probably better, yes.
        
         | scottlamb wrote:
         | > It's a cool idea, but you have to keep in mind that binary
         | searches are not optimal for b-tree nodes unless the node is
         | really big. If you have a smaller node, < 32 or 64 elements, a
         | simple linear search is faster. The reason has something to do
         | with processor cache, branch mis-prediction, and pipelining.
         | 
         | I've heard that--and iirc Rust's BTreeMap type uses linear
         | search--but does this depend on the type of key? Easy for me to
         | see how that might be true for a BTreeMap<u64, V> but not a
         | BTreeMap<String, V>. In the latter case, there's a pointer to
         | chase for each comparison (and more/variable data to process;
         | SIMD over multiple keys doesn't seem likely).
        
           | ccleve wrote:
           | Yes, true. If your comparisons involve pointer-chasing or
           | string comparison or anything complex, then it's not clear
           | which strategy is faster. I always write the simplest
           | possible code first, which is usually a linear search,
           | benchmark it, and then try to beat the benchmark if it's
           | warranted.
        
       | ntonozzi wrote:
       | A similar idea is finger search
       | (https://en.wikipedia.org/wiki/Finger_search), where you can turn
       | a traversal of a tree from O(log(number of items in the tree))
       | into O(log(number of items between the finger and the key)).
        
         | uyt wrote:
         | I was going to link the same thing except I would say he is
         | talking about exactly that without knowing the academic name
         | for it.
        
         | namibj wrote:
         | A related thing: exponential search. It's the sorted-array
         | equivalent.
        
       | mabbo wrote:
       | I love little optimizations like this.
       | 
       | What I'm reading is that the node in the b-tree remembers the
       | most recently found index during search, and if the next search
       | happens to be for the same value or one very close, the binary
       | search goes much faster. This only helps in situations where
       | searches correlated in time tend to have nearby key values, but
       | that's probably very common.
        
         | anonymouse008 wrote:
         | I'm not qualified to truly understand this, but when Lex first
         | interviewed Jim Keller, Jim basically said regarding processor
         | design -- 'yeah, if you just guess the same result as last
         | time, you get a 50% speed increase.'
         | 
         | First Interview (where that poorly paraphrased quote resides):
         | https://www.youtube.com/watch?v=Nb2tebYAaOA&t=13s
         | 
         | Second Interview: https://www.youtube.com/watch?v=G4hL5Om4IJ4
        
           | sitkack wrote:
           | The Weather Prediction Algorithm, weather will be the same as
           | yesterday. Only wrong on transitions, very useful when you
           | have runs of the same state. After that you implement a
           | Markov Chain [1]
           | 
           | https://en.wikipedia.org/wiki/Markov_chain
        
       | Symmetry wrote:
       | I'm sort of surprised that the gains are so high. I'd tend to
       | expect the cost of a pointer deference in a large data structure
       | to be bigger than the comparisons even with the attendant branch
       | mispredictions.
       | 
       | EDIT: Oh, duh, if the path hints are useful then cache locality
       | means that the data you're looking for is already nearby. If the
       | hint works the cost of the lookup will actually be dominated by
       | the comparisons.
        
       | iratewizard wrote:
       | I wonder how this would compare to a B-Tree where you've
       | explicitly loaded large branches of the tree into cpu cache and
       | perform multiple comparisons in a single cpu cycle.
        
       | warmfuzzykitten wrote:
       | The "path hint" sounds a lot like a cursor. Since a lot of b-tree
       | traversal is ordered, knowing the last position in the tree is
       | often valuable information.
        
       | bob1029 wrote:
       | I think you would find the splay tree provides similar
       | properties, potentially in an even more elegant way.
       | 
       | For applications I have used it for, this technique results in
       | the hottest data eventually being condensed into a smaller number
       | of contiguous storage blocks.
       | 
       | The power of this technique is that it amortizes the optimization
       | of the tree across _all_ access (reads), not just inserts  &
       | deletions (or other special-purpose re-balancing operations).
        
         | foobiekr wrote:
         | I love splay trees, skew heaps, etc. data structures that get
         | very little love.
         | 
         | But I've actually tried to use splay trees in production code.
         | The performance benefits versus a simple Patricia tree seemed
         | totally absent in actual practice even for things like very
         | frequent adjacent queries. The structure is neat but paying
         | attention to memory layout and cache line size is far, far more
         | important.
        
           | zinclozenge wrote:
           | One of the things that surprised me was that linear search on
           | optimally laid out data was frequently faster than binary
           | search with frequent pointer chasing.
        
           | bob1029 wrote:
           | I use splay trees for their indirect utility. That is, the
           | splay tree is a trivial way to calculate minimal, incremental
           | updates to an append-only log. With most (all?) other b-tree
           | structures, you would need to rewrite an unsustainable (or
           | otherwise unpredictable) amount of tree data to the append-
           | only log each time it is modified. The magic is that the
           | splay operation redefines the root node every time you access
           | the tree, so you just have to keep track of all the nodes you
           | visited in each transaction batch and then write that sub-
           | tree to the end of the log.
           | 
           | >paying attention to memory layout and cache line size is
           | far, far more important.
           | 
           | Totally agree. For scenarios where I know I am going to have
           | fewer than ~100k of something (assume an array of structs), I
           | typically just do a linear scan over memory. If its a bunch
           | of objects in memory or a cold access to storage, the trees
           | become far more practical.
        
         | jsnell wrote:
         | I don't think this is about hot-data optimization as such
         | though, but about very short-term local clustering of keys.
         | 
         | I.e. it's not that a:b:c and a:b:d are globally accessed more
         | often than x:y:z, but that after a:b:c is accessed it's more
         | likely that the next access is a:b:d than x:y:z.
        
           | eternalban wrote:
           | The tree structure is invariant across the balancing/search
           | algorithms. Maybe you can do splay on loads and then
           | rebalance. Q is whether this ends up being more costly over
           | time. Now I wonder if there can be a hybric splay with
           | distinct sub-roots.
        
           | crdrost wrote:
           | It's also that they are different data structures used
           | historically for different things, no?
           | 
           | It'd be interesting to see a high-fanout splay tree, not even
           | 100% sure how it'd work. Do you splay just the one node up to
           | the root and have to split/combine each splay? Or do you only
           | splay the first node in a "chunk" up? Or the whole "chunk"?
           | Is it better to do a B+ type of thing rather than a B-tree
           | type of thing?
        
             | bob1029 wrote:
             | In theory, this could provide dramatic performance uplift
             | if the splay is done along lines of an aggregate value
             | function over the node contents, and the contents of the
             | node are selected such that their aggregate value function
             | is well-defined & ordered relative to other nodes. I've got
             | an implementation I might try this with. There are a ton of
             | directions you could take this in depending on specifically
             | what you are trying to optimize for.
             | 
             | Looking at this from a mechanical sympathy perspective - If
             | you are unable to batch tree operations for whatever
             | reason, increasing fanout is a good intermediate answer
             | since you can make your storage system do more effective
             | work per I/O. You can't ever read or write in units smaller
             | than 1 storage block, so maybe shoving more logical items
             | into each block is where you need to look.
             | 
             | Practically, this is probably not feasible in a wide number
             | of use cases. GUID/UUID keys (non-linear) would break this
             | scheme in my mind.
        
       | kolbe wrote:
       | I love this. In general, I've found you can do amazing thing to
       | speed up common algorithms by tailoring them to your use case. It
       | usually feels hacky, but I've found it to be very effective.
        
       | kache_ wrote:
       | I love b trees :)
        
       | infogulch wrote:
       | Another B-tree-related post I appreciated recently: Evolution of
       | tree data structures for indexing -
       | https://news.ycombinator.com/item?id=27963753
       | 
       | I'd be interested in more posts on LSM trees.
        
       | kazinator wrote:
       | An interesting idea would be to imitate filesystems and provide
       | an actual "current working node" concept to the application. The
       | current working node is a hint which contains a pointer to a
       | BTree node. This could be the leaf node where some item was
       | previously found, or one node higher above the leaf level.
       | 
       | When the search for an item fails from the current working node,
       | it falls back on earnestly searching from the root.
       | 
       | If a node is split or merged by insertions or deletions, then if
       | any current working node pointers to it exist, they get
       | invalidated or updated somehow. (The B-Tree object could carry a
       | small dynamic set of current working node items which it can
       | update when the tree mutates.)
        
       | [deleted]
        
       | jeremyis wrote:
       | I really enjoyed this post and hope I get more like it in the
       | future:
       | 
       | - It gave a high level, concise background overview of the base
       | concept (B-trees) complete with diagrams
       | 
       | - It talked from first principles about how things worked without
       | really any Jargon
       | 
       | - It explained simply a clever optimization
       | 
       | Thanks for posting!
        
       | srcreigh wrote:
       | I'm not sure this is useful for a traditional database. There are
       | lots of costs that are orders of magnitude more important.
       | 
       | This is a CPU optimization, however in a DB the major costs are
       | 1) pulling blocks from disk 2) transferring results over a
       | network.
       | 
       | Even if you assume nodes are cached in memory, a B-Tree node for
       | a database probably fits in L1 cache. SQLite page size is 4098
       | bytes by default vs 8KB+ L1 cache size. Shouldn't you pay 100x
       | the cost just to load the next node from memory, if you have a
       | large tree? In theory that's <1% faster overall.
       | 
       | I'm curious whether 3x improvement is overall, or just from the
       | binary search. (according to the benchmark it is overall 2x
       | faster - very curious).
       | 
       | Also curious the size of the tree in benchmarks and the spread
       | factor (ie how many children each B-Tree node has). I couldn't
       | find this info from the repo. This could affect the % of time you
       | can find the next node already in a cache.
        
         | srcreigh wrote:
         | I found the answers - the tree uses 256 children by default,
         | implying ~16kb node size on 64 bit architecture. The size of
         | the tree in the benchmark is 1M nodes, so ~16GB tree. Pretty
         | hefty tree! :)
         | 
         | It's the seqential insert/get benchmarks which benefits the
         | most from path hints. You'd only have 1 hot path down the tree
         | in this setup (the rightmost path). Is it possible the CPU is
         | caching multiple non-contiguous blocks (ie the hot path)? That
         | would explain why memory access doesn't have as much of a
         | factor as I originally thought.
         | 
         | In other words - this technique may be useful in DBs! Disk and
         | memory are both heavily cached in the CPU, so CPU optimizations
         | can help for sequential/repeated local accesses
        
           | tidwall wrote:
           | It's a 1M items, not nodes. Each item is a Go interface{},
           | about 16 bytes each. the benchmarks actually show that the
           | tree is around 36MB.
        
       | sroussey wrote:
       | Back in the 1990s, I did something similar with SQL and it worked
       | wonders.
        
       | vvern wrote:
       | It feels to me like this just a kludge added to deal with a lack
       | of a stateful iterator on top of the tree. In this use case, the
       | author indicates that it is about exploiting locality in a
       | request path. Imagine that the tree offered an iterator that
       | maintained a stack of its path and that thing had a seek method.
       | You can optimize that thing to only go back up to the root if it
       | needs to. Stateful iterators make for a nice pairing with btrees.
       | These other hints are papering over the lack of that abstraction
       | as far as I can tell.
        
         | anderskaseorg wrote:
         | With a stateful iterator, you might need to worry about your
         | cached pointers being invalidated by mutations from other
         | callers.
        
           | Ar-Curunir wrote:
           | Not to sound like a member of the Rust Evangelism
           | StrikeForce, but Rust is able to offer stateful iterators
           | over its `BTree` exactly by preventing mutation during
           | iteration.
        
             | oscardssmith wrote:
             | Doesn't that mean that iteration requires a write lock?
             | That sounds bad for lots of applications.
        
               | Ar-Curunir wrote:
               | The write-lock is checked entirely at compile time, so no
               | performance overhead
        
       | jvanderbot wrote:
       | In the mid 80s to 90s, there was a lot of interest around "query
       | responsive" data structures. There still is, I'm sure, but that
       | was what I studied and I've moved on since then.
       | 
       | Of these, the fibbonaci and pairing heaps were the most well
       | known products, I reckon.
       | 
       | They essentially move "Active" parts of the data structure
       | upwards near the top of the tree. The use case is different
       | (being heaps), but I wouldn't be surprised if that line of R&D
       | has continued, or if there's a huge gap in performance that can
       | be reclaimed with "intelligent" guesses like this one.
       | 
       | I'm rambling and conjecturing, but maybe someday branch
       | prediction will get a huge boost to their already tangible smarts
       | and we'll see another explosion in practical CPU throughput,
       | combined with a little help from algs and data structures, of
       | course.
        
         | svnpenn wrote:
         | > fibbonaci
        
         | gopalv wrote:
         | > we'll see another explosion in practical CPU throughput,
         | combined with a little help from algs and data structures
         | 
         | The learned index[1] is basically a variation of this sort of
         | "hint" which relies on getting the right answer most of the
         | time, but from learning the distribution function as a model
         | instead of relying on the programmer to heuristic it.
         | 
         | (from that paper)
         | 
         | > The key idea is that a model can learn the sort order or
         | structure of lookup keys and use this signal to effectively
         | predict the position or existence of records
         | 
         | I haven't seen anything mix this sort of work with an Eytzinger
         | traversal, but there's always diminishing returns for these.
         | 
         | [1] - https://arxiv.org/abs/1712.01208
        
         | nmwnmw wrote:
         | My personal favorite of these is the Splay Tree[1]. I've never
         | used it in production, but it's really simple to implement and
         | reason about. Though my main experience was trying (and
         | failing) to prove amortized bounds on the operations.
         | 
         | [1] https://en.wikipedia.org/wiki/Splay_tree
        
         | scottlamb wrote:
         | Does that mean any access to these data structures needs an
         | exclusive lock? One advantage to the "path hints" approach
         | described in the article is that the path hints can be stored
         | separately from the btree itself. The author writes:
         | 
         | > For single-threaded programs, it's possible to use one shared
         | path hint per B-tree for the life of the program.
         | 
         | > For multi-threaded programs, I find it best to use one path
         | hint per B-tree , per thread.
         | 
         | > For server-client programs, one path hint per B-tree, per
         | client should suffice.
         | 
         | Seems likely multiple threads/clients would have different
         | access patterns anyway, so the hints might actually be better
         | this way in addition to more concurrent.
         | 
         | edit: also, this approach feels a lot lighter-weight to me than
         | mutating the tree: I can imagine the mutations involving
         | frequent allocation/deallocation where this hint is just a
         | vector that grows to the maximum depth and stays there.
        
           | bradleyjg wrote:
           | Had the same thought. Making every access RW seems like a
           | cure worse than the disease in a multithreaded situation.
        
             | jvanderbot wrote:
             | > mid 80s
        
       | nomy99 wrote:
       | I think OP is using the wrong data structure, and his fix is
       | bringing him closer to the ideal solution. You need a hashmap
       | function that returns a search range to direct the binary search
       | on the B tree. This lets you add logic on how you want data to be
       | searched in the future. In base case it will work as a the B tree
       | hint
        
       | vecter wrote:
       | In case others are unclear since it doesn't seem to be explicitly
       | stated in the docs, you update the path hint to be the path of
       | the most recently found item. I think this is implied when he
       | says "My path may be wrong, and if so please provide me with the
       | correct path so I get it right the next time.", but you can see
       | where `hint` is set in the implementation also.
        
       | ot wrote:
       | The C++ STL has something similar, insert operations accept an
       | iterator as hint:
       | 
       | https://en.cppreference.com/w/cpp/container/map/insert
       | 
       | In common implementations iterators are just node pointers, which
       | is enough because nodes have parent pointers, so you can recover
       | the whole path.
       | 
       | In extreme cases like ordered insertion this effectively shaves
       | off a log(n) factor.
        
         | uyt wrote:
         | I really like that the hint is optional.
         | 
         | You don't always want the hint because if you're jumping _d_
         | steps ahead, you actually need to touch 2 log(d) nodes, once to
         | walk up, and another to walk down. This means anytime you 're
         | moving more than d=sqrt(n) steps away you're better off just
         | walking down from the root instead.
         | 
         | For example if n=1000000, if the hot section is wider than
         | d=1000, you're better off going from root all the time. Of
         | course if you access stuff that are directly adjacent you
         | should still use inorder iteration for those (but I see that as
         | a ++ operator, not a separate lookup).
        
       | willvarfar wrote:
       | Wow! Obvious in hindsight!
       | 
       | Do any databases use this or similar tricks?
        
         | jlokier wrote:
         | Yes. Even better, a good database will avoid traversing the
         | blocks and path, when the key is close enough to the previous
         | lookup's hint, making all but the first lookup in a nearly-
         | consecutive sequence take O(1) instead of O(log N) where N is
         | the size of the tree, and runs of n nearly-consecutive
         | operations take O(n + log N) instead of O(n log N).
        
         | bartlomieju wrote:
         | BuntDB [0] from @tidwall uses this package as a backing data
         | structure. And BuntDB is in turn used by Tile38 [1]
         | 
         | [0] https://github.com/tidwall/buntdb [1]
         | https://github.com/tidwall/tile38
        
       ___________________________________________________________________
       (page generated 2021-07-30 23:00 UTC)