[HN Gopher] Efficiently Searching In-Memory Sorted Arrays: Reven...
       ___________________________________________________________________
        
       Efficiently Searching In-Memory Sorted Arrays: Revenge of
       Interpolation Search? [pdf]
        
       Author : greghn
       Score  : 54 points
       Date   : 2023-06-03 16:12 UTC (6 hours ago)
        
 (HTM) web link (pages.cs.wisc.edu)
 (TXT) w3m dump (pages.cs.wisc.edu)
        
       | scrubs wrote:
       | Broadly related: https://news.ycombinator.com/item?id=25899286
       | PGM Indexes: Learned indexes that match B-tree performance with
       | 83x less space
       | 
       | I might suggest, slightly pushing back on this article's PDF,
       | binary search is not the de-facto in-memory search algorithm.
       | Cache consciousness btrees, and various optimizations of tries,
       | or radix trees are better. Eytzinger arrays are a good BST option
       | if data is pre-sorted and/or requires rare resort passes. Which
       | brings us back to btrees: keeping data sorted is usually 51% of
       | the problem. Search is the other 49%.
       | 
       | OK, if you don't need range search, which needs sorted keys, then
       | hashmaps are darn hard to beat.
       | 
       | The PDF author however makes a strong point: if FLOPS is going up
       | way faster than memory BW, the obvious thing to do is spend a
       | little more code looking into memory smartly. And the PDF even
       | gives algos to do it. So my that's why I say BST as de-facto
       | algorithm, while I don't think that's true, is only small push
       | back. The paper goes way further.
        
         | pvansandt wrote:
         | I (author) agree that certainly at the higher end of element
         | counts, it would be a very strange decision to not use an
         | index. One of my favorite related papers is RadixSpline by
         | Kipf, which shows an index based approach.
         | 
         | However, the inner loop of indexes usually is some other kind
         | of search algorithm. Graefe mentions in BTree techniques that
         | interpolation search is sometimes applicable to inner nodes in
         | btrees. One of the SIP evaluations was using SIP as part of the
         | inner loop in LevelDB, which doesn't displace the use of an LSM
         | for managing the larger data set.
         | 
         | If you don't need a lot of range queries, eytzinger, or cache-
         | line optimized btrees do get a lot more interesting, but the
         | iteration logic gets a lot worse. SIP returns an element if
         | it's present and a sentinel if absent, which is an odd choice
         | since the index is usually what's needed. No clue, but
         | hopefully you'll have some sympathy for revisiting your old
         | code and wondering who wrote it. :)
         | 
         | I do think that optimizing search algorithms is still
         | worthwhile in an indexed setting for two reasons: 1. The size
         | of a node within an index like a btree depends on the
         | efficiency with which that node can be searched, so if you are
         | able to decrease the cost of searching a given node in the
         | index, that might motivate larger node sizes. I know that a
         | common heuristic is to use a multiple of a page size, but that
         | is based on the assumption that the whole node will be used as
         | part of the search. 2. Pushing algorithmic decisions to just in
         | time can be more costly than the time saved from choosing a
         | better algorithm. You lose icache efficiency, stall on the
         | decision, etc. I think it's still good to steel man, as best
         | you can, that your index is actually buying you improved
         | performance for additional complexity or update overhead.
        
           | scrubs wrote:
           | Your paper is a valid add. And I'm gonna read/digest it. My
           | point above while true, is more in the margins. Better to
           | keep the focus on engineering know-how in the PDF. Thank you
           | for submitting it.
        
       | marginalia_nu wrote:
       | Commenting so I find this later.
        
         | ComputerGuru wrote:
         | Tip: you can favorite the submission then un-favorite it after
         | you've read it.
        
       | alwaystired wrote:
       | New leetcode problem just dropped
        
       | lucgagan wrote:
       | Wow, this sounds like a really fascinating study! Binary Search
       | has long been the go-to method for sorted, in-memory data search
       | due to its effectiveness and simplicity. However, given the
       | limitations of Binary Search in certain scenarios, it's exciting
       | to see new algorithms being proposed that could potentially
       | perform better under specific circumstances.
       | 
       | It's intriguing that Interpolation Search, despite its
       | historically lackluster performance, is being given a second life
       | through SIP and TIP. I love the fact that they're leveraging
       | hardware trends and advanced calculations to optimize search
       | performance.
       | 
       | The fact that SIP shows up to 4x faster performance on uniformly
       | distributed data and TIP is 2-3x faster on non-uniformly
       | distributed data is quite promising. It seems these approaches
       | could be really useful in practice, given that real-world data
       | often follows non-uniform distributions.
       | 
       | The introduction of a meta-algorithm to dynamically select the
       | optimal search algorithm based on the data distribution is a
       | clever touch. It helps to enhance the overall search performance
       | by not sticking to a one-size-fits-all approach.
       | 
       | I'd love to read more about these techniques and their practical
       | implementations. Do they have any benchmarks or plans to test
       | these new methods in large-scale, real-world systems?
        
         | switchbak wrote:
         | Serious question: did you use an LLM tool to craft some or all
         | of this comment?
        
           | finexplained wrote:
           | The five paragraph format will forever be suspicious.
        
           | adamisom wrote:
           | I'd be willing to bet money at 5-1 against that they didn't.
           | since that's impossible to prove, I guess also on running an
           | llm a lot of times and getting anything close to it.
        
         | pvansandt wrote:
         | This was an interesting journey. My original target was at L1
         | sized arrays with constant sized error, but that limits its
         | applicability to the target audience. My initial exploration
         | with spline based indexes were at this target data size, so
         | they weren't competitive.
         | 
         | SIP is very sensitive to quality of implementation. I did my
         | best to steel man binary search, but if either is not
         | implemented carefully, then you can easily get inverted
         | results. I tried several different axes of variation for
         | implementation of binary search and probably spent more time
         | optimizing binary search than SIP or TIP, but binary search is
         | very sensitive to cache misses and how that relates to branch
         | prediction and conflict misses. Quality of linear search is
         | also critical and relies heavily on array padding. SIMD was
         | actually a major pessimization since for SIP, it's only worth
         | using if linear search is < 100 elements or so.
         | 
         | I spent a lot of effort trying to model the performance of SIP
         | to predict from statistics of the data to how it would perform.
         | You can consider the cost of each interpolation as a random
         | access cache miss, and the cost of a linear search based on its
         | distance, but the cost of a linear search has decreasing
         | marginal cost with more distance before you reach steady state,
         | which is exactly where you are using it in SIP. So, that
         | basically led nowhere, but it does point out that an L2 norm is
         | the wrong metric for identifying if interpolation would be
         | useful on your data.
         | 
         | I think the bigger insight of TIP is around its use of
         | efficient approximations for hyperbolic interpolations. There's
         | a lot of research around polynomial interpolation, but I think
         | that hyperbolas are also worth examining. I don't expect TIP to
         | be used in a database engine, and while data might be commonly
         | zipf distributed, I'm not sure that people are putting that
         | kind of data into an array and doing lookups on them where the
         | values at the bottom of the array have a lot of duplicate
         | values.
         | 
         | SIP, I think, is more about increasing amounts of spatial cache
         | locality, and making the most of out-of-order processors. It is
         | more a variation on interpolation-sequential search, so it
         | might be more practically used in a SIP-2 or SIP-3 variation
         | which a hard-coded, unrolled number of iterations. (SIP-1 would
         | just be interpolation sequential.) The downside of an
         | interpolation method is that pairs of sequential interpolations
         | don't surround the target, which leaves the search unbounded. I
         | don't find hybrid approaches very interesting because I don't
         | expect an implementation to be able to pipeline useful work
         | which has a time per iteration that is sufficiently fast. For
         | perspective, on billion element arrays, I think typical number
         | of interpolations was around 3. We would expect number of
         | interpolations to not increase to 4 until you hit a quintillion
         | elements which suggests that the gap in efficiency between log
         | N and log log N grows slowly enough that algorithmic gains have
         | a nearby, practical upper bound to potential improvement.
         | 
         | One of my favorite related papers is RadixSpline by Kipf et al.
         | The amount of additional memory across a number of scenarios
         | would be tiny, you could write a very efficient inner loop, and
         | it would have basically been tailor fit to work perfectly on
         | the FB dataset (one of the real world distributions).
        
       | mlochbaum wrote:
       | The paper draws the analogy to continuous root-finding
       | ("Interpolation search is the regula falsi method"), nice. One
       | thing I've been wondering lately is whether the ITP method[0],
       | which is actually more recent than this paper, is a good fit for
       | sorted searching. It's an elegant method that guarantees a
       | maximum number of queries, for example one more than binary
       | search, while taking advantage of interpolation for better-
       | behaved distributions. I wrote a short section about this at [1].
       | Which is likely to link to the OP soon too.
       | 
       | Since the author's here: the "Optimized binary search" algorithm
       | looks good, in that it has an obvious branchless implementation.
       | But it'll depend on how the code's written and compiled, and
       | clang in particular will turn it into a branch without -mllvm
       | --x86-cmov-converter=0. Making it branchless was the intention,
       | right? Was the output code checked to make sure it uses cmov?
       | 
       | [0] https://en.wikipedia.org/wiki/ITP_Method
       | 
       | [1]
       | https://mlochbaum.github.io/BQN/implementation/primitive/sor...
        
         | [deleted]
        
         | pvansandt wrote:
         | Big fan on your presentation on binary search that amortised
         | multiple concurrent searches.
         | 
         | I did several variations on binary search and always compared
         | against the best one. [0] From what I remember, some variations
         | were compiled to a branchless implementation and some were
         | compiled to implementations that used a branch. I had many
         | passes of analysis by pasting into godbolt with identical
         | compilers and flags. Power of 2 binary search did better on
         | small arrays iirc, but are the first to hit conflict misses.
         | For larger arrays, I believe the branchy search algorithms
         | start to do better since half their branches get predicted
         | correctly.
         | 
         | I haven't previously looked at ITP, and I would need to study
         | further to get a more clear idea on what its aiming for. Some
         | hybrid methods, unlike ITP require additional memory accesses,
         | but I don't think this does, just compute. On the other hand,
         | since at a billion elements you're looking at roughly 3
         | interpolations or 30 bisections, there's a pretty narrow window
         | of opportunity here, and in the batch case, there are indices
         | to contend with. I'm on the BQN discord if you wanted a
         | different form of communication.
         | 
         | [0] https://github.com/UWHustle/Efficiently-Searching-In-
         | Memory-...
        
           | mlochbaum wrote:
           | Sounds good on the basic binary search. I jumped through the
           | paper a little too quick and missed the source code link,
           | although thanks for the godbolt confirmation as well.
           | 
           | Yeah, the big deal about ITP is not really the interpolation
           | but the graceful fallback to binary search. With of course
           | that nasty division overhead. I'll have to study your paper
           | to see which ideas there could be used.
           | 
           | Glad you liked the talk! And a small world, huh? I've been
           | wanting to get an implementation of the vector binary search
           | into CBQN for a while, so that an open source version of it
           | exists at least. Always some other thing to do. Multiple
           | searches are really a different world. Elijah Stone pointed
           | out recently[0] that with searches that all have the same
           | iteration count (sadly, not interpolation search!), several
           | of them can be run at once just by copying the loop body.
           | That was new to me at least. And for searches that don't fit
           | in cache it's possible to partition the sought values, which
           | does the accesses in a perfectly cache-friendly order for
           | some extra compute. That's described in the page I linked
           | before.
           | 
           | [0] https://news.ycombinator.com/item?id=33648924
        
       ___________________________________________________________________
       (page generated 2023-06-03 23:00 UTC)