[HN Gopher] Efficiently Searching In-Memory Sorted Arrays: Reven... ___________________________________________________________________ Efficiently Searching In-Memory Sorted Arrays: Revenge of Interpolation Search? [pdf] Author : greghn Score : 54 points Date : 2023-06-03 16:12 UTC (6 hours ago) (HTM) web link (pages.cs.wisc.edu) (TXT) w3m dump (pages.cs.wisc.edu) | scrubs wrote: | Broadly related: https://news.ycombinator.com/item?id=25899286 | PGM Indexes: Learned indexes that match B-tree performance with | 83x less space | | I might suggest, slightly pushing back on this article's PDF, | binary search is not the de-facto in-memory search algorithm. | Cache consciousness btrees, and various optimizations of tries, | or radix trees are better. Eytzinger arrays are a good BST option | if data is pre-sorted and/or requires rare resort passes. Which | brings us back to btrees: keeping data sorted is usually 51% of | the problem. Search is the other 49%. | | OK, if you don't need range search, which needs sorted keys, then | hashmaps are darn hard to beat. | | The PDF author however makes a strong point: if FLOPS is going up | way faster than memory BW, the obvious thing to do is spend a | little more code looking into memory smartly. And the PDF even | gives algos to do it. So my that's why I say BST as de-facto | algorithm, while I don't think that's true, is only small push | back. The paper goes way further. | pvansandt wrote: | I (author) agree that certainly at the higher end of element | counts, it would be a very strange decision to not use an | index. One of my favorite related papers is RadixSpline by | Kipf, which shows an index based approach. | | However, the inner loop of indexes usually is some other kind | of search algorithm. Graefe mentions in BTree techniques that | interpolation search is sometimes applicable to inner nodes in | btrees. One of the SIP evaluations was using SIP as part of the | inner loop in LevelDB, which doesn't displace the use of an LSM | for managing the larger data set. | | If you don't need a lot of range queries, eytzinger, or cache- | line optimized btrees do get a lot more interesting, but the | iteration logic gets a lot worse. SIP returns an element if | it's present and a sentinel if absent, which is an odd choice | since the index is usually what's needed. No clue, but | hopefully you'll have some sympathy for revisiting your old | code and wondering who wrote it. :) | | I do think that optimizing search algorithms is still | worthwhile in an indexed setting for two reasons: 1. The size | of a node within an index like a btree depends on the | efficiency with which that node can be searched, so if you are | able to decrease the cost of searching a given node in the | index, that might motivate larger node sizes. I know that a | common heuristic is to use a multiple of a page size, but that | is based on the assumption that the whole node will be used as | part of the search. 2. Pushing algorithmic decisions to just in | time can be more costly than the time saved from choosing a | better algorithm. You lose icache efficiency, stall on the | decision, etc. I think it's still good to steel man, as best | you can, that your index is actually buying you improved | performance for additional complexity or update overhead. | scrubs wrote: | Your paper is a valid add. And I'm gonna read/digest it. My | point above while true, is more in the margins. Better to | keep the focus on engineering know-how in the PDF. Thank you | for submitting it. | marginalia_nu wrote: | Commenting so I find this later. | ComputerGuru wrote: | Tip: you can favorite the submission then un-favorite it after | you've read it. | alwaystired wrote: | New leetcode problem just dropped | lucgagan wrote: | Wow, this sounds like a really fascinating study! Binary Search | has long been the go-to method for sorted, in-memory data search | due to its effectiveness and simplicity. However, given the | limitations of Binary Search in certain scenarios, it's exciting | to see new algorithms being proposed that could potentially | perform better under specific circumstances. | | It's intriguing that Interpolation Search, despite its | historically lackluster performance, is being given a second life | through SIP and TIP. I love the fact that they're leveraging | hardware trends and advanced calculations to optimize search | performance. | | The fact that SIP shows up to 4x faster performance on uniformly | distributed data and TIP is 2-3x faster on non-uniformly | distributed data is quite promising. It seems these approaches | could be really useful in practice, given that real-world data | often follows non-uniform distributions. | | The introduction of a meta-algorithm to dynamically select the | optimal search algorithm based on the data distribution is a | clever touch. It helps to enhance the overall search performance | by not sticking to a one-size-fits-all approach. | | I'd love to read more about these techniques and their practical | implementations. Do they have any benchmarks or plans to test | these new methods in large-scale, real-world systems? | switchbak wrote: | Serious question: did you use an LLM tool to craft some or all | of this comment? | finexplained wrote: | The five paragraph format will forever be suspicious. | adamisom wrote: | I'd be willing to bet money at 5-1 against that they didn't. | since that's impossible to prove, I guess also on running an | llm a lot of times and getting anything close to it. | pvansandt wrote: | This was an interesting journey. My original target was at L1 | sized arrays with constant sized error, but that limits its | applicability to the target audience. My initial exploration | with spline based indexes were at this target data size, so | they weren't competitive. | | SIP is very sensitive to quality of implementation. I did my | best to steel man binary search, but if either is not | implemented carefully, then you can easily get inverted | results. I tried several different axes of variation for | implementation of binary search and probably spent more time | optimizing binary search than SIP or TIP, but binary search is | very sensitive to cache misses and how that relates to branch | prediction and conflict misses. Quality of linear search is | also critical and relies heavily on array padding. SIMD was | actually a major pessimization since for SIP, it's only worth | using if linear search is < 100 elements or so. | | I spent a lot of effort trying to model the performance of SIP | to predict from statistics of the data to how it would perform. | You can consider the cost of each interpolation as a random | access cache miss, and the cost of a linear search based on its | distance, but the cost of a linear search has decreasing | marginal cost with more distance before you reach steady state, | which is exactly where you are using it in SIP. So, that | basically led nowhere, but it does point out that an L2 norm is | the wrong metric for identifying if interpolation would be | useful on your data. | | I think the bigger insight of TIP is around its use of | efficient approximations for hyperbolic interpolations. There's | a lot of research around polynomial interpolation, but I think | that hyperbolas are also worth examining. I don't expect TIP to | be used in a database engine, and while data might be commonly | zipf distributed, I'm not sure that people are putting that | kind of data into an array and doing lookups on them where the | values at the bottom of the array have a lot of duplicate | values. | | SIP, I think, is more about increasing amounts of spatial cache | locality, and making the most of out-of-order processors. It is | more a variation on interpolation-sequential search, so it | might be more practically used in a SIP-2 or SIP-3 variation | which a hard-coded, unrolled number of iterations. (SIP-1 would | just be interpolation sequential.) The downside of an | interpolation method is that pairs of sequential interpolations | don't surround the target, which leaves the search unbounded. I | don't find hybrid approaches very interesting because I don't | expect an implementation to be able to pipeline useful work | which has a time per iteration that is sufficiently fast. For | perspective, on billion element arrays, I think typical number | of interpolations was around 3. We would expect number of | interpolations to not increase to 4 until you hit a quintillion | elements which suggests that the gap in efficiency between log | N and log log N grows slowly enough that algorithmic gains have | a nearby, practical upper bound to potential improvement. | | One of my favorite related papers is RadixSpline by Kipf et al. | The amount of additional memory across a number of scenarios | would be tiny, you could write a very efficient inner loop, and | it would have basically been tailor fit to work perfectly on | the FB dataset (one of the real world distributions). | mlochbaum wrote: | The paper draws the analogy to continuous root-finding | ("Interpolation search is the regula falsi method"), nice. One | thing I've been wondering lately is whether the ITP method[0], | which is actually more recent than this paper, is a good fit for | sorted searching. It's an elegant method that guarantees a | maximum number of queries, for example one more than binary | search, while taking advantage of interpolation for better- | behaved distributions. I wrote a short section about this at [1]. | Which is likely to link to the OP soon too. | | Since the author's here: the "Optimized binary search" algorithm | looks good, in that it has an obvious branchless implementation. | But it'll depend on how the code's written and compiled, and | clang in particular will turn it into a branch without -mllvm | --x86-cmov-converter=0. Making it branchless was the intention, | right? Was the output code checked to make sure it uses cmov? | | [0] https://en.wikipedia.org/wiki/ITP_Method | | [1] | https://mlochbaum.github.io/BQN/implementation/primitive/sor... | [deleted] | pvansandt wrote: | Big fan on your presentation on binary search that amortised | multiple concurrent searches. | | I did several variations on binary search and always compared | against the best one. [0] From what I remember, some variations | were compiled to a branchless implementation and some were | compiled to implementations that used a branch. I had many | passes of analysis by pasting into godbolt with identical | compilers and flags. Power of 2 binary search did better on | small arrays iirc, but are the first to hit conflict misses. | For larger arrays, I believe the branchy search algorithms | start to do better since half their branches get predicted | correctly. | | I haven't previously looked at ITP, and I would need to study | further to get a more clear idea on what its aiming for. Some | hybrid methods, unlike ITP require additional memory accesses, | but I don't think this does, just compute. On the other hand, | since at a billion elements you're looking at roughly 3 | interpolations or 30 bisections, there's a pretty narrow window | of opportunity here, and in the batch case, there are indices | to contend with. I'm on the BQN discord if you wanted a | different form of communication. | | [0] https://github.com/UWHustle/Efficiently-Searching-In- | Memory-... | mlochbaum wrote: | Sounds good on the basic binary search. I jumped through the | paper a little too quick and missed the source code link, | although thanks for the godbolt confirmation as well. | | Yeah, the big deal about ITP is not really the interpolation | but the graceful fallback to binary search. With of course | that nasty division overhead. I'll have to study your paper | to see which ideas there could be used. | | Glad you liked the talk! And a small world, huh? I've been | wanting to get an implementation of the vector binary search | into CBQN for a while, so that an open source version of it | exists at least. Always some other thing to do. Multiple | searches are really a different world. Elijah Stone pointed | out recently[0] that with searches that all have the same | iteration count (sadly, not interpolation search!), several | of them can be run at once just by copying the loop body. | That was new to me at least. And for searches that don't fit | in cache it's possible to partition the sought values, which | does the accesses in a perfectly cache-friendly order for | some extra compute. That's described in the page I linked | before. | | [0] https://news.ycombinator.com/item?id=33648924 ___________________________________________________________________ (page generated 2023-06-03 23:00 UTC)