[HN Gopher] Binary Trees are optimal except when they're not
       ___________________________________________________________________
        
       Binary Trees are optimal except when they're not
        
       Author : jjgreen
       Score  : 20 points
       Date   : 2021-07-21 07:57 UTC (15 hours ago)
        
 (HTM) web link (hbfs.wordpress.com)
 (TXT) w3m dump (hbfs.wordpress.com)
        
       | cakoose wrote:
       | Yup, ignoring the costs of memory/storage access (cache, main
       | memory, disk) can hurt an algorithm's real-world performance.
       | Laying your data out in chunks that match the memory/storage
       | blocks sizes (cache line size, disk block size) can yield much
       | better performance.
       | 
       | And while having the algorithm be aware of the chunk sizes in
       | every layer of your memory hierarchy is optimal, it can be
       | inconvenient to get all that set up. But what's cool is that you
       | can often lay your data out in a fractal-like pattern, and get
       | pretty good performance no matter what the various chunk sizes
       | are: https://en.wikipedia.org/wiki/Cache-oblivious_algorithm
        
       | riggsdk wrote:
       | Basically algorithmic complexity analysis often ignores the cost
       | of accessing data because the underlying storage is sometimes
       | unknown - be it in memory, on a spinning disc or on magnetic
       | tape.
       | 
       | Unfortunately that gives suboptimal algorithms because that basic
       | practically is ignored. You then end up with a bunch of
       | algorithms that look worse "on paper" but still outperforms the
       | theoretically optimal ones.
       | 
       | Often the number of items is also ignored. If you want to
       | implement a set-like feature but you'll be using less than X
       | elements, you can often get better performance just using a
       | linear search through an array than use a full fledged hashed
       | set.
        
         | jsmith45 wrote:
         | Agreed. There are plenty of cases where an asymptotically
         | "worse" algorithm performs better than a "better" one for all
         | practical sizes.
         | 
         | Like it is very very much possible that the asymptotically
         | "better" algorithms have coeffcients so high that the break
         | even point takes like week or more of computation. There are
         | not many programs where users will tolerate using it with so
         | much data that operations take that long.
        
       | valbaca wrote:
       | Author seems to have just backed into rediscovering B-trees
       | 
       | https://en.wikipedia.org/wiki/B-tree
       | 
       | > To conclude, binary trees are optimal when we ignore the cost
       | of accessing a node, but they aren't when it becomes costly to
       | access a node. When we access the nodes on disk, with a high
       | cost, it becomes interesting to bundles many keys in a node, and
       | we gave the optimal solution. However, the problem is often
       | solved quite more directly. We just fit as many keys as possible
       | in an I/O block. Disks operates on small blocks of (typically)
       | 512 bytes, but file systems operate in somewhat larger, but
       | fixed-sized blocks, something like 4 or 8k. A good strategy is
       | therefore to fit as many keys as possible in that block, since
       | even if the number of comparisons is large, it will still be
       | faster than bringing that block into main memory.
        
       | bob1029 wrote:
       | Skiplists are my favorite data structure. There are a few use
       | cases that they can't satisfy, but for the typical memory-bound
       | key-value store application, it's a very simple & effective
       | approach.
       | 
       | If you have to go to disk, a combination of: splay tree, batching
       | and append-only log structure are your best bet.
        
       ___________________________________________________________________
       (page generated 2021-07-21 23:00 UTC)