[HN Gopher] Optimizing Open Addressing
       ___________________________________________________________________
        
       Optimizing Open Addressing
        
       Author : TheNumbat
       Score  : 108 points
       Date   : 2023-04-02 17:16 UTC (5 hours ago)
        
 (HTM) web link (thenumb.at)
 (TXT) w3m dump (thenumb.at)
        
       | mgaunard wrote:
       | I find that there is too much of an emphasis on open-addressing
       | in blog articles and such.
       | 
       | There are a lot of real and common data structures where chaining
       | is much better suited.
       | 
       | For example whenever you have data part of one or multiple other
       | node-based data structures, and you layer an index on top of
       | that.
       | 
       | Chaining also tends to waste a lot less memory.
        
         | dragontamer wrote:
         | In my experience / tests, its way easier to write a high-
         | performance open-addressing Hash Table than a high-performance
         | chaining one.
         | 
         | That being said, chaining is easier to write.
         | 
         | > Chaining also tends to waste a lot less memory.
         | 
         | How so? A typical 32-bit integer or 32-bit float uses 4-bytes.
         | But a typical pointer is 8-bytes. But there's also the internal
         | malloc/new chunk header to consider.
         | 
         | That means, to store a 4-byte integer into a chained hash
         | table, you need 8-bytes (pointer) + 8-byte (glibc malloc chunk
         | header) + 4 byte integer/float. Or 20 bytes total in practice.
         | 
         | If you have say, 50 integers inside of the hashmap, then you'll
         | use (table-size * 8) + 50-integers * 20 bytes each == 1000+
         | bytes for the pointers/data alone, plus even more for the table
         | itself. (ex: Table of size 50 would need 50 * 8-byte pointers
         | even when empty, for a total of 1000 bytes + 400 bytes == 1400
         | bytes or so)
         | 
         | In contrast, a 4-byte open-addressing hash table only takes up
         | 4-bytes. So 50-integers in a hashmap with ~50% load-factor (ie:
         | size the table to be size 128 or so) will be 128 * 4 == 512
         | bytes.
         | 
         | EDIT: Its because of this huge practical difference in memory
         | used that I'm pretty sure open-addressing is so high
         | performance in comparison. When your data-structures are 1/2 or
         | smaller number of bytes than the competition, its easier to
         | stay in L1 cache or other such size benefits.
        
           | mgaunard wrote:
           | Nodes don't have to be allocated with malloc (that is
           | actually the worst thing you could possibly do).
           | 
           | An open-addressing hash table only performs well up to 50-75%
           | bucket usage. Chaining doesn't care and performs the same up
           | to 100%.
        
             | a1369209993 wrote:
             | > (that is actually the worst thing you could possibly do)
             | 
             | Nonsense, there are many worse things you could do. For
             | example, you could allocate nodes with mmap, at 4KB per
             | node. (Which is a thing I've actually _done_ in the context
             | of cursed bootstrap code where malloc isn 't available and
             | having more than a handful of (in that case doubly-linked-
             | list) nodes means the environment is deranged anyway.)
        
             | dragontamer wrote:
             | > Nodes don't have to be allocated with malloc (that is
             | actually the worst thing you could possibly do).
             | 
             | I mean... the most obvious place to place nodes is...
             | inside the table itself. Also known as Open Addressing.
             | 
             | Or what, are you going to implement a 2nd, custom heap
             | algorithm to manage your nodes? I guess there's "buddy
             | allocators", but buddy-allocators aren't exactly free
             | either. Whatever method you're using to keep track of nodes
             | is going to have some level of inefficiency.
             | 
             | Whatever allocation you're doing to get these nodes (above-
             | and-beyond generic malloc) is time you could have been
             | spent writing open-addressing instead.
             | 
             | > An open-addressing hash table only performs well up to
             | 50-75% bucket usage. Chaining doesn't care and performs the
             | same up to 100%.
             | 
             | Hardly an apples-to-apples comparison. As I pointed out,
             | 50% load-factor on 50x uint32 open-addressing is only ~400
             | bytes used.
             | 
             | While 100% load factor on 50x uint32 is 1400 bytes used on
             | chaining. (plus additional load-factor in practice due to
             | malloc fragmentation)
        
               | mgaunard wrote:
               | I already explained. Objects in non-trivial data
               | structures are part of several data structures
               | concurrently.
        
               | dragontamer wrote:
               | Perhaps we need to start talking with actual code? Here's
               | just a simple idea that's in my brain right now.
               | template <class Data>         struct HashNode{
               | // Probably should be shared_ptr<HashNode<Data>>, but
               | that's              // now adding even more
               | inefficiencies like a ref_count per element
               | struct HashNode<Data>* nextPointerChain;             Data
               | d; // Maybe Data* d if you're sharing it with other data-
               | structures?         };              template <class Data,
               | int size>         struct ChainHashTable{
               | HashNode<Data> theTable[size];          };
               | template <class Data, int size>         struct
               | LinearProbingHashTable{             Data theTable[size];
               | vector<bool> occupied; // set to "size", 0 means empty
               | and 1 means occupied         };
               | 
               | nextPointerChain takes up 8 bytes, even if its nullptr.
               | No other data-structure needs to have the
               | nextPointerChain. If we use shared_ptr<HashNode<Data>> as
               | per typical modern C++, there's even more inefficiencies
               | that I forgot about. EDIT: You probably can get away with
               | unique_ptr, now that I think of it.
               | 
               | ----------
               | 
               | Pointers aren't free btw. If you're sharing the pointer
               | with many different parts of your program, that's
               | definitely one of those things you'll want to start
               | getting rid of. Not only does a pointer cost 8 bytes, but
               | its also _SIGNIFICANTLY_ slower and cache-unfriendly in
               | practice.
               | 
               | L1 cache works by reading data in-and-around anything you
               | access. If you're only using 8-bytes of a cache line for
               | pointer-indirection (8/64), you're wasting 56-bytes of
               | your fetch.
               | 
               | Linear Probing is extremely fast because it tends to
               | blitz through L1 cache thanks to locality of data. Simple
               | Linear Probing (or Robin-hood augmented probing) is
               | probably the fastest, and simplest, approach for modern
               | CPUs.
        
               | mgaunard wrote:
               | You still fail to understand what being part of multiple
               | data structures means.
               | 
               | Shared pointers, separate container to check occupancy,
               | suggesting making a container of pointers, all these
               | things suggest you have no idea how to do hybrid data
               | structures.
               | 
               | The nodes contain the data along with multiple pointers
               | in them to chain them to various data structures they are
               | part of (trees, lists, hash tables etc.). The object in
               | question contains state that is several cache lines long.
               | 
               | You cannot just put the nodes inside your buckets as you
               | don't get to control where the nodes are and cannot
               | invalidate the other data structures. Though I suppose
               | it's a possibility if you know ahead of time how many
               | nodes you're going to need, but then in that case you may
               | be able to go for perfect hashing to begin with (unless
               | you're building some kind of fixed-size cache).
               | 
               | A given object for example can be part of several hash
               | maps, based on the various pieces of state it is useful
               | to index by.
               | 
               | In practice you'd use a pool to allocate all your nodes
               | in blocks. There is no per-node overhead for this
               | allocation strategy.
        
               | TheNumbat wrote:
               | Agreed - I do mention in the post that open addressing is
               | impractical when using intrusive linked lists, which are
               | common in low level data structures.
        
               | dragontamer wrote:
               | So you've at best removed just 8 bytes for rather
               | inconvenient programming for...
               | LinearProbingHashTable<FooBar*, size*2>
               | ChainHashTable<FooBar*, size>
               | 
               | Note that linearProbingHashTable only uses 8 bytes per
               | pointer. While ChainHashPointer still needs nextPtr,
               | taking up 16 bytes.
               | 
               | The *2 is for 50% load factor on linear table. But we
               | still win on linear if we're at say 75% (simple linear)
               | or 95% load factor (which is doable with Robin Hood)
        
             | kevin_thibedeau wrote:
             | > only performs well up to 50-75% bucket usage.
             | 
             | Robin Hood performs well into the high 90's.
        
               | anarazel wrote:
               | One situation where I painfully learned that it doesn't,
               | is when you iterate over one hashtable to fill another.
               | To defend against that one needs to add some per-
               | hashtable randomized state into the hash IV.
               | 
               | Through bad experience I also learned that you need to
               | not just grow due to fillfactor, but also due to
               | disproportional chain length, even with the above defense
               | in place.
        
               | 10000truths wrote:
               | > To defend against that one needs to add some per-
               | hashtable randomized state into the hash IV.
               | 
               | You should almost always be doing this anyways, since
               | most hash tables use user-provided data as keys, which
               | opens up a well known avenue for denial-of-service
               | attacks:
               | 
               | https://lwn.net/Articles/474912/
        
               | anarazel wrote:
               | I think that's often fixed using a _per run_ random seed,
               | rather than a per table seed. The patch linked in the
               | python bug linked from that article does seem to do that,
               | for example.
        
         | grive wrote:
         | I would tend to agree.
         | 
         | This article conveniently designs its benchmark to be exactly
         | when open addressing would work.
         | 
         | In almost all of the maps implementations I have used so far,
         | chaining was necessary and intrusive linked lists were used.
         | 
         | An open-addressing scheme was however preferable (cuckoo) for a
         | read-mostly map allowing concurrent reads and locked writes.
         | 
         | I would be more interested in an article exploring the
         | compromises involved there, e.g. related to paying the cost of
         | an additional indirection when accessing the value.
         | 
         | Another critic of this article I would add is the many tables
         | throughout referencing different baselines. A single baseline
         | would be much preferable, to keep a sense of scale for each
         | change.
        
         | [deleted]
        
         | tialaramex wrote:
         | I'm not convinced there are "a lot" of these real and common
         | structures.
         | 
         | They certainly do arise, and particularly if you're also doing
         | concurrency linked lists are attractive because a compare-
         | exchange type mechanism is ideally suited for such structures
         | and that has been implemented cheaply in hardware. But this
         | article was about the _default_ , and we don't want defaults
         | tailored to niche uses. The _default_ T-shirt size shouldn 't
         | be big enough for The Rock. The _default_ ceiling height in new
         | buildings shouldn 't be 1.5 metres. The _default_ beverage
         | offered at your restaurant shouldn 't be Mountain Dew Code Red.
         | It's OK that these things are options, that somebody who wants
         | them can have them, but they're bad defaults.
         | 
         | I'd want to see some measurements from real world projects to
         | believe chainig should be the default. If I take, say, the
         | Chromium codebase, and the LibreOffice codebase, and I look at
         | places they use any type of hash table, how often are they
         | going to be for stuff that'd be "better suited" to chaining? Am
         | I going to consistently find 50% ? 10% ? 1% ?
        
           | mgaunard wrote:
           | Pretty much 95% of all hybrid data structures.
           | 
           | Now hybrid data structures are only used in systems
           | programming, because they require to be particularly careful
           | about object lifetime and such.
        
         | eternalban wrote:
         | Choice-of-two + bias (pick L) + fixed-sized bin slices of the
         | backing arrays = very high loading, strictly 2 ops (max) per
         | R/W, trivial to write.
         | 
         | I typically just use a very fast cryptographic hash (Blake3 is
         | awesome) and have plenty of bits for h1, h2 (you can have
         | additional choices but it is a fast dimiminishing return).
        
       | utopcell wrote:
       | Reads like a great summary for the hash map ideas that work in
       | practice. I would have loved to see flat_hash_map<> thrown into
       | the benchmark mix.
        
         | TheNumbat wrote:
         | I just benchmarked absl::flat_hash_map and got results
         | comparable to Robin Hood with a load factor between 75% and
         | 90%, which makes sense. It's also faster for looking up missing
         | keys, so seems like a good option. I didn't benchmark the
         | maximum probe lengths, though, so not sure on that front.
        
           | camel-cdr wrote:
           | I tied adding the maps to the [1] benchmark, but I wasn't
           | able to, since they aren't type generic yet. You may want to
           | benchmark against [2], and [3] which are the best performing
           | ones in the above benchmark.
           | 
           | [1] https://github.com/martinus/map_benchmark/
           | 
           | [2] https://github.com/martinus/unordered_dense/blob/main/inc
           | lud...
           | 
           | [3] https://github.com/ktprime/emhash
           | 
           | Edit: I had the wrong link for [2] and [3], somehow...
        
       | infinet wrote:
       | I learned about Open Addressing a long time ago when I was
       | optimizing Dnsmasq. The scenario involved looking up a very large
       | list of domain names to block or combine with ipset in order to
       | customize routing based on domain names. Open addressing worked
       | very well on these large immutable lists.
        
       | camel-cdr wrote:
       | Sometimes a very basic fixed size chaining hash tables is
       | actually the best you can do.
       | 
       | Take for example tcc, which is a very fast c compiler.
       | 
       | They just use a basic chaining hash table: `Node *nodes[16384];`.
       | 
       | Since most translation units have far less than 16384 tokens,
       | most lookups result in a direct hit.
        
         | tialaramex wrote:
         | _Is_ that the best TCC could have done here? It 's easy to
         | write this in C, which is one obvious reason for TCC to do it,
         | but it's not at all clear this is the best way.
         | 
         | What TCC is actually storing in TokenSym (which you've named
         | "Node" in this context) is four pointers to other information
         | about symbols, a unique ID, and a string.
         | 
         | If we used a compact inline string (like CompactString) we can
         | squeeze all this into 60 bytes unless the string is more than
         | 24 bytes in length such as TCC's own
         | warn_implicit_function_declaration or
         | is_compatible_unqualified_types which would need a heap
         | allocation like they have today, for the string itself.
         | 
         | If we do _that_ we can build a linear probed open addressed
         | hash table and avoid paying for the extra indirection any time
         | the symbol names are 24 bytes or shorter in length, I 'd expect
         | this is markedly better.
         | 
         | But it's a lot of work in C, so it isn't a surprise TCC doesn't
         | do it.
        
         | mananaysiempre wrote:
         | TCC has the advantage of being a batch process, so it can
         | insert things using open addressing without ever considering
         | how it intends to delete them, then just drop the whole thing
         | on the floor at the end. Of course, if you are also a batch
         | process, you absolutely should do that too.
        
           | citizen_friend wrote:
           | This is an advantage of C style programming. If you
           | understand the problem and tailor a solution you can utilize
           | simplifying assumptions which make the code easier and
           | faster.
           | 
           | General libraries have to handle all kinds cases you might
           | not care about. The most egregious example is assuming every
           | piece of code might be threaded so everything needs to be
           | protected by locks. (I would guess > 75% of dictionaries
           | never remove a key, and simply throw away the whole thing at
           | the end.)
        
       | pornel wrote:
       | Modern CPUs keep making optimal algorithms weirder. Speculative
       | superscalar execution and the colossal gap between the CPU and
       | memory speed means that often a brute-force solution that fits in
       | a cache line wins over solutions that would feel more elegant or
       | efficient.
        
       | sporadicity wrote:
       | Have you tried absl::flat_map? It uses simd in a different way
       | than described in this article, and Google claims that it saves
       | them a lot of memory because it still works pretty well at 90-95%
       | occupancy.
        
         | anonymoushn wrote:
         | I've benchmarked swiss tables and found that (for hit-heavy
         | workloads) a minimum of 2 loads per lookup is expensive
         | compared to 1.
        
       ___________________________________________________________________
       (page generated 2023-04-02 23:00 UTC)