[HN Gopher] Optimizing Open Addressing ___________________________________________________________________ Optimizing Open Addressing Author : TheNumbat Score : 108 points Date : 2023-04-02 17:16 UTC (5 hours ago) (HTM) web link (thenumb.at) (TXT) w3m dump (thenumb.at) | mgaunard wrote: | I find that there is too much of an emphasis on open-addressing | in blog articles and such. | | There are a lot of real and common data structures where chaining | is much better suited. | | For example whenever you have data part of one or multiple other | node-based data structures, and you layer an index on top of | that. | | Chaining also tends to waste a lot less memory. | dragontamer wrote: | In my experience / tests, its way easier to write a high- | performance open-addressing Hash Table than a high-performance | chaining one. | | That being said, chaining is easier to write. | | > Chaining also tends to waste a lot less memory. | | How so? A typical 32-bit integer or 32-bit float uses 4-bytes. | But a typical pointer is 8-bytes. But there's also the internal | malloc/new chunk header to consider. | | That means, to store a 4-byte integer into a chained hash | table, you need 8-bytes (pointer) + 8-byte (glibc malloc chunk | header) + 4 byte integer/float. Or 20 bytes total in practice. | | If you have say, 50 integers inside of the hashmap, then you'll | use (table-size * 8) + 50-integers * 20 bytes each == 1000+ | bytes for the pointers/data alone, plus even more for the table | itself. (ex: Table of size 50 would need 50 * 8-byte pointers | even when empty, for a total of 1000 bytes + 400 bytes == 1400 | bytes or so) | | In contrast, a 4-byte open-addressing hash table only takes up | 4-bytes. So 50-integers in a hashmap with ~50% load-factor (ie: | size the table to be size 128 or so) will be 128 * 4 == 512 | bytes. | | EDIT: Its because of this huge practical difference in memory | used that I'm pretty sure open-addressing is so high | performance in comparison. When your data-structures are 1/2 or | smaller number of bytes than the competition, its easier to | stay in L1 cache or other such size benefits. | mgaunard wrote: | Nodes don't have to be allocated with malloc (that is | actually the worst thing you could possibly do). | | An open-addressing hash table only performs well up to 50-75% | bucket usage. Chaining doesn't care and performs the same up | to 100%. | a1369209993 wrote: | > (that is actually the worst thing you could possibly do) | | Nonsense, there are many worse things you could do. For | example, you could allocate nodes with mmap, at 4KB per | node. (Which is a thing I've actually _done_ in the context | of cursed bootstrap code where malloc isn 't available and | having more than a handful of (in that case doubly-linked- | list) nodes means the environment is deranged anyway.) | dragontamer wrote: | > Nodes don't have to be allocated with malloc (that is | actually the worst thing you could possibly do). | | I mean... the most obvious place to place nodes is... | inside the table itself. Also known as Open Addressing. | | Or what, are you going to implement a 2nd, custom heap | algorithm to manage your nodes? I guess there's "buddy | allocators", but buddy-allocators aren't exactly free | either. Whatever method you're using to keep track of nodes | is going to have some level of inefficiency. | | Whatever allocation you're doing to get these nodes (above- | and-beyond generic malloc) is time you could have been | spent writing open-addressing instead. | | > An open-addressing hash table only performs well up to | 50-75% bucket usage. Chaining doesn't care and performs the | same up to 100%. | | Hardly an apples-to-apples comparison. As I pointed out, | 50% load-factor on 50x uint32 open-addressing is only ~400 | bytes used. | | While 100% load factor on 50x uint32 is 1400 bytes used on | chaining. (plus additional load-factor in practice due to | malloc fragmentation) | mgaunard wrote: | I already explained. Objects in non-trivial data | structures are part of several data structures | concurrently. | dragontamer wrote: | Perhaps we need to start talking with actual code? Here's | just a simple idea that's in my brain right now. | template <class Data> struct HashNode{ | // Probably should be shared_ptr<HashNode<Data>>, but | that's // now adding even more | inefficiencies like a ref_count per element | struct HashNode<Data>* nextPointerChain; Data | d; // Maybe Data* d if you're sharing it with other data- | structures? }; template <class Data, | int size> struct ChainHashTable{ | HashNode<Data> theTable[size]; }; | template <class Data, int size> struct | LinearProbingHashTable{ Data theTable[size]; | vector<bool> occupied; // set to "size", 0 means empty | and 1 means occupied }; | | nextPointerChain takes up 8 bytes, even if its nullptr. | No other data-structure needs to have the | nextPointerChain. If we use shared_ptr<HashNode<Data>> as | per typical modern C++, there's even more inefficiencies | that I forgot about. EDIT: You probably can get away with | unique_ptr, now that I think of it. | | ---------- | | Pointers aren't free btw. If you're sharing the pointer | with many different parts of your program, that's | definitely one of those things you'll want to start | getting rid of. Not only does a pointer cost 8 bytes, but | its also _SIGNIFICANTLY_ slower and cache-unfriendly in | practice. | | L1 cache works by reading data in-and-around anything you | access. If you're only using 8-bytes of a cache line for | pointer-indirection (8/64), you're wasting 56-bytes of | your fetch. | | Linear Probing is extremely fast because it tends to | blitz through L1 cache thanks to locality of data. Simple | Linear Probing (or Robin-hood augmented probing) is | probably the fastest, and simplest, approach for modern | CPUs. | mgaunard wrote: | You still fail to understand what being part of multiple | data structures means. | | Shared pointers, separate container to check occupancy, | suggesting making a container of pointers, all these | things suggest you have no idea how to do hybrid data | structures. | | The nodes contain the data along with multiple pointers | in them to chain them to various data structures they are | part of (trees, lists, hash tables etc.). The object in | question contains state that is several cache lines long. | | You cannot just put the nodes inside your buckets as you | don't get to control where the nodes are and cannot | invalidate the other data structures. Though I suppose | it's a possibility if you know ahead of time how many | nodes you're going to need, but then in that case you may | be able to go for perfect hashing to begin with (unless | you're building some kind of fixed-size cache). | | A given object for example can be part of several hash | maps, based on the various pieces of state it is useful | to index by. | | In practice you'd use a pool to allocate all your nodes | in blocks. There is no per-node overhead for this | allocation strategy. | TheNumbat wrote: | Agreed - I do mention in the post that open addressing is | impractical when using intrusive linked lists, which are | common in low level data structures. | dragontamer wrote: | So you've at best removed just 8 bytes for rather | inconvenient programming for... | LinearProbingHashTable<FooBar*, size*2> | ChainHashTable<FooBar*, size> | | Note that linearProbingHashTable only uses 8 bytes per | pointer. While ChainHashPointer still needs nextPtr, | taking up 16 bytes. | | The *2 is for 50% load factor on linear table. But we | still win on linear if we're at say 75% (simple linear) | or 95% load factor (which is doable with Robin Hood) | kevin_thibedeau wrote: | > only performs well up to 50-75% bucket usage. | | Robin Hood performs well into the high 90's. | anarazel wrote: | One situation where I painfully learned that it doesn't, | is when you iterate over one hashtable to fill another. | To defend against that one needs to add some per- | hashtable randomized state into the hash IV. | | Through bad experience I also learned that you need to | not just grow due to fillfactor, but also due to | disproportional chain length, even with the above defense | in place. | 10000truths wrote: | > To defend against that one needs to add some per- | hashtable randomized state into the hash IV. | | You should almost always be doing this anyways, since | most hash tables use user-provided data as keys, which | opens up a well known avenue for denial-of-service | attacks: | | https://lwn.net/Articles/474912/ | anarazel wrote: | I think that's often fixed using a _per run_ random seed, | rather than a per table seed. The patch linked in the | python bug linked from that article does seem to do that, | for example. | grive wrote: | I would tend to agree. | | This article conveniently designs its benchmark to be exactly | when open addressing would work. | | In almost all of the maps implementations I have used so far, | chaining was necessary and intrusive linked lists were used. | | An open-addressing scheme was however preferable (cuckoo) for a | read-mostly map allowing concurrent reads and locked writes. | | I would be more interested in an article exploring the | compromises involved there, e.g. related to paying the cost of | an additional indirection when accessing the value. | | Another critic of this article I would add is the many tables | throughout referencing different baselines. A single baseline | would be much preferable, to keep a sense of scale for each | change. | [deleted] | tialaramex wrote: | I'm not convinced there are "a lot" of these real and common | structures. | | They certainly do arise, and particularly if you're also doing | concurrency linked lists are attractive because a compare- | exchange type mechanism is ideally suited for such structures | and that has been implemented cheaply in hardware. But this | article was about the _default_ , and we don't want defaults | tailored to niche uses. The _default_ T-shirt size shouldn 't | be big enough for The Rock. The _default_ ceiling height in new | buildings shouldn 't be 1.5 metres. The _default_ beverage | offered at your restaurant shouldn 't be Mountain Dew Code Red. | It's OK that these things are options, that somebody who wants | them can have them, but they're bad defaults. | | I'd want to see some measurements from real world projects to | believe chainig should be the default. If I take, say, the | Chromium codebase, and the LibreOffice codebase, and I look at | places they use any type of hash table, how often are they | going to be for stuff that'd be "better suited" to chaining? Am | I going to consistently find 50% ? 10% ? 1% ? | mgaunard wrote: | Pretty much 95% of all hybrid data structures. | | Now hybrid data structures are only used in systems | programming, because they require to be particularly careful | about object lifetime and such. | eternalban wrote: | Choice-of-two + bias (pick L) + fixed-sized bin slices of the | backing arrays = very high loading, strictly 2 ops (max) per | R/W, trivial to write. | | I typically just use a very fast cryptographic hash (Blake3 is | awesome) and have plenty of bits for h1, h2 (you can have | additional choices but it is a fast dimiminishing return). | utopcell wrote: | Reads like a great summary for the hash map ideas that work in | practice. I would have loved to see flat_hash_map<> thrown into | the benchmark mix. | TheNumbat wrote: | I just benchmarked absl::flat_hash_map and got results | comparable to Robin Hood with a load factor between 75% and | 90%, which makes sense. It's also faster for looking up missing | keys, so seems like a good option. I didn't benchmark the | maximum probe lengths, though, so not sure on that front. | camel-cdr wrote: | I tied adding the maps to the [1] benchmark, but I wasn't | able to, since they aren't type generic yet. You may want to | benchmark against [2], and [3] which are the best performing | ones in the above benchmark. | | [1] https://github.com/martinus/map_benchmark/ | | [2] https://github.com/martinus/unordered_dense/blob/main/inc | lud... | | [3] https://github.com/ktprime/emhash | | Edit: I had the wrong link for [2] and [3], somehow... | infinet wrote: | I learned about Open Addressing a long time ago when I was | optimizing Dnsmasq. The scenario involved looking up a very large | list of domain names to block or combine with ipset in order to | customize routing based on domain names. Open addressing worked | very well on these large immutable lists. | camel-cdr wrote: | Sometimes a very basic fixed size chaining hash tables is | actually the best you can do. | | Take for example tcc, which is a very fast c compiler. | | They just use a basic chaining hash table: `Node *nodes[16384];`. | | Since most translation units have far less than 16384 tokens, | most lookups result in a direct hit. | tialaramex wrote: | _Is_ that the best TCC could have done here? It 's easy to | write this in C, which is one obvious reason for TCC to do it, | but it's not at all clear this is the best way. | | What TCC is actually storing in TokenSym (which you've named | "Node" in this context) is four pointers to other information | about symbols, a unique ID, and a string. | | If we used a compact inline string (like CompactString) we can | squeeze all this into 60 bytes unless the string is more than | 24 bytes in length such as TCC's own | warn_implicit_function_declaration or | is_compatible_unqualified_types which would need a heap | allocation like they have today, for the string itself. | | If we do _that_ we can build a linear probed open addressed | hash table and avoid paying for the extra indirection any time | the symbol names are 24 bytes or shorter in length, I 'd expect | this is markedly better. | | But it's a lot of work in C, so it isn't a surprise TCC doesn't | do it. | mananaysiempre wrote: | TCC has the advantage of being a batch process, so it can | insert things using open addressing without ever considering | how it intends to delete them, then just drop the whole thing | on the floor at the end. Of course, if you are also a batch | process, you absolutely should do that too. | citizen_friend wrote: | This is an advantage of C style programming. If you | understand the problem and tailor a solution you can utilize | simplifying assumptions which make the code easier and | faster. | | General libraries have to handle all kinds cases you might | not care about. The most egregious example is assuming every | piece of code might be threaded so everything needs to be | protected by locks. (I would guess > 75% of dictionaries | never remove a key, and simply throw away the whole thing at | the end.) | pornel wrote: | Modern CPUs keep making optimal algorithms weirder. Speculative | superscalar execution and the colossal gap between the CPU and | memory speed means that often a brute-force solution that fits in | a cache line wins over solutions that would feel more elegant or | efficient. | sporadicity wrote: | Have you tried absl::flat_map? It uses simd in a different way | than described in this article, and Google claims that it saves | them a lot of memory because it still works pretty well at 90-95% | occupancy. | anonymoushn wrote: | I've benchmarked swiss tables and found that (for hit-heavy | workloads) a minimum of 2 loads per lookup is expensive | compared to 1. ___________________________________________________________________ (page generated 2023-04-02 23:00 UTC)