[HN Gopher] A simple hash table in C
       ___________________________________________________________________
        
       A simple hash table in C
        
       Author : todsacerdoti
       Score  : 42 points
       Date   : 2023-06-12 22:58 UTC (1 days ago)
        
 (HTM) web link (theleo.zone)
 (TXT) w3m dump (theleo.zone)
        
       | samsquire wrote:
       | Thank you for this article.
       | 
       | I have been working on trying to design a contiguous hash table
       | data structure that has these two mutual requirements:
       | 
       | a) allows deep clones (due to contiguous memory copyable with
       | memcpy)
       | 
       | b) supports arbitrarily nestable hash tables
       | 
       | The closest library I know about is smolworld (
       | https://github.com/snej/smol_world ) but I don't know how easy it
       | can be cloned.
       | 
       | These requirements rules out multiple mallocs: I do one single
       | malloc and expect that to be enough for the entire hashmap.
       | 
       | I'm not sure if these constraints might also force a fixed number
       | of buckets and a fixed capacity.
       | 
       | The clonable property requires interior mutability be limited,
       | because if you memcpy pointers they would cause structure
       | sharing, that I'm trying to avoid, hence a deep clone.
       | 
       | Rationale: these are the use cases I have for such a data
       | structure. The first is cheap copy on write. I also have a left-
       | right concurrency control hashmap, but I feel the properties of
       | this data structure are even better. The first is a sharding in
       | multithreading design: do a cheap memcpy for each thread and
       | shard the processing and merge at the end, without any
       | synchronization cost while the threads are working. I know that
       | deep cloning is slow in Java and presumably C if you do it with
       | loops rather than memcpy. Another use case is efficient
       | serialization for network.
       | 
       | In other words, a protobuf but easily deep clonable without
       | loops.
        
         | teo_zero wrote:
         | If you pre-allocate a large enough buffer, you can keep a
         | pointer to mark the boundary between the part where you have
         | already stored data and the part still empty. Then re-implement
         | malloc() to just move such pointer. Of course you must check
         | for running out of allocated memory, etc. This simple approach
         | assumes that you seldom delete entries, because it never
         | reclaims the room occupied by deleted entries.
        
       | jmacjmac wrote:
       | Nice article! I think eventually you need to move to macros to
       | support multiple key/value types in C. Just leaving some macro
       | implementations for reference:
       | 
       | - https://github.com/attractivechaos/klib/blob/master/khash.h
       | 
       | - https://github.com/tezc/sc/tree/master/map
       | 
       | - https://github.com/troydhanson/uthash
        
         | hot_gril wrote:
         | I think uthash is the one I used long ago. Fun and easy.
        
       | coffeeri wrote:
       | I've had lately a look at QEMUs internals and saw their thread
       | safe implementation of a hash table, capable of concurrent reads:
       | qht [0].
       | 
       | If the author sees this, you might want to take a look at it.
       | 
       | [0] https://github.com/qemu/qemu/blob/master/util/qht.c
        
         | torstenvl wrote:
         | OP doesn't include a license, but OP author's GitHub is mostly
         | MIT or CC-BY.
         | 
         | Since QEMU's implementation is GPL, it's probably best if OP
         | author does _not_ take a look at it.
        
           | Retr0id wrote:
           | That's not how software licenses work.
        
       | zabzonk wrote:
       | if the initial array is going to be of fixed size, why not:
       | typedef struct HashTable {             Entry *buckets[4];
       | int nBuckets;         } HashTable;
       | 
       | which removes one layer of dynamic memory management.
        
         | the-smug-one wrote:
         | >resize the buckets array once it gets too full
         | 
         | Because of that, basically.
        
           | zabzonk wrote:
           | i don't see where in his code this is done. possibly me being
           | stupid.
        
             | the-smug-one wrote:
             | Oh sorry, it's from "future improvements" :), I assumed
             | that was the train of thought. You're entirely correct,
             | however.
        
       | jandrese wrote:
       | I was surprised recently when looking at different hash tables
       | that have been implemented in C to discover that the standard
       | library includes its own hash table. They are even part of POSIX.
       | There is a reason you have never heard of it, or if you have you
       | have never used it. In true POSIX fashion they are close to
       | useless. The implementation doesn't allow you to modify the table
       | after it has been created, you have to pass in all the data when
       | you create the table. There is no add or delete and you can't
       | change any value you pull from the table. It also stores the data
       | in a static area local to the function so you can only use a
       | single table in a program at a time. It doggedly commits every C
       | stdlib sin possible.
        
         | zabzonk wrote:
         | there are no hash tables in the Standard C Library - POSIX does
         | not define Standard C.
         | 
         | more info: https://stackoverflow.com/questions/6118539/why-are-
         | there-no...
        
       | dvh wrote:
       | 1. Implement simple hash table in C
       | 
       | 2. Implement bloom filter
       | 
       | 3. Download list of Bitcoin wallets with non-zero amount of BTC
       | 
       | 4. Generate long list of random key pairs
       | 
       | 5. ???
       | 
       | 6. Profit
        
       | version_five wrote:
       | I was looking for something like this a few months ago and found
       | this which also may be of interest:
       | 
       | https://news.ycombinator.com/item?id=26590234
       | 
       | How to implement a hash table in C (benhoyt.com) 302 points by
       | benhoyt on March 26, 2021, 158 comments
        
         | benhoyt wrote:
         | Thanks for sharing! I do like the OP's article, but the more
         | the merrier! My "How to implement a hash table in C" article
         | [1] is for whatever reason one of my most popular articles. If
         | only C had a slightly better standard library. :-)
         | 
         | In my article I use "open addressing" [2] instead of linked
         | lists for collisions, as 1) it tends to be simpler, as you only
         | have one data structure to manage (array, rather than array and
         | linked list), and 2) it tends to be faster, as linked lists are
         | slow on modern CPUs because they require jumping around in
         | memory.
         | 
         | [1] https://benhoyt.com/writings/hash-table-in-c/
         | 
         | [2] https://en.wikipedia.org/wiki/Open_addressing
        
           | version_five wrote:
           | Hey, thanks for writing it up! Like I said, I was looking for
           | exactly this a few months ago, and between the explanation
           | and the code it was a big help.
        
       | [deleted]
        
       | the-smug-one wrote:
       | This hashtable sure mallocs a lot!
       | 
       | Look here for example:
       | 
       | Entry *newVal = malloc(sizeof(Entry)); newVal->key = strdup(key);
       | 
       | Why not:
       | 
       | Entry* e = malloc(sizeof(Entry) + strlen(key));
       | 
       | Now you can also skip the pointer to the key, saving 8 bytes for
       | each entry. Also, just return the hashtable by value, it's so
       | small.
       | 
       | I think that if we're coding in C, C++ or Zig, then it can be fun
       | to actually use the lower level power to our advantage :).
        
         | aloisklink wrote:
         | Even better, you can use [C99 Flexible array members][1], e.g.
         | something like:                   struct KeyedEntry {
         | struct Entry entry;           char key[]; // flexible array
         | member         };
         | 
         | It's not too much more useful when just using `char`, but for
         | other data types, it's a bit cleaner, since it handles
         | alignment/padding better.
         | 
         | [1]: https://en.wikipedia.org/wiki/Flexible_array_member, or
         | https://beej.us/guide/bgc/html/split/structs-ii-more-fun-wit...
         | 
         | _Edit_: Something like this might not be suitable for the
         | author, since they did mention how they wanted each `struct` to
         | have a fixed size.
        
       ___________________________________________________________________
       (page generated 2023-06-13 23:01 UTC)