[HN Gopher] How to implement a hash table in C
       ___________________________________________________________________
        
       How to implement a hash table in C
        
       Author : benhoyt
       Score  : 239 points
       Date   : 2021-03-26 09:39 UTC (13 hours ago)
        
 (HTM) web link (benhoyt.com)
 (TXT) w3m dump (benhoyt.com)
        
       | sobriquet9 wrote:
       | Will this expression correctly handle overflow?
       | 
       | size_t mid = (low + high) / 2;
        
         | mywittyname wrote:
         | Interesting question.
         | 
         | Since this is integer division, I think you can do:
         | size_t low_s = low >> 1;        size_t high_s = high >> 1;
         | size_t mid = (low_s + high_s)
         | 
         | ...With no overflow worries. (expanded for clarity)
         | 
         | That would effectively be distributing the division. So it's
         | equivalent to low / 2 + high / 2.
        
           | saagarjha wrote:
           | This gives a different result when both low and high are odd.
        
         | chrchang523 wrote:
         | Not an issue since the following is at the top of the function:
         | if (size + size < size) {         return NULL; // size too big;
         | avoid overflow       }
        
         | kt66 wrote:
         | what about low>>1 + high>>1 + (1 & low & high)
        
       | thamer wrote:
       | > When the hash table gets too full, we need to allocate a larger
       | array and move the items over. This is absolutely required when
       | the number of items in the hash table has reached the size of the
       | array, but usually you want to do it when the table is half or
       | three-quarters full.
       | 
       | Yes, but _how_ you resize is important too: if you have a
       | threshold size (like 3 /4 full) at which you block and re-
       | distribute all elements into a new array, you will incur a
       | significant pause when this happens, e.g.
       | https://log.kv.io/post/2009/05/15/leaky-hashtables
       | 
       | Instead, when you reach the threshold amount you can create the
       | new array and then gradually migrate the keys in small batches
       | either with each operation or with a background thread. So on
       | `get` if we're in the process of resizing: first check the new
       | table, then the old one, and before returning migrate N keys from
       | the old table to the new one. Free the old array once all the
       | keys are migrated.
       | 
       | I wrote a small hash table implementation with gradual re-hashing
       | a while back, search for DICT_MAX_LOAD and dict_rehash here:
       | https://github.com/nicolasff/ht/blob/master/dict.c#L193
        
         | moistbar wrote:
         | I'm not a particularly good programmer, so I may be missing
         | something here, but I'd think that moving the items should be
         | its own function that happens in between gets and sets, no?
         | Besides adding a decidedly non-read-only operation to what
         | should be read-only, shouldn't the movement happen when you're
         | _not_ trying to do anything with the hash table?
        
           | jeffffff wrote:
           | typically rehashing is done during insert. if you were going
           | to do it while not doing anything else with the hash table
           | you'd have to use a background thread, and then you'd have to
           | do some form of synchronization, so there would be
           | significant additional complexity and overhead.
        
         | OskarS wrote:
         | It's usually not the right call to use a hash table like you're
         | describing. You're not saving any work (all keys have to be
         | migrated sooner or later) and by doing it like this you incur a
         | non-trivial amount of overhead on most operations.
         | 
         | Hash tables generally work on the assumption that operations
         | are O(1) in the amortized sense, that individual inserts might
         | be expensive because of rehashing, but in aggregate they are
         | very cheap. In practice, this is almost always a fine
         | assumption, and it is how the vast majority of hash tables in
         | all programming languages work.
         | 
         | The kinds of hash tables you are talking about have their uses
         | in places where you have soft real-time constraints, but that
         | is honestly a pretty niche use case.
        
           | ModernMech wrote:
           | > Hash tables generally work on the assumption that
           | operations are O(1) in the amortized sense, that individual
           | inserts might be expensive because of rehashing, but in
           | aggregate they are very cheap.
           | 
           | Hash tables aren't O(1) in an amortized sense, they are O(1)
           | in a statistical sense. The O(1) complexity of hash tables is
           | dependent entirely on the size of the underlying array, the
           | method of collision resolution, and the quality of the
           | (pre)hash function in relation to the distribution of data
           | you're storing in the table. Whether you insert 1 item or 1
           | million items into the table, you're still getting O(1)
           | computational complexity if the table is designed well. If
           | not, you could be getting O(n) complexity in the worst case.
           | 
           | While the fixed cost of hashing is irrelevant in the O(1)
           | nature of hash table operations, it does mean that a hash
           | table in practice could be slower (in terms of wall time)
           | than a linked list for insert/find/remove operations. Despite
           | a linked list being O(n) for those operations, a linked list
           | doesn't have to pay the fixed cost of hashing. So for low
           | values of n, the linked list might be a better choice. I
           | think this is what you mean by "O(1) in the amortized sense"
           | but it's not what we mean when we say the computational
           | complexity of these operations on a hash table is O(1).
           | 
           | What the OP is talking about is an example of amortization of
           | resize(), essentially adding a fixed cost to insert() and
           | remove() so you don't have to pay a huge sum when you call
           | resize(). But that doesn't mean insert() and remove() are no
           | longer O(1).
        
             | justinpombrio wrote:
             | Amortized cost is a technical term in complexity analysis:
             | https://en.wikipedia.org/wiki/Amortized_analysis
             | 
             | If you use the terminology precisely, the worst case cost
             | of a regular hash table insertion is O(n) because of the
             | potential resize, but its amortized cost is O(1) because
             | that's what it "averages out to".
        
           | sgerenser wrote:
           | Agreed this seems like major overkill, but then again so does
           | writing your own hash table from scratch in C. :shrug:
        
           | legulere wrote:
           | If your application is not negatively affected by such jitter
           | why not use a garbage collected language?
        
           | jeffffff wrote:
           | iirc memcached either uses or used to use
           | https://en.wikipedia.org/wiki/Linear_hashing to avoid
           | rehashing induced pauses. it definitely makes sense for that
           | use case but you're correct that it's not the right call for
           | a general purpose hash table implementation.
        
           | pjam wrote:
           | The progressive rehashing described in the article is very
           | similar to what Redis does [1].
           | 
           | Just thought I'd share a example of a use case where the
           | incremental rehashing logic makes sense.
           | 
           | [1]: https://github.com/redis/redis/blob/unstable/src/dict.c
        
         | vladf wrote:
         | What happens if you have a sequence of puts that triggers an
         | expansion and some keys get migrated, but all of a sudden you
         | get a series of removes, such that you need to perform a shrink
         | to keep memory usage linear?
         | 
         | Do you now end up with three live arrays? You can probably make
         | things even more pathological...
        
           | coliveira wrote:
           | Hash tables are not great for data structures that change
           | dramatically in size. If you have this situation, the best
           | thing to do is to copy one hash table into another and
           | discard the original data structure.
        
         | chasd00 wrote:
         | can you do hashtable sharding? like have a hashtable point to
         | another hashtable by using something appended to the key? just
         | a thought on a procrastination filled Friday afternoon.
        
         | stephc_int13 wrote:
         | You can also allocate a small hash table at first, and
         | reallocate it in a large virtual memory region once it about
         | the size of a page (4k) so you won't have to reallocate it
         | later.
        
         | kbenson wrote:
         | > So on `get` if we're in the process of resizing
         | 
         | Not just _get_ , but _set_ also, right? Otherwise I would think
         | a large group of set operations that overflowed more than once
         | might be problematic to deal with. Or maybe I 'm missing
         | something that makes that a non issue?
        
           | tkhattra wrote:
           | yes, _set_ also. go 's map implementation uses this technique
           | - allocate a new array of buckets and gradually migrate from
           | old to new buckets (i think it does the migration during
           | _set_ and _delete_ only, not _get_ ).
        
       | tengbretson wrote:
       | Unless you need the specific memory characteristics of a hash
       | table, a trie data structure is a pretty good solution to a lot
       | of the same problems and in my opinion is conceptually easier to
       | understand.
        
         | taeric wrote:
         | I have never seen a true described as easier than a hash table.
         | Curious what makes you say they are easier?
        
           | tengbretson wrote:
           | I find their performance characteristics easier to naively
           | understand without resorting to profiling with production
           | data.
        
             | taeric wrote:
             | I should have worded my question more as "do you have a
             | recommendation to describe tries?" In particular,
             | construction is tough.
        
               | tengbretson wrote:
               | The C# implementation here is pretty approachable for
               | just about any kind of programming background.
               | 
               | https://www.geeksforgeeks.org/trie-insert-and-search/
        
         | [deleted]
        
       | bluedino wrote:
       | What if this became the new "invert a binary tree on a
       | whiteboard"?
        
         | Google234 wrote:
         | This is pretty straight forward stuff.
        
           | 0xbadcafebee wrote:
           | Hash tables are deceptively complicated, like sorting
           | algorithms.
        
             | cogman10 wrote:
             | Like sorting algorithms, they can be as complicated as you
             | want them to be.
             | 
             | A merge sort isn't the fastest sort but it's pretty easy
             | for anyone to implement. Something like a timsort is a bit
             | more complex but is what a lot of languages use now-a-days.
             | 
             | Hash tables are in the same boat. A basic hashtable is
             | pretty easy to implement. It's only when you start looking
             | at things like ideal item placement and optimal filling
             | that things start to get more complex.
        
               | kragen wrote:
               | Dunno, I wrote
               | https://news.ycombinator.com/item?id=26595520 just now
               | and my super dumb hash table has two bugs in its 21 lines
               | of code:
               | 
               | 1. It doesn't update--attempts to change the value
               | associated with an existing key are silently ignored,
               | although they do consume buckets.
               | 
               | 2. It loops forever if the table is full _and the key it
               | was looking for is negative_ , because that invokes
               | conversion of a negative signed value to an unsigned
               | value, which I think is UB in C.
               | 
               | Maybe it has more bugs I haven't found yet.
               | 
               | Additionally the reason it took me 15 minutes to write 21
               | lines of code was that I changed a bunch of things in
               | midstream, before posting the comment:
               | 
               | 1. At first there was no "full" field in the hash table
               | bucket, so there was no way to distinguish full buckets
               | from empty buckets.
               | 
               | 2. And then when I added it, I added it as "dead", with
               | the opposite Boolean sense of "full", so I would have
               | needed a separate initialization routine (to set all the
               | dead fields to something nonzero). Instead I swapped the
               | sense.
               | 
               | 3. I was trying to detect the case where we'd looped back
               | around to our starting point, indicating that the table
               | was full and the search was unsuccessful, and I realized
               | that the condition I wanted was not b == orig, which
               | would always fail on the first iteration, but (b + 1) %
               | table_size == orig. But doing that on every iteration
               | seemed obscene, so I incremented b (mod table_size) to
               | start out with instead.
               | 
               | 4. At some point I think I realized I'd forgotten to set
               | the .full field on newly inserted items.
               | 
               | 5. Also I realized I'd forgotten to test the table[b].k
               | == k condition in `get` as well, which would have made
               | every search unsuccessful.
               | 
               | So I think it's reasonable to say that even a pretty
               | limited hash-table implementation has plenty of
               | opportunities to write bugs, more than I expected. But
               | maybe I'm just running on low blood sugar this afternoon
               | or something.
        
         | SavantIdiot wrote:
         | This shouldn't be a mystery to someone who's had a
         | datastructures course. It's a fundamental data structure. Yes,
         | optimization always goes down a rabbit-hole, but Knuth's
         | "Algorithms in C" was mandatory in most engineering schools' CS
         | degrees in the early 90's, and it essentially what this website
         | describes, but with more rigor.
        
           | [deleted]
        
           | kragen wrote:
           | _Algorithms in C_ , which I strongly recommend, is actually
           | by Knuth's student Sedgewick.
        
           | oldmanhorton wrote:
           | Ok, but school was years ago. I've had years of relevant
           | experience that would make me a good fit for some job since,
           | but I've forgotten how to implement a hash table in C within
           | 30 minutes. Is that really the gatekeeper we want?
        
             | kragen wrote:
             | "Is that really the gatekeeper we want?" Yes, I think so? I
             | mean implementing a hash table in C is really easy. What is
             | there to "forget"? If you can't do it, either you don't
             | know what a hash table is, or you don't know how to program
             | in C.                   enum { table_size = 1024 };
             | typedef struct item { int k, v, full; } item;
             | static item table[table_size];              void put(item
             | kv)         {             size_t orig = kv.k % table_size;
             | // dumbest possible hash function             size_t b =
             | (orig + 1) % table_size;             // use linear probing
             | for collisions             while (table[b].full && b !=
             | orig) b = (b + 1) % table_size;             if (b == orig)
             | abort();  // table full             table[b] = kv;
             | table[b].full = 1;         }              item *get(int k)
             | {             size_t orig = k % table_size;
             | size_t b = (orig + 1) % table_size;             while
             | (table[b].full && table[b].k != k && b != orig) {
             | b = (b + 1) % table_size;             }             return
             | (table[b].full && table[b].k == k) ? &table[b] : NULL;
             | }
             | 
             | I haven't tested that (just as I wouldn't in a whiteboard
             | interview) so I don't know if it works. I did briefly skim
             | the article we're talking about, but I concluded I didn't
             | have anything to learn from it, so I didn't read it. I
             | found a bunch of bugs in my implementation as I was writing
             | it. And of course it's a pretty dumb hash table: linear
             | probing is very suboptimal, it uses a statically-allocated
             | non-resizable hash table, it's specialized for a certain
             | type of keys, and so on.
             | 
             | But are you saying you can't even do _that_? Probably it
             | would be bad to hire you for a C programming job, then,
             | unless it was an entry-level position for you to learn C
             | in.
             | 
             | ***
             | 
             | Evidently the above comment took me 15 minutes to write.
             | Oh, and now I see another bug or at least lacuna: if you
             | try to update the value of an existing key, it doesn't
             | fail, but it also doesn't update, instead adding a new
             | entry that will never be read. And then I did write a
             | minimal smoke test and run it, and unsurprisingly, it does
             | work:                   int main()         {
             | put((item) { 4, 7 });           put((item) { 1028, 9 });
             | put((item) { 3, 25 });           printf("%d %d %d %p %p\n",
             | get(4)->v, get(1028)->v, get(3)->v, get(5), get(3));
             | return 0;         }
             | 
             | Writing the test and the above additional commentary, and
             | running the test, took another 8 minutes.
             | 
             | Oh, and there's another bug where it will loop infinitely
             | when the hash table is full and the key is negative.
             | (Actually it invokes UB but typically the result will just
             | be an infinite loop.)
             | 
             | If it takes you 60 or 100 minutes to write this, or to
             | write something better, maybe you still know C and know
             | what hash tables are. But if you throw up your hands and
             | say "I don't remember, it's been a long time since I was in
             | school"? Either you don't know C, or you don't know what a
             | hash table is, which probably means you've never written
             | anything in C but toy programs. (This is a toy program, of
             | course, but many toy programs in C don't need hash tables.)
             | 
             | The compilable and fixed toy program in question is at
             | http://canonical.org/~kragen/sw/dev3/dumbhash.c if you want
             | to probe it for bugs.
             | 
             | It occurs to me that you might have been talking about
             | something that isn't a C programming job. For example, we
             | might be talking about a job programming an HTML database
             | front-end in Python using Django. And of course it would be
             | silly to reject someone for a job like that because they
             | didn't know C, unless they claimed to know C on their
             | resume. And it's totally reasonable for a Python
             | programmer, or a Python/Perl/JS/bash programmer, to not
             | understand hash tables.
        
             | jstanley wrote:
             | Knowing how to implement a hash table in C doesn't strike
             | me as something you can "forget". If you know how to write
             | C and you know how a hash table works, what is there to
             | forget?
        
             | 908B64B197 wrote:
             | Honestly it's such a simple concept it shouldn't be hard to
             | remember.
        
           | philzook wrote:
           | Sedgewick's?
        
         | seanwilson wrote:
         | I wouldn't ask someone to whiteboard it but I think "what is a
         | hash table?" is a brilliant interview question. The depth
         | someone can go into tells you a lot and can bring up lots of
         | other topics to discuss. It tells you a lot if a senior
         | developer doesn't know what a hash table is or only knows "it's
         | a class from Java" too.
        
           | 908B64B197 wrote:
           | > I wouldn't ask someone to whiteboard it
           | 
           | I did. Works very well as an interview question.
        
         | anaerobicover wrote:
         | Who said anything about interviews? This is just in the context
         | of "something interesting to read with your morning cuppa".
        
         | thrower123 wrote:
         | Hashtables are probably more directly useful than all the
         | binary tree gymnastics.
         | 
         | The K&R C Programming Language book has nearly the same example
         | (although I recall it using linked-lists for hash collisions
         | and not probing).
         | 
         | I would be mildly suspicious of anyone claiming to have a
         | recent CS degree who wasn't familiar with the ideas.
        
         | [deleted]
        
       | vram22 wrote:
       | IIRC, a very simple implementation of a hash table was shown in
       | the classic Kernighan & Ritchie C book, "The C Programming
       | Language" (which was one of the first good programming books that
       | I bought and read, early in my programming career).
       | 
       | It was in either the 1st (pre-ANSI C) or the 2nd (ANSI C)
       | edition. Just a handful of lines of C code, and easy enough for
       | even beginners to understand. Of course, it was not a production-
       | quality hash table implementation, nor was it meant to be. I
       | still remember my excitement at understanding the logic and its
       | usefulness at the time, even though I did not know about the
       | importance of data structures and algorithms then.
       | 
       | Edit: Even though it was simple, IIRC, it handled at least
       | collisions.
        
         | benhoyt wrote:
         | Nice. Yeah, looks like it's in section "6.6 Table Lookup", and
         | it implements a simple, fixed-size hash table (array size 101)
         | of char* to char*, and it handles collisions via linked lists.
        
           | vram22 wrote:
           | Cool, thanks for looking it up :)
        
       | daneelsan wrote:
       | This might not be the best place to ask but: I'm wondering what
       | the best data structure is for these needs: * there are fields
       | and values, similar to a struct * I know at initialization time
       | what fields there are (assume a string) * at run time I want to
       | be able to get this values accessing via the key/field * also I
       | want to update those values * what I don't care about is
       | inserting/deleting fields because that's now allowed,
       | 
       | I guess what I want is a struct like data structure, but in an
       | interpreted language
       | 
       | Mightve typed to fast...
        
         | whateveracct wrote:
         | if the fields are different types, how do you know the type of
         | key access's return?
        
           | daneelsan wrote:
           | I guess that can be solved by creating an Expression type
        
         | daneelsan wrote:
         | Sorry if I wasn't clear enough. Yes I want something like an
         | object in javascript, like a struct in C but at runtime. But I
         | want to implement this myself. At first I thought a hash table
         | was the right choice. But I don't need the insert delete
         | increase capacity that this data structure typically needs to
         | implement. I want to implement this in Mathematica. The
         | language doesn't have native objects given that the main idea
         | is for things to be immutable. But I still can program data
         | structures, so I'm looking for one...
        
         | mywittyname wrote:
         | A class? After all, an class is a struct with some additional
         | features to handle functions.
         | 
         | You can use a static instance if you just need a single object.
        
         | ben509 wrote:
         | You're looking for something similar to Python's namedtuple.[1]
         | (Though modern python should use a dataclasses in the stdlib or
         | the attrs package.)
         | 
         | Essentially, you have an array/list/vector under the hood, and
         | a mapping of field names to array indices to govern field
         | access.
         | 
         | If you're trying to implement this at a lower level, in the
         | interpreted language itself, take a look at the slots
         | mechanism[2] in Python.
         | 
         | [1]:
         | https://docs.python.org/3/library/collections.html#collectio...
         | 
         | [2]:
         | https://docs.python.org/3/reference/datamodel.html#object.__...
        
           | daneelsan wrote:
           | That seems like the thing I'm looking for! Thanks for the
           | info
        
         | sgtnoodle wrote:
         | It depends on what you want its footprint in memory to look
         | like. Since you don't need to mutate the structure, a densely
         | packed structure seems optimal. For small structures, linearly
         | searching through an array would be fast. For larger
         | structures, a sorted array could be binary searched
         | efficiently.
         | 
         | It also depends on how you can represent the string and their
         | typical size. If the strings are small, then allocating fixed
         | size chunks to store them inline might make sense. If the
         | strings are long, it might make sense to go crazy and use a
         | "trie" data structure to minimize comparisons.
         | 
         | I believe python uses hash maps for classes by default, but you
         | can alternatively use "slots", which are less mutable and
         | presumably more densely packed.
        
           | daneelsan wrote:
           | Yes. At some point i encountered this trie with payloads
           | (which would act as t he field values).
        
         | tyingq wrote:
         | _" I guess what I want is a struct like data structure, but in
         | an interpreted language"_
         | 
         | Python ffi is an option:
         | https://cffi.readthedocs.io/en/latest/using.html#working-wit...
         | 
         | Though I don't know what advantage that route would have over
         | most interpreted language's already existing associative arrays
         | / hashmaps / dicts.
         | 
         | There may be some module already built that serves your needs.
         | DiscoDB is a good example..a very fast hashmap that uses
         | mmap(). https://discodb.readthedocs.io/en/latest/
        
         | ectopod wrote:
         | Use an object?
        
         | brundolf wrote:
         | Sounds like you want a struct except you want to be able to
         | reflect on the property keys (which you can't do normally in
         | C/C++/Rust)?
         | 
         | I got thrown off by "but in an interpreted language" so I'm not
         | sure whether you're requesting this for C or for another
         | language. In JS, at least the V8 engine does some of this for
         | you: if you have a group of objects that always have the same
         | set of properties, V8 will detect this in many cases and
         | optimize them under the hood as a struct or class (it must also
         | store the string keys somewhere since those can be iterated in
         | JS).
         | 
         | In C/C++ you'd have to store the keys as strings somehow, and
         | also add some way to index the struct via those keys (since you
         | can't even do that with a normal struct). You may be stuck with
         | a hashmap one way or another.
        
           | nicoburns wrote:
           | Rust allows you to do this with a macro. The serde macros are
           | pretty close to what you need, but a slight variant could
           | give you a much more ergonomic API.
        
             | brundolf wrote:
             | Right, I thought about suggesting a macro for C/C++ but I
             | don't know enough about their macro systems to know for
             | sure that it would work
        
               | nicoburns wrote:
               | I believe C++ has "Runtime Type Information" (RTTI) that
               | allows you to do this. Although I also hear that it's not
               | great.
        
               | MaxBarraclough wrote:
               | To expand on _not great_ : RTTI is one of the few
               | features of C++ where you pay for it even if you don't
               | use it, so it's not all that popular. It can be disabled
               | with compiler flags to shrink the binary. LLVM's coding
               | standard prohibits use of RTTI, they instead use their
               | own 'hand-rolled' solution. [0] RTTI is also banned by
               | the Google C++ coding standard. [1]
               | 
               | [0] https://llvm.org/docs/CodingStandards.html#do-not-
               | use-rtti-o...
               | 
               | [1]
               | https://google.github.io/styleguide/cppguide.html#Run-
               | Time_T...
        
       | frankenst1 wrote:
       | JavaScript also lacks a proper hash table. I wonder how much
       | faster a native implementation could possibly be over say `Map()`
       | (which keeps insertion order by maintaining linked lists and does
       | not have constant op).
        
         | perlgeek wrote:
         | What's wrong with using objects as hashes?
        
           | tyingq wrote:
           | There's a table at the bottom of
           | https://developer.mozilla.org/en-
           | US/docs/Web/JavaScript/Refe... that shows some reasons why.
        
           | frankenst1 wrote:
           | They can be faster than `Map` but:
           | 
           | 1) Slower than hash tables
           | 
           | 2) `{}` is not empty but has default keys
           | 
           | 3) `{}` has multiple read-only keys
           | 
           | 2 and 3 are caused by the prototype chain (properties not
           | present on the initial object will be attempted to be
           | resolved via the prototype hierarchy) which can be avoided
           | via `Object.create(null)`
        
       | winrid wrote:
       | Isn't a faster way to, instead of using a backing array, to use a
       | pool of memory that you just point into with pointers addressed
       | with the hash?
       | 
       | Nasty stuff that you can do in a language like C :)
        
         | ufo wrote:
         | As far as the CPU is concerned, there isn't a big difference if
         | you access a block of memory via pointers or via an array.
         | Converting the hash result to a memory address is not the
         | bottleneck. The main bottlenecks are the hash function itself
         | and the logic for dealing with collusions
        
       | selimnairb wrote:
       | And depending on your use case, you may not need to implement
       | growing the hash table, or deletions, making it even simpler/less
       | daunting to implement in C (since this avoids a lot of the memory
       | management parts of the job).
        
         | benhoyt wrote:
         | Yeah, good point. And then you can allocate the hash table
         | struct and all your buckets in one call to malloc().
        
       | tirrex wrote:
       | You skipped hs_remove but depending on implementation, hs_remove
       | strategy may change your hs_set, e.g using tombstones. I've
       | recently learnt that you don't need tombstones at all in linear
       | hashmaps, you just do backshift removals, it may be a simple
       | addition to your map. Check out these repos for the example
       | implementation:
       | 
       | (C) https://github.com/tezc/sc/tree/master/map
       | 
       | (C++) https://github.com/rigtorp/HashMap
        
         | speps wrote:
         | A more detailed explanation of "backward shift deletion":
         | https://codecapsule.com/2013/11/17/robin-hood-hashing-backwa...
        
         | chrchang523 wrote:
         | Well, he did say "simple hash table". In my experience, many
         | applications have no need for single-element deletion.
        
         | attractivechaos wrote:
         | I wrote a blog post [1] on this a couple of years ago. This is
         | a quite old technique [2] but was not widely used until
         | recently. I was surprised that I hadn't found this earlier.
         | 
         | [1] https://attractivechaos.wordpress.com/2019/12/28/deletion-
         | fr...
         | 
         | [2] https://en.wikipedia.org/wiki/Linear_probing#Deletion
        
           | tirrex wrote:
           | I can't find right now but I saw one older version of wiki
           | page, it had C code example how to do deletion with this.
           | This is how I discovered it. I knew your blog but apperantly
           | I didn't visit recently.
           | 
           | > I think the new deletion algorithm without tombstones is
           | overall better than traditional algorithms. It should become
           | the preferred way to implement hash tables
           | 
           | I agree with you 100%. Tombstone has no advantage over this.
        
             | annilt wrote:
             | https://en.wikipedia.org/wiki/Open_addressing
             | 
             | Maybe it was open addressing page? It has pseudo code for
             | that.
        
           | benhoyt wrote:
           | Thanks for this (and parent reply for mentioning it), that's
           | simple and clever!
           | 
           | > The basic idea is not complex: when we delete an element,
           | we move the next element to the deleted location if it is
           | pushed away by the deleted element due to hash collision; we
           | repeat this process until we come to an empty bucket.
           | 
           | I wonder, is it still _amortized_ constant time? My gut
           | (which is not very good at computer science) tells me that it
           | still may be, as with an average probe length of ~1.4 (which
           | I was seeing), you won 't have to walk more than 1 or 2
           | elements on average.
        
             | hnfong wrote:
             | It shouldn't take more operations (asymptotically) to
             | delete than to probe/insert so as long as the original list
             | is still amortized constant time then the delete operation
             | still is.
        
           | [deleted]
        
       | e12e wrote:
       | I guess the idea is to contrast dyi bsearch vs dyi hash table -
       | but it's a bit odd to mention bsearch is in the standard lib,
       | then not use it?
       | 
       | > Here's how you'd do it in C (with and without bsearch).
       | 
       | Juding by random hit on the net - it would've been quite
       | straightforward to use the stdlib version?
       | 
       | https://www.tutorialspoint.com/c_standard_library/c_function...
        
         | st_goliath wrote:
         | POSIX actually specifies a hash table interface in the standard
         | library as well: https://man7.org/linux/man-
         | pages/man3/hsearch.3.html
         | 
         | In typical Unix fashion, however, it is designed to use
         | internal, global state, so you can only have _one_ table at a
         | time.
         | 
         | There is a GNU extension that provides the same functions with
         | an `_r` suffix that allow you to work with more than one table.
        
           | Tyrubias wrote:
           | I've never really understood why standard Unix functions rely
           | on internal global state. For example, strtok():
           | 
           | https://en.cppreference.com/w/c/string/byte/strtok
           | 
           | Was there some reason it was optimal to write these functions
           | this way back in the early days of Unix?
        
             | jandrese wrote:
             | It probably saved a few bytes of memory or disk. A PDP11
             | had 64k of memory and had to be shared by multiple
             | simultaneous users. A lot of the Unix tradeoffs can be
             | traced back to limitations of 1970s hardware.
             | 
             | Plus you weren't going to need more than one anyway, your
             | program wasn't going to be big enough.
        
             | Zambyte wrote:
             | Because they were written with the expectation that they
             | would be used in composable programs that each do one thing
             | well.
        
             | kragen wrote:
             | Well, if you wanted multiple dbm files or hash tables or
             | whatever, it was easy enough to run multiple processes.
             | Unix processes on the PDP-11 could only address 64 KiB of
             | memory--PDP-11 addresses are 16-bit, and C strongly
             | encourages a von Neumann model where your program and data
             | are in the same address space, so you couldn't even use 64
             | KiB for your program and another 64 KiB for your data,
             | though the PDP-11 hardware was capable of doing this. (On
             | things like the AVR, this is handled by copying your data
             | into RAM at startup unless you mark it PROGMEM, in which
             | case you can't pass pointers to it to ordinary functions.
             | PDP-11 Unix didn't have anything like this.)
             | 
             | As long as you don't have multiple threads (and you
             | shouldn't, and on Unix you couldn't) the static-variable
             | interface in these cases was usually more convenient,
             | though more bug-prone.
             | 
             | There's also a micro-efficiency question.
             | 
             | Global variables are at a known memory address. If you put
             | them in a struct and pass around a pointer to it, they are
             | at a dynamically computed memory address. Every time you
             | want to access them, you have to add the field offset to
             | the struct's base pointer. This is a small amount of extra
             | computation, and how much it actually costs you depends on
             | the hardware--on the 8080 it's a mess that probably
             | requires you to spill a couple of registers onto the stack,
             | and on modern implementations of amd64 like the ones from
             | Intel, it probably takes literally zero extra time, because
             | you probably aren't keeping all of your addition-capable
             | functional units busy anyway.
             | 
             | I don't know the PDP-11's instruction set well enough to
             | say, but I suspect the cost of such indexed addressing was
             | intermediate between these extremes: much like the 8086, it
             | had an "index" addressing mode, mode 6, which added an
             | immediate 16-bit offset to a base register
             | https://pages.cpsc.ucalgary.ca/~dsb/PDP11/AddrModes.html,
             | and it doesn't seem to have had a pure immediate addressing
             | mode where the address of the variable is just an immediate
             | word without any indexing. But if you didn't need a base
             | register for your variables, you could use PC-relative
             | addressing, thus saving a register and decreasing register
             | pressure (the PDP-11 only had 8 general-purpose registers,
             | and one of them was PC, so it really only had 7).
             | 
             | I don't have any idea what actual PDP-11 compilers did,
             | though, and that matters a lot here.
        
             | tyingq wrote:
             | A fair amount of the timeline for Unix was when it
             | supported multiple processes, but not multiple threads.
             | SunOS/Solaris didn't have threads until 1992, for example,
             | and it still didn't support SMP, so global state wasn't an
             | issue initially. Strtok() predates 1992. They did add
             | strtok_r() once Posix threads were a thing.
        
           | ludocode wrote:
           | hcreate_r() can be used on some of the BSDs as well, but of
           | course they're not 100% compatible with GNU. I've implemented
           | a platform-independent POSIX hsearch as a wrapper to a
           | templated hashtable in Pottery:
           | 
           | https://github.com/ludocode/pottery/tree/master/examples/pot.
           | ..
           | 
           | Be warned, hsearch is terrible. You should probably only use
           | this if you have some existing code that depends on hsearch
           | and you want to make it portable.
        
         | smcameron wrote:
         | The article actually does both (uses bsearch, and also the home
         | made one) presumably just to demonstrate both.
        
           | smcameron wrote:
           | (I noticed this because I was wondering why the cmp()
           | function was taking void pointers only to cast them to
           | "item*", and it occurred to me it would make sense if it were
           | passed as a function pointer to a generic library function
           | with no knowledge of "item*"... as it is when bsearch is
           | used.)
        
           | e12e wrote:
           | Heh, I missed that - just skimming the code on mobile. Maybe
           | there should be a slight edit in the text and/or the custom
           | binary search could be dropped (as it's not the main focus of
           | the article).
        
       | SavantIdiot wrote:
       | How are keys removed so that the table doesn't overflow? Are you
       | really going to recommend growing the table by duplicating it?
       | 
       | > Beware: this is a teaching tool and not a library, so it's not
       | very well tested.
       | 
       | Dear OP: This really bugs me. If you're not going to put the
       | effort in, and instead crank out some content that might be
       | buggy, then please don't pollute the web with more half-assed
       | "look what I can do" pages.
        
         | kljsandjb wrote:
         | 100% agree with U.
        
         | Tyrubias wrote:
         | Yes, I believe you are intended to duplicate the table once a
         | certain load factor (fraction of elements filled) of your
         | choosing is reached.
         | 
         | For linear probing, this is recommended to keep your probe
         | lengths logarithmic:
         | 
         | https://www.sciencedirect.com/science/article/abs/pii/019667...
         | 
         | Removing elements for a hash table using linear probing is
         | fairly non-trivial. Without using so-called "tombstones" to
         | flag an element as deleted, you essentially have to shift
         | elements up in the table.
         | 
         | Also, your rudeness is really quite unwarranted. If everyone
         | took that attitude, where would the teaching examples ever come
         | from?
        
         | humantorso wrote:
         | This is the worst attitude I have this on HN. Look out people -
         | its the internet's gatekeeper - making sure only
         | legitimate/fully thought out things are published to the
         | internet. By golly, we wouldn't want to waste bits and bytes on
         | just random things being available on the internet.
        
         | st_goliath wrote:
         | > Are you really going to recommend growing the table by
         | duplicating it?
         | 
         | Um... yes.
         | 
         | You want the data linearly in memory in the first place so you
         | can have constant time access by index. If you want to grow a
         | dynamic array, doing a realloc can easily cause the same effect
         | as the article does manually (alloc new, copy data, free old),
         | because the memory immediately after may be occupied by
         | something else if you did _any_ other malloc /calloc. Copying
         | over the data is linear in time, thus degrading your constant
         | time insert to linear time as soon as the table is full if you
         | do a realloc for single items instead.
         | 
         | You may find the choice of growing _when half full_ as in the
         | article questionable, because then there will always be an
         | unused half, but pretty much every, remotely practical dynamic
         | array implementation in C (and probably a ton of other
         | languages) does that (grow by capacity * 2 _when full_ ) to
         | maintain _amortized_ constant time when appending.
        
           | tyingq wrote:
           | It's also how many popular, in production, hashtable
           | implementations work.
           | 
           | From the uthash user guide, for example:
           | 
           |  _" Normal expansion - This hash attempts to keep fewer than
           | 10 items in each bucket. When an item is added that would
           | cause a bucket to exceed this number, the number of buckets
           | in the hash is doubled and the items are redistributed into
           | the new buckets. In an ideal world, each bucket will then
           | contain half as many items as it did before."_
           | 
           | So, the person questioning "why" should be instead explaining
           | "why not".
        
         | solids wrote:
         | The problem with your way of thinking, is who draws the line to
         | define what's "half-assed"? Because everything can be improved.
         | You should take ideas from this kind of articles, not fully
         | working code.
        
         | the-dude wrote:
         | This comment bugs me.
        
         | tyingq wrote:
         | I suppose it's a personal preference thing, but I like over
         | simplified implementations when the purpose is educational.
         | 
         | I am more bugged by the accusation that the author did this
         | because _" look what I can do"_. I don't see any evidence of
         | that.
        
         | ffdeadbeef wrote:
         | do we really need more hostile, sneering, condescending, toxic
         | pricks like you policing the web?
        
         | kstenerud wrote:
         | A teaching tool is deliberately simplified so as not to
         | overwhelm students with details they don't need to worry about
         | yet. This is absolutely the right way to go about it.
        
         | le-mark wrote:
         | >Are you really going to recommend growing the table by
         | duplicating it?
         | 
         | Setting a reasonable initial size then doubling it and copying
         | when (if!) it needs to grow is a pragmatic, reasonable approach
         | that's quite common!
        
           | knome wrote:
           | https://github.com/python/cpython/blob/master/Objects/dictob.
           | ..
        
       | ape4 wrote:
       | This is a fun exercise, but if anyone is thinking about doing
       | this for a project - don't - use a library.
        
         | philwelch wrote:
         | "But Doctor, I am Pagliacci!"
        
         | kevin_thibedeau wrote:
         | Existing libraries make decisions that aren't necessarily
         | suitable for all applications. Most hash table code makes
         | excessive use of malloc/realloc or has power of two growth
         | which is problematic in memory constrained systems. If you want
         | to minimize waste by using high load factors you're going to
         | want Robin Hood hash which is still relatively uncommon in lib
         | code.
        
           | attractivechaos wrote:
           | Although Robin Hood hashing has good performance under high
           | load, it needs to record the probe length in each bucket.
           | This partly cancels its advantage especially for small keys.
           | In my evaluation, Robin Hood hashing doesn't use less memory
           | in comparison to others.
           | 
           | > _Most hash table code makes excessive use of malloc
           | /realloc or has power of two growth_
           | 
           | Libraries based on open addressing only call malloc when
           | there is not enough room. That is not "excessive". Also, if
           | you know the maximum size before hand, many libraries allow
           | you to preallocate buckets such that you never trigger
           | rehashing in this case.
        
       | ModernMech wrote:
       | Small nit: in table 2 with the columns labeled "Key, Hash, Hash
       | modulo 16", the thing called "hash" is actually known as a
       | "prehash", while the "hash modulo 16" is the hash. In the context
       | of hash tables, pre-hashing is when you take something like a
       | string or a data structure and turn it into a number. Hashing is
       | the mapping from that number to one of the hash table bins. The
       | distinction is often lost in practice, but if we're talking about
       | hash table design and implementation it's important to make this
       | clear at least once.
        
         | joosters wrote:
         | Is that always true? It depends whether or not you think of the
         | hash as a property of the hashtable, or a property of the key
         | itself. e.g. you may have string objects that store their hash
         | values internally - a useful optimisation if you are dealing
         | with lots of long read-only strings, given that the hash
         | function will have a cost of at least O(n) for the string
         | length. Inserting a string into a hashtable wouldn't change its
         | hash (what if the string is inserted into two tables of
         | different sizes?)
        
           | ModernMech wrote:
           | Right, it does depend on this perspective, which is why I say
           | the distinction is often lost in practice. But from the
           | perspective of hash table designers (at least historically),
           | the hash can in fact change and is dependent on the size of
           | the table. The distinction is important because the two
           | operations are separable. I guess people didn't want to call
           | one thing a "hash" and the other "hash mod k" because there
           | are ways to do this other than mod. They could have called
           | one "hash" and the other "post hash" but it is what it is.
           | 
           | All I'm trying to say is if you want to know more about hash
           | tables, and you bother to venture into the literature or some
           | textbooks, you're going to run into this distinction sooner
           | rather than later, so I wanted to point that out here.
        
       | varajelle wrote:
       | > Even while writing this I realized I'd used malloc instead of
       | calloc to allocate the entries array, which meant the keys may
       | not have been initialized to NULL.
       | 
       | I think this is still UB to allocate a struct containing pointer
       | with calloc without explicitly set these pointers to 0 (or NULL).
       | The C standard doesn't specify the binary representation of the
       | null pointer value. In practice, that will work in virtually all
       | current platforms, but it's still UB.
        
         | ajross wrote:
         | I think that's true only in the narrowest sense of the C
         | language standard. At the ABI and runtime linkage level,
         | platforms (yes, every single one of them) document data in the
         | .bss section to be zero, and C compilers place pointers there
         | that are defined to be initialized to NULL per the standard.
         | 
         | So, no, in practice it's well-defined, fully specified behavior
         | on any given platform.
        
         | ncmncm wrote:
         | Any particular platform can declare it to be defined, by fiat.
         | "With our compiler, all zeroed raw storage treated as pointers
         | are null."
         | 
         | Anywhere that they technically could, you can treat it as if
         | they actually had so declared, and just didn't publicize it
         | enough.
        
         | dpedu wrote:
         | NULL is guaranteed to equal integer 0 by standard, and
         | converting 0 to a pointer is guaranteed to be a NULL pointer.
         | 
         | > An integer constant expression with the value 0, or such an
         | expression cast to type void *, is called a null pointer
         | constant. If a null pointer constant is converted to a pointer
         | type, the resulting pointer, called a null pointer, is
         | guaranteed to compare unequal to a pointer to any object or
         | function.
        
           | layer8 wrote:
           | An integer constant expression is a syntactical construct,
           | not a runtime entity. The C standard provides no guarantee
           | that bytes (chars) with value zero can be read from memory as
           | a null pointer.
        
             | adamrezich wrote:
             | wait what? my C is a bit rusty but when would reading 0
             | from memory & casting to a (void?) pointer type ever !=
             | NULL?
        
               | layer8 wrote:
               | C implementations are allowed to use a not-all-bits-zero
               | representation for null pointer values. They are merely
               | required to convert integer constant expressions with
               | value zero to a null pointer value -- regardless of how
               | the resulting null pointer value is represented in
               | memory.
               | 
               | See also these FAQs:
               | 
               | http://c-faq.com/null/machnon0.html
               | 
               | http://c-faq.com/null/machexamp.html
        
         | cygx wrote:
         | Implementation-defined as far as I'm aware. If pointers with
         | all bits zero actually represent null pointers, things will be
         | fine.
         | 
         |  _edit:_
         | 
         | I haven't found any smoking gun quotation in the C11 standard
         | on short notice, but there is footnote 296 (p.348) on calloc()
         | zeroing all bits:
         | 
         |  _> Note that this need not be the same as the representation
         | of floating-point zero or a null pointer constant._
         | 
         | Nothing about this being UB in cases where it _does_ represent
         | a null pointer constant.
        
           | benhoyt wrote:
           | Wow, thanks, I did not know that. Though that would be a
           | kinda crazy platform or implementation! I'm betting that a
           | lot of people depend on this.
        
       | yudlejoza wrote:
       | 99% of the effort is 'what hash function?'
        
       | varjag wrote:
       | FWIW there's a standard hash table API in POSIX.
        
         | jandrese wrote:
         | Yeah, but it's pretty lame since you have to specify the size
         | of the table up front and can't change it later. They didn't
         | implement the hard stuff!
        
           | queuebert wrote:
           | This might sound stupid, but could you store N-1 entries, and
           | then store a pointer to the next hash as the Nth entry? So
           | you dynamically add fixed tables as you run out of space?
           | 
           | Yes, I realize the lookup would get complicated.
        
             | banana_giraffe wrote:
             | Nope. The POSIX API lets you have one table.
             | 
             | There is a GNU extension that lets you have more than one
             | table.
        
           | ludocode wrote:
           | There are many reasons why it's lame! Here's a short list:
           | 
           | - There is only one global hash table for the whole process!
           | If you want individual hash tables you need to use
           | implementation-specific extensions (hcreate_r().)
           | 
           | - There is no way to remove a key from a hash table. No
           | implementation I know of provides an extension to do it. If
           | you want to truly remove a key you must destroy and rebuild
           | the table.
           | 
           | - There is no requirement for a means to grow the table. On
           | some platforms it's possible to run out of space. If you want
           | to truly grow you must destroy and rebuild the table.
           | 
           | - There is no standard way to free keys or data stored in the
           | table. Destroying the table leaks all contents; you must keep
           | track of keys and data externally and free them yourself.
           | Only OpenBSD frees keys by default, and only NetBSD has
           | extensions to call callbacks to free keys and data.
           | 
           | - Keys must be strings. You cannot provide custom types or
           | void pointers as keys. There is no way to specify custom hash
           | or comparison functions. The quality of the hash function is
           | unspecified.
           | 
           | - When using ENTER, you may not want to allocate the key
           | unless it is necessary to insert a new entry. Since the given
           | key is inserted directly, it's not necessarily safe to use a
           | stack-allocated buffer for lookups. It's awkward to replace
           | the key with an allocation after insertion and it's unclear
           | whether it's allowed.
           | 
           | This doesn't even get into incompatibilities between the
           | various implementations. You will encounter all of the above
           | flaws even if your only target is glibc.
           | 
           | No one should ever be encouraged to use POSIX hash tables.
           | They should be deprecated and we should all just pretend they
           | don't exist.
        
             | jandrese wrote:
             | IMHO this is one area where the C committee could improve
             | the situation by mandating a sane and useful hashtable
             | implementation for the stdlib.
             | 
             | Or at the very least just deprecate the old one. I can't
             | remember ever seeing it used in code.
        
       | TonyTrapp wrote:
       | If you'd have to implement the hash table every time you want to
       | use one, you'd probably realize that there is a better
       | alternative data structure. Quite often you data is not as
       | randomly distributed as you may initially think it is.
        
         | kragen wrote:
         | One of the cool nuances of hash tables is that you can exploit
         | the nonrandomness of your data to get a more uniform
         | distribution of keys over hash buckets than you would get with
         | an ideal random-oracle hash function. For example, it's very
         | common to use hash functions which will produce four
         | consecutive output values (possibly shuffled) for string keys
         | "r0", "r1", "r2", and "r3", which guarantees that those keys
         | will not collide with each other. Hashpjw is one such widely-
         | used hash function; FNV is another. (Better not use linear
         | probing with such a hash function, though! And it's probably
         | better not to use them in cases where the keys might be chosen
         | adversarially to produce a DoS.)
        
         | jolux wrote:
         | What?
        
           | selljamhere wrote:
           | I think the parent comment is highlighting that a hash table
           | is just one implementation of a "Map" abstract data type.
           | Depending on the use case, other implementations (such as
           | using a heap, in the C++ STL) may have better performance.
        
             | wffurr wrote:
             | Or for the example of word frequencies in the article, a
             | prefix trie.
        
             | jolux wrote:
             | True, but as far as I know it's rarely the worst choice you
             | can make, and it's plenty fast for a lot of applications. I
             | don't know that there's a single data structure that's
             | always better, which is what the GP seems to be implying.
        
         | iforgotpassword wrote:
         | > Quite often you data is not as randomly distributed as you
         | may initially think it is.
         | 
         | That's why it's important to pick the right hash function. A
         | hash table usually goes a long way before it becomes a
         | performance issue.
        
       ___________________________________________________________________
       (page generated 2021-03-26 23:01 UTC)