[HN Gopher] How to implement a hash table in C ___________________________________________________________________ How to implement a hash table in C Author : benhoyt Score : 239 points Date : 2021-03-26 09:39 UTC (13 hours ago) (HTM) web link (benhoyt.com) (TXT) w3m dump (benhoyt.com) | sobriquet9 wrote: | Will this expression correctly handle overflow? | | size_t mid = (low + high) / 2; | mywittyname wrote: | Interesting question. | | Since this is integer division, I think you can do: | size_t low_s = low >> 1; size_t high_s = high >> 1; | size_t mid = (low_s + high_s) | | ...With no overflow worries. (expanded for clarity) | | That would effectively be distributing the division. So it's | equivalent to low / 2 + high / 2. | saagarjha wrote: | This gives a different result when both low and high are odd. | chrchang523 wrote: | Not an issue since the following is at the top of the function: | if (size + size < size) { return NULL; // size too big; | avoid overflow } | kt66 wrote: | what about low>>1 + high>>1 + (1 & low & high) | thamer wrote: | > When the hash table gets too full, we need to allocate a larger | array and move the items over. This is absolutely required when | the number of items in the hash table has reached the size of the | array, but usually you want to do it when the table is half or | three-quarters full. | | Yes, but _how_ you resize is important too: if you have a | threshold size (like 3 /4 full) at which you block and re- | distribute all elements into a new array, you will incur a | significant pause when this happens, e.g. | https://log.kv.io/post/2009/05/15/leaky-hashtables | | Instead, when you reach the threshold amount you can create the | new array and then gradually migrate the keys in small batches | either with each operation or with a background thread. So on | `get` if we're in the process of resizing: first check the new | table, then the old one, and before returning migrate N keys from | the old table to the new one. Free the old array once all the | keys are migrated. | | I wrote a small hash table implementation with gradual re-hashing | a while back, search for DICT_MAX_LOAD and dict_rehash here: | https://github.com/nicolasff/ht/blob/master/dict.c#L193 | moistbar wrote: | I'm not a particularly good programmer, so I may be missing | something here, but I'd think that moving the items should be | its own function that happens in between gets and sets, no? | Besides adding a decidedly non-read-only operation to what | should be read-only, shouldn't the movement happen when you're | _not_ trying to do anything with the hash table? | jeffffff wrote: | typically rehashing is done during insert. if you were going | to do it while not doing anything else with the hash table | you'd have to use a background thread, and then you'd have to | do some form of synchronization, so there would be | significant additional complexity and overhead. | OskarS wrote: | It's usually not the right call to use a hash table like you're | describing. You're not saving any work (all keys have to be | migrated sooner or later) and by doing it like this you incur a | non-trivial amount of overhead on most operations. | | Hash tables generally work on the assumption that operations | are O(1) in the amortized sense, that individual inserts might | be expensive because of rehashing, but in aggregate they are | very cheap. In practice, this is almost always a fine | assumption, and it is how the vast majority of hash tables in | all programming languages work. | | The kinds of hash tables you are talking about have their uses | in places where you have soft real-time constraints, but that | is honestly a pretty niche use case. | ModernMech wrote: | > Hash tables generally work on the assumption that | operations are O(1) in the amortized sense, that individual | inserts might be expensive because of rehashing, but in | aggregate they are very cheap. | | Hash tables aren't O(1) in an amortized sense, they are O(1) | in a statistical sense. The O(1) complexity of hash tables is | dependent entirely on the size of the underlying array, the | method of collision resolution, and the quality of the | (pre)hash function in relation to the distribution of data | you're storing in the table. Whether you insert 1 item or 1 | million items into the table, you're still getting O(1) | computational complexity if the table is designed well. If | not, you could be getting O(n) complexity in the worst case. | | While the fixed cost of hashing is irrelevant in the O(1) | nature of hash table operations, it does mean that a hash | table in practice could be slower (in terms of wall time) | than a linked list for insert/find/remove operations. Despite | a linked list being O(n) for those operations, a linked list | doesn't have to pay the fixed cost of hashing. So for low | values of n, the linked list might be a better choice. I | think this is what you mean by "O(1) in the amortized sense" | but it's not what we mean when we say the computational | complexity of these operations on a hash table is O(1). | | What the OP is talking about is an example of amortization of | resize(), essentially adding a fixed cost to insert() and | remove() so you don't have to pay a huge sum when you call | resize(). But that doesn't mean insert() and remove() are no | longer O(1). | justinpombrio wrote: | Amortized cost is a technical term in complexity analysis: | https://en.wikipedia.org/wiki/Amortized_analysis | | If you use the terminology precisely, the worst case cost | of a regular hash table insertion is O(n) because of the | potential resize, but its amortized cost is O(1) because | that's what it "averages out to". | sgerenser wrote: | Agreed this seems like major overkill, but then again so does | writing your own hash table from scratch in C. :shrug: | legulere wrote: | If your application is not negatively affected by such jitter | why not use a garbage collected language? | jeffffff wrote: | iirc memcached either uses or used to use | https://en.wikipedia.org/wiki/Linear_hashing to avoid | rehashing induced pauses. it definitely makes sense for that | use case but you're correct that it's not the right call for | a general purpose hash table implementation. | pjam wrote: | The progressive rehashing described in the article is very | similar to what Redis does [1]. | | Just thought I'd share a example of a use case where the | incremental rehashing logic makes sense. | | [1]: https://github.com/redis/redis/blob/unstable/src/dict.c | vladf wrote: | What happens if you have a sequence of puts that triggers an | expansion and some keys get migrated, but all of a sudden you | get a series of removes, such that you need to perform a shrink | to keep memory usage linear? | | Do you now end up with three live arrays? You can probably make | things even more pathological... | coliveira wrote: | Hash tables are not great for data structures that change | dramatically in size. If you have this situation, the best | thing to do is to copy one hash table into another and | discard the original data structure. | chasd00 wrote: | can you do hashtable sharding? like have a hashtable point to | another hashtable by using something appended to the key? just | a thought on a procrastination filled Friday afternoon. | stephc_int13 wrote: | You can also allocate a small hash table at first, and | reallocate it in a large virtual memory region once it about | the size of a page (4k) so you won't have to reallocate it | later. | kbenson wrote: | > So on `get` if we're in the process of resizing | | Not just _get_ , but _set_ also, right? Otherwise I would think | a large group of set operations that overflowed more than once | might be problematic to deal with. Or maybe I 'm missing | something that makes that a non issue? | tkhattra wrote: | yes, _set_ also. go 's map implementation uses this technique | - allocate a new array of buckets and gradually migrate from | old to new buckets (i think it does the migration during | _set_ and _delete_ only, not _get_ ). | tengbretson wrote: | Unless you need the specific memory characteristics of a hash | table, a trie data structure is a pretty good solution to a lot | of the same problems and in my opinion is conceptually easier to | understand. | taeric wrote: | I have never seen a true described as easier than a hash table. | Curious what makes you say they are easier? | tengbretson wrote: | I find their performance characteristics easier to naively | understand without resorting to profiling with production | data. | taeric wrote: | I should have worded my question more as "do you have a | recommendation to describe tries?" In particular, | construction is tough. | tengbretson wrote: | The C# implementation here is pretty approachable for | just about any kind of programming background. | | https://www.geeksforgeeks.org/trie-insert-and-search/ | [deleted] | bluedino wrote: | What if this became the new "invert a binary tree on a | whiteboard"? | Google234 wrote: | This is pretty straight forward stuff. | 0xbadcafebee wrote: | Hash tables are deceptively complicated, like sorting | algorithms. | cogman10 wrote: | Like sorting algorithms, they can be as complicated as you | want them to be. | | A merge sort isn't the fastest sort but it's pretty easy | for anyone to implement. Something like a timsort is a bit | more complex but is what a lot of languages use now-a-days. | | Hash tables are in the same boat. A basic hashtable is | pretty easy to implement. It's only when you start looking | at things like ideal item placement and optimal filling | that things start to get more complex. | kragen wrote: | Dunno, I wrote | https://news.ycombinator.com/item?id=26595520 just now | and my super dumb hash table has two bugs in its 21 lines | of code: | | 1. It doesn't update--attempts to change the value | associated with an existing key are silently ignored, | although they do consume buckets. | | 2. It loops forever if the table is full _and the key it | was looking for is negative_ , because that invokes | conversion of a negative signed value to an unsigned | value, which I think is UB in C. | | Maybe it has more bugs I haven't found yet. | | Additionally the reason it took me 15 minutes to write 21 | lines of code was that I changed a bunch of things in | midstream, before posting the comment: | | 1. At first there was no "full" field in the hash table | bucket, so there was no way to distinguish full buckets | from empty buckets. | | 2. And then when I added it, I added it as "dead", with | the opposite Boolean sense of "full", so I would have | needed a separate initialization routine (to set all the | dead fields to something nonzero). Instead I swapped the | sense. | | 3. I was trying to detect the case where we'd looped back | around to our starting point, indicating that the table | was full and the search was unsuccessful, and I realized | that the condition I wanted was not b == orig, which | would always fail on the first iteration, but (b + 1) % | table_size == orig. But doing that on every iteration | seemed obscene, so I incremented b (mod table_size) to | start out with instead. | | 4. At some point I think I realized I'd forgotten to set | the .full field on newly inserted items. | | 5. Also I realized I'd forgotten to test the table[b].k | == k condition in `get` as well, which would have made | every search unsuccessful. | | So I think it's reasonable to say that even a pretty | limited hash-table implementation has plenty of | opportunities to write bugs, more than I expected. But | maybe I'm just running on low blood sugar this afternoon | or something. | SavantIdiot wrote: | This shouldn't be a mystery to someone who's had a | datastructures course. It's a fundamental data structure. Yes, | optimization always goes down a rabbit-hole, but Knuth's | "Algorithms in C" was mandatory in most engineering schools' CS | degrees in the early 90's, and it essentially what this website | describes, but with more rigor. | [deleted] | kragen wrote: | _Algorithms in C_ , which I strongly recommend, is actually | by Knuth's student Sedgewick. | oldmanhorton wrote: | Ok, but school was years ago. I've had years of relevant | experience that would make me a good fit for some job since, | but I've forgotten how to implement a hash table in C within | 30 minutes. Is that really the gatekeeper we want? | kragen wrote: | "Is that really the gatekeeper we want?" Yes, I think so? I | mean implementing a hash table in C is really easy. What is | there to "forget"? If you can't do it, either you don't | know what a hash table is, or you don't know how to program | in C. enum { table_size = 1024 }; | typedef struct item { int k, v, full; } item; | static item table[table_size]; void put(item | kv) { size_t orig = kv.k % table_size; | // dumbest possible hash function size_t b = | (orig + 1) % table_size; // use linear probing | for collisions while (table[b].full && b != | orig) b = (b + 1) % table_size; if (b == orig) | abort(); // table full table[b] = kv; | table[b].full = 1; } item *get(int k) | { size_t orig = k % table_size; | size_t b = (orig + 1) % table_size; while | (table[b].full && table[b].k != k && b != orig) { | b = (b + 1) % table_size; } return | (table[b].full && table[b].k == k) ? &table[b] : NULL; | } | | I haven't tested that (just as I wouldn't in a whiteboard | interview) so I don't know if it works. I did briefly skim | the article we're talking about, but I concluded I didn't | have anything to learn from it, so I didn't read it. I | found a bunch of bugs in my implementation as I was writing | it. And of course it's a pretty dumb hash table: linear | probing is very suboptimal, it uses a statically-allocated | non-resizable hash table, it's specialized for a certain | type of keys, and so on. | | But are you saying you can't even do _that_? Probably it | would be bad to hire you for a C programming job, then, | unless it was an entry-level position for you to learn C | in. | | *** | | Evidently the above comment took me 15 minutes to write. | Oh, and now I see another bug or at least lacuna: if you | try to update the value of an existing key, it doesn't | fail, but it also doesn't update, instead adding a new | entry that will never be read. And then I did write a | minimal smoke test and run it, and unsurprisingly, it does | work: int main() { | put((item) { 4, 7 }); put((item) { 1028, 9 }); | put((item) { 3, 25 }); printf("%d %d %d %p %p\n", | get(4)->v, get(1028)->v, get(3)->v, get(5), get(3)); | return 0; } | | Writing the test and the above additional commentary, and | running the test, took another 8 minutes. | | Oh, and there's another bug where it will loop infinitely | when the hash table is full and the key is negative. | (Actually it invokes UB but typically the result will just | be an infinite loop.) | | If it takes you 60 or 100 minutes to write this, or to | write something better, maybe you still know C and know | what hash tables are. But if you throw up your hands and | say "I don't remember, it's been a long time since I was in | school"? Either you don't know C, or you don't know what a | hash table is, which probably means you've never written | anything in C but toy programs. (This is a toy program, of | course, but many toy programs in C don't need hash tables.) | | The compilable and fixed toy program in question is at | http://canonical.org/~kragen/sw/dev3/dumbhash.c if you want | to probe it for bugs. | | It occurs to me that you might have been talking about | something that isn't a C programming job. For example, we | might be talking about a job programming an HTML database | front-end in Python using Django. And of course it would be | silly to reject someone for a job like that because they | didn't know C, unless they claimed to know C on their | resume. And it's totally reasonable for a Python | programmer, or a Python/Perl/JS/bash programmer, to not | understand hash tables. | jstanley wrote: | Knowing how to implement a hash table in C doesn't strike | me as something you can "forget". If you know how to write | C and you know how a hash table works, what is there to | forget? | 908B64B197 wrote: | Honestly it's such a simple concept it shouldn't be hard to | remember. | philzook wrote: | Sedgewick's? | seanwilson wrote: | I wouldn't ask someone to whiteboard it but I think "what is a | hash table?" is a brilliant interview question. The depth | someone can go into tells you a lot and can bring up lots of | other topics to discuss. It tells you a lot if a senior | developer doesn't know what a hash table is or only knows "it's | a class from Java" too. | 908B64B197 wrote: | > I wouldn't ask someone to whiteboard it | | I did. Works very well as an interview question. | anaerobicover wrote: | Who said anything about interviews? This is just in the context | of "something interesting to read with your morning cuppa". | thrower123 wrote: | Hashtables are probably more directly useful than all the | binary tree gymnastics. | | The K&R C Programming Language book has nearly the same example | (although I recall it using linked-lists for hash collisions | and not probing). | | I would be mildly suspicious of anyone claiming to have a | recent CS degree who wasn't familiar with the ideas. | [deleted] | vram22 wrote: | IIRC, a very simple implementation of a hash table was shown in | the classic Kernighan & Ritchie C book, "The C Programming | Language" (which was one of the first good programming books that | I bought and read, early in my programming career). | | It was in either the 1st (pre-ANSI C) or the 2nd (ANSI C) | edition. Just a handful of lines of C code, and easy enough for | even beginners to understand. Of course, it was not a production- | quality hash table implementation, nor was it meant to be. I | still remember my excitement at understanding the logic and its | usefulness at the time, even though I did not know about the | importance of data structures and algorithms then. | | Edit: Even though it was simple, IIRC, it handled at least | collisions. | benhoyt wrote: | Nice. Yeah, looks like it's in section "6.6 Table Lookup", and | it implements a simple, fixed-size hash table (array size 101) | of char* to char*, and it handles collisions via linked lists. | vram22 wrote: | Cool, thanks for looking it up :) | daneelsan wrote: | This might not be the best place to ask but: I'm wondering what | the best data structure is for these needs: * there are fields | and values, similar to a struct * I know at initialization time | what fields there are (assume a string) * at run time I want to | be able to get this values accessing via the key/field * also I | want to update those values * what I don't care about is | inserting/deleting fields because that's now allowed, | | I guess what I want is a struct like data structure, but in an | interpreted language | | Mightve typed to fast... | whateveracct wrote: | if the fields are different types, how do you know the type of | key access's return? | daneelsan wrote: | I guess that can be solved by creating an Expression type | daneelsan wrote: | Sorry if I wasn't clear enough. Yes I want something like an | object in javascript, like a struct in C but at runtime. But I | want to implement this myself. At first I thought a hash table | was the right choice. But I don't need the insert delete | increase capacity that this data structure typically needs to | implement. I want to implement this in Mathematica. The | language doesn't have native objects given that the main idea | is for things to be immutable. But I still can program data | structures, so I'm looking for one... | mywittyname wrote: | A class? After all, an class is a struct with some additional | features to handle functions. | | You can use a static instance if you just need a single object. | ben509 wrote: | You're looking for something similar to Python's namedtuple.[1] | (Though modern python should use a dataclasses in the stdlib or | the attrs package.) | | Essentially, you have an array/list/vector under the hood, and | a mapping of field names to array indices to govern field | access. | | If you're trying to implement this at a lower level, in the | interpreted language itself, take a look at the slots | mechanism[2] in Python. | | [1]: | https://docs.python.org/3/library/collections.html#collectio... | | [2]: | https://docs.python.org/3/reference/datamodel.html#object.__... | daneelsan wrote: | That seems like the thing I'm looking for! Thanks for the | info | sgtnoodle wrote: | It depends on what you want its footprint in memory to look | like. Since you don't need to mutate the structure, a densely | packed structure seems optimal. For small structures, linearly | searching through an array would be fast. For larger | structures, a sorted array could be binary searched | efficiently. | | It also depends on how you can represent the string and their | typical size. If the strings are small, then allocating fixed | size chunks to store them inline might make sense. If the | strings are long, it might make sense to go crazy and use a | "trie" data structure to minimize comparisons. | | I believe python uses hash maps for classes by default, but you | can alternatively use "slots", which are less mutable and | presumably more densely packed. | daneelsan wrote: | Yes. At some point i encountered this trie with payloads | (which would act as t he field values). | tyingq wrote: | _" I guess what I want is a struct like data structure, but in | an interpreted language"_ | | Python ffi is an option: | https://cffi.readthedocs.io/en/latest/using.html#working-wit... | | Though I don't know what advantage that route would have over | most interpreted language's already existing associative arrays | / hashmaps / dicts. | | There may be some module already built that serves your needs. | DiscoDB is a good example..a very fast hashmap that uses | mmap(). https://discodb.readthedocs.io/en/latest/ | ectopod wrote: | Use an object? | brundolf wrote: | Sounds like you want a struct except you want to be able to | reflect on the property keys (which you can't do normally in | C/C++/Rust)? | | I got thrown off by "but in an interpreted language" so I'm not | sure whether you're requesting this for C or for another | language. In JS, at least the V8 engine does some of this for | you: if you have a group of objects that always have the same | set of properties, V8 will detect this in many cases and | optimize them under the hood as a struct or class (it must also | store the string keys somewhere since those can be iterated in | JS). | | In C/C++ you'd have to store the keys as strings somehow, and | also add some way to index the struct via those keys (since you | can't even do that with a normal struct). You may be stuck with | a hashmap one way or another. | nicoburns wrote: | Rust allows you to do this with a macro. The serde macros are | pretty close to what you need, but a slight variant could | give you a much more ergonomic API. | brundolf wrote: | Right, I thought about suggesting a macro for C/C++ but I | don't know enough about their macro systems to know for | sure that it would work | nicoburns wrote: | I believe C++ has "Runtime Type Information" (RTTI) that | allows you to do this. Although I also hear that it's not | great. | MaxBarraclough wrote: | To expand on _not great_ : RTTI is one of the few | features of C++ where you pay for it even if you don't | use it, so it's not all that popular. It can be disabled | with compiler flags to shrink the binary. LLVM's coding | standard prohibits use of RTTI, they instead use their | own 'hand-rolled' solution. [0] RTTI is also banned by | the Google C++ coding standard. [1] | | [0] https://llvm.org/docs/CodingStandards.html#do-not- | use-rtti-o... | | [1] | https://google.github.io/styleguide/cppguide.html#Run- | Time_T... | frankenst1 wrote: | JavaScript also lacks a proper hash table. I wonder how much | faster a native implementation could possibly be over say `Map()` | (which keeps insertion order by maintaining linked lists and does | not have constant op). | perlgeek wrote: | What's wrong with using objects as hashes? | tyingq wrote: | There's a table at the bottom of | https://developer.mozilla.org/en- | US/docs/Web/JavaScript/Refe... that shows some reasons why. | frankenst1 wrote: | They can be faster than `Map` but: | | 1) Slower than hash tables | | 2) `{}` is not empty but has default keys | | 3) `{}` has multiple read-only keys | | 2 and 3 are caused by the prototype chain (properties not | present on the initial object will be attempted to be | resolved via the prototype hierarchy) which can be avoided | via `Object.create(null)` | winrid wrote: | Isn't a faster way to, instead of using a backing array, to use a | pool of memory that you just point into with pointers addressed | with the hash? | | Nasty stuff that you can do in a language like C :) | ufo wrote: | As far as the CPU is concerned, there isn't a big difference if | you access a block of memory via pointers or via an array. | Converting the hash result to a memory address is not the | bottleneck. The main bottlenecks are the hash function itself | and the logic for dealing with collusions | selimnairb wrote: | And depending on your use case, you may not need to implement | growing the hash table, or deletions, making it even simpler/less | daunting to implement in C (since this avoids a lot of the memory | management parts of the job). | benhoyt wrote: | Yeah, good point. And then you can allocate the hash table | struct and all your buckets in one call to malloc(). | tirrex wrote: | You skipped hs_remove but depending on implementation, hs_remove | strategy may change your hs_set, e.g using tombstones. I've | recently learnt that you don't need tombstones at all in linear | hashmaps, you just do backshift removals, it may be a simple | addition to your map. Check out these repos for the example | implementation: | | (C) https://github.com/tezc/sc/tree/master/map | | (C++) https://github.com/rigtorp/HashMap | speps wrote: | A more detailed explanation of "backward shift deletion": | https://codecapsule.com/2013/11/17/robin-hood-hashing-backwa... | chrchang523 wrote: | Well, he did say "simple hash table". In my experience, many | applications have no need for single-element deletion. | attractivechaos wrote: | I wrote a blog post [1] on this a couple of years ago. This is | a quite old technique [2] but was not widely used until | recently. I was surprised that I hadn't found this earlier. | | [1] https://attractivechaos.wordpress.com/2019/12/28/deletion- | fr... | | [2] https://en.wikipedia.org/wiki/Linear_probing#Deletion | tirrex wrote: | I can't find right now but I saw one older version of wiki | page, it had C code example how to do deletion with this. | This is how I discovered it. I knew your blog but apperantly | I didn't visit recently. | | > I think the new deletion algorithm without tombstones is | overall better than traditional algorithms. It should become | the preferred way to implement hash tables | | I agree with you 100%. Tombstone has no advantage over this. | annilt wrote: | https://en.wikipedia.org/wiki/Open_addressing | | Maybe it was open addressing page? It has pseudo code for | that. | benhoyt wrote: | Thanks for this (and parent reply for mentioning it), that's | simple and clever! | | > The basic idea is not complex: when we delete an element, | we move the next element to the deleted location if it is | pushed away by the deleted element due to hash collision; we | repeat this process until we come to an empty bucket. | | I wonder, is it still _amortized_ constant time? My gut | (which is not very good at computer science) tells me that it | still may be, as with an average probe length of ~1.4 (which | I was seeing), you won 't have to walk more than 1 or 2 | elements on average. | hnfong wrote: | It shouldn't take more operations (asymptotically) to | delete than to probe/insert so as long as the original list | is still amortized constant time then the delete operation | still is. | [deleted] | e12e wrote: | I guess the idea is to contrast dyi bsearch vs dyi hash table - | but it's a bit odd to mention bsearch is in the standard lib, | then not use it? | | > Here's how you'd do it in C (with and without bsearch). | | Juding by random hit on the net - it would've been quite | straightforward to use the stdlib version? | | https://www.tutorialspoint.com/c_standard_library/c_function... | st_goliath wrote: | POSIX actually specifies a hash table interface in the standard | library as well: https://man7.org/linux/man- | pages/man3/hsearch.3.html | | In typical Unix fashion, however, it is designed to use | internal, global state, so you can only have _one_ table at a | time. | | There is a GNU extension that provides the same functions with | an `_r` suffix that allow you to work with more than one table. | Tyrubias wrote: | I've never really understood why standard Unix functions rely | on internal global state. For example, strtok(): | | https://en.cppreference.com/w/c/string/byte/strtok | | Was there some reason it was optimal to write these functions | this way back in the early days of Unix? | jandrese wrote: | It probably saved a few bytes of memory or disk. A PDP11 | had 64k of memory and had to be shared by multiple | simultaneous users. A lot of the Unix tradeoffs can be | traced back to limitations of 1970s hardware. | | Plus you weren't going to need more than one anyway, your | program wasn't going to be big enough. | Zambyte wrote: | Because they were written with the expectation that they | would be used in composable programs that each do one thing | well. | kragen wrote: | Well, if you wanted multiple dbm files or hash tables or | whatever, it was easy enough to run multiple processes. | Unix processes on the PDP-11 could only address 64 KiB of | memory--PDP-11 addresses are 16-bit, and C strongly | encourages a von Neumann model where your program and data | are in the same address space, so you couldn't even use 64 | KiB for your program and another 64 KiB for your data, | though the PDP-11 hardware was capable of doing this. (On | things like the AVR, this is handled by copying your data | into RAM at startup unless you mark it PROGMEM, in which | case you can't pass pointers to it to ordinary functions. | PDP-11 Unix didn't have anything like this.) | | As long as you don't have multiple threads (and you | shouldn't, and on Unix you couldn't) the static-variable | interface in these cases was usually more convenient, | though more bug-prone. | | There's also a micro-efficiency question. | | Global variables are at a known memory address. If you put | them in a struct and pass around a pointer to it, they are | at a dynamically computed memory address. Every time you | want to access them, you have to add the field offset to | the struct's base pointer. This is a small amount of extra | computation, and how much it actually costs you depends on | the hardware--on the 8080 it's a mess that probably | requires you to spill a couple of registers onto the stack, | and on modern implementations of amd64 like the ones from | Intel, it probably takes literally zero extra time, because | you probably aren't keeping all of your addition-capable | functional units busy anyway. | | I don't know the PDP-11's instruction set well enough to | say, but I suspect the cost of such indexed addressing was | intermediate between these extremes: much like the 8086, it | had an "index" addressing mode, mode 6, which added an | immediate 16-bit offset to a base register | https://pages.cpsc.ucalgary.ca/~dsb/PDP11/AddrModes.html, | and it doesn't seem to have had a pure immediate addressing | mode where the address of the variable is just an immediate | word without any indexing. But if you didn't need a base | register for your variables, you could use PC-relative | addressing, thus saving a register and decreasing register | pressure (the PDP-11 only had 8 general-purpose registers, | and one of them was PC, so it really only had 7). | | I don't have any idea what actual PDP-11 compilers did, | though, and that matters a lot here. | tyingq wrote: | A fair amount of the timeline for Unix was when it | supported multiple processes, but not multiple threads. | SunOS/Solaris didn't have threads until 1992, for example, | and it still didn't support SMP, so global state wasn't an | issue initially. Strtok() predates 1992. They did add | strtok_r() once Posix threads were a thing. | ludocode wrote: | hcreate_r() can be used on some of the BSDs as well, but of | course they're not 100% compatible with GNU. I've implemented | a platform-independent POSIX hsearch as a wrapper to a | templated hashtable in Pottery: | | https://github.com/ludocode/pottery/tree/master/examples/pot. | .. | | Be warned, hsearch is terrible. You should probably only use | this if you have some existing code that depends on hsearch | and you want to make it portable. | smcameron wrote: | The article actually does both (uses bsearch, and also the home | made one) presumably just to demonstrate both. | smcameron wrote: | (I noticed this because I was wondering why the cmp() | function was taking void pointers only to cast them to | "item*", and it occurred to me it would make sense if it were | passed as a function pointer to a generic library function | with no knowledge of "item*"... as it is when bsearch is | used.) | e12e wrote: | Heh, I missed that - just skimming the code on mobile. Maybe | there should be a slight edit in the text and/or the custom | binary search could be dropped (as it's not the main focus of | the article). | SavantIdiot wrote: | How are keys removed so that the table doesn't overflow? Are you | really going to recommend growing the table by duplicating it? | | > Beware: this is a teaching tool and not a library, so it's not | very well tested. | | Dear OP: This really bugs me. If you're not going to put the | effort in, and instead crank out some content that might be | buggy, then please don't pollute the web with more half-assed | "look what I can do" pages. | kljsandjb wrote: | 100% agree with U. | Tyrubias wrote: | Yes, I believe you are intended to duplicate the table once a | certain load factor (fraction of elements filled) of your | choosing is reached. | | For linear probing, this is recommended to keep your probe | lengths logarithmic: | | https://www.sciencedirect.com/science/article/abs/pii/019667... | | Removing elements for a hash table using linear probing is | fairly non-trivial. Without using so-called "tombstones" to | flag an element as deleted, you essentially have to shift | elements up in the table. | | Also, your rudeness is really quite unwarranted. If everyone | took that attitude, where would the teaching examples ever come | from? | humantorso wrote: | This is the worst attitude I have this on HN. Look out people - | its the internet's gatekeeper - making sure only | legitimate/fully thought out things are published to the | internet. By golly, we wouldn't want to waste bits and bytes on | just random things being available on the internet. | st_goliath wrote: | > Are you really going to recommend growing the table by | duplicating it? | | Um... yes. | | You want the data linearly in memory in the first place so you | can have constant time access by index. If you want to grow a | dynamic array, doing a realloc can easily cause the same effect | as the article does manually (alloc new, copy data, free old), | because the memory immediately after may be occupied by | something else if you did _any_ other malloc /calloc. Copying | over the data is linear in time, thus degrading your constant | time insert to linear time as soon as the table is full if you | do a realloc for single items instead. | | You may find the choice of growing _when half full_ as in the | article questionable, because then there will always be an | unused half, but pretty much every, remotely practical dynamic | array implementation in C (and probably a ton of other | languages) does that (grow by capacity * 2 _when full_ ) to | maintain _amortized_ constant time when appending. | tyingq wrote: | It's also how many popular, in production, hashtable | implementations work. | | From the uthash user guide, for example: | | _" Normal expansion - This hash attempts to keep fewer than | 10 items in each bucket. When an item is added that would | cause a bucket to exceed this number, the number of buckets | in the hash is doubled and the items are redistributed into | the new buckets. In an ideal world, each bucket will then | contain half as many items as it did before."_ | | So, the person questioning "why" should be instead explaining | "why not". | solids wrote: | The problem with your way of thinking, is who draws the line to | define what's "half-assed"? Because everything can be improved. | You should take ideas from this kind of articles, not fully | working code. | the-dude wrote: | This comment bugs me. | tyingq wrote: | I suppose it's a personal preference thing, but I like over | simplified implementations when the purpose is educational. | | I am more bugged by the accusation that the author did this | because _" look what I can do"_. I don't see any evidence of | that. | ffdeadbeef wrote: | do we really need more hostile, sneering, condescending, toxic | pricks like you policing the web? | kstenerud wrote: | A teaching tool is deliberately simplified so as not to | overwhelm students with details they don't need to worry about | yet. This is absolutely the right way to go about it. | le-mark wrote: | >Are you really going to recommend growing the table by | duplicating it? | | Setting a reasonable initial size then doubling it and copying | when (if!) it needs to grow is a pragmatic, reasonable approach | that's quite common! | knome wrote: | https://github.com/python/cpython/blob/master/Objects/dictob. | .. | ape4 wrote: | This is a fun exercise, but if anyone is thinking about doing | this for a project - don't - use a library. | philwelch wrote: | "But Doctor, I am Pagliacci!" | kevin_thibedeau wrote: | Existing libraries make decisions that aren't necessarily | suitable for all applications. Most hash table code makes | excessive use of malloc/realloc or has power of two growth | which is problematic in memory constrained systems. If you want | to minimize waste by using high load factors you're going to | want Robin Hood hash which is still relatively uncommon in lib | code. | attractivechaos wrote: | Although Robin Hood hashing has good performance under high | load, it needs to record the probe length in each bucket. | This partly cancels its advantage especially for small keys. | In my evaluation, Robin Hood hashing doesn't use less memory | in comparison to others. | | > _Most hash table code makes excessive use of malloc | /realloc or has power of two growth_ | | Libraries based on open addressing only call malloc when | there is not enough room. That is not "excessive". Also, if | you know the maximum size before hand, many libraries allow | you to preallocate buckets such that you never trigger | rehashing in this case. | ModernMech wrote: | Small nit: in table 2 with the columns labeled "Key, Hash, Hash | modulo 16", the thing called "hash" is actually known as a | "prehash", while the "hash modulo 16" is the hash. In the context | of hash tables, pre-hashing is when you take something like a | string or a data structure and turn it into a number. Hashing is | the mapping from that number to one of the hash table bins. The | distinction is often lost in practice, but if we're talking about | hash table design and implementation it's important to make this | clear at least once. | joosters wrote: | Is that always true? It depends whether or not you think of the | hash as a property of the hashtable, or a property of the key | itself. e.g. you may have string objects that store their hash | values internally - a useful optimisation if you are dealing | with lots of long read-only strings, given that the hash | function will have a cost of at least O(n) for the string | length. Inserting a string into a hashtable wouldn't change its | hash (what if the string is inserted into two tables of | different sizes?) | ModernMech wrote: | Right, it does depend on this perspective, which is why I say | the distinction is often lost in practice. But from the | perspective of hash table designers (at least historically), | the hash can in fact change and is dependent on the size of | the table. The distinction is important because the two | operations are separable. I guess people didn't want to call | one thing a "hash" and the other "hash mod k" because there | are ways to do this other than mod. They could have called | one "hash" and the other "post hash" but it is what it is. | | All I'm trying to say is if you want to know more about hash | tables, and you bother to venture into the literature or some | textbooks, you're going to run into this distinction sooner | rather than later, so I wanted to point that out here. | varajelle wrote: | > Even while writing this I realized I'd used malloc instead of | calloc to allocate the entries array, which meant the keys may | not have been initialized to NULL. | | I think this is still UB to allocate a struct containing pointer | with calloc without explicitly set these pointers to 0 (or NULL). | The C standard doesn't specify the binary representation of the | null pointer value. In practice, that will work in virtually all | current platforms, but it's still UB. | ajross wrote: | I think that's true only in the narrowest sense of the C | language standard. At the ABI and runtime linkage level, | platforms (yes, every single one of them) document data in the | .bss section to be zero, and C compilers place pointers there | that are defined to be initialized to NULL per the standard. | | So, no, in practice it's well-defined, fully specified behavior | on any given platform. | ncmncm wrote: | Any particular platform can declare it to be defined, by fiat. | "With our compiler, all zeroed raw storage treated as pointers | are null." | | Anywhere that they technically could, you can treat it as if | they actually had so declared, and just didn't publicize it | enough. | dpedu wrote: | NULL is guaranteed to equal integer 0 by standard, and | converting 0 to a pointer is guaranteed to be a NULL pointer. | | > An integer constant expression with the value 0, or such an | expression cast to type void *, is called a null pointer | constant. If a null pointer constant is converted to a pointer | type, the resulting pointer, called a null pointer, is | guaranteed to compare unequal to a pointer to any object or | function. | layer8 wrote: | An integer constant expression is a syntactical construct, | not a runtime entity. The C standard provides no guarantee | that bytes (chars) with value zero can be read from memory as | a null pointer. | adamrezich wrote: | wait what? my C is a bit rusty but when would reading 0 | from memory & casting to a (void?) pointer type ever != | NULL? | layer8 wrote: | C implementations are allowed to use a not-all-bits-zero | representation for null pointer values. They are merely | required to convert integer constant expressions with | value zero to a null pointer value -- regardless of how | the resulting null pointer value is represented in | memory. | | See also these FAQs: | | http://c-faq.com/null/machnon0.html | | http://c-faq.com/null/machexamp.html | cygx wrote: | Implementation-defined as far as I'm aware. If pointers with | all bits zero actually represent null pointers, things will be | fine. | | _edit:_ | | I haven't found any smoking gun quotation in the C11 standard | on short notice, but there is footnote 296 (p.348) on calloc() | zeroing all bits: | | _> Note that this need not be the same as the representation | of floating-point zero or a null pointer constant._ | | Nothing about this being UB in cases where it _does_ represent | a null pointer constant. | benhoyt wrote: | Wow, thanks, I did not know that. Though that would be a | kinda crazy platform or implementation! I'm betting that a | lot of people depend on this. | yudlejoza wrote: | 99% of the effort is 'what hash function?' | varjag wrote: | FWIW there's a standard hash table API in POSIX. | jandrese wrote: | Yeah, but it's pretty lame since you have to specify the size | of the table up front and can't change it later. They didn't | implement the hard stuff! | queuebert wrote: | This might sound stupid, but could you store N-1 entries, and | then store a pointer to the next hash as the Nth entry? So | you dynamically add fixed tables as you run out of space? | | Yes, I realize the lookup would get complicated. | banana_giraffe wrote: | Nope. The POSIX API lets you have one table. | | There is a GNU extension that lets you have more than one | table. | ludocode wrote: | There are many reasons why it's lame! Here's a short list: | | - There is only one global hash table for the whole process! | If you want individual hash tables you need to use | implementation-specific extensions (hcreate_r().) | | - There is no way to remove a key from a hash table. No | implementation I know of provides an extension to do it. If | you want to truly remove a key you must destroy and rebuild | the table. | | - There is no requirement for a means to grow the table. On | some platforms it's possible to run out of space. If you want | to truly grow you must destroy and rebuild the table. | | - There is no standard way to free keys or data stored in the | table. Destroying the table leaks all contents; you must keep | track of keys and data externally and free them yourself. | Only OpenBSD frees keys by default, and only NetBSD has | extensions to call callbacks to free keys and data. | | - Keys must be strings. You cannot provide custom types or | void pointers as keys. There is no way to specify custom hash | or comparison functions. The quality of the hash function is | unspecified. | | - When using ENTER, you may not want to allocate the key | unless it is necessary to insert a new entry. Since the given | key is inserted directly, it's not necessarily safe to use a | stack-allocated buffer for lookups. It's awkward to replace | the key with an allocation after insertion and it's unclear | whether it's allowed. | | This doesn't even get into incompatibilities between the | various implementations. You will encounter all of the above | flaws even if your only target is glibc. | | No one should ever be encouraged to use POSIX hash tables. | They should be deprecated and we should all just pretend they | don't exist. | jandrese wrote: | IMHO this is one area where the C committee could improve | the situation by mandating a sane and useful hashtable | implementation for the stdlib. | | Or at the very least just deprecate the old one. I can't | remember ever seeing it used in code. | TonyTrapp wrote: | If you'd have to implement the hash table every time you want to | use one, you'd probably realize that there is a better | alternative data structure. Quite often you data is not as | randomly distributed as you may initially think it is. | kragen wrote: | One of the cool nuances of hash tables is that you can exploit | the nonrandomness of your data to get a more uniform | distribution of keys over hash buckets than you would get with | an ideal random-oracle hash function. For example, it's very | common to use hash functions which will produce four | consecutive output values (possibly shuffled) for string keys | "r0", "r1", "r2", and "r3", which guarantees that those keys | will not collide with each other. Hashpjw is one such widely- | used hash function; FNV is another. (Better not use linear | probing with such a hash function, though! And it's probably | better not to use them in cases where the keys might be chosen | adversarially to produce a DoS.) | jolux wrote: | What? | selljamhere wrote: | I think the parent comment is highlighting that a hash table | is just one implementation of a "Map" abstract data type. | Depending on the use case, other implementations (such as | using a heap, in the C++ STL) may have better performance. | wffurr wrote: | Or for the example of word frequencies in the article, a | prefix trie. | jolux wrote: | True, but as far as I know it's rarely the worst choice you | can make, and it's plenty fast for a lot of applications. I | don't know that there's a single data structure that's | always better, which is what the GP seems to be implying. | iforgotpassword wrote: | > Quite often you data is not as randomly distributed as you | may initially think it is. | | That's why it's important to pick the right hash function. A | hash table usually goes a long way before it becomes a | performance issue. ___________________________________________________________________ (page generated 2021-03-26 23:01 UTC)