[HN Gopher] An efficient way to make hash maps with insertion or... ___________________________________________________________________ An efficient way to make hash maps with insertion ordering Author : eatonphil Score : 83 points Date : 2021-07-01 11:21 UTC (11 hours ago) (HTM) web link (blog.toit.io) (TXT) w3m dump (blog.toit.io) | usefulcat wrote: | First off, WTF is up with disabling copying? That is seriously | annoying. Now on to what I came here to say.. | | "When there are enough of these tombstones, the hash map gets | rebuilt, but that's OK. Due to the magic of amortized constant | time, this doesn't slow you down on average." | | If amortized time is all you care about, that's fine. But there | are some applications where worst case time is more important, | and unfortunately I've seen some hash map implementations that | implicitly assume that the worst case never matters. | prutschman wrote: | In some browsers you can get around this by selecting the text, | keeping the mouse button down, hitting the keyboard shortcut | for 'copy', then letting go. The selection will still be | emptied, but the text is already in your clipboard. | peterkelly wrote: | Rust has HashMap and BTreeMap. You use the latter when you care | about ordering. There's also a couple of options for dequeues. A | good overview here explaining which one you'd choose depending on | what oyu want to do: https://doc.rust- | lang.org/std/collections/index.html | | Use the right data structure for the job. Maps are an interface, | not an implementation. | edflsafoiewq wrote: | The indexmap crate provides a hashmap that iterates in | insertion order. | floitsch wrote: | Rust has hash maps that hate you. | | The default hash map visits the keys in arbitrary order | (https://doc.rust- | lang.org/std/collections/struct.HashMap.htm...). | | You could use a BTreeMap, but that's not insertion order and | requires the user to come up with some sort of ordering. | IshKebab wrote: | They don't hate you. That's how it's supposed to work. | Otherwise you get issues like the one discovered in the | article. | | Would you say Rust has Vecs that hate you because prepend | isn't O(1)? Or does JavaScript have Maps that hate you | because they aren't sorted? | | There's no perfect container. It's all trade-offs. | ErikCorry wrote: | The article also explains how the issue discovered can be | fixed. | | And JS maps are insertion ordered. | ecshafer wrote: | This is a weird article. It creates the idea that hash maps are | uncommon? surprising? I am not really sure. Then makes the claim | that you would expect them to be ordered by insertion order, and | not just randomly which is the common usage of hash maps. Ordered | Hash Maps exist, and isn't a insertion order hash map, basically | just a list? But Hash Maps are so common and used absolutely | everywhere its almost like its aiming at people who have used | them but never any other data structures. | EamonnMR wrote: | When I was first taught Java in college I didn't find out about | hash maps for a while because it was considered too advanced | for beginners (ohh scary templates!) | myphs wrote: | This is quite interesting. I wonder, why this was decided to | be too advanced. In my uni we teach it from the very | beginning. | pydry wrote: | Python switching all dicts to being ordered is my favorite new | feature from the last five years. | | It's useful all over the damn place. | rob74 wrote: | Hash maps are a fairly recent addition to the standard data | structures "library", because they are relatively performance- | intensive. They are of course very useful and therefore | commonplace now, but I think that only came about after the | rise of dynamic languages which have built-in hash maps | (Javascript objects, PHP "associative arrays" etc.). So | nowadays any new programming language which wants to attract | web developers (e.g. Go) needs a built-in hash map. | boardwaalk wrote: | Ehh, hash tables were invented in the 50s and were and are | used wherever they are useful. I'm pretty sure a running joke | is that every decently sized C program contains a half dozen | hash table implementations. They're not recent. | mypalmike wrote: | Hash maps have been commonly used in C since just about | forever. | | "isn't a insertion order hash map, basically just a list?" | | No. They are different in the computational complexity of | random access operations. | benibela wrote: | But the standard C++ std::unordered_map only came with | C++11 | jerf wrote: | "The other feature you quickly find yourself needing is the | ability to use arbitrary objects as keys in a hash map. In | languages that don't support that (Perl, Python, Go, JavaScript | until recently)" | | I don't think that's right. Python and Go do effectively support | that. Each has their own take on the problem of being unable to | use mutable keys, but I index by some sort of object all the | time. Makes iteration over hashes substantially more useful when | you get objects for both hashes and keys. | | Back in the Python 2.0 era, before it grew all the other features | that were advantages over Perl, before Python even had | generators/iterators, this was one of the big reasons I preferred | Python over Perl. Only being able to index by strings wasn't | necessarily a huge stopper in theory, but in practice it was very | dangerous, because you really need to _encode_ your keys. It was | extremely common to see print($hash{$keyPart1 . | "|" . $keyPart2}); | | when what you need is | print($hash{pipeEncode($keyPart1) . "|" . | pipeEncode($keyPart2)}); | | to be safe, where pipeEncode can be as simple as a routine to | backslash encode pipe and backslash characters (although no | simpler than that; you must do both of them to be safe, that's | the minimum). This could result in non-trivial bugs and even | security issues when you start encoding things like $object . "|" | $permission and hackers start sticking pipes in the names of | objects. | | But between the inconvenience of that if you know what you're | doing, and the pile of code you'll encounter from developers who | don't know why that's important (and may even argue in code | review), it was definitely the sort of thing that grinds you down | over time. | | In Python it was safe to print(hash[(keyPart1, | keyPart2)]) | | and it was consistent and safe. Or use an object, etc. | ErikCorry wrote: | It looks like your "pipeEncode" is generating a string that | reflects your desired key equality function. I mention this | approach in the blog post, but I consider it a poor substitute | for being able to use arbitrary objects and specify the | equality function on the map or set. | | The issue of mutable keys is a slightly different one. If you | mutate any of the properties of your object that are used by | the map you are going to have a bad time, so don't do that. And | I guess if your maps are sufficiently simple (eg JS's object- | identity maps) then the user can't make that mistake, but at | what cost? | | If they are generating strings as keys and they mutate the | object after creating the string then this will also break so | they haven't even really avoided the problem. | jerf wrote: | "I consider it a poor substitute for being able to use | arbitrary objects and specify the equality function on the | map or set" | | My point is that it's an even _worse_ substitute than most | programmers realize, because to use it properly you have to | understand how to encode parameters. The thing that people | usually use, string concatenation with some delimiter, is | fundamentally flawed. | | (My favorite... and, alas, I've inherited a system that uses | this, though fortunately it hasn't surprised me yet... is | using "underscore" as a delimiter, for values whose names | routinely include underscores! Fortunately, nothing ever | tries to extract the original values from the key | programmatically, and it's not really in a place hackers are | going to attack. But still... yeesh.) | | The one exception I've sometimes made is that if you happen | to be in an environment where you _know_ a certain value will | never be used, you can use that as the delimiter; I 've used | ASCII NUL for that a few times. But you have to be sure that | it's not just a "weird" value that "nobody would ever use", | but something truly _excluded_ by the context, something that | regardless of what is input by someone somewhere is | completely impossible to ever get to your code. Generally, | the characters you can type on a keyboard are not a good | idea. | nemo1618 wrote: | Go does not support using some types as map keys, including | slices, channels, functions, and other maps. Slices are the | most annoying, and I have seen plenty of code that uses | fmt.Sprint(s) as a workaround. Fortunately the compiler now | recognizes when you convert a []byte to a string for use as a | map key, and will not allocate a new string. | jerf wrote: | Python either. I correct my post above to switch to indexing | by a tuple, because what I originally had is wrong: | >>> d = {} >>> d[[1,2]] = 10 Traceback (most | recent call last): File "<stdin>", line 1, in | <module> TypeError: unhashable type: 'list' | | This is why I qualified my statement with "Each has their own | take on the problem of being unable to use mutable keys,". | | Go can be consistent in a simple way because of its type | system, it can see if any part of a key has something in it | that can't be hashed: https://play.golang.org/p/rf8IqPb76Em | | Python, as befits Python, has default behavior for instances | that I believe is "is" equivalency, but you can override that | with various double-underscore methods to do whatever. | vharuck wrote: | >Python, as befits Python, has default behavior for | instances that I believe is "is" equivalency | | Python dicts consider two keys the same if they have the | same hash value and are "=="-equal. So __eq__ and __hash__ | are the dunder methods to finagle. Python's sets are the | same way. | | A useful example is with the pathlib library. | from pathlib import Path p = Path('a.txt') | q = Path('a.txt').absolute() p is q # False | {p, q} # Only one element | | "q" carries some different info than "p", but it refers to | the same file location. So not considering them distinct | values is a good decision for this package. | joshuamorton wrote: | Note that this is an oversimplification, as | p = Path('a.txt') q = Path('a.txt') | p is q # False | | since p and q are different objects and "is" equality | checks if the objects are the same (this can be | interpreted approximately as "have the same memory | address"), so it'll almost never be true. And in the | cases where it is ( 2 is 2), you shouldn't rely on it, as | most of them are optimizations. | arp242 wrote: | In Go it depends if the type is "comparable"; i.e. if "==" | works. | | You can use arrays in Go; e.g. map[[2]int]string, and you can | also use channels; although I'm not sure what the rules for | comparing channels are exactly off-hand (I'm struggling to | come up with a scenario when this would be useful off-hand | actually). | | The big problem with slices and maps is that they can be | modified. That is, what happens if you modify a slice after | you used it as a map key? In slices this is worse than with | maps because the backing array can change if you run out of | cap space. And also, do you compare by value or identity? And | again, what happens if either changes? | | I'm not sure if it's possible to come up with a set of rules | that wouldn't take people by surprise in at least some cases. | patrickthebold wrote: | This reminds me of one bug I introduced a long time ago: | | This was a node backend, and there was some code that was | recreating an object be reinserting all the key value pairs in a | precise order so that that's how it would serialize in the json | and then the UI depended on that order. But to me, it was just | recreating the same object so I deleted that code, causing the | bug. | | In retrospect, I should have at least asked the committer what | they were trying to do. | | In my defense, this was before javascript guaranteed the | insertion order would be the order kept. | arsome wrote: | Is there a proper way to use custom objects as keys in hashmaps | in JS (well, ideally TypeScript compliant)? Coming out of the C++ | world where this is trivial and something I use quite often this | seems pretty infuriating. Maybe a general data structure package | that's worth pulling in? | | EDIT: After doing a bit of research I found "tstl", looks | interesting, their hashmap appears to support custom hashers and | equality comparisons: | | https://tstl.dev/api/classes/std.hashmap.html | | I'm curious about experiences with it if anyone's had the chance? | ErikCorry wrote: | A quick look at the source implies that that iteration order is | not predictable. | upzylon wrote: | If the problem is just that you want to use non-primitive keys, | that's what Maps are for. | | https://developer.mozilla.org/en-US/docs/Web/JavaScript/Refe... | arsome wrote: | I'm aware of them, but the problem with that is you can only | compare an object by instance if I recall correctly. There's | no way to override the hashing and equality check that I | could find, meaning even for simple objects you're basically | forced to serialize your keys in some way still, which is | disgustingly inefficient and slow for no good reason. | doteka wrote: | So much this. I remember being so excited that Map and Set | were added to the language, only to find that they were | worse than useless for objects. At least Set can be handy | in a pinch, but I have never seen an actual usecase for Map | that isn't solved better with plain old JS objects. Curious | if anyone else has? | upzylon wrote: | There definitely are valid use cases for a Map, and even | more now that we have WeakMaps (e.g. associating data DOM | nodes). Or mapping functions to values, e.g. something | like this: const returnCache = new Map<() | => any, any>() const memoized = <T extends () | => any>(f: T): ReturnType<T> => { if | (returnCache.has(f)) return returnCache.get(f) | const v = f() returnCache.set(f, v) | return v } | gildas wrote: | > I have never seen an actual usecase for Map that isn't | solved better with plain old JS objects. Curious if | anyone else has? | | With a Map, you can easily and efficiently associate data | with DOM nodes without changing them. | mediumdeviation wrote: | Create a wrapper class that takes a stringifying function that | converts the key objects into a primitive and use a Map<string, | [K, V]> as the backing store. fast-json-stable-stringify can be | used as the default stringifier, and if the objects are large | or have a natural key a custom stringifier can also be | supplied. Since we are using TS we can have our custom class | implement Map<K, V>. | adrianmonk wrote: | Having hashmaps be ordered is not always desirable. It's not | necessarily meaningful for every application[1]. Also, it can | have overhead. | | If the problem is that the developer doesn't get what they | expect, often the solution is to require the developer to be | explicit about what they ask for. | | I think that could work well here. Instead of having map.keys(), | the API could have map.orderedKeys() and map.unorderedKeys(). (Or | make it okeys() and ukeys() to be more concise.) | | Yes, it's slightly more typing, but to me that's a small price to | pay to ensure you get what you want. | | --- | | [1] For example, maybe the order that's meaningful to me has | something to do with the values rather than the keys. If the map | represents git commits, maybe the order that's meaningful to me | is a topological sort, and the map isn't going to provide that. | stickfigure wrote: | This would be genuinely weird. The in-memory representation | will make one or the other approach more efficient. If you want | ordered, pick an ordered implementation (eg Java's | _LinkedHashMap_ ). If you don't care, pick an implementation | that doesn't care (eg _HashMap_ ). If you want some special | ordering, pick something that can use a Comparator (eg | _TreeMap_ ) or just sort the data when fetching. | | A Frankenstein _LinkedRandomTreeMap_ would pay the overhead of | all approaches for an extremely exotic use case. | adrianmonk wrote: | I suppose these ideas can be combined. An implementation that | doesn't support ordering could be of a type that only has an | unorderedKeys() and lacks orderedKeys(). Which is more | explicit than having a keys() function that behaves | differently than other types' keys() functions. | | Also, I guess you could define unorderedKeys() to mean "I'm | not picky about the order" (rather than "it must not be | ordered"). Then all types could have an unorderedKeys() | function. | stickfigure wrote: | If _unrderedKeys()_ isn 't picky about the order, then | there's no reason to have a separate method. Just _keys()_ | will do, and different implementations can have different | behavior. | | Also note that relaxing the ordering requirement when | obtaining keys does not necessarily improve performance; if | you've already done the work of maintaining an ordered list | (ie _LinkedHashMap_ ), it's actually faster to iterate in | order. | | Furthermore, what exactly does _orderedKeys()_ mean anyway? | Insertion order? Does reinsertion change order? If some | sort of natural ordering (eg, topological) then how do you | configure that? | | I think the "state of the art" is basically correct here: | An abstract _Map_ interface that provides basic operations, | and a variety of implementations all with different | behaviors and performance profiles. Sorry. | dragontamer wrote: | There's all sorts of application-specific assumptions in this | post, that it isn't funny. Ex: Use of hash-tables with insertion | ordering to handle a insert-once task queue (preventing "double | tasking" if a task is ever somehow inserted twice?) | | Solvable with a Linked-list, circular buffer queue + hash table | combined together (hash-table to prevent double-insertions. | Linked-list / circular buffer queue for the FIFO functionality). | | If O(log(n)) insert/removes are sufficient and you have | priorities associated with each element, a Heap data-structure | works (all "repeats" will be next to each other, because max- | heaps always keep the maximum element on the top) | | The goal isn't supposed to be to make a data-structure that | magically works in all situations. The goal is to define standard | library "components" that can be combined together in ways that | matches the performance demands of a particular application. | | --------------- | | And this is all still single-threaded land. If you want a highly | performant multithreaded data-structure, you're probably going to | have to write your own from scratch (instead of using building | blocks from the standard library). | | There's just too many performance assumptions that goes into a | multithreaded data structure. Is it read-mostly / write-rarely? | Read/write balanced? Single-producer / multi-consumer? Multi- | producer / single-consumer? Multi-producer / multi-consumer? Is | the data-structure able to "count the threads" that may access it | and use thread-barriers (thread-pools may practically limit a | data-structure to say 64-threads, which means a 64-thread barrier | / semaphore can be a highly-performant solution to | synchronization) ? Or is it a fully dynamic design where an | unbounded number of threads may exist (unbounded number of | threads means the sempahore method won't work). | | -------- | | I mean... most people don't care about performance. So a standard | library data-structure with a mutex is just fine for well over | 95% of applications. But if you care about performance, then | suddenly things get complicated. | chris_st wrote: | The surprising/interesting thing about this article is the notion | that insertion order is the order that you want to iterate | through a hash map. | | In my experience, I'm doing either (A) using them as associative | memory (I want the value for a key) and I don't ever iterate, or | (B) I don't care about order when I iterate (I'm summing values, | say) or (C) I want the keys in some _sorted_ order, that probably | has nothing whatsoever to do with insertion order (counting | keywords, then showing them in alphabetical order). | | So I'm curious why the insertion order is preferred, in your | experience. | __s wrote: | A result of insertion order is that when I print {b:2,a:1} I | get "{b:2,a:1}". This can be nice for being able to store | something like an HTML element's attributes as a dictionary | while still serializing it back to the order it was declared | with | | Sorted order also requires sortable keys-- not something that's | required when hash maps only require hashing & equality | | A stable ordering is nice, without the ability to sort | insertion order meets this ask. | https://mail.python.org/pipermail/python-dev/2012-December/1... | shows how in Python this useful property also comes as a side | effect of an optimization | dheera wrote: | There are also some use of dictionaries where the ordering | matters (e.g. some mongoDB commands) and you can't use JSON | or BSON and you have to use goddamn SON. | uyt wrote: | Compact dict which preserves insertion order are now the | default implementation of dicts in python 3.7. They gave a | talk about it in https://www.youtube.com/watch?v=npw4s1QTmPg | cabalamat wrote: | > A result of insertion order is that when I print {b:2,a:1} | I get "{b:2,a:1}". | | This implies that {b:2,a:1} is a different data structure | than {a:1,b:2}, which disagrees with my intuitions about how | associative arrays ought to work. | | If I wanted a data structure that cares about order, I could | use [(b,2),(a,1)] and [(a,1),(b,2)] | OskarS wrote: | This is for, like, debug purposes. Makes logs and things | like JSON requests easier to read because the output | matches the source. Logically though, they are still | unordered for things like equivalence checks, so those two | examples you had there still compare equal. | | Since order is unspecified, but when serializing the hash | table you have to pick one, you might as well go for | insertion order and make it easier for the programmer to | read. Especially since this kind of hash table gets that | "for free" | macspoofing wrote: | >So I'm curious why the insertion order is preferred, in your | experience. | | It's a little bit like asking what's the use of a linked list. | A map which can iterate through keys in insertion order is just | a data structure. | | A recent use-case for me was using timestamps as keys that map | to captured state at that time. | chris_st wrote: | > * It's a little bit like asking what's the use of a linked | list.* | | Well, no, it's more like asking why you'd use data structure | X instead of Y. | | I was asking _why_ you 'd want that property as part of your | data structure, and pointing out that (AFAIR) I've never | needed it. | | For your use case, I guess you wanted to be able to retrieve | the data for specific time stamps? Otherwise I'd think you'd | just put it into an array. | recursive wrote: | > A recent use-case for me was using timestamps as keys that | map to captured state at that time. | | That sounds like sorted key order would suffice just fine. | metalliqaz wrote: | When working with data that has been scanned/parsed from a | file, it is often very useful to me that I retrieve the data in | the order it was found in the file. This gives humans some kind | of continuity between the input and the output of the program. | foobarian wrote: | While we can sort the keys before iterating a plain hashmap, | getting insertion order is particularly cumbersome because we | need to keep some extra information about the original ordering | of the elements. So I guess I'm kinda glad that this is the | default implementation, and getting other sort orders is easy | with just having access to the keys. | ErikCorry wrote: | The most important thing is that there is some defined order. | This makes it easier to debug, and reproducibility is just a | very nice property. See | https://en.wikipedia.org/wiki/Reproducible_builds Simple stuff | like making golden output files for your tests becomes easier. | | Certainly, having an ordering on the keys themselves (eg. | alphabetical order) is an alternative to insertion order. I'm | fine with that. But eg. the built-in maps in Go, Perl or JS | don't support that. | | So often you end up getting the keys, and then immediately | sorting them before iterating. That's taking something that | should be O(n) and making it O(n log n). I remember doing this | all the time in Perl. | | And often there's no order, so you have to create one for your | key type. That's extra work, and it's just easier to use a data | structure that doesn't require it. | | I would be interested to see performance comparisons for maps | where the keys are ordered without having to sort on each | iteration. Often they have tree structures internally, which | means cache-unfriendly pointer chasing, but probably you can | remove most of this (and the log-n access time) with wider | trees like B-trees. | sgeisenh wrote: | > The most important thing is that there is some defined | order. This makes it easier to debug, and reproducibility is | just a very nice property. See | https://en.wikipedia.org/wiki/Reproducible_builds Simple | stuff like making golden output files for your tests becomes | easier. | | Relying on this type of behavior when it isn't required is | somewhat of an anti-pattern. And can often lead to mindlessly | copying the results of an implementation in tests. | | A quick aside: any sort of generic ordered container that | supports O(n) iteration must necessarily require O(n log n) | work to construct in the average case. So the asymptotic | complexity is necessarily the same as sorting the keys of a | hash map with undefined iteration order to artificially | create a defined order. | | It isn't too hard to simulate the behavior of insertion order | iteration by maintaining an array alongside the hash map. And | surprisingly enough, space and runtime complexity is | practically identical to containers that implement insertion | order iteration ;-) | ErikCorry wrote: | > any sort of generic ordered container that supports O(n) | iteration must necessarily require O(n log n) work to | construct in the average case | | This doesn't apply if the order you want is insertion | order. | | These hash maps are constructed in O(n) on average, just | like any other hash maps. | | Java's LinkedHashMap is also constructed in O(n). | simcop2387 wrote: | For perl yoy can get this but you have to explicitly ask for | it outside the perl interpreter. This is dome by setting up | the hash seed and tell it not to add turbulence during | runtime. | | export PERL_HASH_SEED=0; export PERL_PERTURB_KEYS=0; | | The article doesn't mention why python and perl made the | choice to add more randomness but ti's because there were | attacks out there where people could determine which bucket | keys would be put in ahead of time and calculate a set of | keys to cause pathological behavior in anything that made a | hash from user input, this was commonly url parameters or | POST requests. This meant that a simple web request with the | right information in it could cause your application to | continually reallocate memory and expand the hash size to | accomadate the inputs. Usually this just burned cpu time and | made other requests time out but it could also result in | excessive memory usage and the OOM killer ccoming into play. | | The above environment variables work to set a known seed and | then not mix any new randomness in, this doesn't get you an | jnsertion order but it will make the hash itself as | reproducible as possible (if you don't put the keys in in the | same order your results will vary still). I uwe this in a now | defunct (because of second/third/fourth system syndrome) | fuzzer I had going to detect problems in the prerelease | versions of perl. | ErikCorry wrote: | I don't think compact dictionaries are necessarily | vulnerable to this issue. You could randomize the hash seed | without affecting the iteration order. | _old_dude_ wrote: | When you want to output values in the same order as the user | give it to you e.g. when you read then write JSON or TOML files | and those files can be written by a real human. Changing the | order feel arbitrary. | | When you have a language that support keyword arguments | (kwargs) like Ruby or Python more recently. | arsome wrote: | > The surprising/interesting thing about this article is the | notion that insertion order is the order that you want to | iterate through a hash map. | | I agree, this is what java calls a "LinkedHashMap" and what | python calls an "OrderedDictionary" - there are times when it's | quite useful, but often it's simply not necessary, uses extra | memory and shouldn't be the default when you just need a | hashmap. | oweiler wrote: | In Kotlin, calling mapOf() gives you a LinkedHashmap. Which | is a weird choice IMHO. | stickfigure wrote: | Or you could look at it as "relaxed ordering is a performance | optimization" and shouldn't be taken prematurely. | | In Java, there is no default; you ask for a HashMap or a | LinkedHashMap explicitly. In dynamic languages like Python or | JS I prefer default insertion ordered maps because if I | wanted tightly optimized code, I wouldn't use those languages | in the first place. | ErikCorry wrote: | The nice thing is the compact dictionary doesn't use extra | memory, so why not make it the default? | tialaramex wrote: | It may (perhaps, I can't tell) not use "extra memory" | compared to the data structure they had before, but it | definitely is using memory it wouldn't need if it didn't | preserve order. | | Otherwise it would be more or less a violation of the | pigeonhole principle. The insertion order is information, | if you don't need to store that anywhere but you promise to | give it back anyway that's a hole in your information | theory. | floitsch wrote: | With that reasoning a sorted list would need more memory | than an unsorted list. | tialaramex wrote: | How so? Sorting _changes_ the order of the list, to the | extent it 's adding information (the sort order) it's | also destroying the same amount of information (the | previous order). | | It's easy enough (including by accident) to deliver | insertion ordered iterator behaviour for small hash maps | that have never had any items removed, but after that | things get sticky very quickly. | floitsch wrote: | You were stating that "[...] it definitely is using | memory it wouldn't need if it didn't preserve order.". | | There is no reason for that. A sorted list doesn't need | more memory than an unsorted list. And an insertion- | sorted map doesn't need to use more memory than a map | that doesn't keep the order. | | Independently, some properties we deem valuable are often | side-products of the way the data-structure works. For | example, an efficient binary search tree will naturally | tend to be balanced. In the case of efficient hash maps, | it is now common to add the key/value pair at the end of | an internal vector. Maintaining the insertion order this | way doesn't add any cost. | tialaramex wrote: | It actually _does_ have a cost. In fact I 'm pretty sure | it has two separate types of cost, so let's talk about | both of them. | | Firstly, and this is orthogonal to my original point, | whenever people say "Cost" they actually silently mean | "... that I care about" and so we're involuntarily | obliged to agree to stop caring about anything else. | Because of Hyrum's law even if you _don 't_ promise | ordering (as Python's "Compact Dict" developer proposes | at one point) your users will rely upon it and you're | stuck with whatever you implemented. This is a real cost | for a language or standard library. Look at the sad state | of C++ standard collections. | | But my main point is that the requirement for ordering | forces you to dispose of otherwise permissible | optimisations which could destroy the order. You actually | ran head first into this in the article, when repeatedly | removing and adding elements blows up by clogging your | data structure with tombstones. | | Swiss Tables wouldn't clog up with tombstones. Since they | don't care about ordering they only need tombstones at | all in uncommon special cases (specifically, if the | entire rest of the group is full or tombstones), in most | cases Swiss Tables get to just set removed (key, value) | pairs as empty. But this optimisation isn't open to you. | joshuamorton wrote: | Sticking tombstones in the backing array doesn't actually | cost you anything though. The backing array still needs | to be the same size. Like maybe you end up having to re- | hash more often? | | >Because of Hyrum's law even if you don't promise | ordering (as Python's "Compact Dict" developer proposes | at one point) your users will rely upon it and you're | stuck with whatever you implemented. This is a real cost | for a language or standard library. Look at the sad state | of C++ standard collections. | | Yes but this cost goes away if you commit to it, and the | alternative (like what go does) to force the order to | change all the time also comes with a complexity cost vs. | doing "nothing" and having the order be arbitrary and | sometimes, but not actually, reliable. | deathanatos wrote: | It does, though. See the diagram in the article. There's | two allocated arrays there: one for the real hash table (on | top) and one for the "backing array" which then stores the | actual items. The "extra" bit is the indexes in the top | allocation: those indexes would be omitted in a normal | (unsorted) hash map. (That is, a normal hash map allocates | a single array, which is the storage for both the hash | table and the items it contains.1) | | (There's also some extra _time_ requirements, too, to chase | that additional pointer from the top table to the bottom | one.) | | 1N.b. in some languages those items will be references, but | that doesn't matter for the comparison here. | ErikCorry wrote: | Note that the bottom array can be densely packed, whereas | the top one must have gaps to avoid excessive probing. If | you only had one array you would have to put the gaps in | that array, and it is at least 2 words per entry (much | worse if you are inlining objects). So it's not so clear. | If you want to save space you can also omit the hash bits | stored in the index and reduce the width of the index | entries to 16 or 8 bits on small maps. | | In practice the compact dict does very well on space. | Much better than Java's unordered bucket-based HashMap | for example. | | It's true that there's an extra pointer chase. | eximius wrote: | note: it has been the default on python for some time, then | made official (instead of an implementation detail) after | some 3.x release. 3.7 maybe? | chris_st wrote: | Turns out it's the default implementation in Ruby as of 2.7 | or so as well. | majormajor wrote: | As of 1.9 :) | https://www.igvita.com/2009/02/04/ruby-19-internals- | ordered-... | [deleted] | sorokod wrote: | Other then having a deterministic order without the need to | specify additional ordering constraints, it is usefull when you | are populating the map while linearly processing some data. | mypalmike wrote: | It's useful for when you are in a job interview and are asked | to build an LRU cache. :-) | paavohtl wrote: | I've relied on the order of keys for URL routing in JavaScript | / TypeScript. Each route of the web application is unique, and | the order matters - it's surprisingly easy to end up with | otherwise ambiguous routes like /products/:ean and | /products/all in real world applications. One benefit of using | an object and not just an array is that I can validate at | compile time with TypeScript that each route has exactly one | handler. | | An insertion-ordered hashmap works really well for this without | requiring any extra syntax or data beyond object literals. | spuz wrote: | Imagine you want to write a JSON parser and prettyfier. Then | writing the keys out in the same order as they were inserted | would be very important. Yes, it's true that key order doesn't | matter to the computers that consume JSON but humans consume | JSON too and key order often matters to them. | chris_st wrote: | You may be able to tell from this reply that I've never built | a JSON parser and prettyfier :-) but I'd guess that in this | case, I'd never have any reason to ask for a bit of the JSON | "out of order", so I'd just use arrays...? | noneeeed wrote: | Curious to read this and not see a reference to Ruby. | | In Ruby hashes are ordered by insertion, can have any object as | keys, but the hash key value can be defined on a class by class | basis (by overriding the default `hash` method on the class). | | The ordering is one of those features that I need very | infrequently, but when I do it significantly simplifies that bit | of code. | | I've always found the usefulness and flexibilty of Ruby's hashes | to be a double-edged sword though: they are extremely useful, but | this leads people to use them in situations where defining a | proper class might be better for long-term maintenance. | weatherlight wrote: | I've used this feature of Ruby Hashes to create something that | behaves like a a LRU cache :) | AtlasBarfed wrote: | Java is treated as the "prior art" like it is some sort of | ancient script? That is amusing. | | Since Java was the first mainstream GC language with sufficient | man-hours of engineering, and it was an opportunity after almost | two decades of C/C++ libraries to start anew with the standard | library, but its standard library always seemed simply | evolutionary/"best practice" (not my favorite phrase, sorry) from | what came before it. | | Is the hashmap impl in the JDK actually novel for its time? | | But | ErikCorry wrote: | As I understood it, there was an implementation of "compact | dictionaries" in Java, but it was not the implementation of any | of the collections that were built into the language. It is | "prior art" in that it was done before compact dictionaries | were reinvented for Python. | abecedarius wrote: | IIRC it was for the collections library of the E language, | whose first implementation was in Java (erights.org). The E | designers had a self-imposed constraint that everything | computational had to be deterministic. The standard way of | doing iterable hashtables lacked this property. | | I tried to check my memory by following the link to the | Raymond Hettinger talk, but he doesn't say just what prior | work he was talking about, only that it was in Java. | bullen wrote: | I wish JavaScript had linked hash maps as default so that JSON | would not require arrays to keep the order of objects! | | Hierarchies would not have to look like this: | http://root.rupy.se/meta/user/task/563541707719927980 | ajanuary wrote: | Javascript already does retain insertion order (except for | special casing of things like integer keys) | | JSON objects, however, have unordered keys. | | JSON != Javascript. ___________________________________________________________________ (page generated 2021-07-01 23:01 UTC)