hngopher.com

       [HN Gopher] Ask HN: What are some cool but obscure data structur...
       ___________________________________________________________________
        
       Ask HN: What are some cool but obscure data structures you know
       about?
        
       I'm very interested in what types of interesting data structures
       are out there HN. Totally your preference.  I'll start: bloom
       filters. Lets you test if a value is definitely NOT in a list of
       pre-stored values (or POSSIBLY in a list - with adjustable
       probability that influences storage of the values.)  Good use-case:
       routing. Say you have a list of 1 million IPs that are black
       listed. A trivial algorithm would be to compare every element of
       the set with a given IP. The time complexity grows with the number
       of elements. Not so with a bloom filter! A bloom filter is one of
       the few data structures whose time complexity does not grow with
       the number of elements due to the 'keys' not needing to be stored
       ('search' and 'insert' is based on the number of hash functions.)
       Bonus section: Golomb Coded Sets are similar to bloom filters but
       the storage space is much smaller. Worse performance though.
        
       Author : Uptrenda
       Score  : 1763 points
       Date   : 2022-07-21 22:50 UTC (1 days ago)
        
       | _trackno5 wrote:
       | Not really obscure, but I had never heard of it till recently
       | (and most people I've talked had no idea about it): Difference
       | Arrays.
       | 
       | Great for when you need to do ranged updates in constant time.
       | 
       | More info: https://codeforces.com/blog/entry/78762
        
       | magicalhippo wrote:
       | Not sure you could call it a data structure as such, but Latent
       | Semantic Indexing[1] kinda blew my mind when I learned about it.
       | 
       | The fact that you could perform linear algebra on documents and
       | words, to get out a number that somehow quantified how close a
       | document was to another really surprised me back in the day. Grew
       | a new appreciation for what exactly those vectors in linear
       | algebra could represent.
       | 
       | Not in the field so don't know how obscure it is though.
       | 
       | [1]:
       | https://en.wikipedia.org/wiki/Latent_semantic_analysis#Laten...
        
       | tialaramex wrote:
       | Hazard Pointers are an interesting concurrent data structure.
       | 
       | Suppose we've got a _lot_ of Doodads, let 's say there's a Graph
       | of ten million Doodads, and a whole bunch (dozens? hundreds?) of
       | threads are poking around in this same graph, maybe looking at
       | Doodads and sometimes (but not often) removing them from the
       | Graph.
       | 
       | What happens if my thread is looking at a Doodad, and meanwhile a
       | different thread removes it from the Graph? Is the Doodad
       | destroyed while I was looking at it? That sounds very bad. What
       | if while I'm looking at it, the Doodad is destroyed, but, a
       | nearly identical Doodad appears in its place. How would I know?
       | Not acceptable.
       | 
       | We could reference count the Doodads, we'd need atomic reference
       | counts, something like C++ shared_ptr -- but this would work.
       | However now our performance is terrible, because we've got 100
       | threads increasing and decreasing these reference counts just to
       | look at Doodads and each such change causes a write that must be
       | observed on other threads.
       | 
       | So instead Hazard pointers go like this:
       | 
       | Every thread who wants to look at Doodads asks for a Hazard
       | Pointer, usually these are never deleted once made because it's
       | much easier. A thread sets this Hazard Pointer to point at a
       | Doodad when it wants to look at one, it checks this was
       | successful (if not the Doodad is gone now), then contemplates the
       | Doodad (it can be in this state for an arbitrarily long period of
       | time) and then it clears the hazard pointer once it is done. It
       | can only point one Hazard Pointer at one Doodad (some schemes
       | allow a thread to have say, two or some finite number of Hazard
       | Pointers each)
       | 
       | Somewhere a Linked List of _all_ these hazard pointers is kept.
       | When you want to remove a Doodad, you take it out of the graph,
       | then you check the List of Hazard Pointers. If any of them was
       | pointing at the Doodad you removed, you put the Doodad on a
       | special pile since it can 't be deleted yet, otherwise you can
       | delete it immediately because nobody else needs it any more. This
       | is a relatively expensive operation, but you're only doing it for
       | removal, not just reads.
       | 
       | Finally there needs to be some mopping up of that pile of
       | "couldn't delete yet" Doodads by trying to delete them again
       | (checking the Hazard Pointers again of course). This can be
       | periodic, it can happen every time somebody tries to delete one,
       | or whatever, the best policy may vary between applications.
       | 
       | This approach means reading Doodads costs only two local writes,
       | no waiting around for the cache. Removing them is more expensive,
       | but that's OK because it was rare.
        
         | robertsdionne wrote:
         | Seems to me to be quite similar to the virtual pages in Bw-
         | Trees: https://www.microsoft.com/en-
         | us/research/publication/the-bw-...
        
         | jez wrote:
         | Have you used this before? What was the domain?
        
           | throwaway17_17 wrote:
           | I have seen this general pattern in several decently large
           | game object systems, although I can't think of which ones off
           | hand, if I'm recalling correctly it tend to be used in the
           | longer lived (aka 10s of frames) object status/object
           | existence checks, and often in AI subsystems, pathing and
           | general decision making queries, etc.
        
             | moonchild wrote:
             | Seems like games are an 'easy' case for this kind of thing,
             | since you can synchronise every frame, and arbitrate
             | destruction then; no?
        
               | throwaway17_17 wrote:
               | For the most part. I think most of the usage was to allow
               | more dynamic 'objects' and queries on them to be used in
               | a low friction way without worrying about generational
               | validity checks or null checks when reading data from
               | those objects. It a lot like implementing a deterministic
               | GC for a small set of memory objects.
        
           | pianoben wrote:
           | These are not commonly used in application code, but are
           | "commonly" used to implement reclamation in languages without
           | memory management for concurrent mutable data structures.
        
       | remorses wrote:
       | Struct of arrays (also called MultiArrayList in Zig), instead of
       | storing big structs in an array you store each field in a
       | separate array and if you need to fetch the full struct you
       | reconstruct it from the arrays.
       | 
       | The benefit is that the arrays items memory size is smaller and
       | has no padding, it also increases cache locality.
        
         | FridgeSeal wrote:
         | Now I remember reading a blog post recently, about an Areta
         | library (one of the rust ones maybe?) that were doing both:
         | array-of-struct-of-array.
         | 
         | The idea was to split the main arrays up into "chunks" that
         | were structs, themselves containing rows of data (still array
         | oriented).
         | 
         | The idea was that you retain the cache/layout/performance
         | benefits of SoA layout, except each chunk is already packaged
         | up into lengths ready for pushing through SIMD. Quite a clever
         | idea.
         | 
         | Found it: https://www.rustsim.org/blog/2020/03/23/simd-aosoa-
         | in-nalgeb...
        
         | samanator wrote:
         | Sounds like a primitive of columnar databases.
        
         | fb03 wrote:
         | Entity-Component Systems do exactly this for better cache
         | locality
        
         | celeritascelery wrote:
         | Wouldn't that hurt locality? Since now you need to do multiple
         | access across the entire heap to reconstruct one object.
        
           | adwn wrote:
           | You wouldn't typically use this when you need lots of random
           | accesses, but when you process the data sequentially.
           | 
           | It also simplifies and speeds up SIMD loads and stores,
           | because you can load/store entire SIMD registers with a
           | continuous, non-strided memory access.
        
           | DoubleFree wrote:
           | It depends. For operations or aggregates on a single field,
           | it improves cache locality, whereas if operations act on all,
           | or most, fields of the struct, it hurts cache locality. The
           | exact same tradeoff differentiates OLTP (transactional/row
           | stored) databases and OLAP (analytics/column stored)
           | databases.
        
         | cmpolis wrote:
         | this is a great way to store structured data in js since it
         | saves the memory cost having repeated keys. e.g.:
         | records = {         time: [1000, 1001],         price: [20,
         | 25],         volume: [50, 15]         }            records = [
         | { time: 1000, price: 20, volume: 50 },           { time: 1001,
         | price: 25, volume: 15 }       ]       // not a big difference
         | with 2 records, but for xxxx records...
        
       | kcbanner wrote:
       | "This is the story of a clever trick that's been around for at
       | least 35 years,     in which array values can be left
       | uninitialized and then read during normal   operations, yet the
       | code behaves correctly no matter what garbage is sitting in the
       | array. Like the best programming tricks, this one is the right
       | tool for the job in certain situations. The sleaziness of
       | uninitialized data access is offset by performance improvements:
       | some important operations change from linear to constant time."
       | 
       | https://research.swtch.com/sparse
        
         | once_inc wrote:
         | Requires numbers inserted to be unsigned, and to have a maximum
         | value within reason, and doubles memory usage.
         | 
         | Very nifty trick though!
        
         | Tao3300 wrote:
         | > The _sleaziness_ of uninitialized data access
         | 
         | That's the perfect word for this kind of thing. It's not
         | _wrong_ , but it's sleazy alright.
        
         | decebalus1 wrote:
         | I remember this from Programming Pearls and I was going to post
         | about it after reading the comments. Great stuff.
        
         | 1024core wrote:
         | I was going to mention Sparse Arrays, thanks for mentioning it.
         | One of my favorite data structures.
        
         | djmips wrote:
         | I've used this type of data structure in the past for things
         | like managing sprite DMA on old consoles. It's very friendly to
         | old systems with simple memory access patterns.
        
       | nvarsj wrote:
       | Probabilistic data structures are pretty cool.
       | 
       | Count-min sketch [1] is another one. It gives a reasonably
       | accurate count of different events, even millions of unique
       | events, in a fixed memory size. It shows up a bunch in high
       | volume stream processing, and it's easy to understand like the
       | bloom filter.
       | 
       | Another cool data structure is HeavyKeeper [2], which was built
       | as an improvement on count-min sketch for one of its use cases:
       | ranking the most frequent events (like for a leaderboard). It can
       | get 4 nines of accuracy with a small fixed size even for enormous
       | data sets. It's implemented by Redis's topk.
       | 
       | [1]: https://en.wikipedia.org/wiki/Count%E2%80%93min_sketch
       | 
       | [2]:
       | https://www.usenix.org/system/files/conference/atc18/atc18-g...
        
       | Blahah wrote:
       | 1. Probabilistic filtering and matching:
       | 
       | Since you mentioned bloom filters - other probabilistic data
       | structures like count-min sketches (roughly, streaming bloom
       | filters) are super useful. Approximate kmer methods like minhash
       | and w-shingling use them in really cool ways. Rolling hash
       | methods like Rabin chunking also work really nicely with
       | probabilistic/streaming hash tables - splitting the data stream
       | into chunks that can be consistently filtered or matched is
       | useful in many circumstances! Think NSA-level data harvesting, or
       | realtime genome/geospatial data classification.
       | 
       | 2. Checking for, locating and counting subsequences in giant
       | string datasets:
       | 
       | Wavelet trees, FM-indices, suffix arrays more generally - and
       | structures that allow extreme performance in very niche
       | situations with arbitrary encodings like bitvectors and
       | compressed integer vectors. If you have a million books, or all
       | the genomes ever sequenced, you use the first structures to find
       | where a specific phrase can be found, even if you don't quite
       | spell it right. You can use the latter ones to do super fast
       | comparisons of giant datasets - given a million viruses, how
       | similar is each one to the human genome, and where specifically
       | do they most closely match?
       | 
       | 3. Reconstructing structured information from (potentially noisy)
       | fragments:
       | 
       | De-brujn graphs (and related graphs like string graphs, colored
       | de brujns). This is a way to find all the overlap connections
       | between fragments of information, and then weight the possible
       | traversals of those connections by the evidence. These can be
       | represented using the data structures from #1 (FM-indices for
       | example), and efficiently used in some circumstances with those
       | from #2 to enable some kinds of graph algorithms. If you have a
       | shredded set of documents, or a billion short DNA reads from a
       | genome sequencing experiment, this is how you reconstruct the
       | original.
       | 
       | 4. Decentralised coordination structures. Merkle-DAGs and
       | Kademlia DHTs in particular. Being able to compare any trees by
       | root-first hashes, and being able to request the content of the
       | subtree for any node hash from an arbitrary set of peers - these
       | structures enable the p2p web infrastructure. Everything from
       | Limewire to bittorrent, IPFS and blockchains, and, most
       | importantly, Sci-Hub.
       | 
       | 1, 2 and 3 together are some of the fundamentals of computational
       | biology. If you're interested in understanding them,
       | https://rosalind.info/problems/list-view/ is a great starting
       | place.
        
         | dundarious wrote:
         | Funny you should mention bloom filters and that use case, I
         | just re-read this post again this morning:
         | https://blog.cloudflare.com/when-bloom-filters-dont-bloom/
         | 
         | Basically, it makes a similar point to
         | https://news.ycombinator.com/item?id=32186837 -- i.e., cache
         | effects are not modeled in big O analysis, even though most
         | problems that involve large amounts of data (so most problems)
         | are memory bound rather than CPU bound. The blog post makes a
         | good case for a very simple hash table as a better fit for this
         | problem.
        
           | vlmutolo wrote:
           | That Cloudflare article is a little frustrating.
           | 
           | > While we could think of more sophisticated data structures
           | like Cuckoo filter, maybe we can be simpler
           | 
           | Yes, standard Bloom filters fail for large filter sizes
           | and/or very small false-positive rates. But we've known this
           | for decades, and _tons_ of other probabilistic filters have
           | come out since then to address the problem. Cuckoo filters in
           | particular are incredible.
           | 
           | Was there really no easy way to bring in an open-source
           | Cuckoo filter implementation? They're not _that_ complicated.
           | Maybe I 'm just used to modern languages having good package
           | managers.
           | 
           | Plus the author's solution is basically the idea behind
           | Cuckoo filters: "what if we just use a hash table, but allow
           | for collisions?" Cuckoo hashing is just clever about
           | guaranteeing O(1) worst-case lookup.
           | 
           | And for absolute dead-simple filters, a blocked Bloom filter
           | is basically just as easy as a Bloom filter. It's like two or
           | three more lines of code. It's just changing the indexing
           | operation to be modulo some large "block" size. That said, I
           | don't think they'd work too well in this case, since the
           | author has a pretty small false-positive rate (0.000019), and
           | in my experience small Bloom filters (which is what comprise
           | blocked Bloom filters) don't handle small FP rates well.
           | 
           | But guess what excel at small FP rates... cuckoo filters.
        
             | [deleted]
        
             | dundarious wrote:
             | A linear probing hash table is simpler. That's the trade
             | off they were going for at that time. I don't think that's
             | the most efficient solution given the hardware they were
             | using, and I don't think the blog author would either, but
             | it's certainly easier to write such a hash table -- and
             | it's well written, but still mostly interview level stuff.
             | 
             | To me the blog post is not about cuckoo or bloom filters or
             | hash tables at all. It's about profiling and highlighting
             | that a naive reading of the literature can easily lead you
             | astray (worse performance and complexity). In school they
             | don't teach you that mov is the biggest cycle-eater of them
             | all -- at least it's not the lesson people remember.
        
               | xxs wrote:
               | Cuckoo is terrible from constant const point of view,
               | linear probe is where is it at, indeed.
               | 
               | >"mov" is the biggest cycle-eater of them all.
               | 
               | The R part in 'RAM' is so wrong nowadays.
        
           | pjscott wrote:
           | If you're willing to use moderately more memory, you can get
           | much better cache locality using a _blocked_ bloom filter.
           | The basic idea is to use an array of bloom filters that each
           | fit into a cache line, and use some bits of hash to determine
           | which sub-filter each item goes into. Suddenly each lookup or
           | insertion has at most one cache miss!
        
       | ur-whale wrote:
       | Cuckoo hashing: https://en.wikipedia.org/wiki/Cuckoo_hashing
       | 
       | Hash tables that come with a parameter that lets you control
       | density vs. speed.
        
       | ImJasonH wrote:
       | Nested Set Model: a way to model nested sets of entities and
       | query them in a SQL database. I haven't used it for anything but
       | it seems fun.
       | 
       | https://en.m.wikipedia.org/wiki/Nested_set_model
        
         | NicoJuicy wrote:
         | Used it for tag hierarchy, liked it a lot!
        
       | alfiedotwtf wrote:
       | The XOR Linked List blew my mind when I first read about it.
       | Clever!
       | 
       | https://en.wikipedia.org/wiki/XOR_linked_list
        
         | freemint wrote:
         | Horrible for todays hardware but i also came to say this.
        
       | myaccount80 wrote:
       | Dancing links https://en.m.wikipedia.org/wiki/Dancing_Links
       | 
       | " In computer science, dancing links (DLX) is a technique for
       | adding and deleting a node from a circular doubly linked list. It
       | is particularly useful for efficiently implementing backtracking
       | algorithms, such as Donald Knuth's Algorithm X for the exact
       | cover problem.[1] Algorithm X is a recursive, nondeterministic,
       | depth-first, backtracking algorithm that finds all solutions to
       | the exact cover problem. Some of the better-known exact cover
       | problems include tiling, the n queens problem, and Sudoku."
        
       | rockwotj wrote:
       | Interval trees. Really cool trees that allow fast lookups of all
       | the ranges that contain a given point.
       | 
       | https://en.wikipedia.org/wiki/Interval_tree
        
         | maxFlow wrote:
         | Found out about this DS in AoC 2021 Day 20+ problems if I
         | remember correctly. Pretty interesting.
        
       | swizzler wrote:
       | I like consistent hashing
       | (https://en.m.wikipedia.org/wiki/Consistent_hashing). When a hash
       | table (or load balancing pool) needs to be resized, it usually
       | reduces the number of keys (clients) that need to be remapped.
        
         | pianoben wrote:
         | Even simpler is Weighted Rendezvous Hashing (https://www.snia.o
         | rg/sites/default/files/SDC15_presentations...).
         | 
         | It's quite a bit easier to implement and verify than consistent
         | hashing, and carries the same benefits (minimal reshuffling on
         | ring resizes, etc)
        
           | eru wrote:
           | I recently updated the Python example on the Wikipedia page
           | for Weighted Rendezvous Hashing. (I'm amazed my edit seems to
           | be still there.)
           | 
           | https://en.wikipedia.org/wiki/Rendezvous_hashing
        
           | feurio wrote:
           | In a similar vein, Maglev hashing has a more uniform
           | distribution and decent reshuffling. I couldn't get my head
           | around the explainaions that I had seen until I read this
           | one:
           | 
           | https://www.usenix.org/sites/default/files/conference/protec.
           | ..
        
       | nailer wrote:
       | > Lets you test if a value is definitely NOT in a list of pre-
       | stored values
       | 
       | wouldn't a hashmap / dict / object do this in 0(1)? Input goes
       | in, is hashed, the key exists or doesn't exist?
        
         | tomsmeding wrote:
         | The point of a Bloom filter is that it takes constant memory.
         | (The tradeoff is the possibility of false-positives, of
         | course.)
        
       | strawhatguy wrote:
       | I actually got to use a k-d tree before, I was naively brute
       | forcing tiles together according to their 2d positions, the tiles
       | being part of a larger image, and it was perfect.
       | 
       | Would run out of memory on a 96gb machine before, after it could
       | process the tiles as quick as they came in from the camera, only
       | a couple held in memory at a time.
       | 
       | When you have the right data structure, man it's like night and
       | day performance!
        
       | mijoharas wrote:
       | My favorite one that I've had the chance to use professionally is
       | the Marisa trie[0].
       | 
       | Context is a data scientist had written a service that
       | essentially was just lookups on a trie. He'd left and the service
       | was starting to have memory problems so I dug in and swapped the
       | implementation. Iirc swapping the trie implementation changed
       | memory usage from 8gb to 100mb and sped everything up as well.
       | 
       | The actual data structure is equivalent to a trie, but cannot be
       | modified once it's been built (I think it may be the same as a
       | LOUDS trie, I don't remember the specifics)
       | 
       | [0] https://github.com/s-yata/marisa-trie
        
       | nemo1618 wrote:
       | My contribution: the Binary Numeral Tree:
       | https://eprint.iacr.org/2021/038
       | 
       | This is a tree-like structure where the structure of the tree is
       | fully determined by the number of leaves N. Specifically, each
       | 1-bit in the binary representation of N corresponds to one of the
       | tree's perfect binary subtrees. For that reason, I think of it
       | more as "structure imposed upon a list" rather than a typical
       | mutable tree with pointers and such.
       | 
       | What are the applications of this data structure? I am only aware
       | of one: computing the Merkle root of a large amount of data in
       | streaming fashion. The funny thing is that it's been discovered
       | independently multiple times: in BLAKE3, Sia, Utreexo, and more.
       | Maybe someday, someone will find another use for it. :)
        
         | zachrip wrote:
         | Can you dumb this down for me?
        
       | nathas wrote:
       | Hashed time wheels: https://blog.acolyer.org/2015/11/23/hashed-
       | and-hierarchical-...
       | 
       | Great for short-lived values in a cache that are frequently
       | accessed. If all of your values always have the same expiration
       | period (i.e. expires at insert time + N period) then it's super
       | efficient.
       | 
       | More or less you have an array with a head and tail. Every "tick"
       | (usually each second but you can customize it) you move the read
       | pointer ahead by 1 array slot. The tail is then evicted.
       | 
       | If you have a highly accessed part of your code, you can be lazy
       | and avoid using an extra thread and just tick ahead the read
       | pointer when the time changes.
        
       | DeathArrow wrote:
       | >Good use-case: routing. Say you have a list of 1 million IPs
       | that are black listed. A trivial algorithm would be to compare
       | every element of the set with a given IP. The time complexity
       | grows with the number of elements. Not so with a bloom filter! A
       | bloom filter is one of the few data structures whose time
       | complexity does not grow with the number of elements due to the
       | 'keys' not needing to be stored ('search' and 'insert' is based
       | on the number of hash functions.)
       | 
       | The easy and efficient way to test if a value is in a list is to
       | use a hash set or dictionary. Complexity is always O(1).
        
         | kushan2020 wrote:
         | OP is probably mentioning bloom filter because it is space
         | efficient. A good filter will use less space than entire hash
         | set making it a good candidate to be kept in RAM to filter out
         | requests and reduce expensive calls to read disk.
        
       | beeforpork wrote:
       | Cuckoo hashing is really nice: fast O(1) access, amortized O(1)
       | insert. And really space efficient, and it can be parameterized
       | easily for a space efficiency vs. access time trade-off (it is
       | always O(1), but with larger constants for better space
       | efficiency). It is also very simple: basically just an array
       | (plus an algorithm). Unexpectedly simple in my opinion for its
       | good time and space performance.
       | 
       | https://en.wikipedia.org/wiki/Cuckoo_hashing
        
       | ben__bradley wrote:
       | Bitboards for chess board representation, the fastest and most
       | complex such representation:
       | https://www.chessprogramming.org/Bitboards Other structures for
       | chess are easier to follow and understand:
       | https://www.chessprogramming.org/Board_Representation
        
       | karsinkk wrote:
       | Lazy Adaptive Trees :
       | https://dl.acm.org/doi/10.14778/1687627.1687669
       | 
       | It is a B-Tree where updates to the tree are buffered in the
       | branch nodes and are propagated down to the leaf nodes at a later
       | point in time.
        
       | nathell wrote:
       | Directed acyclic word graphs (DAWGs)!
       | 
       | They're like tries, but with identical subtrees glued together.
       | They're capable of encoding real-world dictionaries of millions
       | of words into ~1 byte per word, when stored cleverly (see [0]).
       | They let you do things like efficiently finding a list of words
       | in a dictionary whose Levenshtein's distance to a given word is
       | less than x. Their cousins, GADDAGs, are _the_ data structure of
       | choice in Scrabble-playing engines.
       | 
       | [0]: http://www.jandaciuk.pl/fsa.html
        
         | danieldk wrote:
         | Also related: Levenshtein automata -- automata for words that
         | match every word within a given Levenshtein distance. The
         | intersection of a Levenshtein automaton of a word and a DAWG
         | gives you an automaton of all words within the given edit
         | distance.
         | 
         | I haven't done any Java in years, but I made a Java package in
         | 2013 that supports: DAWGs, Levenshtein automata and perfect
         | hash automata:
         | 
         | https://github.com/danieldk/dictomaton
        
       | cdelahousse wrote:
       | Min-Max Heaps
       | 
       | Think of them as double ended priority queues. O(n) construction
       | with O(lgN) mutations. Both the min and max elements are quick to
       | get O(1).
       | 
       | The paper is short and the data structure is elegant. I read it a
       | few years ago in uni and made me appreciate how greatly useful
       | can be implemented very simply.
       | 
       | https://dl.acm.org/doi/pdf/10.1145/6617.6621
        
       | dahart wrote:
       | The Suffix Array is surprisingly cool and useful
       | https://en.wikipedia.org/wiki/Suffix_array
       | 
       | The Alias Table for sampling a discrete distribution in O(1) time
       | is a very clever idea https://www.keithschwarz.com/darts-dice-
       | coins/
        
         | dinobones wrote:
         | The fact that suffix trees/suffix arrays can be built in O(n)
         | is IMO the most surprising and "magical" result in computer
         | science. To me, more surprising than median of medians/quick
         | select.
         | 
         | There might be more "magical"/surprising things that are
         | obscure and I am unaware of, but to me, this is it. :)
        
           | adgjlsfhk1 wrote:
           | what about randomized O(n) minimum spanning trees?
        
       | kragen wrote:
       | Here's a few fun ones: the Burrows-Wheeler transform, suffix
       | arrays in general, Kanerva's fully distributed representation,
       | reduced-affine-arithmetic numbers, LSM trees, binary array sets,
       | sparse array sets, and gap buffers. (I'm not sure how nobody had
       | mentioned gap buffers yet on the thread, but if they did I didn't
       | see it.)
       | 
       | -- *** --
       | 
       | The Burrows-Wheeler transform is the _second_ character of each
       | suffix in a (cyclic) suffix array, and there turns out to be a
       | simple algorithm to recover the original text from this
       | transform; live demo: http://canonical.org/~kragen/sw/bwt. But
       | the new text is very easily compressible, which is the basis for
       | bzip2. As mentioned by Blahah and dahart, this is only one of
       | many interesting things you can do with suffix arrays; they can
       | also do efficient regular expression search, for example. This is
       | more interesting since three practical linear-time and -space
       | suffix array construction algorithms were discovered in the early
       | 02000s.
       | 
       | -- *** --
       | 
       | Pentti Kanerva devised a "fully distributed representation" in
       | the 01990s, which has some features in common with bloom filters
       | -- every association stored in an FDR record is stored to an
       | equal extent (probabilistically) in every bit of the record, so
       | erasing some of the bits erases none of the associations but
       | increases the error probability for all of them. You assign a
       | random or pseudorandom bitvector (of, say, 16384 bits) to every
       | atomic value to be stored, represent an association of two (or
       | more) atoms as their XOR, and then take the bitwise majority rule
       | of all the associations to be stored as the record. To retrieve
       | an association, you XOR the record with an atom, then compute the
       | correlation of the XOR result with all possible atoms; the
       | correlation with the correct atoms will be many standard
       | deviations away from the mean, unless the number of associations
       | stored in the record is high enough that conventional data
       | structures wouldn't have been able to store it either.
       | 
       | This and some related work has blossomed into a field called
       | "hyperdimensional computing" and "vector symbolic architectures".
       | But the problem of finding the nearest atom or atoms seems, on
       | its face, to make this impractical: if you have 16384 possible
       | atoms and 16384 bits in your bitvector, you would seem to need at
       | least 16777216 bit operations to compute all the associations.
       | 
       | I realized the other night that if you use the rows of a Walsh-
       | Hadamard matrix as the "pseudorandom" atom representations, you
       | can compute all the correlations in linearithmic time with the
       | fast Walsh-Hadamard transform. Moreover, Cohn and Lempel's result
       | from 01977 "on fast M-sequence transforms" is that after a simple
       | transposition you can also use the FWHT to compute all the
       | correlations with the complete output of a maximal LFSR (a so-
       | called "M-sequence") in linearithmic time, so you can also use
       | the cyclic shifts of an LFSR output for your atoms without losing
       | efficiency. Probably someone else has already figured this out
       | previously, but I hadn't found it in the literature.
       | 
       | -- *** --
       | 
       | I'm not sure if the quantities used in reduced affine arithmetic
       | count as a "data structure"? They're a data _type_ , at least,
       | but they're not a _container_. The idea is that you execute some
       | arbitrary numerical algorithm, not merely on floating-point
       | numbers, or even on dual numbers as in forward-mode automatic
       | differentiation, nor even yet on (min, max) intervals as in
       | ordinary interval arithmetic, but on a vector of coefficients [
       | _a_ 0, _a_ 1, _a_ 2... _a[?]_ ], which represents a linear
       | function _a_ 0 _x_ 0 + _a_ 1 _x_ 1 + _a_ 2 _x_ 2 + ... _a[?]x[?]_
       | , where _x_ 0 = 1 and each of the other _x_ [?] [-1, 1], so this
       | function is really a representation of an interval. Then you can
       | assign each of the _x_ (except _x_ 0 and _x[?]_ ) to be the
       | unknown error in one of the input parameters to your algorithm,
       | whose magnitude is determined by the _a_ value in the value of
       | that parameter; and when you introduce rounding error, you
       | increase _a[?]_ enough to account for it. (The non-reduced kind
       | of affine arithmetic just increases _n_ without limit to account
       | for new roundoff errors.)
       | 
       | Reduced affine arithmetic has two advantages over the usual kind
       | of interval arithmetic. First, it cancels errors correctly; if
       | you calculate ( _y_ +1)( _y_ +2)/( _y_ +3) with interval
       | arithmetic, with some uncertainty in the value of _y_ , you
       | usually get an unnecessarily loose bound on the result because
       | interval arithmetic assumes that the uncertainties in the
       | numerator and the denominator are uncorrelated, when in fact for
       | many values of _y_ they partly cancel out. (You can 't apply this
       | cancelation to _a[?]_ , the error stemming from all the roundoffs
       | in your algorithm, just the other _a_.) But the more interesting
       | result is that RAA gives you a linear function that approximates
       | the calculation result _as a linear function of the inputs_ ,
       | which is pretty similar to calculating the gradient -- but with
       | error bounds on the approximation!
       | 
       | -- *** --
       | 
       | LSM trees are only about as "obscure" as bloom filters, since
       | they underlie several prominent filesystems, LevelDB, and most of
       | the world's search engines, but if you don't know about bloom
       | filters, you might not know about LSM trees either. I wrote a
       | tutorial explanation of LSM trees in
       | https://github.com/kragen/500lines/tree/master/search-engine,
       | though the name "LSM tree" hadn't yet gone mainstream.
       | 
       | Binary array sets, as described in
       | https://www.nayuki.io/page/binary-array-set, are closely related
       | to LSM trees, to the point that you could describe them as an
       | application of LSM trees. They are a simple data structure for
       | the exact version of the problem bloom filters solve
       | approximately: maintaining a set S of items supporting the
       | operations S.add(item) and S.contains?(item). Insertion (.add) is
       | amortized constant time, and membership testing (.contains?) is
       | O(log2 _N_ ).
       | 
       | -- *** --
       | 
       | A different and more efficient data structure for the exact set
       | membership problem, limited to the case where the items to be
       | stored are up to _M_ integers in the range [0, _N_ ), supports
       | _constant-time_ creation, clearing, insertion, size measurement,
       | and membership testing, and linear-time enumeration, but I think
       | cannot be implemented in modern standard C because of its rules
       | about reading uninitialized data. (By allowing creation to be
       | linear time you can implement it in standard C.) It uses two
       | arrays, _A_ [ _M_ ] and _B_ [ _N_ ] and a variable _n_ , and _i_
       | is a member of the set iff _B_ [ _i_ ] < _n_ [?] _A_ [ _B_ [ _i_
       | ]] == _i_ , from which it is straightforward to work out the
       | other methods.
       | 
       | I don't remember what this data structure is called, but I
       | definitely didn't invent it. I described it and some uses for it
       | in https://dercuano.github.io/notes/constant-time-sets-for-
       | pixe.... You can associate additional values with this structure
       | to use it as a sparse array.
       | 
       | Oh, I see kcbanner posted this one too, with a link to Russ Cox's
       | writeup, which is better than mine and explains that it was
       | published by Briggs and Torczon in 01993, but probably not
       | originally invented because it was an exercise in an Aho,
       | Hopcroft, and Ullman textbook in 01974 and in _Programming
       | Pearls_ : https://research.swtch.com/sparse
       | 
       | -- *** --
       | 
       | A gap buffer is a data structure sort of similar to Huet's zipper
       | but without the dynamic allocation; Emacs famously uses them for
       | its editing buffers. It consists of an array containing of a
       | leading chunk of text, followed by a gap, followed by a trailing
       | chunk of text. Edits are always made in the gap; to start editing
       | in a different position, you have to move some text from the end
       | of the leading chunk to the beginning of the trailing chunk or
       | vice versa.
       | 
       | Suppose you want to edit a 531-byte file; you allocate a larger
       | buffer, say 1024 bytes, and read those 531 bytes into the
       | beginning of the buffer. If the user starts inserting text after
       | byte 200, you begin by moving bytes 201-530 (330 bytes) to the
       | end of the 1024-byte buffer (positions 694-1024); then you can
       | insert the user's text at positions 201, 202, 203, etc., without
       | having to do any more work to move text. If the user moves up to
       | byte 190 and inserts more text, you only have to move bytes
       | 191-203 or whatever to the end of the buffer.
       | 
       | A gap buffer is very fast in the usual case of making a sequence
       | of small edits at the same location or close-together locations,
       | and because the constant factor of copying a large block of bytes
       | from one place to another is so small, even moving the gap from
       | one end of a buffer to another is generally plenty fast enough
       | for interactive use. Moreover, operations like string search are
       | very fast in a gap buffer, while they tend to be very slow in
       | fancier data structures with better asymptotic complexity, such
       | as piece tables and ropes; and it uses a predictable and
       | relatively small amount of extra memory.
       | 
       | -- *** --
       | 
       | However, like FORTH, fancy data structures are a bit of an
       | attractive nuisance: they tempt you to get into trouble. When I
       | first came across Sedgewick's _Algorithms in C_ it was a
       | revelation to me: here were so many clever ways of breaking down
       | problems that made apparently impossible problems easy. (I had
       | come across sorting algorithms before but I didn 't really
       | understand them.) I immediately and erroneously concluded that
       | this was the knowledge I had been missing to program all the
       | things that I didn't know how to program, and I began my
       | counterproductive data-structure-fetishism phase.
       | 
       | Much better advice is found in _The Practice of Programming_ :
       | almost all programs can be written without any data structures
       | but arrays, hash tables, linked lists, and, for things like
       | parsing, symbolic algebra, or filesystems, trees. So just use
       | those if you can.
       | 
       | Occasionally, a program is only possible, or is dramatically
       | better, using fancy data structures like LSM-trees, bloom
       | filters, suffix arrays, skip lists, HAMTs, union find, zippers,
       | treaps, BDDs, ropes, gap buffers, Fenwick trees, splay trees, and
       | so on. And that can be super fun! But this situation is pretty
       | rare, and it's important not to get yourself into the position
       | where you're implementing a red-black tree or HAMTs when it would
       | have been a lot easier to write your program in a brute-force
       | way. Or where your successors are fixing bugs in your Fenwick
       | tree implementation.
       | 
       | Also, it's important to be aware that, quite aside from the
       | difficulty of writing and understanding the code, fancy data
       | structures often have extra costs _at runtime_ which can outweigh
       | their benefits. It 's easy to discover that your splay tree runs
       | _slower_ than sequential search over an unsorted array, for
       | example, and of course it uses much more RAM.
        
         | eru wrote:
         | > Much better advice is found in The Practice of Programming:
         | almost all programs can be written without any data structures
         | but arrays, hash tables, linked lists, and, for things like
         | parsing, symbolic algebra, or filesystems, trees. So just use
         | those if you can.
         | 
         | In general, as much as possible use stuff you already have good
         | libraries for.
         | 
         | Often, you can get away with much simpler data structures with
         | a bit of clever pre-sorting (or sometimes: random shuffling).
         | Relatedly, batching your operations can also often make your
         | data structure needs simpler.
        
           | kragen wrote:
           | Those are good tips.
           | 
           | I'm not that enthusiastic about libraries. Yes, using a good
           | LSM-tree library will keep you and your successors from
           | spending holiday weekends debugging your LSM-tree
           | implementation -- probably, because "good" doesn't mean
           | "perfect". But it won't be a tenth as fast as an in-RAM hash
           | table, if an in-RAM hash table can do the job. And CPython's
           | standard hash table usually won't be a tenth as fast as a
           | domain-specific hash table optimized for your needs.
           | 
           | That doesn't always matter very much, but it does always
           | matter. Especially now with the advent of good property-based
           | testing libraries like Hypothesis, if you can run a function
           | call in 0.1 ms instead of 1.0 ms, you can do ten times as
           | much testing with a given amount of resources.
        
       | vnorilo wrote:
       | One of my favourites is treap [1], which doubles as a great
       | excuse to forget how to balance binary trees. It is similarly
       | "probably efficient" as HAMT, mentioned in other comments; both
       | require a well-behaving distribution in the hash function.
       | 
       | 1: https://en.m.wikipedia.org/wiki/Treap
        
       | ffhhj wrote:
       | Latent Dirichlet Allocation (LDA) structure:
       | 
       | "LDA represents documents as mixtures of topics that spit out
       | words with certain probabilities."
       | 
       | "LDA assumes that each document in a corpus contains a mix of
       | topics that are found throughout the entire corpus. The topic
       | structure is hidden - we can only observe the documents and
       | words, not the topics themselves. Because the structure is hidden
       | (also known as latent), this method seeks to infer the topic
       | structure given the known words and documents."
       | 
       | https://cfss.uchicago.edu/notes/topic-modeling/
       | 
       | I'm making a next-gen search engine.
        
       | Dwedit wrote:
       | Here's one I don't know if I've ever seen documented anywhere. If
       | anyone knows a proper name for this one, let me know!
       | 
       | Imagine it's for a text editor, and you want to map Line Numbers
       | to Byte Positions. But then you want to insert a byte somewhere,
       | and you need to add 1 to all your Byte Position values.
       | 
       | Instead of actually keeping a big array of Byte Position values,
       | you have a hierarchical array. The convention is that whenever
       | you want to read the value, you add the Parent's value, and that
       | parent's value, etc. Number of parents would be logarithmic.
       | 
       | So if you want to add 1 to everything past a certain point, you
       | just apply it to some leaf nodes, then the next internal nodes,
       | rather than applying it to every single leaf node.
        
         | progval wrote:
         | https://en.wikipedia.org/wiki/Rope_(data_structure)
        
           | Dwedit wrote:
           | That's good for representing the text, but I'm more focused
           | on the part about being able to add X to everything past a
           | particular point very quickly by manipulating parent values.
        
             | progval wrote:
             | Ropes do this too; each node is labeled the size of its
             | subtree. If you want to compute line numbers as well, add
             | them as another label
        
               | kragen wrote:
               | That kind of thing is a common extension to ropes,
               | because it makes them indexable containers like arrays,
               | but it is not inherent to the definition of ropes. I
               | don't think it's mentioned in the original Cedar rope
               | paper.
        
         | ant6n wrote:
         | An annotated tree?
        
         | compressedgas wrote:
         | Ted Nelson named them Enfilades. Guy Steele called them Monoid
         | Cached Trees.
        
           | twic wrote:
           | Enfilade trees were going to be my suggestion for this page?
           | Really nice structures. Such a shame the writing describing
           | them is so hermetic in style - dispative and widdative
           | properties, really, come on now.
        
           | kragen wrote:
           | Actually I think it was Roger Gregory or Mark Miller who came
           | up with the term "enfilade", and I don't think enfilade wids
           | and dsps have to be monoids necessarily. I'm not sure,
           | though.
        
         | ly3xqhl8g9 wrote:
         | Somewhat along these lines, I have formed a concept forcedly
         | called "differentially composable string", or "deposed string",
         | or more precise "poor man's git".
         | 
         | The intended use case is to obtain a compact representation of
         | all the historic text entered into an input field (notes,
         | comments, maybe long-form): all the stages of the text, where a
         | stage is a tuple [add/remove, start_index, text/end_index].
         | Once you get the stages from the deposed string as JSON, you
         | could transform them however you want then load them into a new
         | deposed string.
         | 
         | You can read more on GitHub: https://github.com/plurid/plurid-
         | data-structures-typescript#... or play around on my note-taking
         | app implementing deposed strings and more:
         | https://denote.plurid.com
        
       | groby_b wrote:
       | All probabilistic structures are fascinating to me. I'm most
       | enamored with HyperLogLog - estimating the number of distinct
       | elements in extremely large sets of data without having to build
       | a set of seen elements - but the entire field is pretty awesome.
       | 
       | The canonical use case for HLL is counting unique visitors. But
       | any large dataset where you only need a good approximation is a
       | valid target :)
        
         | drinkcocacola wrote:
         | I used to work for a mobile analytics company, and HLL made
         | things SO much easier and fast that at the time we changed all
         | our queries to be based on HLL. At the time, the ability of
         | being able to sum uniques was like black magic, and still to
         | this day I am quite impressed about HyperLogLog.
        
       | weinzierl wrote:
       | Nearly 300 comments and no one said
       | 
       | Rainbow Tables.
       | 
       | Sure, they not used anymore and are not particularly useful
       | nowadays, but I find the idea behind them very beautiful.
        
       | kqr wrote:
       | Not particularly innovative, but extremely useful: HDRHistograms
       | let you very efficiently create, combine and query empirical
       | probability distributions.
       | 
       | If I'm doing anything with random or random-looking variation, I
       | prefer to store the full distribution in a HDRHistogram rather
       | than just the last value, last few values, mean value, or
       | whatever. Opens up so many possibilities for analysis later down
       | the road.
        
       | FridgeSeal wrote:
       | Succinct data structures are one of my favourites.
       | 
       | The idea that we can very cleverly pack a collection of
       | information into a much smaller form, but you can query the
       | compressed form with good computational complexity, and without
       | unpacking the data you need to read is _amazing_.
        
         | ignoramous wrote:
         | See Steve Hanov's classic article on the topic:
         | http://stevehanov.ca/blog/index.php?id=120
        
       | zarzavat wrote:
       | Spatial hashing.
       | 
       | Say that you have data that is identified with points in 2D or 3D
       | space. The standard way that CS students learn in school to
       | represent this is via a quadtree or octree. However, these tree
       | data structures tend to have a lot of "fluff" (needless
       | allocations and pointer chasing) and need a lot of work to be
       | made efficient.
       | 
       | Spatial hashing is the stupidly simple solution of just rounding
       | coordinates to a grid and storing nearby objects together in a
       | hash table. As long as your data is reasonably well behaved then
       | you get many of the advantages of a quadtree, and it's completely
       | trivial to implement.
        
         | de6u99er wrote:
         | Ever heard of geospatial hierarchical Indices?
         | 
         | E.g. Uber's H3 index or Google's S2 index.
        
         | exDM69 wrote:
         | Couple some spatial hashing with Morton codes and fast parallel
         | Radix sorting and some binary search derivative and you can do
         | all sorts of fancy queries for collision detection and graphics
         | stuff.
        
         | manholio wrote:
         | It should be noted that this solution completely fails when the
         | data is anything like real world location data which tends to
         | be clustered in cities, stadiums etc. The strength of quad
         | trees is precisely the fact that the "Alaska" quadrant can have
         | a single child for that crazy outdoors-man using your gadget,
         | while New York can have a million children trees down to every
         | house on the street.
        
         | atemerev wrote:
         | Yeah, just reinvented it a few weeks ago and was like "whoa,
         | that's it?" It worked perfectly for my fast clustering use
         | case.
        
         | phkahler wrote:
         | I have a very optimized octree, but one thing I can not use it
         | for is points in space. Mine is built with the assumptions that
         | every object has a finite size, which is used to determine what
         | level of the tree it lives in. Points fail because they have
         | zero size.
         | 
         | I'm still looking for a good spatial index for points. I'll
         | think about yours.
        
         | pacaro wrote:
         | TBH even quadkeys are a fun answer to OPs question, many people
         | aren't aware of them.
         | 
         | Simple explanation:
         | 
         | If you have data with x y coordinates and you know the bounds.
         | 
         | To compute the quad key for a point:
         | 
         | 1. The key starts as the empty string. (I've also seen it start
         | with "Z" to handle points outside the bounds)
         | 
         | 2. Divide the space into 4 quadrants
         | 
         | 3. determine which quadrant the point falls in, append a letter
         | (A-D depending on the quadrant) to the key
         | 
         | 4. Repeat step 3 using the quadrant bounds (i.e. recursively
         | smaller area) until you have desired accuracy
         | 
         | This can then be used to efficiently find points within
         | rectangular bounds.
        
           | Gravityloss wrote:
           | Hmm... the quad keys end up looking very much like
           | coordinates. In the regular coordinates, the most significant
           | bit divides the space in half, then the next bit divides
           | those spaces in halves etc...
           | 
           | You could get a single "key" by just having a sequence of
           | bits with most significant bit of x coordinate, most
           | significant bit if of y coordinate, second most significant
           | bit of x coordinate, second most significant bit of y
           | coordinate, third....
        
           | blueblob wrote:
           | From a basic binary search, this is very intuitive. Your
           | explanation was well written.
        
           | zbobet2012 wrote:
           | A lot of people don't know about the related segment and
           | interval trees. Both efficiently allow you to compute all
           | time ranges that overlap a point. This can be generalized to
           | higher dimensional spaces.
           | 
           | For example if you have a set of calendar invites with a
           | start and stop time (say tens of millions of them) and you
           | want to find all invites which _span_ 12pm (that is start
           | before 12 and end after 12) you want an interval tree.
           | 
           | https://en.wikipedia.org/wiki/Interval_tree
        
           | yellowflash wrote:
           | That's how mongodb geospatial indexes work IIRC
        
             | pacaro wrote:
             | It's tried and tested for sure, I first encountered them in
             | 94 but I assume they're much older.
             | 
             | A little sleuthing shows that (TIL) they are an application
             | of Z-order curves, which date back to 1906
        
             | soren1 wrote:
             | Was about to mention this. If I recall correctly, the
             | 2d-sphere index rounds geospatial coordinates to 5
             | decimals. Very occasionally, I found it would distort
             | polygon geometries just enough to cause them to become
             | invalid (e.g. overlapping geometries), which causes the
             | index build to fail.
             | 
             | In my recent experience working with collections containing
             | million of documents, each containing a geoJSON-style
             | polygon/multipolygon representing a property (i.e. a block
             | of land), I found invalid geometries to occur for about 1
             | document in 1 million. For a while, I suspected the data-
             | vendor was the cause, however it became more puzzling when
             | other geospatial software confirmed they were valid.
             | Eventually we traced the issue to the 2d-sphere index.
             | 
             | A very clever workaround was suggested by a colleague of
             | mine, inspired by [1]. It preserved the original
             | geometries. In each document, we added a new field
             | containing the geometry's extent. A 2d-sphere index was
             | then built on the extent field instead of the original
             | geometry field. Invalid geometries were no longer an issue
             | since we were dealing with much simpler geometries that
             | were substantially larger than the max precision of the
             | index.
             | 
             | When running geoIntersects queries on our collection of
             | millions of documents, we did so in 2 steps (aggregation
             | queries):
             | 
             | 1. GeoIntersects on the extent field (uses the index).
             | 
             | 2. On the result set from the last step, perform
             | geoIntersects on the original geometry field (operates on a
             | much smaller set of records compared to querying the
             | collection directly)
             | 
             | [1] https://www.mongodb.com/docs/manual/tutorial/create-
             | queries-...
        
               | hubadu wrote:
               | Seems exactly like broad phase and narrow phase in games
               | physics engine.
        
               | djmips wrote:
               | The same things get invented over and over again and
               | named different things depending on the field. Sometimes
               | it's not immediately clear that they are the same things
               | mathematically.
        
         | cpgxiii wrote:
         | A variation of this are dynamic spatially-hashed voxel grids,
         | where the outer grid is spatially hashed and the grid elements
         | are stored (as needed) as dense voxel grids. The advantages of
         | this are that you get voxel grid-like behavior over essentially
         | unbounded size, so long as the data is sparse (otherwise you
         | would need to allocate a significant number of grid elements as
         | voxel grids and the memory savings go away).
         | 
         | These data structures have been used in 3D reconstruction
         | methods like CHISEL where they can often outperform octrees due
         | to better memory access behavior.
        
         | chatmasta wrote:
         | I've always found Morton encoding to be a cool approach.
         | There's an old but good blog post about this:
         | https://www.forceflow.be/2013/10/07/morton-encodingdecoding-...
         | 
         | This looks like a more modern implementation:
         | https://crates.io/crates/lindel
        
         | klysm wrote:
         | > reasonably well behaved
         | 
         | Curious what this means in this context. No hot spots? Not
         | sparse?
        
           | evouga wrote:
           | Performance degrades when you get many hash collisions, which
           | will happen if you have a lot of points clumped together in
           | space.
           | 
           | Sparsity is fine.
           | 
           | For points you expect to be unevenly distributed the more
           | advanced spatial partitioning data structures (kd tree, BVH)
           | perform better.
        
         | TheFragenTaken wrote:
         | I didn't think that would be an obscure data structure, but
         | Locality-sensitive hashing [1] helps in so many cases. Like
         | nearest neighbour search, collision detection, etc.
         | 
         | [1]: https://en.wikipedia.org/wiki/Locality-sensitive_hashing
        
       | ur-whale wrote:
       | The Alias method: O(n) in space and O(1) biased number generator
       | 
       | https://en.wikipedia.org/wiki/Alias_method
       | 
       | Mind was utterly blown when I first understood the way this works
       | in my late teens.
        
       | O__________O wrote:
       | Wikipedia's list of algorithms and data structures is pretty
       | extensive:
       | 
       | https://en.m.wikipedia.org/wiki/List_of_algorithms
       | 
       | https://en.m.wikipedia.org/wiki/List_of_data_structures
       | 
       | __________________
       | 
       | As for a specific algorithm, this paper covers analysis of
       | HyperLogLog (HLL):
       | 
       | http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf
       | 
       | TLDR: Calculating the exact cardinality of a multiset requires an
       | amount of memory proportional to the cardinality, which is
       | impractical for very large data sets. Probabilistic cardinality
       | estimators, such as the HyperLogLog algorithm, use significantly
       | less memory than this, at the cost of obtaining only an
       | approximation of the cardinality. The HyperLogLog algorithm is
       | able to estimate cardinalities of > 10^9 with a typical accuracy
       | (standard error) of 2%, using 1.5 kB of memory.
       | 
       | Source:
       | 
       | https://en.wikipedia.org/wiki/HyperLogLog
       | 
       | Recent HN coverage of HyperLogLog:
       | 
       | https://news.ycombinator.com/item?id=32138790
        
       | zeroonetwothree wrote:
       | Fibonacci Heap
       | 
       | Famous as the data structure that achieved optimum runtime in
       | Djikstra's algorithm. In practice it isn't often used because
       | simpler heaps outperform it except in very specific
       | circumstances.
       | 
       | https://en.wikipedia.org/wiki/Fibonacci_heap
       | 
       | Soft heap
       | 
       | As far as I know this hasn't really been implemented. It's used
       | for the best known minimum spanning tree algorithm. Pretty crazy.
       | https://en.wikipedia.org/wiki/Soft_heap
        
       | bell-cot wrote:
       | > ...bloom filters... Good use-case: routing. Say you have a list
       | of 1 million IPs that are black listed...
       | 
       | Seems-needed clarification: Bloom filters suffer from false
       | positives, but _not_ false negatives. So - if
       | BloomFilterBlacklistCheck( $IP ) returns false, your router can
       | safely conclude  "not blacklisted". But if it returns true, then
       | you'll need to perform further checks. (And you can't just use a
       | Bloom filter of known false positive - that will _also_ suffer
       | from false positive, potentially letting in traffic from
       | blacklisted IPs.)
        
       | zbobet2012 wrote:
       | Here is one not on the list so far:
       | 
       | Set Sketches. They allow you compute the difference between two
       | sets (for example to see if data has been replicated between two
       | nodes) WITHOUT transmitting all the keys in one set to another.
       | 
       | Say you have two sets of the numbers [1, ..., 1million] all 32
       | bit integers, and you know one set is missing 2 random numbers.
       | Set sketches allow you to send a "set checksum" that is only _64
       | BITS_ which allows the other party to compute those missing
       | numbers. A naive algorithm would require 4MB of data be
       | transferred to calculate the same thing.
       | 
       | *(in particular pin sketch https://github.com/sipa/minisketch).
        
         | FridgeSeal wrote:
         | Thank you!
         | 
         | I always wondered about this problem and never knew what to
         | look up to read more about it!
        
       | tsujp wrote:
       | Verkle Tree
       | 
       | https://vitalik.ca/general/2021/06/18/verkle.html
       | 
       | https://blog.ethereum.org/2021/12/02/verkle-tree-structure/
        
       | ChrisMarshallNY wrote:
       | Ole Begemann and Chris Eidhof wrote _Advanced Swift_ , and, in
       | there, described a really efficient FIFO queue.
       | 
       | I implemented a variant of it, in my Generic Swift Toolbox
       | Package[0]. It's lightning fast.
       | 
       | [0]
       | https://github.com/RiftValleySoftware/RVS_Generic_Swift_Tool...
        
       | jillesvangurp wrote:
       | String tries are pretty fun if you need to implement auto
       | completion. I spent some time implementing my own on one project
       | and open sourced it. And then brought it in on a few more
       | projects.
       | 
       | I've used bloom filters with in memory caches a few times.
        
       | demindiro wrote:
       | This isn't a data structure technically but I find it clever:
       | ZFS' space maps.
       | 
       | It keeps track of which space is allocated by logging every
       | allocation and deallocation. When initializing an allocator,
       | which can be implemented in any way, this log is replayed.
       | 
       | If the map becomes too big it is rewritten so it is equivalent to
       | the old map but with no redundant entries.
        
       | sgtnoodle wrote:
       | Monotonic stacks are neat.
       | 
       | I made up a data structure once, consisting of a pyramid of
       | deques. It lets you efficiently compute any associative function
       | over a streaming window of data.
        
         | jacamera wrote:
         | I recently learned about monotonic stacks thanks to this
         | LeetCode problem: https://leetcode.com/problems/sum-of-total-
         | strength-of-wizar...
         | 
         | I feel like they're quite possibly the most deceptively simple
         | data structure I've yet to encounter. That is to say that for
         | me at least there was/is a wide gulf between simply
         | understanding what the data structure is (doesn't get much
         | simpler than a stack!) and when/how to actually apply it.
        
           | sgtnoodle wrote:
           | Do folk genuinely get asked problems like that in technical
           | interviews these days? I can't think of any plausible reason
           | why anyone would want to do a calculation as contrived as
           | that. I love the required modulo value too.
           | 
           | I assume the trick is to somehow leverage the distributive
           | property of multiplication to refactor the data to no longer
           | require as much iteration? Is there supposed to be a linear
           | time solution?
        
             | jacamera wrote:
             | > Do folk genuinely get asked problems like that in
             | technical interviews these days?
             | 
             | No idea! I'm still in interview prep mode though so I'm
             | doing the LeetCode contests every weekend. Even if
             | practicing problems like these is over-preparing I think
             | it's probably a good idea for me because I imagine I'll be
             | much more nervous and less performant during an actual live
             | interview.
             | 
             | > Is there supposed to be a linear time solution?
             | 
             | Yes! Using a monotonic stack to find the "pivot points" of
             | the contiguous subarray groups as well as both a prefix-sum
             | array and prefix-product array to calculate the sum of all
             | subarrays in each group in O(1) time after initially
             | building the arrays in linear time.
             | 
             | There are some great explanations in the discussion
             | section. This is one of the few problems that I got stuck
             | on and wasn't able to solve on my own after a couple days
             | of trying. The LeetCode community is awesome though and
             | getting to see how other people solved the problem is a
             | great educational resource.
        
         | scott_s wrote:
         | You may be interested in the work myself and some coauthors
         | have done on data structures and algorithms for streaming
         | aggregation. GitHub repo for the code, which high level
         | descriptions of their properties and pointers to papers:
         | https://github.com/IBM/sliding-window-aggregators
        
           | sgtnoodle wrote:
           | That is indeed very interesting!
        
       | peterhunt wrote:
       | Sketching data structures: count min sketch, hyperloglog, sliding
       | window hyperloglog, t digest
       | 
       | Skiplists
       | 
       | Locality sensitive hashing
       | 
       | Log structured storage (bitcask, lsm)
        
       | aliceryhl wrote:
       | Link cut trees. They can be used to query information about paths
       | in a forest, while also allowing you to add and remove edges in
       | the forest.
       | 
       | For example, you could use them to store the minimum spanning
       | tree of a graph, while supporting updates if edges are added to
       | the graph.
       | 
       | They can also be used to speed up max flow algorithms by finding
       | the augmenting path in logarithmic time.
        
       | jupiter909 wrote:
       | o Fractal Tree Index's are rather nifty as used in Tokudb engine.
       | Basic gist is that you get the speed of Btree but your update are
       | less painful.
       | 
       | https://en.wikipedia.org/wiki/Fractal_tree_index http://www-
       | db.deis.unibo.it/courses/TBD/Lezioni/mysqluc-2010...
       | 
       | o k-d tree for space partitioning, handy real world applications
       | in mapping and gaming.
       | 
       | o Bloom Filters as you mention are great, always a handy one to
       | know.
        
       | sidcool wrote:
       | Tim sort.
        
       | smokel wrote:
       | Bitboards, although perhaps not so obscure, allow for some very
       | nice optimization tricks.
       | 
       | I've used them to solve Sudoku puzzles really fast.
        
       | [deleted]
        
       | berendka wrote:
       | Nested Sets
       | 
       | https://en.wikipedia.org/wiki/Nested_set_model
       | 
       | You can ust to store tree structures in relational databases for
       | faster querying stuff like "all nodes below node x", "How many
       | nodes to the top node" and so on.
       | 
       | Querying is easy, but comes at the cost that changes like
       | inserting and deleting are more expensive.
       | 
       | I used this for an auth service with roles, user and org units.
       | pretty query heavy, worked very well.
        
       | sam0x17 wrote:
       | Years ago as part of my thesis research I created a rather
       | obscure family of data structures called DFI Filters or DFilters
       | for short. The work centers around indexing typed trees such as
       | abstract syntax trees or a web browser DOM such that you can
       | answer arbitrary queries of the form "give me the set of all
       | descendants of node P that are of type T" in something that in
       | practice collapses to O(1) in the number of tree nodes. There are
       | several versions of the data structure, but the general setup is
       | a sort of "tree of trees" where the main tree is a binary search
       | tree organized based on the depth-first indices of the nodes in
       | the tree being indexed, and there are links into each type-
       | specific variant of the main tree (in other words, simplified
       | versions of the main tree containing only one type). The key
       | intuition behind the whole thing, and the reason I called them
       | DFI Filters, is the fact that if you are at some particular node
       | in a tree, and you know ahead of time the depth first indices of
       | your node and it's siblings, you actually already know exactly
       | how many descendents your node has based on the difference in the
       | DFI number between your node and its depth-first successor. Then
       | you can build a variety of indexing approaches and data
       | structures based on this insight -- I came up with ones that do
       | it in truly constant time but are expensive to update as an
       | undergrad, and in grad school I was able to build a version that
       | updates on the fly but still retains ammortized O(1) for querying
       | and ammortized O(m) for updating where m is the size of the
       | update.
       | 
       | By the way if you're curious how I can get constant time when I'm
       | returning a set of descendants, it's because I can give you the
       | size and pointers to the first and last elements right off the
       | bat.
       | 
       | The link to the original paper seems to be down (looking into
       | it), but you can find the master's thesis covering all variants
       | here:
       | https://github.com/sam0x17/hierarch_old/blob/master/Master's...
       | 
       | I've been working on a rust implementation lately as well. There
       | are a number of applications for DFilters including fast file-
       | system searching, DOM traversal, very fast static analysis on
       | abstract syntax trees, and more.
        
       | allanrbo wrote:
       | Frugal streaming.
       | 
       | For estimating a median or percentile in a stream, using constant
       | memory.
       | 
       | It's very simple: if the current value is higher than our
       | estimate, then increase our estimate by one. Else decrease by
       | one. It will converge over long enough time. This is called the
       | Frugal-1U algorithm.
       | 
       | The Frugal-2U is a slightly more advanced version that modifies
       | the step size along the way, plus other optimizations, to
       | converge faster and not oscillate too much.
       | 
       | https://pastebin.com/wAcCS8WZ
        
         | Lichtso wrote:
         | Isn't that essentially a low-pass filter?
        
         | jjgreen wrote:
         | An alternative streaming method due to Arandjelovic, Pham,
         | Venkatesh based on maximum entropy:
         | https://ieeexplore.ieee.org/document/6971097 and implementation
         | in C: https://github.com/dressipi/apv-median
        
       | fatih-erikli wrote:
       | - Prefix tree (also known as a trie) - Splaytree (Self balancing
       | tree optimized for querying fast for the frequently accessed
       | nodes)
        
       | boredbot wrote:
       | arrays are pretty underground, theyre like lists but better. C++
       | offers arrays but unfortunately python doesnt :(
        
         | progval wrote:
         | It does: what Python calls lists are actually (dynamic) arrays.
         | 
         | There is also this module, which provides unboxed arrays:
         | https://docs.python.org/3/library/array.html
        
       | darksaints wrote:
       | The Rete algorithm / graph network, and its variants.
       | 
       | You can evaluate infinite rules, with automatic incremental
       | updates (no re-evaluation if not necessary), with O(1) time
       | complexity. The tradeoff being that you have to hold a very large
       | graph which represents your rule set in memory.
        
       | mkindahl wrote:
       | Tree automata and the BURS (Bottom Up Rewrite System) algorithm
       | are pretty cool.
        
       | yeputons wrote:
       | Eertree. As one can guess from the name, it stores all sub-
       | palindromes of a given string and does stuff to them. Somewhat
       | similar to a suffix tree in spirit, but was only invented in
       | 2015.
       | 
       | See https://en.wikipedia.org/wiki/Palindrome_tree and the
       | original paper
        
       | glouwbug wrote:
       | Static arrays - when was the last time you used a static array?
        
         | jhallenworld wrote:
         | I find I use fixed-sized arrays a lot on microcontrollers. You
         | want to avoid the heap (for reliability) and you don't have a
         | lot of RAM, so you must carefully size your fixed-size arrays.
        
         | dahart wrote:
         | What do you mean by "static" exactly? It could be referring to
         | an array that doesn't change size. Or the contents are read-
         | only. Or do you mean C/C++ static linking? Or something else?
         | 
         | In C, C++ and CUDA I use constant fixed-size arrays with lookup
         | tables of specialized data all the time. I also use small
         | arrays to hold temporary results that need to get first
         | collected, and then sorted or processed together. Seems like
         | very common in C++, especially embedded, game, and GPU
         | programming. I'd love to hear about why static arrays seem
         | uncommon in your world.
        
           | glouwbug wrote:
           | Static arrays as in constant fixed sized arrays. With a
           | static array, O(N) is faster than O(LogN) and even some cases
           | O(1) _ahem, hashes_ when N is small. A contiguous cache is
           | king, and even with modern dynamic languages, the
           | generational GC can even be completely bypassed or highly
           | simplified if the JIT learns that memory access is just
           | dependent on offsets.
           | 
           | Simple code runs fast.
           | 
           | See Pike on Complexity - Rule 3:
           | https://www.lysator.liu.se/c/pikestyle.html
        
       | Fiahil wrote:
       | Append-Log. A grow-only linked list where you can only append
       | elements to it.
       | 
       | Good use-case: Lock-free concurrent reads are allowed, because
       | there is no way committed content could change. Writes need to be
       | Linearizable (so locking is required). Given this property, this
       | data structure provides faster reads than a Mutex<List<T>> and
       | similar write perfs.
       | 
       | Bonus section: Provides broadcast broadcast capabilities if used
       | as channel.
        
         | charleslmunger wrote:
         | I think this can be done completely lock free with almost same
         | number of memory barriers as a mutex when writing - read the
         | tail pointer (acquire), append your node using CAS(release),
         | then update the tail pointer with CAS (release).
         | 
         | For reads, start with an acquire read of the tail pointer,
         | which will make all preceding writes to the list visible to
         | you. Then you can read the list all you want up until you hit
         | the node you read from the tail pointer without
         | synchronization.
        
           | Fiahil wrote:
           | > read the tail pointer (acquire)
           | 
           | > append your node using CAS(release)
           | 
           | > update the tail pointer with CAS (release).
           | 
           | I thought as well, but, when I wrote that impl there was no
           | way to shortcut the 2 releases into a single atomic
           | operation. That ended up creating forks or loosing blocks.
           | 
           | I tested that model and a few variations with loom
           | (https://docs.rs/loom/latest/loom/), so I'm confident that it
           | didn't work. However, I also think that a working design
           | might exist. I just haven't found it yet :)
        
       | guuggye wrote:
       | Rope (or Cord)
       | 
       | https://en.m.wikipedia.org/wiki/Rope_(data_structure)
        
       | synalx wrote:
       | PR Quadtrees (https://opendsa-
       | server.cs.vt.edu/OpenDSA/Books/CS3/html/PRqu...) got me my
       | current job at Google. One of my interviewers asked basically
       | "how would you design Google Maps?" and I based my answer on this
       | data structure.
        
         | plank wrote:
         | Looked to see what these things were. Realised it might be the
         | reason I made a career in IT: I used this structure (I called
         | it a box) to calculate residues
         | (https://en.wikipedia.org/wiki/Residue_(complex_analysis)),
         | which I figured would be simpler using an object-oriented hence
         | I learned myself C++. This was during my PhD (theoretical
         | physics).
         | 
         | Off course, never used C++ again (learned object oriented
         | language 'properly' using SmallTalk, which I also never used
         | again in real life). Now, 25 years later, seeing the first time
         | that I was using an (adaption of) PR Quadtrees.
        
         | mkl wrote:
         | Why subdivide so far? I find it better to have leaf nodes
         | contain an array of up to some small number of points, e.g. 20.
         | That reduces the pointer chasing significantly, and a linear
         | search through such a small array is very fast.
        
       | 1024core wrote:
       | > Say you have a list of 1 million IPs that are black listed. A
       | trivial algorithm would be to compare every element of the set
       | with a given IP. The time complexity grows with the number of
       | elements.
       | 
       | In the case of IPs, you just have to make 4 comparisons (for
       | IPv4), if you store them in a tree.
        
       | anmalhot wrote:
       | Since you mentioned Bloom filters, here are ribbon filters:
       | https://engineering.fb.com/2021/07/09/data-infrastructure/ri...
       | 
       | Edit: another that I enjoyed working with 'Burst Tries':
       | https://dl.acm.org/doi/10.1145/506309.506312
        
       | raydiatian wrote:
       | I've bookmarked this thread. I have a new reading list! Thanks
       | OP!
        
       | Dwedit wrote:
       | Here's another one I was thinking about. It's an idea for a
       | hierarchical Free Space Bitmap. I don't know if anything like
       | this has appeared in a real file system or not.
       | 
       | At the lowest level, let's have one bit represent if a 4K chunk
       | is free space or occupied. 64 bits at this level tells you the
       | usage of 64 * 4096 bytes (256KB)
       | 
       | Then Level 2. Two arrays. One bit at this level represents 64
       | bits in the level 1 array. One array for All-Bits-Set in level 1,
       | one array for Any-Bit-Set in level 1. 64 bits at Level 2 tells
       | you usage of 64 * 64 * 4096 bytes (16MB).
       | 
       | Then Level 3. Two arrays. One bit at this level represents 64
       | bits in the level 2 array. One array for All-Bits-Set in level 2,
       | one array for Any-Bit-Set in level 2. 64 bits at Level 3 tells
       | you the usage of 1GB.
       | 
       | I'd imagine there would be ways to use the hierarchical tables to
       | quickly find a hole of a particular size without checking only
       | the lowest level.
        
         | zbobet2012 wrote:
         | I think you're describing a Van Emde Boas tree (or trie):
         | https://en.wikipedia.org/wiki/Van_Emde_Boas_tree
        
         | compressedgas wrote:
         | This reminds me of libswift's binmaps:
         | https://github.com/gritzko/swift/blob/master/doc/binmaps-ale...
        
       | manuelabeledo wrote:
       | Concurrent tries with non-blocking snapshots [0]
       | 
       | Say that you have a dataset that needs to be ordered, easily
       | searchable, but is also updated quite frequently. Fast accesses
       | are a pain if you decide to use traditional read-write locks.
       | 
       | Ctries are entirely lock-free, thus there is no waiting for your
       | read operations when an update is happening, i.e. you run lookups
       | on snapshots while updates happen.
       | 
       | They are also a lot of fun to implement, especially if you aren't
       | familiar with lock-free algorithms! I did learn a lot doing it
       | myself [1]
       | 
       | [0] http://aleksandar-prokopec.com/resources/docs/ctries-
       | snapsho...
       | 
       | [1] https://github.com/mabeledo/ctrie-java
        
         | nly wrote:
         | Sounds like if you want a version of this that doesn't leak
         | memory you need a garbage collected language?
        
           | manuelabeledo wrote:
           | Yeah, either that or you implement your own. Doing so with
           | reference counts might not be too hard, though.
        
             | nly wrote:
             | The reference count on each element would presumably
             | require atomic operations that could race with those used
             | to update the actual data structure, and therefore you'd
             | lose all your concurrency guarantees?
        
               | manuelabeledo wrote:
               | Generally speaking, this shouldn't be an issue if you
               | perform batch collections.
               | 
               | Either way, any new mechanism to release memory would
               | need to take into account both the generation and the
               | reference count, and the atomic release should be
               | performed in the same GCAS fashion.
        
       | chrsig wrote:
       | I'll add: the count-min sketch[0]. Used for streaming top-k,
       | where the set of keys may be too large to fit in memory. It's
       | essentially a composite of a fixed size (k) heap and a counting
       | bloom filter.
       | 
       | Instead of using a bit field for membership, use an integer field
       | for tracking count.
       | 
       | When querying: take the min of each value in the field location
       | given by the hash functions.
       | 
       | When updating: increment each field location given by the hash
       | functions, take the min of new values. if the min is greater than
       | the least value in the heap, push the updated key into the heap
       | 
       | [0] https://en.wikipedia.org/wiki/Count%E2%80%93min_sketch
        
       | valbaca wrote:
       | (EDIT: using "mail" because I was using "letter" too much in my
       | description)
       | 
       | The way I think of a bloom filter is like at an office mailroom
       | or at post office mailbox where mail is grouped by the letters of
       | people's name.
       | 
       | Given someone's name, you can glance and see "Uptrenda? No
       | letters for 'U'" very quickly. This is when a bloom filter
       | returns FALSE. But if they glance and see mail in the "U" box,
       | then they return MAYBE. After a MAYBE, to see if you _actually_
       | have any mail, they need to actually go through the box (i.e. a
       | bloom filter miss).
       | 
       | Mine are a couple that are present in the Java Collections but
       | I'm always disappointed that other language's don't include them:
       | 
       | - TreeMap/Set: A RedBlack tree. It keeps keys ordered, either
       | naturally or by using the provided Comparator, but has log(n)
       | time on basic ops.                   -
       | https://docs.oracle.com/javase/8/docs/api/java/util/TreeMap.html
       | 
       | - LinkedHashMap/Set: keeps keys ordered by insert order via a
       | linkedlist but can be configured to use access-order making it
       | easy to use as a LRU cache. It keeps O(1) time but is less
       | efficient than HashMap:                   - https://docs.oracle.c
       | om/javase/8/docs/api/java/util/LinkedHashMap.html
        
         | dskloet wrote:
         | > other language's don't include them
         | 
         | stl::map and stl::set in C++ are actually a tree map and tree
         | set.
        
       | lproven wrote:
       | From a prior post:
       | 
       | The piece table. https://news.ycombinator.com/item?id=15387672
       | 
       | Mirror of original:
       | https://web.archive.org/web/20160308183811/http://1017.songt...
        
         | lproven wrote:
         | After that: maybe the Enfilade, from the Xanadu project:
         | 
         | https://xanadu.com/tech/
         | 
         | https://en.wikipedia.org/wiki/Enfilade_(Xanadu)
        
       | vivegi wrote:
       | 1. Merkle-trees. Helps with content-based-retrieval, efficient
       | add/modify/delete/move node operations. Useful in scenarios where
       | you are building a data-serialization format that needs to
       | support efficient updates.
       | 
       | 2. Not necessarily a data structure, but SMT solvers (like Z3)
       | provide a very useful abstraction for certain kinds of problems.
       | They also use some cool data structures like disjoint-sets among
       | other things.
       | 
       | 3. Lock-free datastructures
        
       | freen wrote:
       | Fountain Codes: reconstruct data from random selection of
       | packets.
       | 
       | Was deeply patent encumbered until recently.
       | 
       | https://en.m.wikipedia.org/wiki/Fountain_code
        
         | noman-land wrote:
         | I searched this thread far and wide for fountain codes.
         | Surprised they're so low on the list. I find them amazing. Aso,
         | the name is really good. The fact that you can collect any
         | random packets in any order invokes the idea of putting a cup
         | under a fountain and letting it fill up with water. Love it.
        
       | petethepig wrote:
       | Tries (or prefix trees).
       | 
       | We use them a lot at Pyroscope for compressing strings that have
       | common prefixes. They are also used in databases (e.g indexes in
       | Mongo) or file formats (e.g debug symbols in macOS/iOS Mach-O
       | format are compressed using tries).
       | 
       | We have an article with some animations that illustrate the
       | concept in case anyone's interested [0].
       | 
       | [0] https://github.com/pyroscope-
       | io/pyroscope/blob/main/docs/sto...
        
         | btschaegg wrote:
         | Indeed, but it took a while for me to appreciate Tries.
         | 
         | Originally, i thought "nice idea", but I rarely ever
         | encountered a problem where I really needed a prefix tree.
         | 
         | But while reading into HAMT and Clojure's datastructure
         | implementations, it dawned on me that prefix trees aren't only
         | useful for strings (or arbitrary byte sequences): Hashes in a
         | dictionary, for example, are sequences of bits, too. And the
         | design of a HAMT is exactly such that it doesn't concern itself
         | with bytes but any blocks of N bits (determined by the
         | branching factor; i.e. a factor of 32 implies 6-bit-blocks).
         | And if you know the length of each sequence (hash), you know
         | the maximum Trie depth, which also works _for_ you, not against
         | you.
         | 
         | That was rather eye opening for me at the time.
        
         | rockwotj wrote:
         | Do you have any documentation about how tries are used in mongo
         | indexes? Or could you point to the source?
        
           | petethepig wrote:
           | There's not much about it online. They do mention it in the
           | docs, they call this "prefix compression" [0], so you can
           | search for that. This article describes it [1], although it
           | is somewhat high level.
           | 
           | They only use this for indexes. It works really well for
           | internal mongo ids (they all start with timestamps if I
           | remember correctly). It also works well for compound indexes.
           | Pro tip: for compound indexes always go from low-cardinality
           | attribute to high-cardinality attribute to get the best
           | results from prefix compression (it's also good for speed of
           | queries).
           | 
           | [0] https://www.mongodb.com/docs/manual/reference/glossary/#s
           | td-...
           | 
           | [1] https://medium.com/swlh/mongodb-indexes-deep-dive-
           | understand...
        
         | chrismarlow9 wrote:
         | Even though a trie is pretty standard and expected (to be
         | known) these days, I will vote for this. It was my first deep
         | dive into more exotic data structures after an interview
         | question about implementing T9 that stumped me many years ago.
         | 
         | https://github.com/Azeem112/tries-T9-Prediction
        
         | NSMutableSet wrote:
         | I wouldn't consider tries to be obscure tbh. They are the
         | recommended solution for many leetcode-style interview problems
         | involving string searching. I think anyone who has done serious
         | interview prep has encountered them.
         | 
         | https://leetcode.com/discuss/general-discussion/931977/begin...
        
           | EddySchauHai wrote:
           | I got stung by this for a test engineering role. Apparently
           | implementing them from scratch is a pre-req to building out
           | test infrastructure!
        
           | OJFord wrote:
           | The starting example was Bloom filters.
        
             | NSMutableSet wrote:
             | I'm not sure what the point of your post is. While bloom
             | filters are heavily used in actual production code
             | throughout the industry, it is very rare for anyone to need
             | to code their own, or make changes to a prior legacy
             | implementation.
             | 
             | Not all educational programs will cover bloom filters, and
             | for those that do, there's no guarantee that the students
             | will retain the information, and be able to recall it.
             | 
             | I don't know of any common interview question patterns
             | where a bloom filter would be the only optimal solution. If
             | it does share optimality with any other data structures,
             | you would be better off choosing something else unless you
             | are _extremely_ confident in your ability to implement one
             | and explain it clearly and in-depth, since the odds of your
             | interviewer not being familiar with them are high in
             | comparison to other solutions.
             | 
             | I only know what a bloom filter is because of HN, and
             | previous submissions like this one that have made it to the
             | front page throughout the years. It's nowhere near as
             | common knowledge as tries, especially if you are sampling
             | younger cohorts.
        
               | jefftk wrote:
               | I was with you until the last seven words ;)
               | 
               | Trees were a huge part of CS practice and education
               | historically, but have been replaced by hash-based
               | methods in many cases. For example, in C++ std::map is
               | generally a tree, while in a more recent language the
               | standard Map data structure will be a hashmap. My
               | impression is that the instruction time devoted to Bloom
               | filters (and other hash-based methods) vs trees has
               | shifted towards the former over the last ~20y.
        
               | sicp-enjoyer wrote:
               | C++ STL uses trees because the generic requirement for
               | them to work is a < operator. Designing generics around
               | hashing is much more difficult, and most languages
               | struggle with that.
               | 
               | Ironically, due to caches, sorting and then using
               | algorithms that rely on order tend to be superior than
               | most hashing implementations, even though they are
               | theoretically worse due to log(n) factor. So in a way,
               | C++ algorithms are actually more modern.
        
               | Sesse__ wrote:
               | > Ironically, due to caches, sorting and then using
               | algorithms that rely on order tend to be superior than
               | most hashing implementations
               | 
               | This doesn't match my experience at all. C++ trees are
               | not cache-friendly; they're pointer-chasing (and there's
               | no arena implementation in the STL). Second, any sort of
               | ordering structure (be it through a tree or through
               | sorting + binary search) is notoriously prone to branch
               | mispredictions. They look great in a microbenchmark if
               | you search for the same element all the time and the
               | trees are small, but in a practical application, it can
               | be incredibly painful. Pretty much by design, every
               | choice is 50-50 and unpredictable.
               | 
               | std::unordered_map from most STLs isn't a fantastic hash
               | table, but in my experience it absolutely destroys
               | std::map (and is also comfortably ahead of something like
               | std::lower_bound on a sorted array). All of this is
               | assuming large-ish N, of course; for small N (e.g. below
               | 30, usually), the best by far is a simple unsorted array
               | and just going through it linearly.
        
               | sicp-enjoyer wrote:
               | > C++ trees are not cache-friendly
               | 
               | Agreed, they should use a B-tree to get cache locality
               | and easy generics, but there is legacy code there.
               | 
               | I was referring to the performance of algorithms. For
               | example `std::unique`, `std::lower_bounds`, etc. Many of
               | these use sorted lists, whereas most other languages'
               | standard libraries utilize hashing for these.
               | 
               | > is also comfortably ahead of something like
               | std::lower_bound on a sorted array
               | 
               | I would be interested to learn more about when that's the
               | case. But, it's also not very flexible. You can put an
               | `int` in it, great. Can you put `std::pair<int, int>` in
               | it? Does it work as well?
        
               | Sesse__ wrote:
               | B-trees don't work well for CPU caches; 64b lines are
               | typically too small to gain much. (OK, the M1 has 128b
               | lines, but still.) And you still get the branch
               | mispredicts, and a significant increase in code
               | complexity, so hashing is still significantly better.
               | (I've seen C++ projects where the main structure was a
               | B+-tree because the lower levels could be residing on
               | disk, but where they maintained a separate hash table in
               | memory to skip the top first few B-tree levels!)
               | 
               | > I would be interested to learn more about when that's
               | the case. But, it's also not very flexible. You can put
               | an `int` in it, great. Can you put `std::pair<int, int>`
               | in it? Does it work as well?
               | 
               | If by "it" you're talking about std::unordered_map (as
               | key); yes, you can, but you'd have to supply your own
               | hash function, because std::pair<A, B> does not have a
               | std::hash specialization. (It's a bit sad, but the
               | problem is highly specific to std::pair/std::tuple.)
               | Likewise, you can use any arbitrary type you create
               | yourself, as long as it has operator== and you've written
               | a hash function for it.
        
               | umanwizard wrote:
               | To your point, C++ got hash structures in the standard
               | library (unordered_map and unordered_set) in C++11 (so
               | only about a decade ago).
        
               | adgjlsfhk1 wrote:
               | a lot of this is that other than heaps, most tree based
               | algorithms involve O(logn) random access pointer lookups
               | which make them relatively slow for in memory data
               | structures
        
               | ant6n wrote:
               | Trie != tree
        
               | OJFord wrote:
               | I just didn't think it was necessary to be
               | judgemental/'gate-keep' about what's obscure or
               | interesting 'enough' - better to let anyone share what's
               | interested them and why.
               | 
               | To your points, personally I do fall in a 'younger
               | cohort' I imagine; have come across bloom filters on HN
               | _way_ more frequently, and I dimly recall tries from
               | university but I knew them as prefix trees. Without
               | submissions like this one I wouldn 't learn what a trie
               | is/to make that alias in my head.
        
       | b20000 wrote:
       | the BroTree. it's the tree you don't know about but will
       | inevitably be queried about during a google interview with a
       | googler on a mission to prove supreme smartness.
        
       | benreesman wrote:
       | https://github.com/facebook/folly/blob/main/folly/docs/Packe...
       | 
       | is pretty awesome and I've personally gotten big wins from it.
       | 
       | The parent mentioned Bloom Filters, HyperLogLog-type stuff is on
       | that family tree and also very interesting and very useful.
       | 
       | But the coolest thing lately by far? Swiss table (and the largely
       | equivalent F14).
       | 
       | That thing is a game changer in certain scenarios.
        
       | ignoramous wrote:
       | > _Good use-case: routing. Say you have a list of 1 million IPs
       | that are [deny listed]._
       | 
       | Apparently, bloom filters make for lousy IP membership checks,
       | read: https://blog.cloudflare.com/when-bloom-filters-dont-bloom/
       | 
       | CritBit Trie [0] and possibly Allotment Routing Table (ART) are
       | better suited for IPs [1].
       | 
       | [0] https://github.com/agl/critbit
       | 
       | [1]
       | https://web.archive.org/web/20210720162224/https://www.harig...
        
         | arein3 wrote:
         | Why not a Map? It will have search complexity of 1
         | 
         | If it should take into account ip blocks then just put the
         | ranges in an sorted index and you have logarhytmic complexity
         | for search (if speed is important and the blacklist is not
         | huge, you can also expand the blacklisted ranges by individual
         | ip and put them in a Map so that you get complexity of 1 for ip
         | range search)
        
         | zasdffaa wrote:
         | That over-simplifies what the linked article says, plus there
         | are things like blocked bloom filters and other tricks to speed
         | things up.
         | 
         | Plus if he's allocating 128MB, well, just do it as a direct
         | array 1 bit per IP4 address (which can be optimised to remove a
         | few special blocks I guess) and skip any hashing.
        
       | llimos wrote:
       | The Israeli queue.
       | 
       | Like a regular queue, but if something new that comes in sees its
       | friend in the queue already, it can jump the queue and go and
       | stand next to her.
       | 
       | Useful for when something has a big overhead on top of its own
       | processing, but the overhead can be shared between several
       | similar entries. If you're doing it anyway for the one already in
       | the queue, you get to make the most of it for its friends too.
       | 
       | I chose it because I live here and it's totally true :)
        
         | kgeist wrote:
         | Sounds like a priority queue
        
           | vosper wrote:
           | Is it the same thing? I'm trying to imagine how this would
           | work with a priority queue.
           | 
           | I think you need an flexible number of priorities that is
           | equal to the queue length, and the priority values come from
           | an infinite counter that only goes up. Higher values = lower
           | priority.
           | 
           | When a unique item comes into the queue you increment the
           | counter and give it that value as a priority. This puts it on
           | the end of the queue.
           | 
           | But if the new item is a duplicate of an existing item then
           | you give it the priority of the existing item.
        
           | [deleted]
        
           | Tmpod wrote:
           | Not really. In a priority queue, an item with higher priority
           | gets pulled out before one with lower priority, so you can
           | have items that skip parts of the queue (or all of it) even
           | if there are no items of the same priority in there already.
           | 
           | This structure seems to behave differently, in that an item
           | may skip ahead in the queue if and only if another item with
           | some matching characteristic is already contained. This makes
           | it so you can never skip the whole structure, for example,
           | assuming you get put after your "friend".
        
         | Dave_Rosenthal wrote:
         | There is a great (always busy) sandwich shop in Boston called
         | Darwin's that I used to go to a lot that worked that way. When
         | a customer placed an order they would call down the long line,
         | "anyone else for an X?", then make make the sandwiches with the
         | inner loop over sandwich and the outer loop over ingredient.
         | 
         | I took much delight in the efficiency as well as the incentive
         | structure: If you are not picky and you can help us increase
         | throughput then we'll reduce your latency :)
        
         | [deleted]
        
         | [deleted]
        
         | valbaca wrote:
         | Sounds like a priority queue with a sidecar hashmap or some
         | kind of notify mechanism for when "order's up!"
        
         | abirch wrote:
         | How do you implement it? Do you do a scan with every insert or
         | do you hash or use a tree for indexing?
        
         | teraflop wrote:
         | Sounds similar to the behavior of Guava's LoadingCache. If you
         | try to fetch a key that's not in the cache, the supplied
         | callback is used to fetch it. While the value is being loaded,
         | other threads can access the cache concurrently, but if a
         | second thread asks for the same key, it will wait for the first
         | load to complete instead of redundantly fetching the same
         | value.
        
       | rjh29 wrote:
       | For routing you can also use a sparse radix tree, also known as a
       | Patricia tree. This allows fast matching against a set prefixes
       | (which is used for IP routing) without taking up a lot of space.
        
       | carapace wrote:
       | DLX https://en.wikipedia.org/wiki/Dancing_links
       | 
       | Stanford Lecture: Don Knuth -- "Dancing Links" (2018)
       | https://www.youtube.com/watch?v=_cR9zDlvP88
        
       | elromulous wrote:
       | Somewhat straddling pure cs and cryptography/coding theory: one
       | of my absolute favorites: Merkle tree. It's used for easily
       | confirming the integrity of data. E.g. a filesystem.
       | 
       | https://en.m.wikipedia.org/wiki/Merkle_tree
        
       | apgwoz wrote:
       | Good old Calendar Queues:
       | https://en.m.wikipedia.org/wiki/Calendar_queue
       | 
       | These things are common in discrete event simulation, but, are
       | useful anytime you have a time based queue.
        
       | Hippocrates wrote:
       | Log-structured Merge (LSM) Trees:
       | https://en.wikipedia.org/wiki/Log-structured_merge-tree
        
       | adontz wrote:
       | Succinct Trie http://stevehanov.ca/blog/?id=120
       | 
       | Super effective way to store a searchable list of items, like a
       | dictionary of words.
        
       | contingencies wrote:
       | The LIFO buffer. Also surfaces as the messy desk, the million
       | email inbox, or ten-thousand browser tabs. It really works at
       | progressively sorting frequently accessed items to the top, even
       | accounting for changing or seasonal habits and preferences.
        
       | pklausler wrote:
       | Briggs-Torczon sets of small integers (0 <= x < N) deserve to be
       | better known.
       | 
       | (Set up two N-element arrays "member" and "index" of ints and an
       | elementCount; set elementCount to zero. You don't have to
       | initialize the arrays if you can live with some valgrind
       | grumbling. The leading "elementCount" entries in "member" are the
       | current members of the set, unordered, and for each of those
       | values x, index[x] is its index in "members".
       | 
       | Then isInSet(x) = index[x] >= 0 && index[x] < elementCount &&
       | member[index[x]] == x and addToSet(x) = isInSet(x) ||
       | (member[index[x] = elementCount++] = x)
       | 
       | and rmFromSet() is left as an exercise.)
        
       | carlopi wrote:
       | Not sure whether they classify as obscure, but I haven't see
       | cited already:
       | 
       | - Dominator tree
       | (https://en.wikipedia.org/wiki/Dominator_(graph_theory))
       | 
       | - Single-Connected-Components-Graph
       | 
       | - Deterministic data structures (eg. a set that acts
       | deterministic to the fact that addresses might be somehow
       | randomly assigned, very useful for ensuring reproducibility)
       | 
       | Already cited, but it's clearly among the most elegant:
       | 
       | - union-find (!!!!)
       | 
       | and as a bonus one that is easily overlooked:
       | 
       | -std::deque, that when restricted to push_back() or push_front()
       | guarantees not to ever move objects around.
        
       | vuonghv wrote:
       | I like Ring Buffer, it leverages the CPU-cache line and can use
       | to store event queue.
        
       | dskloet wrote:
       | Invertible Bloom Lookup Tables. They're like Bloom filters but
       | you can also remove elements and even reconstruct elements in
       | certain cases. Useful for syncing 2 databases which are almost
       | in-sync, by sending only a small amount of data between them.
       | It's used by Bitcoin nodes to communicate the contents of newly
       | mined blocks.
        
       | mkindahl wrote:
       | Also, the algorithms around using Reduced Order Binary Decision
       | Diagrams (ROBDD) to represent sets efficiently. They are used in
       | verification algorithms with very large state spaces, such as
       | "Symbolic Model Checking: 10^20 states and beyond" by Burch,
       | Clarke, and McMillan
       | <http://www.cse.chalmers.se/edu/year/2012/course/TDA956/Paper...>
        
       | filoeleven wrote:
       | HAMT: Hash Array Mapped Trie. This data structure makes efficient
       | immutable data possible. You can update a list of a million
       | items, and keep a reference to the original list, by changing 3
       | or 4 references and some bytes.
       | 
       | This should replace copy-on-write for scripting languages. I
       | really want to see it in a JS spec soon. There are libraries that
       | can do it, but they add translation penalties and extra steps.
       | I'd complain less if I could use Clojurescript professionally,
       | because it and Clojure are built around them.
       | 
       | https://en.m.wikipedia.org/wiki/Hash_array_mapped_trie
        
         | ksbrooksjr wrote:
         | If the Record & Tuple proposal advances to stage 3 we'll
         | finally have native immutable data structures in JS [1].
         | 
         | [1] https://github.com/tc39/proposal-record-tuple
        
           | jitl wrote:
           | I have beeen waiting aeons for this, and suspect will need to
           | continue waiting effectively forever :(
        
         | mkirchner wrote:
         | This. Roughly a year ago I got interested in efficient
         | immutability for my write-from-scratch-in-C Lisp [0] and
         | started to write a HAMT implementation in C [1], along with a
         | (somewhat hacky, you have been warned) benchmarking suite [2].
         | 
         | The docs are only 70% done (in particular the "putting it all
         | together" part is missing) but it has been a really interesting
         | and enlightening journey so far and can only recommend
         | embarking on this path to everyone.
         | 
         | [0]: https://github.com/mkirchner/stutter [1]:
         | https://github.com/mkirchner/hamt [2]:
         | https://github.com/mkirchner/hamt-bench
        
           | ignoramous wrote:
           | See also Matt Bierner's HAMT impl in JavaScript:
           | https://blog.mattbierner.com/hash-array-mapped-tries-in-
           | java...
        
         | miclill wrote:
         | also Ctrie: "a concurrent thread-safe lock-free implementation
         | of a hash array mapped trie":
         | https://en.wikipedia.org/wiki/Ctrie
         | 
         | So basically like HAMT but lock free.
        
         | jahewson wrote:
         | That's what Immutable.js used under the hood.
        
           | filoeleven wrote:
           | Yes, I evaluated it. The complexity of knowing where to
           | convert from/to plain JS, plus the extra library syntax to
           | learn, plus the performance cost of toJS, made it a poor fit
           | for my particular use case. Nearly as much of a hard sell at
           | work as saying "let's rebuild the UI in Clojurescript," and
           | without providing as much benefit.
           | 
           | My use case is pretty atypical though, and it's worth
           | checking out if you have more reliable data shapes/update
           | paths.
        
             | bhl wrote:
             | Does immer have the same drawbacks as immutable? It uses
             | proxies as opposed to hased array map tries.
        
               | filoeleven wrote:
               | The key design difference between the two is that
               | immutable.js wants you to keep your stuff in its format
               | (and operate on your stuff with its functions) until you
               | hit some piece of code that needs a plain JS object.
               | Whereas immer is scoped (mostly?) to update functions,
               | and always returns plain JS. Immer seems to be designed
               | to simplify the problem of only changing references to
               | changed objects--and it looks like it does a good job of
               | it.
               | 
               | So with immutable.js, you get more powerful data
               | manipulation throughout the app, at the cost of having to
               | know when and where to convert things back to plain JS.
               | With immer, you get tightly-scoped immutable updates to
               | objects, and inside the update function you can treat
               | them as if they're mutable, letting you write more
               | idiomatic-looking JS. Instead of spreading 4 layers of
               | objects to update one state flag, immer lets you say:
               | const nextState = produce(state, draft) => {
               | draft.view.marketing.annoyingSubscribePopup = false
               | }
               | 
               | Every object reference in that path will be updated, but
               | the rest of "state", "state.view", etc. will be
               | unchanged.
               | 
               | If you can keep everything inside of immutable.js, it is
               | the fastest thing out there. As soon as you have to drop
               | back to plain JS, it gets slower. See this performance
               | graph: https://immerjs.github.io/immer/performance/
               | 
               | Thanks for reminding me of this. We finally dropped IE11
               | support, so may be able to get some benefits from
               | introducing immer either by itself or by bringing in
               | Redux Toolkit.
        
         | dan-robertson wrote:
         | Relatedly, RRB trees for immutable vectors with good constant
         | factors.
        
         | robalni wrote:
         | What I like about HAMTs is that they can be super simple to
         | implement if you make them one bit per level. They are like a
         | combination of a binary search tree and hash table but without
         | any of their annoyances.
         | 
         | * In binary search trees, you need to balance the tree every
         | time you insert something because the tree will be linear if
         | you insert the nodes in order. In a HAMT the positions are
         | determined by the hash, so the positions will be random-like
         | and not depend on the order you insert the nodes in.
         | 
         | * In binary search trees, you need to perform some complicated
         | steps to remove a node. In a HAMT, you can either just mark the
         | node as deleted, or if you want to fill the gap, find a leaf
         | node that has the removed node on its path and just put it
         | there.
         | 
         | * In hash tables, when it starts to get full you need to rehash
         | it or put the added item in the "wrong" slot or something. A
         | HAMT never gets full because it's a tree so you just add
         | another child node.
         | 
         | Here I have written an explanation of a simple HAMT:
         | https://www.robalni.org/posts/20220507-a-hash-table-you-dont...
        
           | robalni wrote:
           | Here is an implementation of a binary HAMT that allows you to
           | add and find nodes:
           | 
           | https://gist.github.com/robalni/311afd0756f25c4f234b2ae332cd.
           | ..
        
         | WithinReason wrote:
         | Isn't this slow because of pointer chasing in the trie?
        
         | mklarmann wrote:
         | I do recall Asami using a HAMT and it is written in Clojure.
         | [1]: https://github.com/threatgrid/asami
        
         | WatchDog wrote:
         | I imagine careless use of such a structure would be an easy way
         | to create a memory leak. Is it possible to create persistent
         | collections in js, that will free data no longer directly
         | referenced?
        
           | filoeleven wrote:
           | The careless usage of shallow copies of objects (with JS
           | spreading) presents the same issue. Properly scoping assigned
           | constants and variables is still important.
           | 
           | Persistent collections are available today via immutable.js,
           | so it can be done. The catch is that you have to use a
           | library for it, and transform them back into plain JS when
           | something expects an object or an array. The language itself
           | could make this transparent and lower-cost by implementing it
           | at the engine level.
           | 
           | Persistent collections are primarily useful in functional
           | programming paradigms. JS is a multi-paradigmatic language,
           | so it doesn't make sense to use them as the default. It would
           | sure be nice to be able to opt into using them in FP-aligned
           | frameworks like the current React ecosystem though.
        
         | sfvisser wrote:
         | Another interesting approach to copy-on-write for immutable
         | collections (for example arrays) is where you actively mutate
         | the array in place, but leave the original as a lazy
         | description/view of how to get back to the before state.
         | 
         | From the outside the effect is the same, but the performance is
         | optimized for accessing the updated collection and only
         | possibly using the old value.
         | 
         | Great for cases where you want the immutability guarantee, but
         | where it might be unlikely the old value is actually going to
         | be used in the general case.
        
           | ignoramous wrote:
           | See also this cppcon talk by Juan Pedro
           | https://youtu.be/sPhpelUfu8Q
        
           | seunosewa wrote:
           | Databases do that.
        
         | CSMastermind wrote:
         | Isn't this what Sublime Text uses under the hood to give it
         | such good performance?
        
       | m12k wrote:
       | The Bag. Also known as a Multiset. I can't believe it took me so
       | many years to learn the name of the most basic data structure
       | imaginable - it basically makes no promises whatsoever about the
       | structure. Think of it like a Set that allows duplicates - you
       | can put stuff in and take stuff out, just like a bag - what more
       | could you want?
        
         | Gregam3 wrote:
         | I love the Bag! It's so simple I use it constantly
        
         | chirau wrote:
         | How is that any different from a list?
        
           | Gregam3 wrote:
           | Hopefully this java can illustrate the point if you're not a
           | Java lover
           | 
           | The bag essentially works like this Map<K, Collection<V>>
           | 
           | But looks like this Bag<K, V>
           | 
           | I use my own simple "BagSet" which works like this Map<K,
           | Set<V>>
        
           | joshlemer wrote:
           | Two lists are equal only if they have the same contents and
           | in the same order. Bags don't have an order, they're more
           | like sets, so bags are equal so long as they have the same
           | contents.
        
       | hansvm wrote:
       | Succinct data structures in general are interesting. They use
       | close to the information theoretic bound for a given problem and
       | still support fast operations -- a decent mental model is that
       | they're able to jump to a roughly correct place and (de)compress
       | the data locally.
       | 
       | A good example is a wavelet tree. Suppose you have a text of
       | length N comprised of an alphabet with C characters and wish to
       | execute full-text pattern searches for patterns of length P. Once
       | the tree is created, you can do your full-text search in O(P log
       | C) time, and you use less space than any constant multiplicative
       | factor of the information theoretic lower bound. Note that the
       | runtime is independent of the original string's length.
        
         | PartiallyTyped wrote:
         | Another example is distance oracles. A distance oracle is a two
         | stage object, in one stage creates a cache of distances, and in
         | the other it allows us to query the cache. The oracle gives an
         | approximate solution to the distance between any two nodes
         | while avoiding the quadratic memory requirement to something in
         | order of nlogk with k query steps.
        
       | mikewarot wrote:
       | Trivial, but very useful - phase accumulator / numerically
       | controlled oscillator.
       | 
       | With two unsigned integers, freq and phase, you get amazingly
       | fine control over timing. You only use the top N bits of phase
       | for output. It's saved my bacon in embedded systems more than
       | once.
        
       | ablaba wrote:
       | Succinct Data Structures
       | https://courses.csail.mit.edu/6.851/spring12/scribe/L17.pdf
        
       | imachine1980_ wrote:
       | Use arrayanes instead of matrix efficiently in boardgames like
       | tetris or chess"A bitboard is a specialized bit array data
       | structure commonly used in computer systems that play board
       | games, where each bit corresponds to a game board space or piece.
       | This allows parallel bitwise operations to set or query the game
       | state, or determine moves or plays in the game"
        
       | mtlmtlmtlmtl wrote:
       | N-dimensionsonal arrays are pretty cool, and fun to implement. I
       | did one in C for an Advent of Code problem that switched from 3
       | to 4 dimensions between the two parts.
        
       | 6510 wrote:
       | I did an append only list of URL's with RAW files where each bit
       | says something is true/false about the url at that offset.
       | 
       | For example 26x26 RAW files asking if the page contains the
       | letter combination "aa" or "ab" all the way up to "zy" and "zz".
       | 
       | When one types a search query after 2 letters a file is pulled,
       | at the 3rd letter we have a second 2 letter combination. Then do
       | the AND operation.
       | 
       | It is much like a bloom filter and tells you what is not in the
       | set.
        
         | compressedgas wrote:
         | That is a form of q-gram indexing.
        
       | markdown wrote:
       | > I'll start
       | 
       | You're supposed to post yours as a comment, so it can be voted on
       | like every other type of structure suggested here.
        
       | Philip-J-Fry wrote:
       | I want to write a content based pub/sub system. Where a
       | subscriber says "I want messages where field X = A, y = B and z >
       | C".
       | 
       | I've looked around and I've seen R+ Trees are potentially a
       | solution but there's very little info on how they're actually
       | created in code.
       | 
       | Does anyone know of any nice data structures to do this?
       | Obviously the most basic way is just indexing each field and that
       | works for simple equality but gets harder when you're trying to
       | do other comparators like < and >.
        
       | xyproto wrote:
       | Cuckoo filter
       | 
       | "A space-efficient probabilistic data structure that is used to
       | test whether an element is a member of a set, like a Bloom filter
       | does. False positive matches are possible, but false negatives
       | are not - in other words, a query returns either "possibly in
       | set" or "definitely not in set". A cuckoo filter can also delete
       | existing items, which is not supported by Bloom filters. In
       | addition, for applications that store many items and target
       | moderately low false positive rates, cuckoo filters can achieve
       | lower space overhead than space-optimized Bloom filters."
       | 
       | https://en.m.wikipedia.org/wiki/Cuckoo_filter
        
       | actually_a_dog wrote:
       | Previously on HN:
       | 
       | https://news.ycombinator.com/item?id=1370847 (2010)
       | 
       | https://news.ycombinator.com/item?id=2363522 (2011)
       | 
       | https://news.ycombinator.com/item?id=3369454 (2011) -- Although
       | not much is in the discussion, it points to a Stack Overflow post
       | with some examples.
       | 
       | https://news.ycombinator.com/item?id=7079427 (2014)
       | 
       | https://news.ycombinator.com/item?id=28758106 (2021) -- Not
       | exactly a data structure, but an interesting algorithm.
        
       | agilob wrote:
       | Not super obscure, but I remember that one specific time when
       | circular-linked-list made a lot of sense to use, well I wanted to
       | use it so I used it.
       | 
       | I had a bunch of API keys with poor expiry documentation and
       | implementation, so to find out if a key expired it had to be
       | used. I put it in a main "keys.pop loop" and all methods below
       | tried to use the key. If HTTP response was (some another obscure
       | HTTP status code like) 505, I simply called `continue;` in the
       | loop to jump to another, without caring at all where I was
       | before.
       | 
       | https://github.com/atedja/gring
        
       | valbaca wrote:
       | FYI: https://en.wikipedia.org/wiki/List_of_data_structures
        
       | kaymanb wrote:
       | Disjoint-Sets have a very cool implementation whose amortized
       | time complexity is extremely slow growing. It is not quite
       | constant, but even for a disjoint-set with as many elements as
       | there are particles in the universe, the amortized cost of an
       | operation will be less than or equal to 4.
       | 
       | https://en.wikipedia.org/wiki/Disjoint-set_data_structure
        
         | EnderShadow8 wrote:
         | I've been told to consider it constant for all practical
         | purposes
        
           | pjscott wrote:
           | The inverse Ackermann function is one of the slowest-growing
           | functions that you're ever likely to encounter in the wild.
           | To say that it's "constant for all practical purposes" is
           | 100% true but doesn't do justice to how amazingly slow this
           | function is to grow.
           | 
           | https://en.m.wikipedia.org/wiki/Ackermann_function
        
         | kragen wrote:
         | aka "union find" -- some other comments in the thread call it
         | that
        
         | andreareina wrote:
         | Use the idea[1] at work to make large NP-complete problems more
         | tractable by breaking the problem up into disjoint subsets.
         | 
         | [1] doesn't implement the inverse-ackermann algorithm but still
         | implemented as union/find.
        
       | jedberg wrote:
       | I saw this and bloom filter was my very first thought, so I'm
       | glad you had it.
       | 
       | We used it to check if you had voted on a discussion on reddit.
       | It almost all cases the user had _not_ voted, so it was a lot
       | quicker to check if you hadn 't voted than if you had.
        
       | icsa wrote:
       | [Ordered] Minimal Perfect Hash Functions with O(N) construction
       | time.
       | 
       | They have been around for 30 years and I don't see them used in
       | practice.
        
         | HelloNurse wrote:
         | Fixed key sets are uncommon.
        
       | felixr wrote:
       | The SO question "What are the lesser known but useful data
       | structures?" has a good list
       | 
       | https://stackoverflow.com/questions/500607/what-are-the-less...
       | 
       | [edit: the link above has been discussed on HN multiple times
       | https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...]
        
       | loxias wrote:
       | (Fantastic post idea OP. One of the best I've ever seen :D)
       | 
       | Related to bloom filters, xor filters are faster and more memory
       | efficient, but immutable.
       | 
       | HyperLogLog is an efficient way to estimate cardinality.
       | 
       | Coolest thing I've learned recently was Y-fast trie. If your
       | dataset M is bounded integers (say, the set of all 128 bit
       | numbers), you get membership, predecessor, or successor queries
       | in log log time, not log, like in a normal tree.
       | 
       | see:
       | https://www.youtube.com/playlist?list=PLUl4u3cNGP61hsJNdULdu...
       | (6.851 Advanced Data Structures, Erik Demaine)
       | 
       | Would love to learn more "stupidly fast at the cost of conceptual
       | complexity" things.
       | 
       | edit:
       | 
       | (adding) I don't know a name for it, because it's not one thing
       | but a combination, but once can:
       | 
       | Use the first N bits from a very quick hash of a key from an
       | unknown distribution (say a file path, or a variable name, or an
       | address, or a binary blob,..) as a way to "shard" this key across
       | M other fast data structures (like a tree) for search/add/remove.
       | By changing M, you can tune the size of the terms in the
       | O(1)+O(log) equation for running time.
       | 
       | Trees getting too deep for fast search? Every increase of N moves
       | the computation from the log search of the tree to the space
       | tradeoff of having more trees.
       | 
       | Added benefit is you can _scale to multiple threads easily_.
       | Instead of locking the whole tree, you lock a tiny sub-tree.
       | 
       | Very clever. (I first saw in the Aerospike key value store)
        
         | tresil wrote:
         | Big fan of HLL
         | 
         | Apache foundation has a fantastic DataSketches library that
         | includes HLL and many other powerful data analytics algorithms:
         | https://datasketches.apache.org/
         | 
         | Lee Rhodes has done an excellent introduction to this library -
         | explaining some of the use cases, advantages, and things to be
         | aware of when using these techniques:
         | https://www.youtube.com/watch?v=nO9pauS-mGQ
        
           | twic wrote:
           | On sketches, there is a genre of structure for estimating
           | histogram-like statistics (median, 99th centile, etc) in
           | fixed space, which i really like. Two examples:
           | 
           | t-digest https://github.com/tdunning/t-digest
           | 
           | DDSketch https://github.com/DataDog/sketches-java
        
             | arunk-s wrote:
             | In the same spirit there is also Circllhist from Circonus.
             | https://arxiv.org/abs/2001.06561
             | 
             | There was a good discussion on approaches to store
             | histogram here:
             | https://news.ycombinator.com/item?id=31379383
        
         | lorenzhs wrote:
         | If you enjoyed XOR filters, you might also like ribbon filters,
         | something that I had the pleasure of working on last year. They
         | share the basic idea of using a system of linear equations, but
         | instead of considering 3 random positions per key, the
         | positions to probe are narrowly concentrated along a ribbon
         | with a typical width of 64. This makes them far more cache-
         | efficient to construct and query.
         | 
         | By purposefully overloading the data structure by a few per
         | cent and _bumping_ those items that cannot be inserted as a
         | result of this overloading to the next layer (making this a
         | recursive data structure), we can achieve almost arbitrarily
         | small space overheads:  <1% is no problem for the fast
         | configurations, and <0.1% overhead with around 50% extra
         | runtime cost. This compares to around 10% for XOR filters and
         | >= 44% for Bloom filters.
         | 
         | In fact, I'm going to present them at a conference on Monday -
         | the paper is already out:
         | https://drops.dagstuhl.de/opus/volltexte/2022/16538/pdf/LIPI...
         | and the implementation is at https://github.com/lorenzhs/BuRR/.
         | I hope this isn't too much self-promotion for HN, but I'm super
         | hyped about this :)
        
         | hendler wrote:
         | I like https://en.wikipedia.org/wiki/Cuckoo_filter which allows
         | for delete, unlike Bloom
        
         | blueblob wrote:
         | Sounds a bit like radix sort?
        
           | loxias wrote:
           | AIUI Radix sort "kinda ish" related, yeah :) They both
           | involve looking at your data at a 90 degree angle and slicing
           | on bitplanes.
        
         | bogomipz wrote:
         | I forgot about Aerospike. They basically built a NAND optimized
         | key, value store right? I remember reading about how they used
         | the FTL and thinking they were pretty clever. I cant for the
         | life of me find the article now. I think they were really big
         | in the ad tech space? Is that still the case?
        
           | loxias wrote:
           | "NAND optimized key value store" doesn't do it justice ;-)
           | The fact that it's SSD optimized has nothing to do with key
           | sharding across trees, the latter is what gets you the
           | absurdly low latency and near infinite scale out. This link
           | gives an overview:
           | https://vldb.org/pvldb/vol9/p1389-srinivasan.pdf And it's
           | open source...
        
         | bglazer wrote:
         | I think what you're describing in the last example is the
         | MinHash algorithm
         | 
         | edit: actually I'm not sure, sounds related though
        
         | champtar wrote:
         | A fantastic thing about HyperLogLog is that it can be merged,
         | so you can split your data between multiple server, precompute
         | HLL for all IPs every minute, and then ask "how many unique IPs
         | was there yesterday".
         | 
         | Discovered HLL because it's used in ClickHouse, which employ a
         | ton of cool but obscure data structure.
        
           | Xorlev wrote:
           | Works well in analytics cubes since they can be combined.
           | 
           | You can retain them across time too, such that you can ask
           | questions like "how many unique users were there over the
           | last N days?" without needing the source data. Great for
           | privacy-aware analytics solutions.
        
             | tarun_anand wrote:
             | Love DataSketches but I was wondering if there is a way to
             | compute datasketches across time for e.g. I want to compute
             | the users who did X and then Y in that order. Since
             | intersection is commutative it doesnt give an answer for
             | time ordering.
             | 
             | Nonetheless the best data structure I have read over last
             | 10 years.
        
         | zbobet2012 wrote:
         | Y-fast tries are some of my favorites. I think they are heavily
         | under utilized in modern terms. They sat by the the wayside for
         | a long term because datasets where relatively small, at the
         | time they where created ram didn't exist, and bitwise
         | operations where inefficient along with many other constant
         | factors.
         | 
         | Today; however, a lot of people have datasets on the order of
         | 2^16 or 2^32 keys they need to maintain. And efficient bitwise
         | operations (on upto 512 bits) are the norm. Y-fast tries are
         | faster than b-trees or other datastructures. Also because they
         | divide _space_ not the _data_ they are very amenable to multi-
         | threaded and distributed algorithms. For example the hash-
         | tables in a y-fast trie can actually be a rendezvous hash
         | pointing to a database node. Once in that node you can hash on
         | cores again to get to a local process for example.
        
           | FridgeSeal wrote:
           | > Also because they divide _space_ not the _data_ they are
           | very amenable to multi-threaded and distributed algorithms.
           | 
           | Ha, I was thinking about this idea the other day and couldn't
           | figure out a good way to search for anything already written
           | on the topic. I suspect there's quite a lot of ground to be
           | gained in the area.
        
           | loxias wrote:
           | I want to hear more, esp about the distributed applications,
           | do you have any useful links, or can I buy you an "e coffee"
           | to pick your brain for a few minutes?
        
       | beached_whale wrote:
       | I find the RTree beautiful. While not as in use these days.
        
       | kaylynb wrote:
       | Sparse sets.
       | 
       | They're often used in Entity Component System architectures since
       | you have O(1) add/remove/find, & O(n) iteration. Iterations are
       | also packed, which is a bonus for cache efficiency.
       | 
       | Can be difficult to implement well, but the concept is simple and
       | a neat example of a useful specialty data structure. Take a look
       | at https://github.com/skypjack/entt
        
         | hgs3 wrote:
         | You can apply the same idea to hash tables: Store hashes and
         | dense-array indices in your sparse array, during lookup use
         | robin hood hashing to cache efficiently probe the sparse array
         | for a matching hash value, and if a match is found use the
         | dense-array indice to lookup the data in the packed/dense
         | array. This approach is both cache efficient and space
         | efficient as the dense array can grow independently of the
         | sparse array.
        
       | andrenotgiant wrote:
       | Can I describe a data queueing problem that I feel like there is
       | a specific data (or queue) structure for, but that I don't know
       | the name is?
       | 
       | Let's say you are trying to "synchronize" a secondary data store
       | with a primary data store. Changes in the primary data store are
       | very "bursty", one row will not change for days, then it'll
       | change 300 times in a minute. You are willing to trade a bit of
       | latency (say 10 seconds) to reduce total message throughput. You
       | don't care about capturing every change to the primary, you just
       | want to keep it within 10 seconds.
       | 
       | It feels like there should be a clever way to "debounce" an
       | update when another update overrides it 500ms later. I know
       | debounce from the world of front-end UI where you wait a little
       | when doing autocomplete search based on keyboard input so as not
       | to overwhelm the search.
        
         | dabears wrote:
         | Ideally you have a reliable Change Data Capture (CDC) mechanism
         | like a Binlog Reader. Debezium, for example, can write directly
         | to a queue like Kafka. A Kafka consumer picks up the events and
         | writes to your secondary datastore. Something like that can
         | probably handle all of your events without bucketing them, but
         | if you want to cut down the number of messages written to the
         | queue you can add that logic into the Binlog Reader so it emits
         | a burst every 5 seconds or so. During those 5 seconds it
         | buffers the messages in the processes memory or externally in
         | something like Redis using a key so only the latest message is
         | stored for a given record.
        
         | 8note wrote:
         | I've seen this done with kinesis streams.
         | 
         | Basically just update a cache, and forward the results every so
         | often.
         | 
         | If you put things in a stack, you can keep using the most
         | recent for the update. Compare times and you won't add a bad
         | value
        
         | js2 wrote:
         | Sentry used Redis to buffer writes for a similar use case:
         | 
         | https://blog.sentry.io/2016/02/23/buffering-sql-writes-with-...
        
         | nl wrote:
         | If you want everything to be within 10 seconds, then you build
         | a state-change tracker (which only tracks all state changes
         | since last update) and then you send the updates every 10
         | seconds.
         | 
         | Don't worry about debouncing - the state tracker should handle
         | representing the 300 updates as a single state change, and if
         | there are more then they just get sent in the next update 10
         | seconds later.
        
         | zbobet2012 wrote:
         | There are a variety of solutions to this but CRDTs are a very
         | good one (CRDTs solve your problem
         | (https://en.wikipedia.org/wiki/Conflict-
         | free_replicated_data_...). If the operations you're doing
         | commute (that is a * b = b * a, e.g. the order in which you
         | apply the operations doesn't matter) then one could apply all
         | operations in parallel, and only send the final result of doing
         | them all. Casandra uses LWW-Element-Set CRDTS to solve this
         | exact problem (https://cassandra.apache.org/doc/latest/cassandr
         | a/architectu...).
        
       | philosopher1234 wrote:
       | Linked list but where each entry in the list is fixed size array.
       | O(1) inserts into the array. If the array fills you split it into
       | two in the linked list. Amortizes the cost of traversing list
       | pointers.
        
       | CreRecombinase wrote:
       | Roaring bitmaps are compressed bitmaps and they're pretty great.
       | https://roaringbitmap.org/about/
        
       | Linda703 wrote:
        
       | didip wrote:
       | Do you know about:
       | 
       | * HyperLogLog? Convenient for doing approximate counting across
       | huge datasets.
       | 
       | * SkipList is another probabilistic data structure that allows
       | you to skip ahead N elements in 1 step. I believe ClickHouse uses
       | it.
       | 
       | * Bitmap index organizes database pointers into a matrix of bits.
       | The bitmap scans simply skip over the zeros. This type of index
       | gets in trouble if you have high cardinality, though.
        
         | pjscott wrote:
         | Lots of companies use skiplists without realizing it because
         | Redis uses them as the implementation of sorted sets. For
         | example, last time I checked, every time someone looks at a
         | playlist on PornHub the software is doing a partial traversal
         | of a skiplist behind the scenes.
        
       | winrid wrote:
       | Structures good for Geospatial information like rtrees,
       | quadtrees.
       | 
       | https://en.m.wikipedia.org/wiki/R-tree
       | 
       | Also concurrent data structures.
       | 
       | https://youtu.be/jcqGSehrMGU
        
         | winrid wrote:
         | Also LSM trees!
        
         | bragr wrote:
         | I've used quadtrees for some cool stuff I can't really talk
         | about. All the things you think binary trees are good at but
         | with x,y coordinates, or any n-tuple of coordinates as you
         | suggest.
        
         | TrackerFF wrote:
         | Yes, R-tree turned out to be a godsend when working with larger
         | geospatial data and calculating intersections.
        
           | winrid wrote:
           | Yep, that's what I used it for! Intersections. Hard to make
           | concurrent though. Ended up with concurrent map backed
           | geohash.
        
       | mamcx wrote:
       | I made https://github.com/mamcx/tree-flat as flattened stored
       | tree in pre-order that allows for very fast iterations even for
       | childs/parent queries. Is based on APL, so not that novel.
       | 
       | I also like a lot the relational model, is not that much
       | represented so I making a language on top of it:
       | https://tablam.org.
        
         | kragen wrote:
         | Nice! I'm curious to see where you take TablaM. I agree that
         | the relational model is not nearly as well represented as a
         | basis for programming language semantics as one might hope.
        
         | leeoniya wrote:
         | https://en.m.wikipedia.org/wiki/Nested_set_model
         | 
         | ?
        
           | mamcx wrote:
           | Yeah, that one is good, but I bet is possible to encode trees
           | efficiently as relations without indirections.
           | 
           | That is part of my look at `tree-flat` and I wanna base trees
           | on that for my lang.
        
       | nl wrote:
       | I came here to say Golomb compressed sets except now I see it's
       | part of the question!
       | 
       | They are used by default in the OpenMined implementation of
       | Private Set Intersection[1] - a multi-party computation
       | technique.
       | 
       | [1]
       | https://github.com/OpenMined/PSI/blob/master/private_set_int...
        
       | jtolmar wrote:
       | The union-find data structure / algorithm is useful and a lot of
       | fun.
       | 
       | The goal is a data structure where you can perform operations
       | like "a and b are in the same set", "b and c are in the same set"
       | and then get answers to questions like "are a and c in the same
       | set?" (yes, in this example.)
       | 
       | The implementation starts out pretty obvious - a tree where every
       | element either points at itself or some thing it was merged with.
       | To check if two elements are in the same set, check if they have
       | the same parent. Without analyzing it, it sounds like you'll
       | average findRoot() performance of O(log(n)), worst-case O(n).
       | 
       | There are a couple of simple optimizations you can do to this
       | structure, the type of things that seem like they shouldn't end
       | up affecting asymptotic runtime all that much. The first is that,
       | whenever you find a root, you can re-parent all the nodes you
       | visited on the way to that root, so they'll all be quicker to
       | look up next time. The other is that you keep track of the size
       | of sets, and always make the larger set be the parent of the
       | smaller.
       | 
       | And neither of those actually do anything impressive alone, but
       | if you use both, the algorithm suddenly becomes incredibly fast,
       | with the slowest-growing (non-constant) complexity I've ever
       | heard of: O(the inverse of the Ackermann function(n)). Or, for
       | any reasonable N, O(4 or less).
        
         | rag-hav wrote:
         | I wrote a beginner's guide on this topic a while ago
         | https://medium.com/nybles/efficient-operations-on-disjoint-s...
        
         | contificate wrote:
         | Agreed, union-find is great.
         | 
         | My favourite usage of it in practice is for solving first-order
         | (syntactic) unification problems. Destructive unification by
         | means of rewriting unbound variables to be links to other
         | elements is a very pretty solution - especially when you
         | consider the earlier solutions such as that of Robinson's
         | unification which, when applied, often involves somewhat error-
         | prone compositions of substitutions (and is much slower).
         | 
         | The fact that the forest of disjoint sets can be encoded in the
         | program values you're manipulating is great (as opposed to,
         | say, the array-based encoding often taught first): makes it
         | very natural to union two elements, chase up the tree to the
         | set representative, and only requires minor adjustment(s) to
         | the representation of program values.
         | 
         | Destructive unification by means of union-find for solving
         | equality constraints (by rewriting unbound variables into links
         | and, thus, implicitly rewriting terms - without explicit
         | substitution) makes for very easy little unification algorithms
         | that require minimal extension to the datatypes you intend to
         | use and removing evidence of the union-find artefacts can be
         | achieved in the most natural way: replace each link with its
         | representative by using find(l) (a process known as "zonking"
         | in the context of type inference algorithms).
         | 
         | Basic demonstration of what I'm talking about (the incremental
         | refinement of the inferred types is just basic union-find
         | usage): https://streamable.com/1jyzg8
        
         | IanCal wrote:
         | This sounds super useful thanks! I think I'll have something
         | this can be used for, with individual claims of whether two
         | elements are in the same set, then easily getting the largest
         | set of equal items.
        
         | JamesUtah07 wrote:
         | Is there a name for the data structure/algorithm?
        
           | lenocinor wrote:
           | https://en.m.wikipedia.org/wiki/Disjoint-set_data_structure
        
           | [deleted]
        
           | [deleted]
        
           | whoisburbansky wrote:
           | It's called union-find, but also disjoint-set sometimes. The
           | process of reparenting nodes is called path compression, and
           | the optimization of using the larger set of the two as the
           | parent when merging is usually just called "union by rank."
           | Any of those terms should give you more details upon search.
        
         | kizer wrote:
         | If at some point you have a cycle, you can then reparent and
         | remove the cycle, right? This structure in general then can
         | encode transitive relations effectively?
         | 
         | Something like "a and b ..." "b and c ..." "c and a ...".
        
           | thaumasiotes wrote:
           | > If at some point you have a cycle, you can then reparent
           | and remove the cycle, right?
           | 
           | You'd never have a cycle. The way a cycle would theoretically
           | arise would have to be joining something to its own child.
           | But you don't naively join the two objects you're given - you
           | find their root objects, and join those. If - as would be
           | necessary for a cycle to form - they have the same root, you
           | can return without doing anything.
           | 
           | If we call                   unite(a,b)         unite(b,c)
           | unite(c,a)
           | 
           | then we should end up with this structure:
           | c            / \           b   a
           | 
           | With parents on the right, the structure will be a-b after
           | the first call and a-b-c after the second call. The reason we
           | end up with a shallower tree after the third call is that
           | when a is passed to unite, we call find on it and it gets
           | reparented directly to c. Then, c (the representative for c)
           | and c (the representative for a) are already related, so we
           | don't adjust the structure further.
           | 
           | > This structure in general then can encode transitive
           | relations effectively?
           | 
           | Well... no. This structure embodies the concept of an
           | equivalence relation. You wouldn't be able to use it to
           | express a general transitive relation, only a transitive
           | relation that is also symmetric and reflexive.
        
         | kragen wrote:
         | My writeup of union find is in
         | https://dercuano.github.io/notes/incremental-union-find.html. I
         | did not in fact find a way to make it efficiently support
         | incremental edge deletion, which is what I was looking for.
        
           | thaumasiotes wrote:
           | > I did not in fact find a way to make it efficiently support
           | incremental edge deletion, which is what I was looking for.
           | 
           | I don't understand this goal. The interior connections aren't
           | relevant to a union-find structure; ideally you have a bunch
           | of trees of depth 2 (assuming the root is at depth 1), but
           | the internal structure could be anything. That fact
           | immediately means that the consequence of removing an edge is
           | not well-defined - it would separate the two objects joined
           | by the edge, and it would also randomly divide the original
           | connected set into two connected sets, each containing one of
           | those two objects. Which object each "third party" object
           | ended up being associated with would be an artifact of the
           | interior structure of the union-find, which is not known and
           | is constantly subject to change as you use the union-find.
           | 
           | If all you want is to be able to remove _a single object_
           | from a connected set, on the assumption that your union-find
           | always has the ideal flat structure, that 's very easy to do
           | - call find on the object you want to remove, which will link
           | it directly to the root of its structure, and then erase that
           | link.
        
             | kragen wrote:
             | Union find takes as input some undirected graph < _N_ , _E_
             | > and internally constructs (and progressively mutates) a
             | directed graph < _N_ , _E_ '> which it uses to efficiently
             | answer queries about whether two nodes _n_ 1, _n_ 2 [?] _N_
             | are in the same connected component of  < _N_ , _E_ >. It
             | additionally supports incrementally adding edges to _E_.
             | 
             | My quest was to find a way to incrementally delete edges
             | from _E_ , not _E_ '. You're talking about deleting edges
             | from _E_ ', which I agree is not generally a useful thing
             | to do.
             | 
             | For example, your _N_ might be the cells of a maze map, and
             | your _E_ might be the connections between adjacent cells
             | that are not separated by a wall. In that case you can tear
             | down a wall and add a corresponding edge to _E_. But it
             | would be nice in some cases to rebuild a wall, which might
             | separate two previously connected parts of the maze. I was
             | looking for an efficient way to handle that, but I didn 't
             | find one.
        
               | thaumasiotes wrote:
               | > Union find takes as input some undirected graph <N, E>
               | and internally constructs (and progressively mutates) a
               | directed graph <N, E'> which it uses to efficiently
               | answer queries about whether two nodes n1, n2 [?] N are
               | in the same connected component of <N, E>.
               | 
               | I don't think this is a valuable way to think about the
               | structure. That's not what it's for.
               | 
               | A union-find is a direct representation of the
               | mathematical concept of an equivalence relation. The
               | input is a bunch of statements that two things are equal.
               | It will then let you query whether any two things are or
               | aren't equal to each other.
               | 
               | This is captured well by the more formalized name
               | "disjoint-set structure". You have a set of objects. You
               | can easily remove an element from the set. But it makes
               | no conceptual sense to try to "separate two of the
               | elements in a set". They're not connected, other than
               | conceptually through the fact that they are both members
               | of the same set.
               | 
               | A union-find is a good tool for answering whether two
               | vertices are in the same connected component of a graph
               | _because being in the same connected component of a graph
               | is an equivalence relation_ , not because the union-find
               | is a graph-related concept. It isn't, and there is no
               | coherent way to apply graph-theoretical concepts to it.
               | 
               | Putting things another way:
               | 
               | You describe a union-find as something used to
               | "efficiently answer queries about whether two nodes n1,
               | n2 [?] N are in the same connected component of [a
               | graph]".
               | 
               | But you focused on the wrong part of that statement. What
               | makes a union-find appropriate for that problem is the
               | words "the same", not the words "connected component".
        
               | kragen wrote:
               | You are welcome to think about the union-find structure
               | however you prefer to think about it, but I was
               | describing the problem I was trying to solve, for which
               | the correct description of union find I gave above is
               | optimal.
               | 
               | If your way of thinking about union find makes it hard
               | for you to understand the problem I was trying to solve,
               | maybe it isn't the best way to think about it for the
               | purpose of this conversation, even if it is the best way
               | to think about it in some other context.
               | 
               | I'm not claiming my description is the _only_ correct
               | description of union find, just that it 's the one that
               | most closely maps to the problem I was trying to solve.
               | 
               | The description can equally well be formulated in the
               | relational terms you prefer: given a relation _R_ , union
               | find can efficiently tell you whether, for any given _x_
               | and _y_ , ( _x_ , _y_ ) [?] _R_ ', where _R_ ' is the
               | symmetric transitive reflexive closure of _R_ (and is
               | thus the smallest equivalence relation containing _R_ ).
               | It efficiently supports incrementally extending _R_ by
               | adding new pairs ( _a_ , _b_ ) to it.
               | 
               | This is equivalent to my graph-theoretic explanation
               | above except that it omits any mention of _E_ '. However,
               | it is longer, and the relationship to the maze-
               | construction application is less clear. Perhaps you
               | nevertheless find the description clearer.
               | 
               | What I was looking for, in these terms, is a variant of
               | union find that also efficiently supports incrementally
               | _removing_ pairs from _R_.
               | 
               | Does that help you understand the problem better?
        
         | divan wrote:
         | I remember Ravelin shared their story of using it in card fraud
         | detection code.
         | 
         | Basically they managed to replace a full-fledged graph database
         | with small piece of code using union-find in Go. Was a great
         | talk.
         | 
         | https://skillsmatter.com/skillscasts/8355-london-go-usergrou...
        
         | MichaelMoser123 wrote:
         | Can you give some examples where union-find is applied to great
         | benefit?
        
           | jobigoud wrote:
           | I used it to find and track islands of connected polygons in
           | 3D meshes.
        
           | fumblebee wrote:
           | A while back I had to perform an entity resolution task on
           | several million entities.
           | 
           | We arrived at a point in the algorithm where we had a long
           | list of connected components by ids that needed to be reduced
           | into a mutually independent set of components, e.g. ({112,
           | 531, 25}, {25, 238, 39, 901}, {43, 111}, ...)
           | 
           | After much head banging about working out way to do this that
           | wouldn't lead to an out-of-memory error, we found the
           | Disjoint Set Forest Data Structure and our prayers were
           | answered.
           | 
           | I was very grateful for it!
        
             | BeetleB wrote:
             | Exactly the same for me. We had a defect management system,
             | and a defect can be connected to others as dupes. We'd have
             | whole chains of dupes. We used union find to identify them.
        
           | superjan wrote:
           | One example is connected component analyis on images. In one
           | linear scan of an image (2d or 3d) you can find all connected
           | blobs in an image.
        
         | mentater wrote:
         | Wouldn't that be extremely useful and a polynomial solution for
         | 3-SAT, which is NP complete?
        
           | HelloNurse wrote:
           | SAT complexity results from blowing up a reasonable number of
           | variables/clauses (very rarely equivalent to others) in the
           | formula to an enormous number of variable assignments (too
           | many to store, and having only three classes, formula true or
           | false or not evaluated so far, that are not useful
           | equivalencies). What disjoint sets are you thinking of
           | tracking?
        
           | jcranmer wrote:
           | I think I understand why you're confused here. To the extent
           | that union-find is similar to SAT, it's similar to 2-SAT: the
           | constraints you're reasoning about are defined only on pairs
           | of elements, whereas 3-SAT is fundamentally reasoning about
           | triplets (at least one of these three variables is true).
           | Notably, 2-SAT is a variant of SAT that _is_ polynomial in
           | time, unlike 3-SAT.
        
           | xigoi wrote:
           | How is it a solution to 3-SAT?
        
         | endgame wrote:
         | I have heard stories about annotating the edges with elements
         | from a group rather than using unlabelled edges. Have you heard
         | anything about that?
        
           | eutectic wrote:
           | You can efficiently maintain count/max/sum/hash/etc. for each
           | component. It's occasionally useful.
        
         | anewpersonality wrote:
         | I remember getting this in an interview without ever coming
         | across it.. basically, fuck union find.
        
         | [deleted]
        
         | baby wrote:
         | We use it to wire different outputs and inputs to gates in a
         | zero-knowledge proof circuit!
        
         | hexomancer wrote:
         | I have always wanted to really understand this data structure.
         | Sure, I can follow the analysis with the potential functions
         | and all, but I never really understood how Tarjan came up with
         | the functions in the first place. Does anybody have a resource
         | which intuitively explains the analysis?
        
           | sn9 wrote:
           | Robert Sedgewick has a great explanation in his _Algorithms_
           | book. He has gorgeous diagrams alongside his explanations.
        
             | hexomancer wrote:
             | Is it in an old edition? I just downloaded an extremely
             | legitimate version of Algorithms 4th edition but it has no
             | section on union data structure.
        
               | sn9 wrote:
               | It should be in Chapter 1 Section 5 of the 4th edition.
               | It's even in the table of contents.
        
           | hashingroll wrote:
           | This might a long read but it is well-written reader-friendly
           | analysis of Tarjan's proof with chain of reasoning.
           | https://codeforces.com/blog/entry/98275
        
           | simonlindholm wrote:
           | Here's a writeup I made a while ago:
           | https://www.overleaf.com/read/hjbshsqmjwpg
        
         | zimpenfish wrote:
         | Oh, huh, this is serendipitously handy because I need to be
         | doing something like that this weekend (translating from Go to
         | Rust - I already have a reparenting solution using hashes but
         | hopefully this is easier / better.)
        
       | FullyFunctional wrote:
       | My absolute favorite, Cuckoo hashing, has been mentioned already,
       | but so far (239 comment in) nobody has mention this approach
       | (called?) to store an array A[] of N-bit numbers, most of which
       | are zero or small: use N set of numbers S[N], such that for each
       | A[I] you store I in S[J] where J are the bit positions where A[I]
       | have a set bit. In other words, S[J] are the indices of elements
       | from A that has a set bit in position J.
       | 
       | The representation of S can be anything, and even dynamically
       | adapting to the data. I've seen trees of bitmaps for example.
        
         | xxs wrote:
         | In practice Cuckoo sucks, b/c the reading from an unknown index
         | in an array is not a const cost operation. It has an upper
         | bound of course, but its average cost is quite different,
         | depending if you are to hit L1, or L2... or just go for the a
         | cache miss.
         | 
         | "Cache misses" is what dominates performance nowadays.
        
           | FullyFunctional wrote:
           | Where I go there are no cache misses (it's in hardware)
        
       | kelseyfrog wrote:
       | I am enamored by data structures in the
       | sketch/summary/probabilistic family: t-digest[1], q-digest[2],
       | count-min sketch[3], matrix-sketch[4], graph-sketch[5][6], Misra-
       | Gries sketch[7], top-k/spacesaving sketch[8], &c.
       | 
       | What I like about them is that they give me a set of engineering
       | tradeoffs that I typically don't have access to: accuracy-
       | speed[9] or accuracy-space. There have been too many times that
       | I've had to say, "I wish I could do this, but it would take too
       | much time/space to compute." Most of these problems still work
       | even if the accuracy is not 100%. And furthermore, many (if not
       | all of these) can tune accuracy to by parameter adjustment
       | anyways. They tend to have favorable combinatorial properties ie:
       | they form monoids or semigroups under merge operations. In short,
       | a property of data structures that gave me the ability to solve
       | problems I couldn't before.
       | 
       | I hope they are as useful or intriguing to you as they are to me.
       | 
       | 1. https://github.com/tdunning/t-digest
       | 
       | 2. https://pdsa.readthedocs.io/en/latest/rank/qdigest.html
       | 
       | 3. https://florian.github.io/count-min-sketch/
       | 
       | 4.
       | https://www.cs.yale.edu/homes/el327/papers/simpleMatrixSketc...
       | 
       | 5. https://www.juanlopes.net/poly18/poly18-juan-lopes.pdf
       | 
       | 6.
       | https://courses.engr.illinois.edu/cs498abd/fa2020/slides/20-...
       | 
       | 7. https://people.csail.mit.edu/rrw/6.045-2017/encalgs-mg.pdf
       | 
       | 8.
       | https://www.sciencedirect.com/science/article/abs/pii/S00200...
       | 
       | 9. It may better be described as error-speed and error-space, but
       | I've avoided the term error because the term for programming
       | audiences typically evokes the idea of logic errors and what I
       | mean is statistical error.
        
       | nextstepguy wrote:
       | https://en.wikipedia.org/wiki/ZigZag_(software)
        
       | cl0rkster wrote:
       | How about the current DATE formats based on the EPOCH. Hell, they
       | knew there was nothing to it, but some backstreet performances
       | inspired my Y2K formative years where my normal, kind, religious
       | mother became a hoarder that destroyed everything she touched.
       | Why can Java still not do simple date processing when even M$$
       | nailed that formula generations ago? Not letting devs communicate
       | cross platform without Linux level religious arguments is
       | ridiculous. Git this, Linus, is the EPOCH the final destination
       | of the post-Christian measure of time (i.e. the common era)
        
         | TristanBall wrote:
         | I think you managed too lose me about 3 times in there.. what
         | point are you trying to make, and what do you think Linus
         | should do?
        
       | cratermoon wrote:
       | Skip lists.
        
       | mgraczyk wrote:
       | Fractional Cascading. A simple and very cool way to speed up
       | searching for the same key in multiple lists. Instead of K binary
       | searches taking time Klog(N), you can do it in log(N) time
       | without using asymptomatically more space.
       | 
       | https://en.m.wikipedia.org/wiki/Fractional_cascading
       | 
       | I wrote a simple demo in Rust a while back to help myself learn
       | the language.
       | 
       | https://github.com/mgraczyk/fractional_cascading
       | 
       | I also think pairing heaps are neat.
        
       | pgt wrote:
       | Not technically a data structure, but Reservoir Sampling is an
       | interesting probabilistic sampling technique used to choose a
       | random sample from a population of unknown size N in a single
       | pass, so useful for a data structure that does not fit in memory.
       | 
       | https://en.wikipedia.org/wiki/Reservoir_sampling
        
       | foobiekr wrote:
       | Skew heaps. They aren't useful but their utter simplicity is
       | really fun.
        
         | eru wrote:
         | Any data structure based on skew binary numbers is also fun.
        
       | noname120 wrote:
       | HyperLogLog[1] -- approximately count distinct elements from an
       | extremely large set (e.g. 109 elements) super efficiently.
       | 
       | - Distinct count time complexity[2]: O(1)
       | 
       | See Redis's PFCOUNT[4].
       | 
       | [1] https://en.wikipedia.org/wiki/HyperLogLog
       | 
       | [2] For a fixed number of registers (e.g. Redis's implementation)
       | 
       | [3] https://redis.io/commands/pfcount/
        
       | fho wrote:
       | Due to data immutability, a lot of the standard data structures
       | in Haskell are quite interesting.
       | 
       | Eg `Data.Sequence`
       | (https://hackage.haskell.org/package/containers-0.6.5.1/docs/...)
       | is based on Finger Trees
       | (http://staff.city.ac.uk/~ross/papers/FingerTree.html) which
       | allow O(log(min(i,n-i))) inserts (at i) and O(1) inserts at the
       | beginning or end even thou the data structure is basically copy-
       | on-write.
       | 
       | In general Haskell gas a lot of specialized data structures. On
       | of my favorites is `Data.IntMap`
       | (https://hackage.haskell.org/package/containers-0.6.5.1/docs/...)
       | which is a specialized map for integer keys (based on Patricia
       | trees iirc).
       | 
       | (Man I miss working in Haskell :-( )
        
       | mcqueenjordan wrote:
       | A Patricia Merkle tree. Sometimes affectionately referred to as a
       | Perkle tree in some large companies.
        
       | harporoeder wrote:
       | The entire space of probabilistic and sublinear data structures
       | is fascinating. From hyperloglog for high cardinality uniqueness
       | counting, to Cuckoo filters (similar to bloom filters), or Count-
       | min sketch for frequency counting.
       | 
       | https://en.wikipedia.org/wiki/Category:Probabilistic_data_st...
        
       | efortis wrote:
       | The humble array but with a twist for accessing its indices in a
       | hardware multiplexer way with 'shift left' and 'or' bitwise
       | operations.                   /* Bits: selected, hovered */
       | const colors = [           grey,  // 00           green, // 01
       | blueA, // 10           blueB  // 11         ]         const color
       | = colors[selected << 1 | hovered]
       | 
       | https://blog.uidrafter.com/bitwise-table-lookup
        
         | pjscott wrote:
         | No hardware multipliers here: left shifts are handled by much
         | cheaper hardware [1], and are almost part of the basic
         | arithmetic logic unit taught in school -- it can do addition,
         | subtraction, bitwise operations, shifts by one, and _maybe_
         | shifts by any number less than their word size.
         | 
         | [1] https://en.wikipedia.org/wiki/Barrel_shifter
        
           | efortis wrote:
           | Multiplexer (not multiplier)
        
             | pjscott wrote:
             | My mistake! They look so similar.
        
       | Gehinnn wrote:
       | Finger trees. They work on an arbitrary monoid and maintain a
       | list of items. You can insert and remove items in logarithmic
       | time, as well as ask for the sum of items in a given range, also
       | in logarithmic time.
       | 
       | This is useful for example when you want to convert between
       | offsets and line/column numbers efficiently, while you also want
       | to efficiently update/insert/replace lines.
        
       | slaymaker1907 wrote:
       | I really like the splay tree. It optimizes itself during reads
       | and writes such that it is optimal for a fixed probability
       | distribution of accesses.
        
       | qwerty3344 wrote:
       | rope - https://en.wikipedia.org/wiki/Rope_%28data_structure%29
       | skip list - https://en.wikipedia.org/wiki/Skip_list
        
         | 2OEH8eoCRo0 wrote:
         | Never heard of this one until it came up in a google interview.
         | Totally botched it. I couldn't think of any way to do it to
         | their liking that doesn't require a dynamic cast.
        
       | odipar wrote:
       | SeqHash
       | 
       | This data structure is an immutable, uniquely represented
       | Sequence ADS (Authenticated Data Structure) with various
       | interesting use cases.
       | 
       | It is based on a novel hashing scheme that was first introduced
       | with SeqHash.
       | 
       | See http://www.bu.edu/hic/files/2015/01/versum-ccs14.pdf) by
       | Jelle van den Hooff , a brilliant young researcher back then.
       | 
       | SeqHashes are uniquely represented Merkle Trees that also
       | represents a sequence. Concatenation of two SeqHashes is
       | O(log(n)). In turn, the cryptographic signing of SeqHashes is
       | O(1) as each tree node carries a cryptographic hash.
       | 
       | Of course, for each node to carry a hash incurs a hefty overhead
       | but that can be alleviated by (lazily) grouping nodes into
       | buckets (turning it in some kind of BTree).
       | 
       | SeqHashes also don't support splitting in O(log(n)) like for
       | example AVL trees.
       | 
       | I've created an Btree version of SeqHash that also allows
       | splitting SeqHashes in O(log(n)) called SplitHash.
       | 
       | See:
       | https://github.com/odipar/spread/blob/master/src/main/scala/...
        
       | perlgeek wrote:
       | Skip lists.
       | 
       | They are useful for checking in O(log(n)) if a number (or string)
       | is in any of n ranges of numbers.
       | 
       | You just keep a sorted list of endpoints of the ranges, and when
       | you do a lookup, you do a binary search if the search value is in
       | the list, and if not, between which two indexes. If it's between
       | and even and an odd index, it's in.
       | 
       | Very useful if you remember that IP addresses are also integers,
       | and with IPv6 networks become way too huge to enumerate.
        
       | dicroce wrote:
       | I came up with a clever one recently. So, I had a stream of
       | records of fixed size where I will receive exactly one per
       | second. I needed to keep approximately two weeks of this data. I
       | would also need to be able to query the records by time. Anyway,
       | the idea is: use modulus on the file creation time to calculate a
       | write position. Use the location of the current write position to
       | locate query starts and stops. The file then has zero space
       | overhead. Is this something known or did I actually invent
       | something here? If this is my invention I'm open to hearing
       | suggestions for a cool name!
        
         | michaelmior wrote:
         | Sounds like a ring buffer stored in a file.
        
       | kqr wrote:
       | Suffix trees are known to anyone who's done text search, but very
       | neat! Fairly easy idea too: in order to allow quick substring
       | search, build a mapping from all suffixes to documents. This is
       | compressed as a tree, and prefix matching is done by partial
       | traversal.
        
       | brightball wrote:
       | Actually came here to say bloom filter but you beat me to it.
        
       | whycombagator wrote:
       | I really like this thread, but now I am concerned which of these
       | will show up in a future interview.
       | 
       | Edit: forgot to add mine. Not obscure by any means but Fenwick
       | Trees are very cool IMO
       | 
       | https://en.m.wikipedia.org/wiki/Fenwick_tree
        
       | CharanSriram wrote:
       | This isn't so much a data structure than an implementation of
       | one, but map-represented trees are great.
       | 
       | For context, they were used by Figma to represent its scene-graph
       | (https://www.figma.com/blog/how-figmas-multiplayer-
       | technology...).
       | 
       | Essentially, it's a way to represent a normal tree as a map; keys
       | of the map represent individual nodes on the trees, and one key
       | "ROOT" points to the rootmost node.
       | 
       | Each node is connected via the following relationship: parents
       | have an unordered set of their children's keys (not the actual
       | node reference) and children know their parent's key. Children
       | also maintain a property to track their index using a technique
       | called fractional indexing.
       | 
       | What's great about this data structure is that you have O(1)
       | transformations for most parts. If you need to change a node's
       | property, height for instance, (assuming this is a design tool or
       | something), all you need is that nodes key. If you need to
       | reposition a node under a new parent, simply remove the key from
       | the parent's set, change the node's parent key, and update the
       | index. This structure plays well with collaborative apps too as
       | most edits to the tree are O(1)-- edits are swift and can be
       | communicated to other collaborative instances quite simply with
       | small payloads.
       | 
       | Last thing I'll touch on is fractional indexing. Very crudely
       | put, it forces children nodes to track their position in a list
       | via an "index" that can be a normal float. To reposition a child
       | node, you first find its neighbors and average their indices to
       | find a new index. Ordering the list is a sort from smallest
       | indexes to largest.
       | 
       | In an example, say we had 3 nodes A, B, and C. A's index is 0.1,
       | B's is 0.3, and C's is 0.5. To insert an element D between A and
       | B, I average A and B's index (0.2), set that as the index of D
       | and insert it into the parent node's children set. At render, the
       | set is ordered and we find A, D, B, C.
       | 
       | Shameless plug, but I'm using these techniques to build a multi-
       | framework mobile app builder at phazia.com :). I love nerding out
       | about these things so thanks for the chance to do so @op.
        
       | winReInstall wrote:
       | Input, Output Unionnions. basically a input struct into a
       | algorith, layed out in such a way, that the algorithms output,
       | always only overwrites input no longer needed. If done well, this
       | allows for hot-loops that basically go over one array for read &
       | write-backs. After all, its all just memory and to use what you
       | got in situ is the fastet way one can go.
       | 
       | No pointers to dereference and wait, just the input, computated,
       | stored back into the same cacheline- to be used for the next part
       | of the hot loop. If the hot-loop is multi staged and the output
       | continously decreases in size, one can even "compress" the data,
       | by packing an array of orgMaxisze/outputSize subunits into the
       | unit.
        
         | TristanBall wrote:
         | Do you have any references for this? I'd like to understand
         | better
        
       | didip wrote:
       | Have you heard of Geohash? My mind was blown the first time I
       | learned it. https://en.m.wikipedia.org/wiki/Geohash
        
         | centur wrote:
         | This one is my favourite concept, bloom filter goes after that.
         | Cool stuff about Geohash that you can expand it's concept into
         | 3 or more dimensions and whole ide just makes you think a bit
         | differently about coding and operating over small tokens of
         | data. fantastic stuff
        
       | 3wolf wrote:
       | A minor quibble with your use-case explanation: The advantage of
       | a bloom filter isn't strictly time complexity. For example, a
       | hash table would also have constant lookup time (best case), and
       | would give a definitive answer on set membership. However, to
       | store 1 million IPv6 addresses would take 16 MB. You can see very
       | quickly that this would not scale very well to, say, a billion
       | addresses stored in-memory on a laptop. With a bloom filter, we
       | can shrink the amount of storage space required* while
       | maintaining an acceptable, calculable false positive rate.
       | 
       | * IP addresses actually aren't a great use case for basic bloom
       | filters, as they're fairly storage efficient to begin with, as
       | opposed to a url for example. Taking your example, say we need to
       | store 1 million IP addresses in our bloom filter and we're okay
       | with a ~1% false positive rate. Well then, if we use a bloom
       | filter with 2^23 bits (1 MB), the optimal number of hash
       | functions is (2^23)/(10^6)*ln(2) = 6, yielding a false positive
       | rate of (1 - exp(-6* 10^6 /2^23))^6 = ~1.8%. So we're using 6% of
       | the storage space, but with a nearly 2% false positive rate.
        
       | api wrote:
       | I love the set reconciliation structures like the IBLT (Iterative
       | Bloom Lookup Table) and BCH set digests like minisketch.
       | 
       | https://github.com/sipa/minisketch
       | 
       | Lets say you have a set of a billion items. Someone else has
       | mostly the same set but they differ by 10 items. These let you
       | exchange messages that would fit in one UDP packet to reconcile
       | the sets.
       | 
       | Minisketch is more costly in CPU but has deterministic
       | guarantees. IBLTs are very fast but have a chance of failure
       | requiring a randomized iterative procedure.
        
       | tealpod wrote:
       | Bloom Filter is very clever and simple DataStructure, Thank you.
        
       | gpestana wrote:
       | Conflict-free replicated data type (CRDT) is super interesting
       | data type (and algorithm), especially when it is implemented for
       | complex data structures like JSON: A JSON CRDT is "[...] an
       | algorithm and formal semantics for a JSON data structure that
       | automatically resolves concurrent modifications such that no
       | updates are lost, and such that all replicas converge towards the
       | same state (a conflict-free replicated datatype or CRDT)." [1].
       | 
       | [1] A Conflict-Free Replicated JSON Datatype
       | (https://arxiv.org/abs/1608.03960) b Martin Kleppmann and
       | Alastair R. Beresford.
        
       | JohnDeHope wrote:
       | I always thought the relational calculus was awesome. There's
       | only one place I've seen it in a real world use case, outside of
       | college, and that's in a database called Suneido.
        
       | s1k3 wrote:
       | There's this one called a binary search tree. I've only heard
       | about it a handful of times- during engineering interviews, but
       | is otherwise reserved for the most esoteric of problem/spaces.
        
       | goddstream wrote:
       | Piece tables used in text editors
       | 
       | https://www.averylaird.com/programming/the%20text%20editor/2...
        
       | scary-size wrote:
       | Count-min sketch comes to mind [1]. It's a probabilistic data
       | structure similar to the bloom filter. Learnt about it in this
       | system design video "Top K Problem" [2].
       | 
       | [1] https://en.wikipedia.org/wiki/Count-min_sketch
       | 
       | [2] https://www.youtube.com/watch?v=kx-XDoPjoHw
        
       | eru wrote:
       | Soft heaps are pretty neat. They are like normal min-heaps, but
       | support all operations you care about in O(1) instead of O(log n)
       | time. The catch is that they are allowed to make a configurable
       | proportion of errors.
       | 
       | They are basically only useful in theory, not in practice. They
       | allow construction of some asymptotically optimal algorithms.
        
       | jmartrican wrote:
       | Not super obscure, but also not well known... LinkedHashMap. You
       | get the O(1) lookups and ordered iteration.
        
       | amenghra wrote:
       | Purely Functional Data Structures[1] by Chris Okasaki is worth
       | reading. There's a book version if you prefer vs reading a
       | thesis.
       | 
       | Even though the domain application is functional programming,
       | these datastructures can come in handy when you want to enable
       | state sharing / keeping old versions around without having to
       | copy data.
       | 
       | [1] https://www.cs.cmu.edu/~rwh/students/okasaki.pdf
        
         | dllthomas wrote:
         | I loved the "Numerical Representations" chapter, on deriving
         | new data structures by analogy to number bases.
        
       | mzaks wrote:
       | Not a data structure, but a nice concept for very fast search on
       | immutable pre sorted array, eytzinger order. I did some
       | experiments, it is about 2-3x faster then binary search.
       | https://medium.com/swlh/binary-search-vs-eytzinger-order-301...
        
       | motoboi wrote:
       | If you like Bloom Filters, you'll love Cuckoo filter, which allow
       | key deletion.
        
       | Arainach wrote:
       | Deterministic acyclic finite state automata. Essentially a
       | compressed trie. Significantly more memory efficient for real
       | world dictionaries.
       | 
       | https://en.wikipedia.org/wiki/Deterministic_acyclic_finite_s...
        
       | paskozdilar wrote:
       | I've had a situation where I needed to stream blocks of data from
       | a remote computer in soft-realtime, but still needed to share the
       | same data with many different consumers.
       | 
       | The code was simple, but effective (Go):                   import
       | "sync"              type Muxer[T any] struct {             ret T
       | mut sync.RWMutex         }              func (c *Muxer[T])
       | Multiplex(fn func() T) (ret T) {             if c.mut.TryLock() {
       | defer c.mut.Unlock()                 ret = fn()
       | c.ret = ret             } else {                 c.mut.RLock()
       | defer c.mut.RUnlock()                 ret = c.ret             }
       | return ret         }
       | 
       | The way to use it is to write a non-concurrent nullary _get()_
       | function, then pass it to the _Multiplex()_ function in as many
       | goroutines as needed:                   get := func() DataType {
       | ... }         muxer := Muxer[DataType]{}                  // in
       | other goroutines         value := muxer.Multiplex(get)
       | 
       | The _value_ will be shared across all goroutines, so we get
       | minimum latency with minimum copying.
        
         | sph wrote:
         | I don't get it. It seems overengineered to me, but I can't
         | formulate my reasoning well enough.
         | 
         | Why isn't the original value that you're returning protected by
         | a RWLock, and all goroutines will just need to acquire a read
         | lock, instead of using a write lock for what is basically a
         | getter function?
         | 
         | Yeah, I don't get it.
        
           | paskozdilar wrote:
           | The Multiplex() function serves a double purpose:
           | 
           | 1) If no other goroutine is currently executing the
           | Multiplex() function, then Multiplex() acquires a write lock,
           | executes the getter and caches the result.
           | 
           | 2) If some other goroutine _is_ currently executing the
           | Multiplex() function, then Multiplex() waits to acquire the
           | read lock, then reads the cached value.
           | 
           | So in case of a single goroutine executing the Multiplex()
           | function in a loop, it will be nothing more than a wrapper.
           | But when many goroutines are executing the same Multiplex()
           | function concurrently, only one goroutine will actually
           | _execute_ the getter, and all others will just wait for the
           | first one to finish, then take its cached result.
           | 
           | The point is that you don't have to think about who executes
           | the getter and who copies the cached values, how often to
           | execute the getter, how much time will getter take... You
           | just call it like an ordinary function, and every goroutine
           | will get the most recent value it can get.
           | 
           | Hopefully this clears things out. I'm willing to elaborate
           | further if you have any more questions.
        
       | gb wrote:
       | The half-edge data structure, used in computational geometry.
       | 
       | I remember when I first discovered it, it drove home the point
       | that choice of data structure can really matter - it made a bunch
       | of problems I'd been working on much easier to solve, as it gave
       | a different way of looking at things.
        
       | Chris_Newton wrote:
       | Piece tables: a persistent data structure used to represent the
       | current document with relatively efficient editing operations in
       | various text editors, word processors, etc.
       | 
       | https://en.wikipedia.org/wiki/Piece_table
        
       | samorozco wrote:
       | It's not a data structure but a really cool algorithm. Locality
       | Sensitive Hashing. It allows like items to be hashed to the same
       | value. So instead of a typical hashing functions that wants to
       | avoid collisions this algorithm tries to optimize for collisions.
       | 
       | https://en.wikipedia.org/wiki/Locality-sensitive_hashing
        
       | sairahul82 wrote:
       | FST is one underappreciated data structure. Its used in places
       | like speech recognition and synthesis, machine translation,
       | optical character recognition, pattern matching, string
       | processing, machine learning, information extraction. If you know
       | about it you exactly know where to use it.
        
         | vvrm wrote:
         | +1 if your input and output are both sequences, FSTs can be
         | your best friend...Unless you prefer the company of the flaky
         | cool kids from deep learning (CTC, RNN-T, ...)
        
       | RiverBucketShoe wrote:
       | Splay trees: https://en.wikipedia.org/wiki/Splay_tree recently
       | searched items are always near the top.
        
         | jonstewart wrote:
         | Nuts! I was going to say splay trees. Completely silly these
         | days, like most trees, given caching hierarchies and bursty
         | memory accesses, but a really neat data structure all the same.
        
       | zasdffaa wrote:
       | > Golomb Coded Sets are similar to bloom filters but the storage
       | space is much smaller. Worse performance though.
       | 
       | Humm. A link to these would have been good[1], but anyway I
       | understand that an optimally filled bloom filter takes ~1.44
       | times more space than the optimal theoretic value (which I
       | understand Golomb encoding gives you?), so 'much smaller' is not
       | a helpful measure. Likewise the 'worse performance' is not a
       | helpful phrase, I believe a linear lookup time is needed for
       | decoding, but a small amount of space for an extra indexing
       | structure can speed things up lots. Bloom filters are updatable
       | (addition only in the simplest bloom filter), golomb coded sets
       | practically can't be updated except by rebuilding AIUI.
       | 
       | I suppose binary heaps should get a mention
       | <https://en.wikipedia.org/wiki/Binary_heap> cos they barely seem
       | to figure in other comments here. The neat bit is they are trees
       | but implicit (stored in an array) not explicit (which would use
       | pointers) and are always balanced giving you log access times in
       | a compact form.
       | 
       | [1] <https://en.wikipedia.org/wiki/Golomb_coding> but there are
       | simpler explanations on the web
        
       | binarymax wrote:
       | HNSW, or Hierarchical Navigable Small World is a graph data
       | structure for approximate nearest neighbor search of vectors.
       | 
       | https://arxiv.org/abs/1603.09320
       | 
       | The problem space of ANN is one of those really deep holes you
       | can go down. It's a game of balancing time and space, and it's
       | got plenty of fascinating algorithms and datastructures.
       | 
       | Check out http://ann-benchmarks.com/ for a comparison. HNSW is
       | not "the best" but it's easy to understand and is quite
       | effective.
        
         | israrkhan wrote:
         | I second this.. very useful for vector NLP and other ML tasks.
        
       | zeugmasyllepsis wrote:
       | A GADDAG[1] is a combination pre-/suffix trie that's used for
       | move generation in Scrabble. Unlike a traditional dictionary to
       | find anagrams for words, a GADDAG makes it easier to identify
       | plays based off of available letters already on the board. I made
       | a little visualization of how they work for a data visualization
       | class a few years back.[2]
       | 
       | [1] https://en.wikipedia.org/wiki/GADDAG [2]
       | https://www.axioms.me/gaddag/#?page-5
        
       | earthnail wrote:
       | Vantage-point trees. Great structure for nearest neighbour
       | searches.
       | 
       | Bonus tip: When you have a fast distance function (like a hamming
       | distance) and write your tree from scratch in C (which almost
       | automatically means you'll optimise it for your use case), they
       | can become absurdly fast - much faster than the general-purpose
       | nearest neighbour libraries I'm aware of.
        
       | XCSme wrote:
       | Not a data structure, and not really obscure, but I am still
       | amazed by the concept of solving DP (dynamic-programming)
       | problems in O(log N) instead of O(N) time by writing the solution
       | as a Fast Matrix Exponentiation equation:
       | https://www.geeksforgeeks.org/matrix-exponentiation/
        
       | ww520 wrote:
       | Extendible hashing allows space efficient growable hashtable on
       | disk, with a maximum of two reads to reach the data page, and an
       | extremely simple and efficient way to grow the hashtable.
        
       | silverlake wrote:
       | Most of the data structures posted here are taught in CS classes.
       | Here's an interesting list of more obscure ones:
       | https://web.stanford.edu/class/cs166/handouts/090%20Suggeste...
        
         | rramadass wrote:
         | This is great!
         | 
         | Exactly what this thread is about; everybody should go through
         | this.
        
         | [deleted]
        
         | TristanBall wrote:
         | Not all of us got (cough any) CS degrees though, and while
         | technically the information is all out there for me to find, I
         | wouldn't necessarily have a reason to look, or on those rare
         | occasions where my sysadmin/DBA projects do throw up a problem
         | that one of these algorithms/structures are suited for - I
         | wouldn't know _to_ look, or even how.
         | 
         | That's possibly one of the better arguments for doing a CS
         | degree actually, at least for me. But 20 years ago when I might
         | have cared, CS was always advertised as "you'll learn data
         | structures and algorithms" and that's it. ( 20 yr old me: "I
         | know loops and recursion, linked lists, some basic trees and oh
         | look this "perl" thing has hashes, meh I'll be fine" )
         | 
         | If they'd listed out the sort of descriptions I've seen here..
         | well, I might have had a very different life!
         | 
         | So I'm absolutely loving this thread!
         | 
         | And thankyou for your link too, added to my list.
        
       | epelesis wrote:
       | BW Trees are pretty neat [1][2]. Typically BTrees in database
       | engines need latches to prevent concurrent writes messing up the
       | tree but BW Trees conveniently sidestep that requirement by
       | appending all node updates to a linked list (called a Delta
       | Chain), and having each reader apply the delta of updates by
       | traversing the linked list until it hits the final node in the
       | list. Periodically, these linked lists are compacted back into
       | single nodes.
       | 
       | If you like these kinds of data structures, the Database
       | Internals [3] book has a good introduction and explanation of a
       | handful of BTree optimizations like this.
       | 
       | 1: https://www.microsoft.com/en-us/research/wp-
       | content/uploads/...
       | 
       | 2:
       | https://www.cs.cmu.edu/~huanche1/publications/open_bwtree.pd...
       | 
       | 3: https://www.databass.dev/
        
       | JKCalhoun wrote:
       | A certain ePub reader on MacOS used to use a bloom filter for
       | text search. You would decompose an ePub into pages of text and
       | generate a bit array for the text on each page. As the user types
       | in the search field you could often reject large portions of the
       | document -- leaving perhaps only a handful of pages where the
       | code would only then have to drop down to the more tedious string
       | matching algorithm.
       | 
       | Very cool.
        
       | flobosg wrote:
       | Another use case for Bloom filters: Bioinformatics (see
       | https://en.wikipedia.org/wiki/Bloom_filters_in_bioinformatic...)
        
       | untererdisch wrote:
       | Three of my favorite probabilistic data structures are: Skip
       | Lists, Bloom Filters, and SimHashes.
        
       | pavelboyko wrote:
       | OBDD: Ordered Binary Decision Diagrams. This data structure
       | allows efficient symbolic operations on boolean functions. Widely
       | used in electronic design automation software.
       | 
       | [1] https://en.wikipedia.org/wiki/Binary_decision_diagram [2]
       | https://apps.dtic.mil/sti/pdfs/ADA470446.pdf
        
       | hamilyon2 wrote:
       | My attention seems to to be attracted to perfect hash functions.
       | They are typically used for stuff like keywords detection in
       | parsers, but they are an interesting concept, that should be
       | possible to extend to much more.
        
       | mungoman2 wrote:
       | Not really a pure data structure, but I like hash tables where
       | the entries expire after some time. This is very useful in
       | interactive situations like in games or chat bots, where for
       | example you want the AI/bot to remember a conversation or piece
       | of information for some time.
       | 
       | Simplest to implement is to just wrap a hash table and insert
       | time checks in the accessors.
        
       | anon291 wrote:
       | Finger trees allow you to do amazing things. In essence they let
       | you build an index, or multiple indices, for your dataset and
       | then store them in a single structure.
       | 
       | When I go from Haskell back to imperative land I find myself
       | greatly missing this ability. Sure I can make multiple hashmaps
       | or trees or whatever but being able to stuff it all in one data
       | structure is amazing.
       | 
       | One structure I built with them that is much more difficult with
       | typical structures is a "block list" structure.
       | 
       | In this structure each block has a particular width and they're
       | all stacked side by side.
       | 
       | I want to perform a query, "which box is at position X". So if my
       | boxes are of widths 7,20,10, then the lookup for 2 yields the
       | first box, the lookup for 12 yields the second, etc.
       | 
       | More interestingly, if add a new box between the second and
       | third, the indices covered by the last box is increased.
       | 
       | With finger trees you use a sum monoid. This is easy. In other
       | languages you have to roll your own structure either using a list
       | (with o(n) lookup) or a tree with o(log n) lookup, but o(n)
       | inserts (to translate the indices of future blocks)
       | 
       | https://andrew.gibiansky.com/blog/haskell/finger-trees/
        
         | pjscott wrote:
         | You can do this with any tree, finger or otherwise, and the
         | big-O stuff applies as long as the tree is within a constant
         | factor of balanced. For some bizarre reason, though, the
         | concept of caching a monoid in each node doesn't seem to have
         | spread to the standard libraries of most languages. It's a
         | pity, since that would be _incredibly_ useful.
        
         | scott_s wrote:
         | You might enjoy a paper I'm a coauthor on which combines finger
         | trees with B-trees to build a data structure for optimal
         | updates to sliding windows: Optimal and General Out-of-Order
         | Sliding-Window Aggregation, https://www.scott-
         | a-s.com/files/vldb2019_fiba.pdf
         | 
         | Slides from the conference talk: https://www.scott-
         | a-s.com/files/vldb2019_fiba_slides.pdf
        
           | yellowflash wrote:
           | Very interested in your idea. How does concurrent updates
           | work with the augumented trees? I have been thinking about
           | the same for a while something like CouchDB but segment
           | aggregations (monoid) augumented so we can do interval
           | queries over time too. But the only thing bothering is
           | augumented trees are notorious of contention on concurrent
           | updates. Do you have any ideas on merging back if we have
           | multiple versions of augumented trees.
        
         | endgame wrote:
         | Finger Trees Explained Anew is a great derivation of the data
         | structure: https://www.youtube.com/watch?v=ip92VMpf_-A
        
         | throwamon wrote:
         | If I'm not mistaken, Clojure's data structures are (or used to
         | be) implemented using finger trees.
        
         | jacamera wrote:
         | Very interesting! Sounds similar to a segment tree:
         | https://en.wikipedia.org/wiki/Segment_tree
        
       | simonpure wrote:
       | The BIP buffer.
       | 
       | It's a ring buffer compatible with APIs that don't specifically
       | support it.
       | 
       | It's a great fit for any network I/O and I believe Android uses
       | it under the hood.
       | 
       | https://www.codeproject.com/Articles/3479/The-Bip-Buffer-The...
        
       | stevefan1999 wrote:
       | https://en.wikipedia.org/wiki/Fibonacci_heap?wprov=sfla1
       | 
       | Fibonacci heap is theoretically better than Binary heap but the
       | cache behavior is terrible
       | 
       | https://en.wikipedia.org/wiki/Van_Emde_Boas_tree?wprov=sfla1
       | 
       | Well, I saw this in CLRS tho. Very clever way of abusing bit
       | patterns for quick range query in O(log log M) where M is the
       | integer size.
       | 
       | https://en.wikipedia.org/wiki/Suffix_array?wprov=sfla1
       | 
       | A simpler way to do text matching since suffix array is the
       | preorder traversal of a suffix trie constructed using the same
       | string. Not heavily used in practice, but I do know bioinfo
       | people used it a lot.
       | 
       | https://study.com/academy/lesson/robin-hood-hashing-concepts...
       | 
       | Not a data structure, but still very underrated. Robinhood hash
       | basically means you keep the first insertion order and the actual
       | insertion order and compare their distance. Once the difference
       | is bigger than expected, you put it in front to "steal from the
       | rich" to have better expected access distribution, hence
       | Robinhood. Getting more prominent today thanks to Rust
       | popularizing it again, since the default hash table
       | implementation of Rust uses hashbrown which uses Robinhood
       | hashing scheme.
       | 
       | https://en.wikipedia.org/wiki/Skip_list?wprov=sfla1
       | 
       | The ordered linked list with a fast lane every log(n) level. Not
       | used very much, but one of the place I know uses it is Redis
       | 
       | https://en.wikipedia.org/wiki/Double-ended_priority_queue?wp...
       | 
       | Basically one priority queue that does two things. Can ironically
       | be implemented with two priority queues (dual)
        
         | hcs wrote:
         | A suffix array is useful as a part of binary diff, finding a
         | longest matching substring to copy from an old file. Bsdiff
         | makes heavy use of this, it's used in Courgette for patching
         | after preprocessing.
         | 
         | And once you have a suffix array, you're on the way to the
         | Burrows-Wheeler Transform for e.g. bzip2 compression!
         | https://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transf...
        
       | fulmicoton wrote:
       | pure algorithm & perf stuff... - exponential unrolled linked
       | list: I don't know what they are called, so I called them that
       | way https://fulmicoton.com/posts/tantivy-stacker/. - radix heap.
       | I actually had to use that data structure on a real use case. -
       | radix tree. Actually super useful. - HashMap<String, V> where the
       | string and the value are contiguous in memory. Weirdly I've never
       | seen that one... The hashmap can stores the full hash in its hash
       | table anyway, so false positive are super rare. It will have to
       | check the string anyway... You can improve your locality by
       | keeping the key and the value in the same place in RAM.
       | 
       | just practical stuff, in python, I often use a magic dict defined
       | as follows.
       | 
       | ``` from collections import defaultdict >>> def MagicDict(): ...
       | return defaultdict(MagicDict)
       | 
       | ```
       | 
       | then you can use your MagicDict as a weird json-like map.
       | 
       | ``` c = MagicDict() c[3]["aaa"][2] = 5 ```
        
       | RF_Savage wrote:
       | Low Density Parity Codes (LDPC) (1).
       | 
       | So good forward error correction that up toa few years ago it was
       | all under patents.
       | 
       | Gold Codes (2) are also very cool in CDMA systems.
       | 
       | [1] https://en.m.wikipedia.org/wiki/Low-density_parity-
       | check_cod...
       | 
       | [2] https://en.m.wikipedia.org/wiki/Gold_code
        
       | mysterydip wrote:
       | Djikstra maps. This page describes it better than I can:
       | http://www.roguebasin.com/index.php/Dijkstra_Maps_Visualized but
       | it can be a real simple way to add emergent behavior to an AI in
       | a game (my interest in it), among other things.
        
       | anigbrowl wrote:
       | Quadrilateral Simmelian backbones
       | 
       | Count quadrangle structures in graphs to derive an embeddedness
       | coefficient for each edge and expand the graph along the most
       | embedded edges, yielding nice clusters.
       | 
       | https://jgaa.info/accepted/2015/NocajOrtmannBrandes2015.19.2...
        
       | vbezhenar wrote:
       | Ropes. Something between array and tree.
        
       | nullc wrote:
       | How about a pinsketch:
       | 
       | A pinsketch is a set that takes a specified amount of memory and
       | into which you can insert and remove set members or even add
       | whole sets in time O(memory size). You can insert an unbounded
       | number of entries, and at any time that it has equal or fewer
       | entries than the size you can decode the list of members.
       | 
       | For an example usage, say I have a list of ten million IP
       | addresses of people who have DOS attacked my systems recently. I
       | want to send my list to you over an expensive pay as you go
       | iridium connection, so I don't want to just send the 40MiB list
       | since that would cost about $700. (Or 12.7 MiB if you use an
       | optimal set encoding, or somewhat in between if you use the
       | Golomb coded set mentioned in the OP).
       | 
       | Fortunately you've been making your own observations (and maybe
       | have stale data from me), and we don't expect our lists to differ
       | by more than 1000 entries. So I make and maintain a pinsketch
       | with size 1000 which takes 4000 bytes (1000 * 4bytes because IP
       | addresses are 32-bits). Then to send you an update I just send it
       | over. You maintain your own pinsketch of addresses, you subtract
       | yours from the one I sent and then you decode it. If the number
       | of entries in the resulting set (the number different between us)
       | is 1000 or fewer you're guaranteed to learn the difference
       | (otherwise the decode will fail, or give a false decode with odds
       | ~= 1/2^(1000)).
       | 
       | Bonus: We don't need to know in advance how different our sets
       | are-- I can send the sketch in units as small as one word at a
       | time (32-bits in this case) and stop sending once you've got
       | enough to decode. Implemented that way I could send you exactly
       | the amount of data you're missing without even knowing which
       | entries you're missing.
       | 
       | The sketch itself is simple: To insert an element you raise it to
       | a sequence of odd powers (1,3,5,7...) in a finite field and add
       | it to an accumulator for each power. The tricky part is decoding
       | it. :P
       | 
       | Here is an implementation I contributed to:
       | https://github.com/sipa/minisketch/
       | 
       | There is a application related data-structure called an inverted
       | bloom lookup table (IBLT) that accomplishes the same task. Its
       | encoding and especially decoding is much faster, and it has
       | asymptotically the same communications efficiency. However, the
       | constant factors on the communications efficiency are poor so for
       | 'small' in set difference (like the 1000 above) it has a rather
       | high overhead factor, and it can't guarantee decoding. I think
       | that makes it much less magical, though it may be the right tool
       | for some applications.
       | 
       | IBLT also has the benefit that it the decoder is a fun bit of
       | code golf to implement.
       | 
       | The encoding for IBLT is simple: take your entries, append a
       | checkvalue (can't be a plain CRC). Then hash them with three
       | hash-functions to obtain three locations in a hash table and xor
       | them into those locations (removal works the same way, due to
       | xor's self-inverse property). Decoding an IBLT works by finding
       | entries whos check value passes (which will be true when they are
       | the only entry in their position) and subtracting them from the
       | table (hash them to get their other two positions, and xor them
       | in all three). Decoding is successful when the data structure is
       | all zeros. When the tables are big enough and have a modest
       | amount of slack space the decoding algorithm will be successful
       | (for it to fail there has to be an overlapping cycle, which
       | becomes rare in sufficiently large random graphs). (this
       | description is probably enough for a reader to make a working
       | IBLT implementation though it omits some improvements)
        
       | Loranubi wrote:
       | I am trying to compile a list of data structures (and algorithms
       | and other stuff) into a kind of knowledge base:
       | https://github.com/Dobatymo/data-algos/blob/master/data-stru....
       | There are few cool and obscure data structures already. Please
       | feel free to help extend it!
        
       | jiggawatts wrote:
       | Cache-Oblivious Data Structures:
       | https://cs.au.dk/~gerth/MassiveData02/notes/demaine.pdf
       | 
       | A vaguely related notion is that naive analysis of big-O
       | complexity in typical CS texts ignores over the increasing
       | latency/cost of data access as the data size grows. This can't be
       | ignored, no matter how much we would like to hand-wave it away,
       | because physics gets in the way.
       | 
       | A way to think about it is that a CPU core is like a central
       | point with "rings" of data arranged around it in a more-or-less a
       | flat plane. The L1 cache is a tiny ring, then L2 is a bit further
       | out physically and _has a larger area_ , then L3 is even bigger
       | and further away, etc... all the way out to permanent storage
       | that's potentially across the building somewhere in a disk array.
       | 
       | In essence, as data size 'n' grows, the random access time grows
       | as sqrt(n), because that's the radius of the growing circle with
       | area 'n'.
       | 
       | Hence, a lot of algorithms that _on paper_ have identical
       | performance don 't in reality, because one of the two may have an
       | extra sqrt(n) factor in there.
       | 
       | This is why streaming and array-based data structures and
       | algorithms tend to be faster than random-access, even if the
       | latter is theoretically more efficient. So for example merge join
       | is faster than nested loop join, even though they have the same
       | performance in theory.
        
         | robalni wrote:
         | I was experimenting a while ago with something that I think is
         | related to this.
         | 
         | If you have a quadratic algorithm where you need to compare
         | each pair of objects in a large array of objects, you might use
         | a loop in a loop like this:                   for (int i = 0; i
         | < size; i++) {  for (int j = 0; j < size; j++) {
         | do_something(arr[i], arr[j]); }  }
         | 
         | When this algorithm runs, it will access every cache line or
         | memory page in the array for each item, because j goes through
         | the whole array for each i.
         | 
         | I thought that a better algorithm might do all possible
         | comparisons inside the first block of memory before accessing
         | any other memory. I wanted this to work independently of the
         | size of cache lines or pages, so the solution would be that for
         | each power of 2, we access all items in a position lower than
         | that power of 2 before moving to the second half of the next
         | power of 2.
         | 
         | This means that we need to first compare indexes (0,0) and then
         | (1,0),(1,1),(0,1) and then all pairs of {0,1,2,3} that haven't
         | yet been compared and so on.
         | 
         | It turns out that the sequence (0,0),(1,0),(1,1),(0,1) that we
         | went through (bottom-left, bottom-right, top-right, top-left in
         | a coordinate system) can be recursively repeated by assuming
         | that the whole sequence that we have done so far is the first
         | item in the next iteration of the same sequence. So the next
         | step is to do (0,0),(1,0),(1,1),(0,1) again (but backwards this
         | time) starting on the (2,0) block so it becomes
         | (2,1),(3,1),(3,0),(2,0) and then we do it again starting on the
         | (2,2) block and then on the (0,2) block. Now we have done a new
         | complete block that we can repeat again starting on the (4,0)
         | block and then (4,4) and then (0,4). Then we continue like this
         | through the whole array.
         | 
         | As you can see, every time we move, only one bit in one index
         | number is changed. What we have here is actually two halves of
         | a gray code that is counting up, half of it in the x coordinate
         | and the other half in y. This means that we can iterate through
         | the whole array in the described way by making only one loop
         | that goes up to the square of the size and then we get the two
         | indexes (x and y) by calculating the gray code of i and then
         | de-interleaving the result so that bits 0,2,4... make the x
         | coordinate and bits 1,3,5... make y. Then we compare indexes x
         | and y:                   for (int i = 0; i < size * size; i++)
         | { get_gray_xy(i, &x, &y); do_something(arr[x], arr[y]); }
         | 
         | The result of all this was that the number of cache line/page
         | switches went down by orders of magnitude but the algorithm
         | didn't get faster. This is probably for two reasons:
         | incremental linear memory access is very fast on modern
         | hardware and calculating the gray code and all that added an
         | overhead.
        
           | phkahler wrote:
           | While reading your post I kept thinking "de-interleave the
           | bits of a single counter" since I have used that before to
           | great benefit. It does suffer from issues if your sizes are
           | not powers of two, so your grey code modification seems
           | interesting to me. I'll be looking in to that.
           | 
           | FYI this also extends to higher dimensions. I once used it to
           | split a single loop variable into 3 for a "normal" matrix
           | multiplication to get a huge speedup. Unrolling the inner 3
           | bits is possible too but a bit painful.
           | 
           | Also used it in ray tracing to scan pixels in Z-order, which
           | gave about 10 percent performance improvement at the time.
           | 
           | But yes, I must look at this grey code step to see if it
           | makes sense to me and understand why its not helping you even
           | with reduced cache misses.
        
           | jve wrote:
           | > but the algorithm didn't get faster
           | 
           | Could it be branch predictor at play?
           | https://stackoverflow.com/questions/11227809/why-is-
           | processi...
        
           | samsquire wrote:
           | This is similar to what I'm working on.
           | 
           | I am working on machine sympathetic systems. One of my ideas
           | is concurrent loops.
           | 
           | Nested loops are equivalent to the Nth product of the loop x
           | loop x loop or the Cartesian product.                 for
           | letter in letters:        for number in numbers:         for
           | symbol in symbols:          print(letter + number + symbol)
           | 
           | If len(letters) == 3, len(numbers) == 3, len(symbols) == 3.
           | If you think of this as loop indexes "i", "j" and "k" and
           | they begin at 000 and go up in kind of base 3 001, 001, 002,
           | 010, 011, 012, 020, 100 and so on.
           | 
           | The formula for "i", "j" and "k" is                 N =
           | self.N       for index, item in enumerate(self.sets):
           | N, r = divmod(N, len(item))         combo.append(r)
           | 
           | So the self.N is the product number of the nested loops, you
           | can increment this by 1 each time to get each product of the
           | nested loops.
           | 
           | We can load balance the loops and schedule the N to not
           | starve the outer loops. We can independently prioritize each
           | inner loop. If you represent your GUI or backend as extremely
           | nested loops, you can write easy to understand code with one
           | abstraction, by acting as if loops were independent
           | processes.
           | 
           | So rather than that sequence, we can generate 000, 111, 222,
           | 010, 230, 013 and so on.
           | 
           | Load balanced loops means you can progress on multiple items
           | simultaneously, concurrently. I combined concurrent loops
           | with parallel loops with multithreading. I plan it to be a
           | pattern to create incredibly reactive frontends and backends
           | with low latency to processing. Many frontends lag when
           | encrypting, compressing or resizing images, it doesn't need
           | to be that slow. IntelliJ doesn't need to be slow with
           | concurrent loops and multithreading.
           | 
           | See https://github.com/samsquire/multiversion-concurrency-
           | contro...
           | 
           | Loops are rarely first class citizens in most languages I
           | have used and I plan to change that.
           | 
           | I think I can combine your inspiration of gray codes with
           | this idea.
           | 
           | I think memory layout and data structure and algorithm can be
           | 3 separate decisions. I am yet to see any developers talk of
           | this. Most of the time the CPU is idling waiting for memory,
           | disk or network. We can arrange and iterate data to be
           | efficient.
        
             | samsquire wrote:
             | Link to my concurrent loop repository.
             | 
             | https://github.com/samsquire/multiversion-concurrency-
             | contro...
        
           | samsquire wrote:
           | I decided to try implement get_gray_xy. I am wondering how
           | you can generalise it to any number of loops and apply it to
           | nested loops of varying sizes, that is if you have three sets
           | and the sizes are 8, 12, 81.
           | 
           | Wikipedia says graycodes can be found by num ^ (num >> 1)
           | a = [1, 2, 3, 4]       b = [2, 4, 8, 16]       indexes =
           | set()       correct = set()            print("graycode loop
           | indexes")       for index in range(0, len(a) * len(b)):
           | code = index ^ (index >> 1)         left = code & 0x0003
           | right = code >> 0x0002 & 0x0003                print(left,
           | right)         indexes.add((left, right))              assert
           | len(indexes) == 16              print("regular nested loop
           | indexes")              for x in range(0, len(a)):         for
           | y in range(0, len(b)):           correct.add((x,y))
           | print(x, y)                     assert correct == indexes
           | 
           | This produces the sequence:                   graycode loop
           | indexes         0 0         1 0         3 0         2 0
           | 2 1         3 1         1 1         0 1         0 3         1
           | 3         3 3         2 3         2 2         3 2         1 2
           | 0 2         regular nested loop indexes         0 0         0
           | 1         0 2         0 3         1 0         1 1         1 2
           | 1 3         2 0         2 1         2 2         2 3         3
           | 0         3 1         3 2         3 3
        
             | robalni wrote:
             | Here is the implementation of get_gray_xy that I used:
             | static uint64_t deinterleave(uint64_t x) {             x =
             | x & 0x5555555555555555;             x = (x | (x >>  1)) &
             | 0x3333333333333333;             x = (x | (x >>  2)) &
             | 0x0f0f0f0f0f0f0f0f;             x = (x | (x >>  4)) &
             | 0x00ff00ff00ff00ff;             x = (x | (x >>  8)) &
             | 0x0000ffff0000ffff;             x = (x | (x >> 16)) &
             | 0x00000000ffffffff;             return x;         }
             | static void get_gray_xy(uint64_t n, uint64_t *x, uint64_t
             | *y) {             uint64_t gray = n ^ (n >> 1);
             | *x = deinterleave(gray);             *y = deinterleave(gray
             | >> 1);         }
             | 
             | I don't think the gray code solution can work well for sets
             | of different sizes because the area of indexes that you go
             | through grows as a square.
             | 
             | But it does work for different number of dimensions or
             | different number of nested loops. For 3 dimensions each
             | coordinate is constructed from every 3rd bit instead of
             | 2nd.
             | 
             | So if you have index 14, then gray code is 9 (1001) which
             | means in two dimensions x=1,y=2 and in three dimensions,
             | x=3,y=0,z=0.
        
               | phkahler wrote:
               | deinterleave doesn't actually produce gray code does it?
               | The GP said to convert to gray code before doing the
               | deinterleave.
        
               | samsquire wrote:
               | Thank you for sharing your idea and your code and
               | thoughts and thanks for your reply.
               | 
               | I would like to generalise it for sets of different
               | sizes. Maybe it's something different to a gray codes
               | which you taught me today, it feels that it should be
               | simple.
               | 
               | Maybe someone kind can step in and help me or I can ask
               | it on Stackoverflow.
               | 
               | I need to think it through.
               | 
               | At one point in time I was reading a section on Intel's
               | website of cache optimisation using a tool that shows you
               | the cache behaviour of the CPU and column or row major
               | iteration it had a GUI. I found it very interesting but I
               | cannot seem to find it at this time. I did find some
               | medium article of Cachediff.
        
         | umanwizard wrote:
         | Do we mean different things by "merge join" and "nested loop
         | join" ? For me "merge join" is O(n) (but requires the data to
         | be sorted by key) whereas "nested loop join" is O(n^2).
        
           | mountainreason wrote:
           | Wouldn't you at least be looking at nlog(n) for the sort in
           | the merge join?
        
             | umanwizard wrote:
             | Yes, if the data is not already sorted. Thus it's O(n) for
             | already sorted data and O(n log(n) + n) -- which simplifies
             | to O(n log(n)) -- for arbitrary data.
        
               | mountainreason wrote:
               | Yeah. I know. Why would the data already be sorted?
        
               | umanwizard wrote:
               | In a database you might have already built an index on
               | that data.
        
         | vvanders wrote:
         | Realtime collision detection[1] has a fantastic chapter in this
         | with some really good practical examples if I remember right.
         | 
         | Great book, I used to refer to it as 3D "data structures" book
         | which is very much in theme with this thread.
         | 
         | [1] https://www.amazon.com/Real-Time-Collision-Detection-
         | Interac...
        
           | thegeomaster wrote:
           | It's a stellar book. I work on a commercial realtime physics
           | engine, the "orange book" is practically required reading
           | here.
        
           | Gene_Parmesan wrote:
           | Seconding the RTCD recommendation. Handy code examples, and
           | my favorite part is that the book is real-world-performance-
           | conscious (hence the "real-time" in the title).
        
           | trebcon wrote:
           | The implicit grid data structure from there is a personal
           | favorite of mine. I used it in a game once and it performed
           | incredibly well for our use case.
           | 
           | It's a bit too complicated to totally summarize here, but it
           | uses a bit per object in the scene. Then bit-wise operations
           | are used to perform quick set operations on objects.
           | 
           | This data structure got me generally interested in algorithms
           | that use bits for set operations. I found the Roaring Bitmap
           | github page has a lot of useful information and references
           | wrt this topic:
           | https://github.com/RoaringBitmap/RoaringBitmap
        
             | phkahler wrote:
             | I spent a lot of time optimizing an octree for ray tracing.
             | It is my favorite "spatial index" because it's the only one
             | I know that can be dynamically updated and always results
             | in the same structure for given object placement.
        
           | SotCodeLaureate wrote:
           | I often recommend this book, RTCD, as a "better intro to data
           | structures and algorithms" (sometimes as "the best"). The
           | advice often falls on deaf ears, unfortunately.
        
         | cvoss wrote:
         | > In essence, as data size 'n' grows, the random access time
         | grows as sqrt(n), because that's the radius of the growing
         | circle with area 'n'.
         | 
         | I was about to write a comment suggesting that if we made
         | better use of three dimensional space in constructing our
         | computers and data storage devices, we could get this extra
         | latency factor down to the cube root of n.
         | 
         | But then, I decided to imagine an absurdly large computer. For
         | that, one has to take into account a curious fact in black hole
         | physics: the radius of the event horizon is directly
         | proportional to the mass, rather than, as one might expect, the
         | cube root of the mass [1]. In other words, as your storage
         | medium gets really _really_ big, you also have to spread it out
         | thinner and thinner to keep it from collapsing. No fixed
         | density is safe. So, in fact, I think the extra factor for
         | latency in galactic data sets is neither the square root nor
         | cube root, but n itself!
         | 
         | [1] https://en.wikipedia.org/wiki/Schwarzschild_radius
        
           | chriswarbo wrote:
           | This series of blog posts explains this sqrt behaviour quite
           | nicely, covering both theory (including black holes) and
           | practice:
           | 
           | http://www.ilikebigbits.com/2014_04_21_myth_of_ram_1.html
        
           | samus wrote:
           | The main difficulty with growing computing devices in three
           | dimensions is cooling. All that heat has to be somehow
           | carried away. It's already difficult with 2D devices. One
           | would have to incorporate channels for cooling liquids into
           | the design, which takes a bite out of the gains from 3D.
           | 
           | https://www.anandtech.com/show/12454/analyzing-
           | threadripper-...
        
           | l33t2328 wrote:
           | How do you accurately include those sqrt(n)'s in your
           | analysis?
        
           | 6510 wrote:
           | If its stored far away enough re-doing the calculation is
           | faster. Depending on how much accuracy you need you might as
           | well load a 2022-07-22 07:26:34 earth and have that guy on HN
           | evaluate his thoughts all over again.
        
         | cratermoon wrote:
         | > the increasing latency/cost of data access as the data size
         | grows
         | 
         | Latency Numbers Everyone Should Know
         | https://static.googleusercontent.com/media/sre.google/en//st...
        
           | NavinF wrote:
           | The ratios are mostly correct, but the absolute numbers are
           | wrong these days. For example, memory read latency can be
           | 45ns (not 100) with a decent DDR4 kit and data center latency
           | can be 10us (not 500us) with infiniband.
        
         | physicsguy wrote:
         | I spent a long time looking at an algorithm for long range
         | force calculations called the Fast Multipole Method which is
         | O(N) and found that practically it couldn't compete against an
         | O(N log N) method we used that involved FFTs because the
         | coefficient was so large that you'd need to simulate systems
         | way bigger than is feasible in order for it to pay off because
         | of cache locality, etc.
        
           | jsmith45 wrote:
           | Honestly, that is not too uncommon. For top level algorithm
           | choice analysis, "rounding" O(log(n)) to O(1), and
           | O(n*log(n)) to O(n), is often pretty reasonable for algorithm
           | comparison, even though it is strictly incorrect.
           | 
           | It is not uncommon for a simple O(n*log(n)) algorithm to beat
           | out a more complex O(n) algorithm for realistic data sizes.
           | If you "round" them to both O(n), then picking the simpler
           | algorithm because the obvious choice, and can easily turn out
           | to have better perf.
        
             | physicsguy wrote:
             | To be fair, in the general case (particles placed in
             | arbitrary positions), it worked very well. The issue was
             | that in the area of electromagnetics I worked on at the
             | time, we make an assumption about things being placed on a
             | regular grid, and that allows you to do the force
             | calculation by convolving a kernel with the charges/dipole
             | strength in Fourier space.
        
           | xigoi wrote:
           | https://en.wikipedia.org/wiki/Galactic_algorithm
        
           | mmorse1217 wrote:
           | This implementation and associated paper improve the cache
           | locality to make the algorithm more competitive
           | https://github.com/dmalhotra/pvfmm
           | 
           | They compare solving toy problems with different methods here
           | https://arxiv.org/pdf/1408.6497.pdf
           | 
           | But as you mention, this is just fiddling with constants, and
           | it still may be too costly for your problem
        
         | robotresearcher wrote:
         | Doing a decent job of analysis would include the sqrt(n) memory
         | access factor. Asymptotic analysis ignores constant factors,
         | not factors that depend on data size. It's not so much 'on
         | paper' as 'I decided not to add a significant factor to my
         | paper model'.
         | 
         | Cache oblivious algorithms attempt to make the memory access
         | term insignificant so you could continue to ignore it. They are
         | super neat.
        
         | itamarst wrote:
         | Does anyone actually use cache-oblivious data structure in
         | practice? Not, like, "yes I know there's a cache I will write a
         | datastructure for that", that's common, but specifically cache-
         | oblivious data structures?
         | 
         | People mention them a lot but I've never heard anyone say they
         | actually used them.
        
           | artemisart wrote:
           | Cache oblivious data structures are absolutely used in every
           | serious database.
           | 
           | From what I remember/can find online (don't take this as a
           | reference): B-tree: relational DB, SQLite, Postgres, MySQL,
           | MongoDB. B+-tree: LMDB, filesystems, maybe some relational
           | DBs. LSM trees (Log-structured merge tree, very cool) for
           | high write performance: LevelDB, Bigtable, RocksDB,
           | Cassandra, ScyllaDB, TiKV/TiDB. B-epsilon tree: TokuDB, not
           | sure if it exists anymore. COLA (Cache-oblivious lookahead
           | array): I don't know where it's used.
           | 
           | Maybe modern dict implementations can qualify too, e.g.
           | Python.
        
           | jiggawatts wrote:
           | To an extent, the B-Tree data structure (and its variants)
           | are cache oblivious. They smoothly improve in performance
           | with more and more cache memory, and can scale down to just
           | megabytes of cache over terabytes of disk.
           | 
           | The issue is that the last tree level tends to break this
           | model because with some caching models it is "all or nothing"
           | and is the biggest chunk of the data by far.
           | 
           | There are workarounds which make it more truly cache
           | oblivious. How often this is used in production software? Who
           | knows...
        
             | jfim wrote:
             | There are also Judy arrays:
             | https://en.wikipedia.org/wiki/Judy_array
        
               | bruce343434 wrote:
               | This sounded promising but then I saw a performance
               | benchmark [1] that was done on a pentium(!) with a very
               | suboptimal hash table design compared to [2] and [3].
               | 
               | The Judy site itself [4] is full of phrases like "may be
               | out of date", when the entire site was last updated in
               | 2004. If it was already out of date in 2004...
               | 
               | It seems that Judy arrays are just kind of a forgotten
               | datastructure, and it doesn't look promising enough for
               | me to put in effort to revive it. Maybe someone else
               | will?
               | 
               | [1] http://www.nothings.org/computer/judy/
               | 
               | [2] https://probablydance.com/2017/02/26/i-wrote-the-
               | fastest-has...
               | 
               | [3] https://probablydance.com/2018/05/28/a-new-fast-hash-
               | table-i...
               | 
               | [4] http://judy.sourceforge.net/
        
               | fanf2 wrote:
               | Judy arrays are (under the covers) a kind of radix tree,
               | so they are comparable to things like ART (adaptive radix
               | tree), HAMT (hashed array-mapped trie), or my qp-trie.
               | Judy arrays had some weird issues with patents and lack
               | of source availability in the early days. I think they
               | are somewhat overcomplicated for what they do.
               | 
               | Regarding cache-friendiness, a neat thing I discovered
               | when implementing my qp-trie is that it worked amazingly
               | well with explicit prefetching. The received wisdom is
               | that prefetching is worthless in linked data structures,
               | but the structure of a qp-trie makes it possible to
               | overlap a memory fetch and calculating the next child,
               | with great speed benefits. Newer CPUs are clever enough
               | to work this out for themselves so the explicit prefetch
               | is less necessary than it was.
               | https://dotat.at/prog/qp/blog-2015-10-11.html
               | 
               | I guess I partly got the idea for prefetching from some
               | of the claims about how Judy arrays work, but I read
               | about them way back in the 1990s at least 15 years before
               | I came up with my qp-trie, and I don't know if Judy
               | arrays actually use or benefit from prefetch.
        
               | efreak wrote:
               | It seems to me that the project is still active,:
               | https://sourceforge.net/p/judy/bugs/29/
        
           | pdpi wrote:
           | Cache-obliviousness is an important part of the whole array-
           | of-structs vs structs-of-arrays. One of the advantages of the
           | struct-of-arrays strategy is that it is cache-oblivious.
        
         | JackFr wrote:
         | I saw an amazing presentation on Purely Functional Data
         | structures with a really approachable and understandable proof
         | on ho w a particular structure was asymptotically constant or
         | linear time (I believe), and then followed up by saying that in
         | practice none of it worked as well as a mutable version because
         | of caching and data locality.
         | 
         | Oh well.
        
           | qwrshr wrote:
           | Is there by any chance a publicly available recording of this
           | presentation on purely functional data structures?
        
             | JackFr wrote:
             | NE Scala Symposium Feb 2011
             | 
             | http://vimeo.com/20262239 Extreme Cleverness: Functional
             | Data Structures in Scala
             | 
             | or maybe
             | 
             | http://vimeo.com/20293743 The Guerrilla Guide to Pure
             | Functional Programming
             | 
             | (At work, so I can't check the video)
             | 
             | Both of them were excellent speakers
        
       | justahuman74 wrote:
       | I'll just sit here hoping that all of the above datatypes get a
       | nice rust crate :)
        
       | benhoyt wrote:
       | The BK-Tree, which allows fast querying of "close" matches, such
       | as Hamming distance (number of bits different).
       | http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK...
       | 
       | I wrote a Python library implementing them a number of years ago:
       | https://github.com/benhoyt/pybktree
        
       | edem wrote:
       | It must be [CHAMP](https://blog.acolyer.org/2015/11/27/hamt/#:~:t
       | ext=CHAMP%20st....). Acronym for Compressed Hash-Array Mapped
       | Prefix-tree. It is a current state of the art persistent
       | hashtable. I'm surprised that no one mentioned this before,
       | because it is super useful!
        
       | slazaro wrote:
       | I don't know whether it already exists or if it has a name, but
       | internally I call it Virtual List.
       | 
       | - Reasoning: It's used in cases where you'd ideally use an array
       | because you want contiguous memory (because you'll usually
       | iterate through them in order), but you don't know beforehand how
       | many elements you'll insert. But you can't use a resizeable
       | version like std::vector, because it invalidates any pointers to
       | the elements you've already allocated, and you need to store
       | those pointers.
       | 
       | - Implementation: As a black box, it's used like a list: you
       | index elements with an integer, and you can append to the end.
       | But internally it uses different lists. Like a std::vector, it
       | starts with an allocated size, and when it needs a bigger size,
       | it allocates newer blocks that are "added" to the end, but
       | doesn't move the already allocated elements.
       | 
       | - Downsides: Data is not all contiguous, at some point there are
       | jumps between elements that should be close together. For
       | performance when iterating, it's negligible since those cache
       | misses happen "rarely" (depends on the block size, bigger block
       | size means better cache performance but more potential wasted
       | allocated memory), and most of the time you're iterating from
       | start to end. But this means that you can't use this structure
       | with algorithms that use "pointer + size", since internally it
       | doesn't work like that. And since internally you can't shrink the
       | lists and invalidate pointers, there's no way to delete elements,
       | but in my case it's not a problem since I'd reuse indices
       | afterwards with an added structure.
       | 
       | - Applications: I used it when coding games, when you have a big
       | list of entities/components but you don't know how many
       | beforehand. You access them by index, you might store pointers to
       | elements, and most of the time you'll iterate through the whole
       | list (when updating, drawing, etc.).
        
         | twic wrote:
         | I think you're describing an unrolled linked list:
         | https://en.wikipedia.org/wiki/Unrolled_linked_list
         | 
         | Although specifically, in an unrolled linked list, the blocks
         | are connected by a chain of pointers from one to the next. You
         | can alternatively just keep pointers to each block in a top-
         | level array. it's not clear from your description which you're
         | doing.
         | 
         | The JDK has a twist on the latter, which it calls a spined
         | buffer, where the blocks double in size every time one is
         | added, which makes sense if you have absolutely no idea how
         | many elements there will be when you start.
        
           | slazaro wrote:
           | Yeah that's pretty similar. I keep a list of pointers to
           | blocks and they're all the same size, that way you can easily
           | address elements via a linear index. I tried doing the
           | doubling size thing like std::vector, but the addressing
           | became more complex and I didn't find a need for it, but I
           | might consider it in the future. But yeah, it seems like it's
           | almost the same kind of structure. Thanks for the pointer!
        
         | nly wrote:
         | So like a std::deque? If the block size varies then lookup will
         | be O(log B) where B is the number of blocks.
        
           | slazaro wrote:
           | I remember checking whether I would use std::deque, and I
           | don't remember exactly why but I decided to implement this
           | other system instead. I think it was because you can
           | invalidate pointers with it if you delete somewhere in the
           | middle, either manually or by calling an algorithm with it.
           | 
           | Of course you have the option to just not do that in your
           | code, but it's nice to have a hard guarantee that this can
           | never happen, even by accident. I could have used a deque as
           | a backend and wrapped it in the class, though.
        
       | jonathan-kosgei wrote:
       | Time complexity doesn't seem like a great reason to use a bloom
       | filter. Why not just check if the IP is in a set? Set membership
       | is supposed to be O(1).
       | 
       | The reason to use a bloom filter would be to save space.
        
       | lscharen wrote:
       | CDAWGs. Compressed Directed Acyclic Word Graphs.
       | 
       | They are like tries but share prefixes and suffixes of a word
       | corpus, can be built in linear time, and since they are graphs,
       | one can define an inner product between two CDAWGs and use them
       | for kernel-based machine learning algorithms.
       | 
       | https://hal-upec-upem.archives-ouvertes.fr/hal-00620006/docu...
        
       | sebastianconcpt wrote:
       | Good! I got always fascinated by non-boolean indexing
       | (probabilistic instead).
       | 
       | I've remember having implemented an OOP indexed version of U.C.
       | Berkeley's Cheshire-II in 2007 with Dolphin Smalltalk.
       | 
       | I was using BTrees tho, hence O(log N). The non-boolean part was
       | in the values of the keys which were probability of good match
       | against your search target.
        
       | Gravityloss wrote:
       | Don't know if it fills the criteria, but some very basic concept:
       | Circular arrays.
       | 
       | This is great for live updated time series data: you never have
       | to expand your arrays, allocate memory etc. It is extremely fast,
       | predictable and resource efficient.
        
         | Retr0id wrote:
         | Also known as ring buffers
        
       | zbentley wrote:
       | HyperLogLogs! They're extremely useful for cardinality count
       | estimation, and support set operations.
       | 
       | https://en.m.wikipedia.org/wiki/HyperLogLog
        
         | thebigspacefuck wrote:
         | Reddit's engineering blog has a good article of how they put
         | HyperLogLog to use https://www.redditinc.com/blog/view-
         | counting-at-reddit
        
       | z3t4 wrote:
       | A sorted list. It's often cheaper to sort the list after each
       | change, then to search the whole list. Especially when you do
       | collision detection in 3d... (maybe not so obscure, but at least
       | underestimated) When I pair my socks I first place them in color
       | order :)
        
         | pjscott wrote:
         | Note that re-sorting an array after making a constant number of
         | changes can be done in O(n) and will usually be very fast in
         | practice. Insertion sort will do it, as will Timsort and
         | similar algorithms.
        
         | jedimastert wrote:
         | I was reading up on C++'s standard library and found out that
         | maps and sets are generally implemented as continually sorted
         | lists
        
       | dqpb wrote:
       | Type->Value Lattice
       | 
       | https://cuelang.org/docs/concepts/logic/
        
       | johncalvinyoung wrote:
       | Not crazy obscure, but half-edge mesh data structures are super
       | cool to me.
        
       | robertsdionne wrote:
       | Posit numbers (a better floating point representation):
       | https://www.johndcook.com/blog/2018/04/11/anatomy-of-a-posit...
       | 
       | https://www.johngustafson.net/pdfs/BeatingFloatingPoint.pdf
        
         | robertsdionne wrote:
         | Also, Learned Indexes Structures:
         | 
         | The Case for Learned Index Structures:
         | https://arxiv.org/abs/1712.01208
         | 
         | Learned Indexes for a Google-scale Disk-based Database:
         | https://arxiv.org/abs/2012.12501
         | 
         | And Noms/Dolt Prolly Trees:
         | 
         | How to Chunk Your Database into a Merkle Tree:
         | https://dolthub.com/blog/2022-06-27-prolly-chunker/
        
       | jhallenworld wrote:
       | Some ones I've used recently:
       | 
       | The "golden section search" to find a the minimum (or maximum) of
       | a unimodal function. An actual real-world use case for the golden
       | ratio.
       | 
       | Exponentially Weighted Moving Average filters. Or how to have a
       | moving average without saving any data points..
       | 
       | Some of my classic favorites:
       | 
       | Skiplists: they are sorted trees, but the algorithms are low
       | complexity which is nice.
       | 
       | Boyer-Moore string search is awesome..
       | 
       | Bit twiddling algorithms from Hackers Delight: for example (x
       | &-x) isolates the least significant set bit: basically use the
       | ALU's carry chain to determine priority.
       | 
       | Another is compress from Hacker's Delight. It very quickly
       | compresses selected bits (from a bitmask), all to the right of
       | the word. It's useful to make certain hash functions.
       | 
       | The humble circular queue. If your circular queue is a power of 2
       | in size, then the number of bits needed for indexes is at least
       | log2(size) + 1. The + 1 is key- it allows you to distinguish
       | between completely full and completely empty with no effort. So:
       | Empty: (write_index == read_index)         Full: (write_index ==
       | read_index + size)         Count: (write_index - read_index)
       | Insert: queue[write_index++ & (size - 1)] = data         Remove:
       | data = queue[read_index++ & (size - 1)];
       | 
       | All lossless data compression algorithms are amazing.
        
         | jongjong wrote:
         | In case anyone is interested, I wrote a skip list
         | implementation for Node.js here:
         | https://www.npmjs.com/package/proper-skip-list
         | 
         | I provided the time complexity of each operation as part of the
         | README.
        
         | chrchang523 wrote:
         | Regarding the bit "compress" operation, this is supported in
         | hardware (PEXT instruction) by Haswell and newer processors,
         | though unfortunately most AMD processors have poor
         | implementations. I've found it to be quite handy when
         | implementing subsetting operations.
        
       | crazybob123 wrote:
       | FST: Finite State Transducer. Widely used inside search (Lucene),
       | DB powering prefix, fuzzy search
        
       | lizardactivist wrote:
       | The Trie is pretty cool. I think it's a bit obscure because the
       | memory use can grow quite fast.
        
         | pjscott wrote:
         | That depends very much on how the trie is represented in
         | memory. Something like a burst trie or HAT trie is usually very
         | compact!
        
       | sterlind wrote:
       | Skip lists! Super wacky, super fun to implement. O(log n)
       | insertion/deletion/binary search, maintained sorted. It's
       | probabilistic; you have a number of linked list levels, so you
       | flip k coins to insert a node, and insert into the lowest k
       | linked lists.
       | 
       | also, radix heaps are underappreciated.
        
       | SahAssar wrote:
       | https://en.wikipedia.org/wiki/Macaroons_(computer_science) are
       | very interesting to me.
       | 
       | Think of it as a JWT that you can narrow down the authorizations
       | for without needing to communicate with a server, so if you have
       | read permissions to all your photos you can add a caveat saying
       | `photoid=123456` and share it, and the recipient can only read
       | the photo 123456. The caveats can be anything, including
       | requiring third party authentication via third-party caveats.
       | I've implemented systems with it being validated by a lua module
       | in nginx which worked well, but the whole concept never took off.
       | 
       | I still think it seems like one of the most interesting authZ
       | strategies around.
        
         | FridgeSeal wrote:
         | I've heard of these!
         | 
         | The fly.io blog has a _really_ cool write-up about them.
         | 
         | Definitely an under appreciated concept.
         | 
         | IIUC, you can even validate the "sub issued" macaroons offline,
         | provided you know the validity of one of its ancestors up the
         | chain. Is this correct, or am I misunderstanding?
        
           | doerig wrote:
           | here is the mentioned blog post for anyone interested (and
           | lazy): https://fly.io/blog/api-tokens-a-tedious-survey/
        
           | SahAssar wrote:
           | I think that's correct since each added caveat builds on the
           | previous one.
           | 
           | I don't have the formal knowledge to understand the full
           | underpinnings of it, but it always seemed to me (from
           | implementing validation of them and looking at many authZ
           | systems) that many of the sharing and delegation features of
           | current apps would be so much neater in a macaroon based
           | system.
        
       | turkeywelder wrote:
       | Might not be very obscure but it was new to me. Ended up being
       | really useful for finding out whether dates intersect a
       | particular period and ended up using it quite a bit at work.
       | 
       | Range Tree:
       | 
       | https://en.wikipedia.org/wiki/Range_tree
       | 
       | You give it a range of dates and it'll tell you if any intersect
       | so if you're looking for "how many people are absent in this time
       | period" you can really quickly find out by using a range tree.
        
       | jcadam wrote:
       | How about the R-Tree, used for accessing/searching spatial data:
       | https://en.wikipedia.org/wiki/R-tree
       | 
       | Maybe not so obscure if you've worked in the geospatial realm.
       | 
       | I actually had a job interview once that asked me to implement an
       | R-Tree from scratch in C++ as a homework assignment - I didn't
       | get that job :)
        
       | nicwolff wrote:
       | The Aho-Corasick automaton [0]. You have a bunch of strings, and
       | you want to know if and where they appear in a longer string. You
       | build a search trie from the strings, but add links back from
       | each node to the node representing the longest suffix of the
       | string it represents that is also in the trie, so you can skip
       | backtracking as you scan, yielding linear time complexity in
       | O(length of string being searched + number of results found).
       | 
       | [0] https://en.wikipedia.org/wiki/Aho-Corasick_algorithm
        
         | burntsushi wrote:
         | The aho-corasick Rust crate[1] also provides a way to build the
         | automaton in a slightly different way to provide leftmost first
         | or "preference first" match semantics. This matches how popular
         | backtracking regex engines implement alternations of literals,
         | for example. (It also provides leftmost longest semantics,
         | which matches how POSIX regexes implement alternations.)
         | 
         | [1]: https://docs.rs/aho-corasick
        
       | Dowwie wrote:
       | Slotmap seems like a superior data structure to adjacency lists
       | backed by Vec, but still haven't replaced them in graph
       | libraries: https://docs.rs/slotmap/latest/slotmap/
        
       | gordaco wrote:
       | Fenwick Trees (which, despite the name, are implemented using an
       | array) allow counting prefix sums AND updating prefix sums in
       | O(log n) time. Very useful when _n_ is in the order of millions.
       | I have used them a few times in Project Euler problems.
       | 
       | https://en.wikipedia.org/wiki/Fenwick_tree
        
         | nyanpasu64 wrote:
         | Is it possible to implement a Fenwick tree using a tree, and
         | support quickly adding and removing items as well as quickly
         | shifting the positions of suffixes?
        
           | pjscott wrote:
           | Yes! This is both possible and fairly easy.
        
           | kragen wrote:
           | Yes.
        
         | EnderShadow8 wrote:
         | Segment trees are objectively superior in all ways except
         | implementation length
        
           | almog wrote:
           | Another advantage to Fenwick Tree: while it shares asymptotic
           | space complexity with Segment Tree, it has better constant
           | factors, which can be useful in very restricted environments.
        
         | almog wrote:
         | Was going to mention Fenwick Tree here as well, but since
         | you've already did, I'll just add that this is IMO a great
         | introduction to Fenwick Trees:
         | https://www.youtube.com/watch?v=kPaJfAUwViY
        
       | skrebbel wrote:
       | Not a very deep CS-y one, but still one of my favourite data
       | structures: Promise Maps.
       | 
       | It only works in languages where promises/futures/tasks are a
       | first-class citizen. Eg JavaScript.
       | 
       | When caching the result of an expensive computation or a network
       | call, don't actually cache the result, but cache the promise that
       | awaits the result. Ie don't make a                   Map<Key,
       | Result>
       | 
       | but a                   Map<Key, Promise<Result>>
       | 
       | This way, if a new, uncached key gets requested twice in rapid
       | succession, ie faster than the computation takes, you avoid
       | computing/fetching the same value twice. This trick works
       | because:
       | 
       | - promises hold on to their result value indefinitely (until
       | they're GC'ed)
       | 
       | - you can await (or .then()) an existing promise as many times as
       | you want
       | 
       | - awaiting an already-resolved promise is a very low-overhead
       | operation.
       | 
       | In other words, the promise acts as a mutex around the
       | computation, and the resulting code is understandable even by
       | people unfamiliar with mutexes, locks and so on.
        
         | cauthon wrote:
         | What is a promise in this context?
        
           | cpburns2009 wrote:
           | A promise is an object representing an asynchronous result.
           | Some languages, such as Python, call them futures. A promise
           | object will typically have methods for determining if result
           | is available, getting the result if it is, and registering
           | callback functions to be called with the result when it's
           | available.
        
         | polyrand wrote:
         | This is only tangentially related to:
         | 
         | > you avoid computing/fetching the same value twice
         | 
         | But this problem comes up in reverse proxies too. In Nginx, you
         | can set the `proxy_cache_lock`[0] directive to achieve the same
         | effect (avoiding double requests).
         | 
         | [0]:
         | https://nginx.org/en/docs/http/ngx_http_proxy_module.html#pr...
        
           | talos2110 wrote:
           | That right there is Cache Stampede[0] prevention.
           | 
           | [0]: https://en.wikipedia.org/wiki/Cache_stampede
        
             | psytrx wrote:
             | Interesting! This is an issue I had with an external API
             | which I intended to cache on my serverless workers infra.
             | 
             | I hit the API's rate limiter when the workers invalidated
             | their cache, even though I staggered the lifetime of the
             | cache keys for each replicated instance. This is how I
             | found out how many Cloudflare workers run on a single edge
             | instance. Hint: It's many.
             | 
             | Didn't know it had a name. I'm delighted, thanks!
        
               | talos2110 wrote:
               | You're welcome, happy to help. If you are in the .NET
               | space I suggest you to take a look at FusionCache. It has
               | cache stampede protection built in, plus some other nice
               | features like a fail-safe mechanism and soft/hard
               | timeouts
               | https://github.com/jodydonetti/ZiggyCreatures.FusionCache
        
         | acoye wrote:
         | I wonder if Swift's AsyncAwait could be used in such a way.
        
           | Ryuuke5 wrote:
           | Sure it's possible, we need to await for the Task if it
           | already exists on the dictionary, for example we could
           | imagine writing something like this inside an actor to make
           | it threadSafe.                   private var tasksDictionary:
           | Dictionary<String, Task<Data, Error>>              func
           | getData(at urlString: String) async throws -> Data {
           | if let currentTask = tasksDictionary[urlString] {
           | return await currentTask.value             }             let
           | currentTask = Task {                 return try await
           | URLSession.shared.data(from: URL(string:urlString)!)
           | }             tasksDictionary[urlString] = currentTask
           | let result = await currentTask.value
           | tasksDictionary[urlString] = nil             return result
           | }
        
             | afro88 wrote:
             | Great! So much nicer than the Combine (or Rx) version.
        
             | ninkendo wrote:
             | Hmm, why set `taskDictionary[urlString] = nil` at the
             | bottom there? If the whole point is to cache the result,
             | isn't the point to leave the value there for other requests
             | to pick it up?
        
               | Ryuuke5 wrote:
               | Yup, nice catch, no need to reset it to nil if you wanna
               | keep it in memory indefinitely.
               | 
               | I guess making it nil can be also used if you don't wanna
               | make the same request when there is one already in
               | flight, in case you have a 403 error and need to request
               | a refresh token, you don't wanna make two simultaneous
               | requests, but you also don't wanna catch it indefinitely
               | either.
        
         | lalaithion wrote:
         | What do you mean by first class citizen, here? I'm pretty sure
         | this works in all languages with promises, but I might be
         | misunderstanding something.
        
         | dan-robertson wrote:
         | Be careful with this data structure. If the language allows
         | async exceptions or you have a big where the promise won't
         | become deferred, there are a lot of edge cases. Examples of
         | edge cases:
         | 
         | - if the promise never becomes determined (eg bug, async
         | exception) your app will wait forever
         | 
         | - if the promise has high tail latency things can be bad
         | 
         | - if the language eagerly binds on determined promises (ie it
         | doesn't schedule your .then function) you can get weird
         | semantic bugs.
         | 
         | - changing the keys of the table incrementally instead of just
         | using for memoisation can be really messy
        
           | wxnx wrote:
           | I wrote some Python code recently that uses a similar data
           | structure (Futures instead of Promises, without knowing
           | necessarily about the data structure's use in JavaScript). It
           | wasn't really for caching.
           | 
           | I modeled mine on the Django ASGI reference library's server
           | implementation, which uses the data structure for maintaining
           | references to stateful event consumers. Exception handling is
           | done with a pre-scheduled long-running coroutine that looks
           | at the map.
           | 
           | I'm curious about your second point -- why exactly do things
           | get bad with high tail latency? Is it only a weakness of the
           | data structure when used for caching? I'm having trouble
           | picturing that.
        
           | hinkley wrote:
           | I have a couple pieces of code where we had to add rate
           | limiting because it was just too easy to make too many async
           | calls all at once, and things only got 'worse' as I fixed
           | performance issues in the traversal code that were creating
           | an ersatz backpressure situation.
           | 
           | Philosophically, the main problem is that promise caches are
           | a primary enabler of the whole cache anti-pattern situation.
           | People make the mistake of thinking that 'dynamic
           | programming' means memoization and 'memoization' means
           | caching. "Everything you said is wrong." Trying to explain
           | this to people has been a recurring challenge, because it's
           | such a common, shallow but strongly-held misunderstanding.
           | 
           | Youtube recently suggested this video to me, which does a
           | pretty good job of explaining beginning and intermediate DP:
           | 
           | https://www.youtube.com/watch?v=oBt53YbR9Kk
           | 
           | What I love most about this video is that it introduces
           | memoization about 10 minutes in, but by 20% of the way
           | through it has already abandoned it for something better:
           | tabulation. Tabulation is an interative traversal of the
           | problem that gathers the important data not only in one place
           | but within a single stack frame. It telegraphs a information
           | architecture, while recursive calls represent emergent
           | behavior. The reliance on cache to function not only obscures
           | the information architecture, it does so by introducing
           | global shared state. Global shared state and emergent
           | behavior are both poison to the long-term health of a
           | project, and here we have a single data structure that
           | represents both in spades.
           | 
           | We are supposed to be fighting accidental coupling, and
           | especially global variables, not engaging in word games to
           | hide what we are doing.
           | 
           | So I think my answer to OP's question is 'tabulation', which
           | is part data structure and part algorithm.
        
           | feoren wrote:
           | - if the language eagerly binds on determined promises (ie it
           | doesn't schedule your .then function) you can get weird
           | semantic bugs.
           | 
           | What would be an example of this? If the promise has been
           | determined, why not just immediately run the .then function?
        
           | bloudermilk wrote:
           | Best practice in my experience is to use a timeout on all
           | async operations to handle edge cases 1 & 2 above. The third
           | case isn't possible with JavaScript AFAIK.
        
         | hmm-interesting wrote:
         | can you please explain this a bit more how this avoid fetching
         | the same value twice? maybe with example.
        
           | kube-system wrote:
           | It sounds like, if you have a process that takes a long time
           | to do, you check to see if you already have initiated that
           | process (see if the key is in the list of promises), rather
           | than firing off another promise for the same request.
           | 
           | If you get 10 requests to fire function
           | lookUpSomethingThatTakesAWhile(id) that returns a promise,
           | rather than firing off 10 promises, you fire it off once,
           | look up the ID in your map for the next nine, and return that
           | same promise for the next 9.
        
         | maest wrote:
         | More generally, I believe this is a monad structure.
        
           | pdpi wrote:
           | Promises are not monads, for one simple reason: they're not
           | referentially transparent. The whole point of OP's example is
           | to double down and take advantage of that.
        
             | epolanski wrote:
             | Referential transparency is a property of expressions, not
             | type classes. Referential transparency is nowhere in the
             | definition of a monad because a monad is obviously not an
             | expression but a type class.
             | 
             | It is definitely true that Promise is a data type that
             | _could_ admit a monad instance. It has:
             | 
             | - a data type M a which is Promise a
             | 
             | - a function pure with the signature a => M a, which is x
             | => Promise.resolve(x)
             | 
             | - a bind function with the signature M a => a => M b => M b
             | which is essentialy `then`
             | 
             | But a monad requires three properties to hold true: Right
             | identity, Left identity and associativity. Right identity
             | holds for every function `f: a => M b`, but left identity
             | and associativity do not hold.
        
               | brewmarche wrote:
               | Maybe comonads fit better
        
         | Shorel wrote:
         | I know this as memoization with lazy evaluation, nothing new,
         | but at the same time very useful, and I would argue, it is very
         | CS.
        
           | hn_throwaway_99 wrote:
           | not really quite _lazy_ evaluation, thought, at least in
           | Javascript. Promises begin execution as soon as they are
           | created and there is no way to delay that.
        
             | marcosdumay wrote:
             | Promises are an implementation of lazy evaluation for
             | Javascript. This is exactly lazy evaluation.
             | 
             | By the way, lazy evaluation is the one that offers no
             | guarantees about when your code will be executed. If you
             | can delay the execution, it's not lazy.
        
               | anyfoo wrote:
               | Wait. I thought lazy evaluation is defined as evaluation
               | _at the time a value is needed_. After that I think it
               | will then be in Weak Head Normal Form, which can be
               | thought of as  "evaluated at its top level"... but I'm a
               | bit rusty. Basically, an expression gets used somewhere
               | (e.g. pattern matched, in the ADT sense). If it's in
               | WHNF, cool, it's evaluated already (subexpressions within
               | may not yet, but that's their problem--that's exactly
               | what WHNF says). If not, we evaluate it down to WHNF.
               | 
               | Wikipedia states it as: "When using delayed evaluation,
               | an expression is not evaluated as soon as it gets bound
               | to a variable, but when the evaluator is forced to
               | produce the expression's value."
               | 
               | So if evaluation begins effectively at the time the
               | promise is issued, not at the time the promise is
               | awaited, then I would not call that lazy evaluation, and
               | what hn_throwaway_99 said sounds correct to me.
               | 
               | Is my rusty old PL understanding wrong?
        
               | Sohcahtoa82 wrote:
               | > Wait. I thought lazy evaluation is defined as
               | evaluation at the time a value is needed.
               | 
               | Correct.
               | 
               | You've got it right, GP comment has it wrong.
               | 
               | Lazy does not simply mean "evaluated later". From what I
               | understand, in JS, if you call an async function without
               | `await`, it essentially gets added to a queue of
               | functions to execute. Once the calling function stack
               | completes, the async function will be executed.
               | 
               | A queue of functions to execute does not constitute "lazy
               | evaluation" on its own.
        
               | [deleted]
        
               | cstrahan wrote:
               | Definition on Wikipedia says otherwise:
               | 
               | "In programming language theory, lazy evaluation, or
               | call-by-need,[1] is an evaluation strategy which delays
               | the evaluation of an expression until its value is needed
               | (non-strict evaluation) and which also avoids repeated
               | evaluations (sharing)."
               | 
               | Haskell would be an apt example here.
        
           | sethammons wrote:
           | I've always called it request piggy backing (atop
           | memoization). Our db abstraction at my last gig absolutely
           | required it for scaling purposes. Great tool for the tool
           | belt.
        
         | abuckenheimer wrote:
         | This is clever, otherwise you have to coordinate your cache
         | setter and your async function call between potentially many
         | concurrent calls. There are ways to do this coordination
         | though, one implementation I've borrowed in practice in python
         | is modes stampede decorator:
         | https://github.com/ask/mode/blob/master/mode/utils/futures.p...
        
         | ralusek wrote:
         | I have written many tools that do something like this, very
         | useful.
         | 
         | Just gotta be careful with memory leaks, so (for a TS example)
         | you might want to do something like this:
         | const promiseMap<key, Promise<any>> = new Map();
         | async function keyedDebounce<R>(key: string, fn: () => R) {
         | const existingPromise = promiseMap.get(key);           if
         | (existingPromise) return existingPromise;                const
         | promise = new Promise(async (resolve) => {             const
         | result = await fn(); // ignore lack of error handling
         | // avoid memory leak by cleaning up
         | promiseMap.delete(key);                   // this will call the
         | .then or awaited promises everywhere
         | resolve(result);            });
         | promiseMap.set(key, promise);                return promise;
         | }
         | 
         | So that the promise sticks around for that key while the
         | function is active, but then it clears it so that you're not
         | just building up keys in a map.
        
           | jdthedisciple wrote:
           | You check for if(promiseMap.get(key)) and in the NO case you
           | do promiseMap.delete(key)?
           | 
           | Could you explain why that's necessary? (sorry probs a stupid
           | question)
        
             | aidos wrote:
             | I'm the no case, the promise is created, and _then_ deleted
             | after the function is resolved.
             | 
             | I'm not entirely sure of the use of this pattern where your
             | using a map and deleting the keys as you go - seems like
             | you'd end up doing more work kinda randomly depending on
             | how the keys were added. I'd just relive the whole map when
             | I was done with it.
        
             | ralusek wrote:
             | So if the promise still exists, that means that there is an
             | active call out for a promise using this key already, so
             | we'll just return it. We know this because when the
             | function that is executing finishes, we delete the promise
             | from the map.
             | 
             | A purpose of this might be, let's say you rebuild some
             | structure very frequently, like hundreds of times per
             | second or whatever. You can call this function for a given
             | key as many times as you want while that async process is
             | executing, and it will only execute it once, always
             | returning the same promise.
        
           | blue_pants wrote:
           | Please correct me if I'm wrong. But you'd better avoid
           | creating extra promises for performance reasons
           | const promiseMap<key, Promise<any>> = new Map();
           | async function keyedDebounce<R>(key: string, fn: () => R) {
           | const existingPromise = promiseMap.get(key);         if
           | (existingPromise) return existingPromise;
           | const promise = fn().then(result => {
           | promiseMap.delete(key);                  return result;
           | });                promiseMap.set(key, promise);
           | return promise;       }
        
           | lolinder wrote:
           | This is a great solution to the thundering herd problem, but
           | isn't OP explicitly trying to cache the results for later? In
           | that case, the promise map is a clever way to make sure the
           | cachable value is fetched at most once, and you _want_ the
           | keys to build up in the map, so GC 'ing them is
           | counterproductive.
        
             | kevinforrestcon wrote:
             | If the result of the async function is not necessarily the
             | same value every time you call it, you may want to
             | recompute it
        
             | ralusek wrote:
             | Ya it obviously depends on the intended goal, but caching
             | items in a map indefinitely better be done on a set of
             | values that is known to be limited to a certain size over
             | the lifetime of the server or it'll lead to a memory leak.
        
           | withinboredom wrote:
           | Won't JS garbage collect orphaned references? Why is this
           | necessary?
        
             | deckard1 wrote:
             | there is some confusion here.
             | 
             | OP is intentionally caching results of, presumably,
             | expensive computations or network calls. It's a cache.
             | There is no memory leak. OP just did not detail how cache
             | invalidation/replacement happens. The 2nd comment adds a
             | rather rudimentary mechanism of removing items from cache
             | as soon as they are ready. You get the benefit of batching
             | requests that occur _before_ the resolve, but you get the
             | downside of requests coming in right after the resolve
             | hitting the expensive process again. Requests don 't neatly
             | line up at the door and stop coming in as soon as your
             | database query returns.
             | 
             | Typically you would use an LRU mechanism. All caches
             | (memcached, redis, etc.) have a memory limit (whether fixed
             | or unbounded, i.e. all RAM) and a cache replacement policy.
        
             | stickupkid wrote:
             | It does, but the map has a reference to it, so it will
             | "leak" (in gc languages an unwanted reference is considered
             | a leak). If this map got rather large, you could end up
             | with a rather large heap and it would be un-obvious why at
             | first.
        
               | withinboredom wrote:
               | Do Promises hold a reference to the chain of functions
               | that end in the result? If so, that seems like a bug.
        
               | epolanski wrote:
               | A promise is just an object, it does not contain any
               | reference to the chained functions.
        
           | andrewingram wrote:
           | Could you use a WeakMap for this instead?
        
             | skrebbel wrote:
             | If the key is a type that you expect to be GC'ed, yes,
             | totally. If the key is a simple value type eg a string then
             | it behaves the same as a regular Map or an object.
        
               | andrewingram wrote:
               | Ah yeah, good point.
               | 
               | I've mostly used this overall pattern with something like
               | Dataloader, where you throw away the entire cache on
               | every request, so the GC problem doesn't crop up.
        
         | Cthulhu_ wrote:
         | I didn't know this was a formal design pattern; I do recall
         | having implemented this myself once, as a means to avoid double
         | requests from different components.
         | 
         | Later on I switched to using react-query which has something
         | like this built-in, it's been really good.
        
         | deltasevennine wrote:
         | Can someone explain to me why the second example is better.
         | 
         | To me it seems to be the same thing. Replace result with int
         | and I literally do not see a problem with the first one.
         | 
         | Also why is a mutex or lock needed for Result in javascript? As
         | far as I know... In a single threaded application, mutexes and
         | locks are only needed for memory operations on two or more
         | results.
         | 
         | With a single value, say an int, within javascript you are
         | completely safe in terms of concurrent access. Only 2 more more
         | variables can cause a race condition. Or am I wrong here?
         | 
         | --edit:
         | 
         | Thanks for the replies. I see now. The mutex concept is not
         | around memory access or memory safety. It's solely around the
         | computation itself. When the Promise is "pending". It has
         | nothing to do with safety. The parent did mention this, but I
         | completely glossed over it.
        
           | deltasevennine wrote:
           | Wait a minute.
           | 
           | Under Javascript, The behavior induced by this pattern is
           | EXACTLY equivalent to that of CACHING the result directly and
           | a BLOCKING call. The pattern is completely pointless if you
           | view it from that angle. Might as well use blocking calls and
           | a regular cache rather then async calls and cached promises.
           | 
           | So then the benefit of this pattern is essentially it's a way
           | for you to turn your async functions into cached non-async
           | functions.
           | 
           | Is this general intuition correct?
           | 
           | --edit: It's not correct. Different async functions will
           | still be in parallel. It's only all blocking and serial under
           | the multiple calls to the SAME function.
        
           | ohgodplsno wrote:
           | Say you have three elements in your UI that need to fetch
           | some user info. You have two major options:
           | 
           | 1/ Have a higher level source of truth that will fetch it
           | once from your repository (how does it know it needs to
           | fetch? How is cache handled ?) and distribute it to those
           | three elements. Complex, makes components more independent
           | but also more dumb. It's fine to have pure elements, but
           | sometimes you just want to write <UserInfoHeader/> and let it
           | handle its stuff.
           | 
           | 2/ Your repository keeps this Map<Key, Promise<T>>, and every
           | time you call getUserInfo(), it checks the map at key
           | "userinfo" and either return the promise (which might be
           | ongoing, or already resolved) or see that it's not there and
           | do the call, writing the promise back into the map. This way,
           | your three components can just call getUserInfo() without
           | giving a damn about any other ones. The first one that calls
           | it pre-resolves it for others.
           | 
           | As to why a promise instead of just the raw result: one can
           | potentially return null (and you need to call again later to
           | refresh, or straight up blocks during the entire call), the
           | other one just gives you back promises and you can just
           | listen to them and update your UI whenever it's ready (which
           | might be in 5 seconds, or right now because the promise has
           | already been resolved)
           | 
           | It's a bad implementation of a cached repository (because it
           | ignores TTL and invalidation as well as many problems that
           | need to be handled) that any junior developer could figure
           | out (so it's everything but obscure), but sometimes, eh, you
           | don't need much more.
        
           | Marazan wrote:
           | Because the computation of Result is _expensive_ and _async_.
           | 
           | So if you get two requests for the computation that computes
           | Result in close succession the Map doesn't have the result in
           | place for the second call and as a result you get no caching
           | effect and make 2 expensive requests.
           | 
           | By caching the Promise instead the second request is caught
           | by the cache and simply awaits the Promise resolving so now
           | you only get 1 expensive request.
        
           | diputsmonro wrote:
           | I believe the idea is that if you stored the result instead,
           | you would have to wait for the promise to resolve the first
           | time. If you made two requests in quick succession, the
           | second request wouldn't see anything in the result map for
           | the first one (as it hasn't resolved yet) and would then need
           | to make its own new request.
           | 
           | If you store the promise, then not only does the second
           | attempt know that it doesn't need to make a new request, but
           | it can also await on the promise it finds in the map
           | directly.
        
           | [deleted]
        
           | JimDabell wrote:
           | You can put the promise into the cache immediately but you
           | can only put the result from the promise into the cache once
           | the promise resolves. So if an identical request comes in a
           | second time before the promise has been resolved, then if you
           | are caching the promise you have a cache hit but if you are
           | caching the result then you have a cache miss and you end up
           | doing the work twice.
        
             | epolanski wrote:
             | I am still not understanding the purpose of this as I
             | believe it is grounded on the wrong assumption.
             | 
             | Pretty much every single asynchronous operation other than
             | some `Promise.resolve(foo)` where foo is a static value can
             | fail. Reading from the file system, calling an api,
             | connecting to some database, etc.
             | 
             | If the original promise fails you're gonna return a cached
             | failure.
             | 
             | Mind you, I'm not stating this might be completely useless,
             | but at the end of the day you will be forced to add code
             | logic to check all of those asynchronous computation
             | results which will eventually outweight the cost of only
             | saving the resolved data.
        
               | lucideer wrote:
               | > _If the original promise fails you 're gonna return a
               | cached failure._
               | 
               | This is usually a good thing, and even in cases where it
               | isn't, it's often a worthwhile trade-off.
               | 
               | In the common case a failure will either be persistent,
               | or - if load/traffic related - will benefit from a
               | reduction in requests (waiting a while to try again). In
               | both of these cases, where your first key request fails,
               | you want the immediately-following cases to fail fast:
               | "caching" the promise caches the failure for the duration
               | of one request (presumably the cache is emptied once the
               | promise resolves, allowing subsequent key accesses to
               | retry).
               | 
               | The less common case where the above isn't true is where
               | you have very unstable key access (frequent one-off
               | failures). In those cases you might want a cache miss on
               | the second key request, but successful key retrieval
               | usually isn't as critical in such systems which makes the
               | trade off very worthwhile.
        
               | balls187 wrote:
               | It's not the cost of saving the resolved data.
               | 
               | If I understand the pattern correctly, it is to avoid
               | multiple asynchronous requests to a resource that has yet
               | to be cached.
        
               | gorjusborg wrote:
               | Yeah, that's my understanding too.
               | 
               | It seems like an optimization to prevent lots of
               | downstream requests that occur in rapid succession before
               | the first request would have finished. I'd also suspect
               | that the entry would be removed from the map on failure.
        
               | dragonwriter wrote:
               | > If the original promise fails you're gonna return a
               | cached failure.
               | 
               | In many cases, another near-in-time request would also
               | fail, so returning a cached failure rather than failing
               | separately is probably a good idea (if you need retry
               | logic, you do it within the promise, and you still need
               | only a single instance.)
               | 
               | (If you are in a system with async and parallel
               | computation both available, you can also use this for
               | expensive to compute pure functions of the key.)
        
               | deltasevennine wrote:
               | It's a tiny optimization.
               | 
               | When the VERY FIRST ASYNC operation is inflight the cache
               | is immediately loaded with a Promise, which blocks all
               | other calls while the FIRST async operation is in flight.
               | This is only relevant to the very FIRST call. That's it.
               | After that the promise is pointless.
               | 
               | As for the Promise failure you can just think of that as
               | equivalent of the value not existing in the cache. The
               | logic should interpret the promise this way.
        
               | lightbendover wrote:
               | It's not always a tiny optimization. If you have an
               | expensive query or operation, this prevents potentially
               | many duplicative calls.
               | 
               | A practical example of this was an analytics dashboard I
               | was working on years ago -- the UI would trigger a few
               | waves of requests as parts of it loaded (batching was
               | used, but would not be used across the entire page). It
               | was likely that for a given load, there would be four or
               | more of these requests in-flight at once. Each request
               | needed the result of an expensive (~3s+) query and
               | computation. Promise caching allowed operations past the
               | first to trivially reuse the cached promise.
               | 
               | There are certainly other approaches that can be taken,
               | but this works very well as a mostly-drop-in replacement
               | for a traditional cache that shouldn't cause much of a
               | ripple to the codebase.
        
               | cerved wrote:
               | You do one IO operation instead of two.
               | 
               | It's unlikely that always doing two is going to be more
               | successful than just trying once and dealing with failure
        
               | zbentley wrote:
               | Yep, I've been bitten by exactly this failure mode.
               | 
               | You have to invalidate/bust the cache when a failure is
               | returned (which is racy, but since cache busting is on
               | the sad path it's a fine place to protect with a plain
               | ol' mutex, assuming you even have real
               | parallelism/preemption in the mix).
               | 
               | Alternatively you can cache promises to functions that
               | will never return failure and will instead internally
               | retry until they succeed. This approach generalizes less
               | well to arbitrary promises, but is more friendly to
               | implementing the custom back off/retry scheme of your
               | choice.
        
               | zamalek wrote:
               | > If the original promise fails you're gonna return a
               | cached failure.
               | 
               | In the scenario where that's an issue, you would need to
               | add (extremely trivial 3-5 lines) logic to handle
               | retrying a cached failure. The underlying data structure
               | would continue to be a promise map.
        
         | mperreux wrote:
         | This is great. I'm actually running into this problem but
         | across distributed clients. Is there a distributed way to
         | achieve somthing like this with say Redis?
        
         | [deleted]
        
         | yla92 wrote:
         | Anybody knows if there is similar pattern for Python's asyncio?
        
           | ActorNightly wrote:
           | https://docs.python.org/3.9/library/asyncio-
           | future.html#asyn...
        
           | abrichr wrote:
           | How about a dict whose values are asyncio.Future?
           | 
           | https://docs.python.org/3/library/asyncio-
           | future.html#asynci...
        
         | wingspan wrote:
         | I love this approach and have used it many times in JavaScript.
         | I often end up adding an additional map in front with the
         | resolved values to check first, because awaiting or then'ing a
         | Promise always means you will wait until the next microtask for
         | the value, instead of getting it immediately. With a framework
         | like React, this means you'll have a flash of missing content
         | even when it is already cached.
        
           | skrebbel wrote:
           | This surprises me. I had expected the microtask to be
           | executed right after the current one, ie before any layouting
           | etc - isnt that the whole point the "micro" aspect?
        
             | chmod775 wrote:
             | The event loop shouldn't continue the circle until the
             | microtask queue is empty, from what I remember.
             | 
             | Browsers probably used to squeeze layouting and such things
             | in when the event loop completed an iteration or became
             | idle. However nowadays it's entirely possible that have, or
             | will have, browsers that do even work in parallel while the
             | javascript event loop is still running, possibly even
             | beginning to paint while javascript is still running
             | synchronously (haven't seen that yet).
             | 
             | I've seen instances where browsers seem to be like "nah,
             | I'll render this part later" and only would render some
             | parts of the page, but certain layers (scrolling
             | containers) would not be filled on the first paint -
             | despite everything being changed in one synchronous
             | operation!* And I've seen them do that at least 5 years
             | ago.
             | 
             | It seems to me the situation is complicated.
             | 
             | * There was a loading overlay and a list being filled with
             | data (possibly thousands of elements). The loading overlay
             | would be removed _after the list was filled_ , revealing
             | it. What often happened was that the overlay would
             | disappear, but the list still looking empty for a frame or
             | two.
        
             | wingspan wrote:
             | I was able to reproduce this in a simple example [1]. If
             | you refresh it a few times you will be able to see it flash
             | (at least I did in Chrome on Mac). It is probably more
             | noticeable if you set it up as an SPA with a page
             | transition, but I wanted to keep the example simple.
             | 
             | [1] https://codesandbox.io/s/recursing-
             | glitter-h4c83u?file=/src/...
        
           | pflanze wrote:
           | Replacing the promise with the result in the map (using a sum
           | type as the map value type) might be more efficient than
           | maintaining two maps?
        
         | moonchrome wrote:
         | Just make sure Map access is thread safe - awaiting promises is
         | - but updating/reading map keys usually isn't.
        
           | skrebbel wrote:
           | Yep, in multithreaded langs you just want to use a
           | ConcurrentMap or whatever the stdlib has for this. The write
           | is very brief because you only write the promise and not eg
           | keep a proper lock on the key until the result has arrived,
           | so no need for a fancier datastructure.
        
           | unrealhoang wrote:
           | js is single-threaded so no problem there.
        
             | moonchrome wrote:
             | Since OP mentioned Mutexes I assumed he's dealing with
             | multithreaded code, but with JS this works as advertised :)
        
         | MobiusHorizons wrote:
         | I have a similar pattern when using an rxjs pipeline to do
         | async tasks in response to changes in other values. The typical
         | pattern would be                   readonly foo = bar.pipe(
         | switchMap(v => doTask(v)),             shareReplay(1),
         | );
         | 
         | Where doTask returns a promise or an AsyncSubject. The
         | shareRelpay allows foo to cache the last value so any late
         | subscriber will get the most recent value. This has a problem
         | of returning cached values while new work is happening (eg
         | while a network request is in progress) so instead I like to
         | write                   readonly foo = bar.pipe(
         | map(v => doTask(v),             shareReplay(1),
         | switchMap(v => v),         );
         | 
         | This way the shareReplay is caching the promise instead of the
         | result. With this pattern I can interact with this pipeline
         | from code like so                   async
         | someEventHandler(userInput) {
         | this.bar.next(userInput);             const result = await
         | firstValueFrom(this.foo);             ...         }
         | 
         | Without the pattern, this code would get the old result if
         | there was ever any cached value
        
         | bschaeffer wrote:
         | I wrote a version of this with Elixir:
         | https://github.com/bschaeffer/til/tree/master/elixir/gen_ser...
         | 
         | Didn't know what to call it but PromiseMaps is nice name for
         | it.
        
           | lightbendover wrote:
           | Promise Caching has a nicer name to it since that's what it
           | is if well-implemented.
        
         | asadawadia wrote:
         | Yeah, saves you from thundering herd problems
         | 
         | https://github.com/ben-manes/caffeine cache does something
         | similar by caching the future that gets returned on look-ups if
         | the value is still being computed
        
         | idontwantthis wrote:
         | It's also the clearest and least buggy way to iterate over the
         | results. Map over Await Promise.all(map).
        
           | unsupp0rted wrote:
           | Partly off-topic, but in JS I tend to avoid iterating over
           | promises with maps because it's always confusing what
           | will/won't work, so I use a basic FOR loop because
           | async/await works in there.
           | 
           | How bad is this? Should I switch to Promise.all instead?
        
             | seedboot wrote:
             | Promise.all/allSettled/etc will allow you to resolve your
             | promises concurrently rather than awaiting each result in
             | the for loop. Depends what your use case is, really.
        
             | mromnia wrote:
             | Iterating with async/await means you wait for every result
             | before making another call. You basically execute the async
             | calls one by one.
             | 
             | Promise.all runs them all simultaneously and waits until
             | they are all complete. It returns an array of results in
             | the same order as the calls, so it's usually pretty
             | straightforward to use.
             | 
             | Both approaches have a purpose, so it's not like you
             | "should" strictly use one. But you should be aware of both
             | and be able to use the one that fits the need.
        
               | dmix wrote:
               | The third major strategy being Promise.any():
               | 
               | - as soon as any of them resolve (any)
               | 
               | - when all of them resolve (all)
               | 
               | - resolve each in order (for)
        
         | samatman wrote:
         | promise map?
         | 
         | do you mean                   setmetatable({}, {__index =
         | function(tab, key)               local co = coroutine.running()
         | uv.read(fd, function(err, result)                    tab[key] =
         | result                    return coroutine.resume(result)
         | end)               return coroutine.yield()           end })
         | 
         | This is an incomplete implementation, clearly: but I've never
         | missed Promises with coroutines and libuv.
        
         | jtwigg wrote:
         | FYI, if you plan to do this in ASP netcore, combine with
         | AsyncLazy for the most optimal results
         | https://github.com/davidfowl/AspNetCoreDiagnosticScenarios/b...
        
           | [deleted]
        
           | alexchro93 wrote:
           | Neat page from David Fowler. I didn't know about that one.
           | Thank you.
           | 
           | To further explain why you want to use an AsyncLazy (or Lazy
           | for synchronous programming) when working with
           | ConcurrentDictionary I find this article to be helpful:
           | https://andrewlock.net/making-getoradd-on-
           | concurrentdictiona...
        
       | ajjenkins wrote:
       | I recently learned about deques in Python (other languages may
       | have it too) which is like a linked list but it keeps pointers
       | for the first and last elements. This allows you to
       | insert/read/remove elements from the beginning or end of the list
       | in constant time. I've never used it, but I could imagine it
       | being useful in certain situations.
       | 
       | https://docs.python.org/3/library/collections.html
        
         | sgtnoodle wrote:
         | A deque may be implemented as a linked list, but it's actually
         | more common to be implemented on top of arrays.
         | 
         | Python in particular uses arrays under the hood for almost
         | everything because modern computer hardware is so blazingly
         | fast manipulating them.
        
           | foolswisdom wrote:
           | Arrays aren't efficient when you want to add/remove to the
           | head. Deque in python exists so there is a data structure
           | with constant time pop/push to the head. And it is in fact
           | implemented as a doubly linked list, with various
           | optimizations: https://github.com/python/cpython/blob/v3.8.1/
           | Modules/_colle....
        
             | sgtnoodle wrote:
             | Neat, it's always best to look at the source. To be clear,
             | a deque can be implemented on top of an array and still
             | have constant time operations on the head.
        
               | sicp-enjoyer wrote:
               | If it's not constant time on both ends, it's not a deque.
               | You can already insert items at any index in an array
               | with linear cost.
        
               | sgtnoodle wrote:
               | I invite you to think about the problem for a while, and
               | try to think of a way to implement a performant deque on
               | top of a resize-able array. I promise you it's possible.
        
               | eru wrote:
               | I don't think the comment you replied to disagreed with
               | that.
        
             | moonchild wrote:
             | It is not commonly done, but there is no reason why you
             | can't have an array with amortised constant-time insertion
             | at both the head and the tail.
        
               | sgtnoodle wrote:
               | I think it's more common than not to implement deques on
               | top of arrays. There's a meaningful performance
               | difference at the hardware level due to the spatial
               | locality in memory. Linked lists in general cause the CPU
               | to jump all over the place in memory, causing pressure on
               | the cache and whatnot.
               | 
               | I believe the standard C++ deque implementation consists
               | of a vector of vectors. The purpose of the two layers is
               | at least two-fold. It guarantees that elements stored
               | within the deque remain at a stable memory location at
               | all times, even during a reallocation. Also, the memory
               | getting shifted around during a reallocation consists of
               | relatively tiny vector metadata rather than the
               | application data.
               | 
               | To speculate, I suspect the reason it's a linked list in
               | python is due to the age of the module. It was likely
               | written relatively early on in python's development, and
               | the linked list implementation was simple and elegant.
               | Now enough stuff uses it that it would be painful to
               | change.
               | 
               | Dicts are hashmaps in python rather than binary trees for
               | similar performance reasons. You can't even find any
               | binary tree based containers in python's standard
               | library. You can find containers with the same
               | performance characteristics and semantics of binary
               | trees, but under the hood they're implemented on top of
               | arrays because it better maps to hardware.
        
       | okennedy wrote:
       | Linked Hash/Tree Maps, simple, but elegant. A Map with its nodes
       | connected in a linked list so you can traverse them in insertion
       | order (and O(n) time). Very useful for window queries over
       | sequential data and other cases where you want FIFO access, but
       | also quick access by a field of the data.
        
         | xxs wrote:
         | Why would Linked Hash Map be a obscure one? It's been a part of
         | java collection framework for over 20y. It's a standard for LRU
         | caches as well.
        
           | okennedy wrote:
           | Hence the "simple but good". It's not really obscure, but
           | also not on the common path. I've lost track of the number of
           | times I've pointed this one out to a colleague or student.
        
         | H8crilA wrote:
         | You forgot the most important feature over normal hash maps:
         | they offer a deterministic iteration order without incurring
         | the cost of a tree-based ordered map.
         | 
         | (If you don't know why this is important then maybe you haven't
         | worked on large systems that undergo rigorous evaluation)
        
           | okennedy wrote:
           | Modulo the snark, you're completely right. Iteration order =
           | insertion order is super valuable for a bunch of reasons
           | (this one included).
        
           | andyferris wrote:
           | You can do this without a linked list and it can be quite
           | performant, like python's newest dictionary implementation
           | (appeared around Python 3.6 or 3.7 from memory).
        
           | ztorkelson wrote:
           | Is (non-)determinism really the right concern here? I'm aware
           | that most hash tables do not have generally predictable
           | iteration orders, but I nevertheless understood them to be
           | deterministic.
        
             | andyferris wrote:
             | Well, hard/impossible to predict perhaps. Iteration order
             | can depend on the order things were inserted and deleted
             | and may differ from computer to computer (for example in
             | Julia the hash-based dictionary ordering differs between 32
             | and 64 bit systems, and might change between versions of
             | Julia, etc - you'd see the same thing with C++ unordered
             | maps, etc, etc).
        
               | okennedy wrote:
               | The same thing was a huge issue in Python until somewhere
               | around 3.6, until they canonicalized the sort order.
        
             | xxs wrote:
             | >Is (non-)determinism really the right concern here?
             | 
             | A massive one - there is a lot of code that implicitly
             | depends on the order of iteration (think
             | configurations/initialization). The issue might be
             | invisible, and reproduce rarely in production only. The
             | iteration order depends on the hash of the keys, plus the
             | capacity of the underlying array.
             | 
             | In Java I advise to never use the standard HashMap in favor
             | of LinkedHashMap. In cases where data density is important
             | both are a terrible fit having nodes instead of a plain
             | object array (and linear probe/search).
        
             | ilammy wrote:
             | They are deterministic in the sense that provided
             | everything else is the same, iteration order would be the
             | same after each insertion.
             | 
             | However, it can change after insertion (if the hash bucket
             | count changes). If the hash function is random-keyed,
             | iteration order would be different for different instances
             | of the hash table even if you insert the same items in the
             | same order.
             | 
             | Sometimes you want the order to be consistent every time.
             | Be it insertion order, or some natural order of the items.
        
       | vojtah wrote:
       | Fenwick Tree: https://cp-
       | algorithms.com/data_structures/fenwick.html
        
       | romes wrote:
       | Equality graphs (e-graphs) for theorem proving and equality
       | saturation and other equality-related things.
       | 
       | They're awesome data structures that efficiently maintain a
       | congruence relation over many expressions
       | 
       | > At a high level, e-graphs extend union-find to compactly
       | represent equivalence classes of expressions while maintaining a
       | key invariant: the equivalence relation is closed under
       | congruence.
       | 
       | e.g. If I were to represent "f(x)" and "f(y)" in the e-graph, and
       | then said "x == y" (merged "x" and "y" in the e-graph), then the
       | e-graph, by congruence, would be able to tell me that "f(x) ==
       | f(y)"
       | 
       | e.g. If I were to represent "a*(2/2)", in the e-graph, then say
       | "2/2 == 1", and "x*1 == x", by congruence the e-graph would know
       | "a*(2/2) == a" !
       | 
       | The most recent description of e-graphs with an added insight on
       | implementation is https://arxiv.org/pdf/2004.03082.pdf to the
       | best of my knowledge.
       | 
       | P.S: I'm currently implementing them in Haskell
       | https://github.com/alt-romes/hegg
        
         | FnuGk wrote:
         | Isnt that the building blocks of logic based programming
         | languages like Prolog?
        
           | Twisol wrote:
           | I had a similar thought when I learned about e-graphs. I'm
           | not sure yet, but I think congruence closure is a slightly
           | different problem from the one Prolog unification solves. In
           | particular, if you tell an e-graph that `f(x) = g(y)`, it
           | will take you at your word -- while Prolog would give a
           | unification error. (An e-graph should also handle `X = f(X)`,
           | while this would fail Prolog's admittedly often-ignored
           | occurs check.)
        
           | superlopuh wrote:
           | Yes, at least these egg-based projects are usign this idea,
           | but it looks like it's just one of the options:
           | 
           | https://souffle-lang.github.io/docs.html
           | http://www.philipzucker.com/souffle-egg4/
           | https://arxiv.org/pdf/2004.03082.pdf
        
         | gopiandcode wrote:
         | Shameless plug, I also maintain an OCaml implementation of
         | egraphs (named ego) at https://github.com/verse-lab/ego
         | 
         | While the most popular implementation at the moment seems to be
         | egg in Rust, I find that OCaml serves as a much more ergonomic
         | environment for quickly prototyping out uses of egraphs in
         | practice. As a bonus, ego also shares the same logical
         | interface as egg itself, so once you've finalised your designs,
         | you shouldn't have much trouble porting them to egg if you need
         | the performance gains.
        
         | smasher164 wrote:
         | I wonder if you could apply e-graphs to situations where the
         | union-find data structure would be used, i.e. if there are any
         | additional benefits gotten from the congruence relation.
        
           | HelloNurse wrote:
           | More pointers and more processing in exchange for
           | representing equivalence classes of many, often infinitely
           | many graphs compactly and enumerating them efficiently.
           | Usually not relevant when equivalence of simple nodes is the
           | object.
        
       | throwaway45631 wrote:
       | Not obscure by itself, but gifted with obscurity due to its
       | language's success: JavaScript's Map type.
       | 
       | Just because of the Type being introduced very late to the
       | language, and another very successful method named identically
       | makes it almost impossible to google your way out of any
       | situation.
       | 
       | You have to rely on core documentation and link lists. Just by
       | using "new Map()" in JS, you're suddenly _ehem_ mapped 30 years
       | back in time!
        
       | fulafel wrote:
       | This post:
       | http://www.frankmcsherry.org/graph/scalability/cost/2015/01/...
       | 
       | "Scalability! But at what COST?"
       | 
       | In it Frank McSherry uses a handful of data structure tricks to
       | make graph algorithms like connectivity and PageRank run
       | progressively faster on his laptop, beating distributed cluster
       | implementations. It's actually a version of their HotOS paper
       | with Isard and Murray.
       | 
       | Admittedly it's the opposite of obscure, being kind of a myth
       | busting piece about distributed processing being faster and how
       | even largish data sets can fit on a single node (and even your
       | laptop).
        
         | fulafel wrote:
         | Addenum.
         | 
         | Link to paper:
         | https://www.usenix.org/system/files/conference/hotos15/hotos...
         | 
         | COST acronym: "Configuration that Outperforms a Single Thread"
        
       | jll29 wrote:
       | _Associative memory?_ Store data in them and they will never
       | complain they are full, but they may  "forget" instead. Querying
       | them with slightly wrong keys, they'll auto-correct the keys as
       | long as they are not overloaded.
       | 
       |  _Binary fuse filters?_ (Xor|Bloom) filters on steroids.
       | 
       |  _Non-deterministic finite state transducers?_ Reversible, more
       | elegant and more powerful version of non-deterministic finite
       | state recognizers (transitions are labelled with two outputs,
       | which can be interpreted as inputs versus outputs, but these
       | roles can be swapped as they are symmetric).
       | 
       |  _Ropes?_ What, you are still using strings? Dude, please stop
       | writing that newbie editor.
       | 
       |  _Caches with Time aware least recently used (TLRU) or Segmented
       | LRU (SLRU) cache replacement strategies_ (Oh, you thought Least
       | recently used (LRU) was cool?)
        
       | jordanmorgan10 wrote:
       | So it's not snazzy, but I also wrote Objective-C a long time
       | without knowing about NSPointerArray.
       | 
       | Among other things, it can store nil values, which if you've
       | written any Cocoa apps, is an all too common way to crash your
       | software.
        
       | almog wrote:
       | The Hierarchical Timing Wheels is an efficient data
       | structure/algorithm for managing timers (event scheduling) when:
       | 1. The timers variance is large. 2. Timers are likely to be
       | cancelled. 3. A fixed (configurable) precision is configurable.
       | 
       | This talk provides a nice overview of different timing wheels
       | implementations including hierarchial and hashed timing wheels.
        
         | beyonddream wrote:
         | I think you had not posted the link to the talk!
        
         | almog wrote:
         | Too late to edit but I forgot to paste the link to the timing
         | wheels overview talk that I mentioned, so here it is:
         | 
         | https://www.youtube.com/watch?v=AftX7rqx-Uc
        
           | beyonddream wrote:
           | Nm, thanks:)
        
       | benou wrote:
       | Very good idea!
       | 
       | And I love bloom filter too (and their more modern successor
       | cuckoo filter) but I'd challenge the usecase you mention though:
       | 1 million IPv4 is 4MB, and 16MB for IPv6. That's tiny, you're
       | better off using some kind of hashtable, unless you have a fast
       | and small memory, and then a slow and big memory (say embedded
       | processor with small CPU cache and some DRAM).
       | 
       | Bloom filters are useful when your working set cannot fit in your
       | fast memory, to allow you to go to the slow memory only for a
       | small number of requests.
        
         | IanCal wrote:
         | The other use is when the cost of checking the set is high due
         | to something like accessing a remote system. You can pull a
         | small representation of the full set and only do a network
         | request after when it's likely to succeed.
        
       | haroldl wrote:
       | In a Scrabble AI coding contest I discovered the GADDAG which is
       | like a trie but indexes all of the words forwards and backwards
       | from any starting point within the words.
       | 
       | https://en.wikipedia.org/wiki/GADDAG#:~:text=A%20GADDAG%20is....
        
       | unsupp0rted wrote:
       | If somebody made a Youtube channel going over the data structures
       | mentioned in this thread with clear examples of pros/cons, I'd
       | watch the heck out of it and support it on Patreon.
        
       | stevenjgarner wrote:
       | Crit-bit trees [0]. Also known as TRIES, with an interesting use
       | described in "TRASH - A dynamic LC-trie and hash data structure",
       | which combines a trie with a hash function [1].
       | 
       | Also ROPE, a string allowing for prepends, substrings, middle
       | insertions and appends [2]
       | 
       | [0] http://cr.yp.to/critbit.html
       | 
       | [1]
       | https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.96....
       | 
       | [2] https://en.wikipedia.org/wiki/Rope_%28data_structure%29
        
       | bjoli wrote:
       | We all know about the immutable vectors of clojure (tries,
       | basically). I always preferred the RRB-tree which is the same
       | thing, but relaxes the leftwise dense requirement. This means you
       | have a bit-partitioned trie until you do a split, merge or
       | insert, after which you get a slightly slower operations, but
       | still very competitive.
       | 
       | It is actually a quite trivial change until you come to the merge
       | algorithm which is finicky in all the wrong ways. The benefits
       | are (in clojure terminology) O(1)ish splits, inserts and
       | concatenations instead of O(n).
       | 
       | I don't know if it can be called obscure anymore, since Scala's
       | vectors are RRB-trees and clojure actually has an implementation
       | of them, but I often see people talk about ropes in situations
       | where an RRB tree should be a no-brainer.
        
       | samsquire wrote:
       | I'm trying to create a first class citizen of loops and trying to
       | improve software responsiveness.
       | 
       | There's two ideas: concurrent loops and userspace scheduling and
       | preemption.
       | 
       | Nested loops can be independently scheduled. I wrote a M:N
       | userspace scheduler[2] and it does something simple. You can
       | interrupt a thread that is in a hot loop by changing its limit
       | variable to the end. You don't need to slow down a hot loop by
       | placing an if statement or checking if X number of statements
       | executed. (Many runtimes do this to implement scheduling and you
       | don't need to do it this way)
       | 
       | Golang schedules away from a goroutine by checking during stack
       | expansion if the thread has used its timeslice.
       | 
       | I think this simple observation is really powerful. The critical
       | insight is that frontend performance can be drastically improved
       | by reducing latency from the user's perspective. Did you ever
       | notice that the cancel button rarely cancels immediately? Or
       | resizing an image is slow and gives no feedback or encrypting
       | gives no feedback?
       | 
       | By having a watching thread interrogate the output buffer, we can
       | create GUIs that visually do things and provide observability to
       | what the computer is doing. Imagine watching a resize operation
       | occur in real time. Or encryption. Or network communication.
       | 
       | One of my ideas of concurrent loops is "cancellation trees"[1],
       | if you model every behaviour of a piece of software as a tree of
       | concurrent loops that are reentrant - as in they never block and
       | each call is a loop iteration of a different index, you can
       | create really responsive low latency software. If you cancel a
       | branch of the tree, you can cancel all the loops that are related
       | to that branch. It's a bit like a tree of processes from PID 0 or
       | loops as lightweight green threads/goroutines.
       | 
       | So as you're moving around in your text editor or browser, if you
       | invalidate the current action - such as press ESCAPE while
       | Intellisense is trying to find related code, you can interrupt
       | all the loops that fanned out from that GUI operation.
       | 
       | Telling the computer to stop doing what it is doing and
       | observability are two key weaknesses of modern computing. GUIs do
       | not always accurately communicate what the computer is doing and
       | you usually need to wait for the computer to finish before it
       | shows you what it is doing. I am sure the cancellation token
       | could be extended to follow this idea.
       | 
       | If you liked this comment, check out my profile to links to my
       | ideas documents.
       | 
       | 1:
       | https://github.com/samsquire/ideas4/blob/main/README.md#120-...
       | 
       | 2: https://github.com/samsquire/preemptible-thread
        
       | sgtnoodle wrote:
       | Disruptor queues are also fun. They are lock free multi-producer
       | multi-consumer circular buffers. I figured out the basics on my
       | own in a vacuum out of need, then discovered a relevant white
       | paper.
       | 
       | I used one to implement a fast shared memory pub-sub message bus
       | for communicating across dozens of processes in a simulation
       | framework.
        
         | xxs wrote:
         | I don't quite see how's that obscure; it's a standard circular
         | buffer. The fast/low latency communication part comes from the
         | busy wait (and L3 cache communication).
        
           | sgtnoodle wrote:
           | A standard circular buffer is only single-producer single-
           | consumer. The producer manipulates the head and the consumer
           | manipulates the tail.
           | 
           | Extending a circular buffer to allow multiple consumers is
           | relatively straight forward; you just give each consumer its
           | own tail and accept the loss of back pressure.
           | 
           | Extending it to allow multiple producers without introducing
           | locks is where the complexity shoots up drastically.
           | 
           | But yes, the reason why you would want something like this is
           | to have the semantics of message passing with the performance
           | of direct IPC between userspace processes without going
           | through the kernel.
        
             | xxs wrote:
             | >Extending it to allow multiple producers without
             | introducing locks is where the complexity shoots up
             | drastically.
             | 
             | You can have a single tail with an atomic add, and that's
             | pretty much it. The consumer needs to know, if the data is
             | available, so there has to be a serialization point that
             | with each producer has to wait - effectively a locking
             | mechanism... or the consumer has to check all the producers
             | progress.
             | 
             | It doesn't feel harder. The disruptor part/fame mostly came
             | as it didn't have to allocate new memory.
             | 
             | For example writing a lock-free, max capacity limited, FIFO
             | requires a single long (64bit) in each node and then
             | reading head (1st), then tail one by the producers. Same
             | idea - 64bits are too many to cause an integer overflow
             | when increasing one-by-one.
             | 
             | It has been awhile since the time I wrote lock free stuff
             | (occasional CAS and COW do not count). My current job
             | doesn't really need it.
        
               | sgtnoodle wrote:
               | > You can have a single tail with an atomic add, and
               | that's pretty much it.
               | 
               | That's the primary difference. In a disruptor queue,
               | there's two tails. The first one is used for producers to
               | allocate space in the buffer to write to, and the second
               | one is used to commit it so that it's available to
               | consumers.
               | 
               | It's true that there is a small amount of necessary
               | synchronization across producers, because a producer
               | can't commit its data before a concurrent producer
               | earlier in the buffer does, and can end up spinning on an
               | atomic compare-and-exchange. They can both independently
               | write to their allocated parts of the buffer freely
               | before that point, though, so in practice it's not a
               | contention point.
        
               | xxs wrote:
               | > there's two tail
               | 
               | My point about the sync part w/ the producers, the
               | consumers won't be ready to read from the producers
               | before it's ready. However another option would NOT using
               | a contention point of a shared tail but marking each
               | record as done - need write/write memory fence, so even
               | if the consumers start reading, then can bail out if the
               | write process has not committed.
               | 
               | > the buffer freely before that point, though, so in
               | practice it's not a contention point.
               | 
               | Likely a false sharing between the producers, unless the
               | impl. is careful enough to allocate cache-line sized
               | records.
        
         | bmitc wrote:
         | Do you have a link to the white paper you found?
        
           | sgtnoodle wrote:
           | This doesn't look like I remember, but this sounds right.
           | 
           | https://lmax-
           | exchange.github.io/disruptor/disruptor.html#_de...
           | 
           | The basic idea is that you maintain two atomically
           | incremented counters. You increment one to allocate some
           | buffer space to write your message into, and you increment
           | the other once the message is written in. There's some tricky
           | details to get right when multiple producers have messages
           | staged up at the same time.
        
             | bmitc wrote:
             | Thank you!
        
       | cr__ wrote:
       | Huet's zipper.
       | https://en.wikipedia.org/wiki/Zipper_(data_structure).
       | 
       | One zipper value represents a regular tree of data, but from the
       | perspective of the "current position". There are traversal
       | operations that take a zipper value and return the same tree from
       | the perspective of an element above/beside/below the current
       | position, and there are operations that return the same structure
       | but with the current element changed, or items deleted or
       | inserted.
       | 
       | Huet's paper is an easy read if you've been exposed to OCaml or
       | similar languages: https://www.st.cs.uni-
       | saarland.de/edu/seminare/2005/advanced... . His "returned glove"
       | metaphor is what made it click for me.
       | 
       | Clojure includes an implementation of Huet's zippers
       | https://github.com/clojure/clojure/blob/master/src/clj/cloju...
       | that is <300 lines of code. It's very, very clever and broadly
       | useful for dealing with XML or other nested data structures.
        
       | hackarama wrote:
       | Quadrupley linked lists
        
       | chriswarbo wrote:
       | The Zipper acts like a linked-list with a cursor, or "focused
       | element"; it's implemented as a pair of lists in opposite orders;
       | or, equivalently but more symmetric, as triple of (List[A], A,
       | List[A])
       | 
       | Say we have a zipper containing [0, 1, 2, 3, 4, 5], and we're
       | focusing on the 3. In code this will look like:
       | ([2, 1, 0], 3, [4, 5])
       | 
       | Where [a, b, c] denotes a singly-linked list, with O(1) head
       | (returning a) and tail (returning [b, c]). Notice that the first
       | list is in reverse order.
       | 
       | To focus on the next element, we put the currently focused
       | element on the first list, and pull the head off the second list:
       | ([3, 2, 1, 0], 4, [5])
       | 
       | And vice versa to focus on the previous element:
       | ([1, 0], 2, [3, 4, 5])
       | 
       | I like to imagine this as a string of beads, with the focused
       | element held in our fingers and the rest hanging down:
       | 3         / \         2 4         | |         1 5         |
       | 0
       | 
       | To move the focus forwards and backwards, we move our grip to the
       | next/previous bead.
       | 
       | This works nicely as an immutable datastructure, since the tails
       | of both lists can be shared (i.e. we don't need to copy the whole
       | list, just the wrap/unwrap an element on to the head)
       | 
       | Zippers were first applied to lists, but have since been
       | generalised:
       | 
       | - To trees: focusing on one node, and being able to move the
       | focus up, to the left-child, or to the right child
       | 
       | - To having more than one focus
       | 
       | - To datastructures more generally, by treating them as
       | 'derivatives' (as in calculus)
       | 
       | https://en.wikipedia.org/wiki/Zipper_(data_structure)
       | 
       | As an example, the XMonad window manager uses a zipper to keep
       | track of the currently-focused window.
       | 
       | I've also used Zippers for window management, e.g. having a list
       | of Displays, with one of them focused (accepting keyboard input);
       | each Display containing a list of Workspaces, with one of them
       | focused (currently shown); and each Workspace containing a list
       | of Windows, with one of them focused (the active window).
       | Keyboard shortcuts could shift between the previous/next Window,
       | Workspace, or Display.
        
         | mastermedo wrote:
         | I don't understand why wouldn't you just use a list and an
         | index. You can always access list[index+1] or list[index-1] or
         | list[0] or list[list.length-1].
         | 
         | What is the benefit here?
        
           | Skeime wrote:
           | The benefit is the data sharing between the tails. You can
           | very cheaply get a copy of zipper with the data at (or
           | around) the focused element changed while also keeping the
           | original data unchanged. Admittedly, this is something people
           | in functional programming languages probably care much more
           | about.
        
           | chriswarbo wrote:
           | Firstly, accessing an arbitrary list index is O(N), whilst a
           | Zipper can access its focus in O(1) (shifting the focus is
           | O(1) for both Zippers and list+index)
           | 
           | Secondly, using an index forces us to do bounds checks, keep
           | track of the length (or re-calculate it at O(N) cost), etc.
           | whereas a Zipper is "correct by construction"; i.e. every
           | value of type (List[A], A, List[A]) makes sense as a zipper;
           | whereas many values of type (List[A], Nat) don't make sense
           | as zippers; e.g. ([], 42) doesn't make sense. Our logic also
           | becomes easy for pattern-matching, e.g. in Scala:
           | myZipper match {           case Zipper(xs, x, y::ys) =>
           | Zipper(x::xs, y, ys)           case z => z         }
           | 
           | Also, your example of 'list[index-1]' is more complicated
           | than it first appears. In particular, what is the type of
           | 'index'? If it's an integer, then at least half of the
           | possible values are meaningless (negative numbers); if its a
           | natural (AKA unsigned integer), then it's not closed under
           | subtraction (i.e. we have to worry about underflow)!
           | 
           | Thirdly, Zippers can be unbounded in both directions (i.e.
           | the lists can be "infinite"/co-inductive). For example,
           | they're great for implementing Turing machines, starting with
           | an infinite blank tape in both directions.
           | https://en.wikipedia.org/wiki/Coinduction
           | 
           | Fourthly, Zippers have more sharing. Here's an arbitrary
           | example: say we have a zipper like this:
           | ([c, b, a, ...], elem, [x, y, z, ...])
           | 
           | If we shift right, replace the focused element with
           | 'newElem', then shift right again, we'll get this zipper:
           | ([newElem, elem, c, b, a, ...], y, [z, ...])
           | 
           | Those operations take O(1) time and O(1) memory (and produce
           | O(1) garbage, if we're no longer using the old zipper). In
           | particular, since the tails of the lists ([c, b, a, ...] and
           | [z, ...]) are unchanged, the same pointers will appear in
           | both the old and new zippers; there has been no copying. This
           | is important, since those lists could be arbitrarily long (or
           | infinite!).
           | 
           | In contrast, if we did these operations on a list+index,
           | everything _before_ the replaced element would need to be
           | reallocated; i.e. [..., a, b, c, elem, _] would need to be
           | copied, since _ has changed from  'y, z, ...' to 'newElem, z,
           | ...'. That requires O(N) memory allocation, takes O(N) time
           | to do the copying (and produces O(N) garbage, if we don't
           | want the old zipper).
        
             | funcDropShadow wrote:
             | And they can be generalized to work on any kind of trees
             | instead of just lists.
        
             | mathgenius wrote:
             | > accessing an arbitrary list index is O(N)
             | 
             | I think you mean linked-list here. Parent is talking about
             | "arrays".
        
               | chriswarbo wrote:
               | > I think you mean linked-list here
               | 
               | Yes, I specified this explicitly, e.g. "The Zipper acts
               | like a linked-list with a cursor" and "Where [a, b, c]
               | denotes a singly-linked list".
               | 
               | > Parent is talking about "arrays"
               | 
               | You're using quotation marks, but I don't see what you're
               | quoting? The parent explicitly says: "a list and an
               | index"
               | 
               | In any case, arrays come with most of the same problems I
               | mentioned for lists:
               | 
               | - They have invalid states, e.g. ([], 42)
               | 
               | - They require bounds-checking, a consistent notion of
               | subtraction on the index type, etc.
               | 
               | - They can't be infinite
               | 
               | - They require O(N) time, O(N) memory (and O(N) garbage
               | if we're discarding the old value) to replace the focused
               | element
               | 
               | - etc.
               | 
               | Plus, arrays come with all sorts of extra gotchas, like
               | buffer-overflows, pre-allocation/re-allocation decisions,
               | over/under-provisioning, etc.
        
           | zachrose wrote:
           | A list and an index creates the possibility that the index is
           | out of bounds, and then your code probably has to check for
           | that scenario and figure out how to handle it.
           | 
           | With a zipper such a scenario isn't possible.
        
           | mathgenius wrote:
           | This zipper stuff is for immutable data structures.. For us
           | normal folk that just use & abuse array's we don't need
           | zippers. :-). Just kidding. I am dieing to find a use for
           | zippers tho, i'm not a functional programmer.
        
         | nemo1618 wrote:
         | I encountered this when I tried implementing a Brainfuck
         | interpreter in Haskell. Representing the memory array using a
         | List + index would be way too slow; but a Zipper is perfect!
        
       | dwaite wrote:
       | Splay trees:
       | 
       | These are binary search trees, but when you 'splay' the tree
       | (such as by search) it rebalances the tree so that the search
       | result is on top. This means while it is O(log n) for operations
       | like other self-balancing trees, it optimizes tree depth in the
       | face of non-random access so that recently accessed items are
       | fewer steps into the tree.
       | 
       | Piece tables:
       | 
       | A bit more common for text editors, where you need to represent a
       | sequence of characters in a form that allows efficient memory use
       | and fast insertions/removals (so editing the first line isn't
       | moving the 10MB of data that follows it). You create a series of
       | references to spans (or pieces) of the document, possibly
       | starting with a single span pointing to a mmap() version. Edits
       | are done by appending/prepending pieces, which are potentially
       | just references to subsequences of items created in fresh memory,
       | appended into a buffer. Saving can create a single sequence (and
       | a single span).
       | 
       | This has interesting variations:
       | 
       | - Put attributes on the pieces for formatting, such as indicating
       | text should be rendered a different color or bolded.
       | 
       | - Create a hierarchy of pieces-of-pieces. With formatting
       | attributes you are then dangerously close to a DOM.
       | 
       | - Retain old copies of a piece table - since your original mmap()
       | file hasn't changed and your changes are in an append-only
       | buffer, those piece table copies provide undo/redo state.
        
         | bob1029 wrote:
         | Came here for splay tree. I've found some really powerful
         | applications for this in low level database engine work.
         | 
         | Being able to incrementally rebalance a tree and keep the set
         | of deltas small each time is really powerful when you are
         | dealing with append only IO abstractions.
        
       | scottcodie wrote:
       | Reservoir sampling is a statistical technique used to randomly
       | select a finite number of elements from a population. The
       | elements are chosen such that each element has an equal
       | probability of being selected. This technique is often used when
       | it is impractical to select a random sample of elements from a
       | very large population.
       | 
       | To do reservoir sampling, you first need to decide how many items
       | you want in your sample. This number is called the size of the
       | reservoir. Once you have the size of the reservoir, you can fill
       | it by selecting items from the population at random.
       | 
       | The code is beautifully simple:                 for i = k to
       | population.length-1         j = random integer between 0 and i,
       | inclusive         if j < k            reservoir[j] =
       | population[i]       return reservoir
        
         | bhawks wrote:
         | Reservoir sampling is really cool - a slightly optimized
         | version (algorithm L) let's you skip over every record that
         | will not be sampled and it is still pretty simple. If your
         | records are fixed size this can be an awesome speedup.
         | 
         | (* S has items to sample, R will contain the result _)
         | 
         | ReservoirSample(S[1..n], R[1..k])                 // fill the
         | reservoir array       for i = 1 to k           R[i] := S[i]
         | (* random() generates a uniform (0,1) random number *)       W
         | := exp(log(random())/k)            while i <= n           i :=
         | i + floor(log(random())/log(1-W)) + 1           if i <= n
         | (* replace a random item of the reservoir with item i *)
         | R[randomInteger(1,k)] := S[i]  // random index between 1 and k,
         | inclusive               W := W * exp(log(random())/k)
         | 
         | https://en.m.wikipedia.org/wiki/Reservoir_sampling_
        
           | lorenzhs wrote:
           | Indeed! Funny enough, I rewrote that article a few years ago,
           | it previously contained an approximation of something like
           | Algorithm L that some person with a blog came up with, having
           | no idea that an even simpler and provably correct algorithm
           | was published as early as 1994 :) Though others have improved
           | the article a lot since then, adding explanations for how/why
           | it works. Couldn't be happier to see it cited in this list!
        
         | benou wrote:
         | I was going to mention it too! The really cool thing about
         | reservoir sampling is that it can be done "online" (ie process
         | input incrementally) which makes it super useful when you want
         | to compute statistical properties of something in the field
         | without blowing up your cpu and memory.
         | 
         | For example, let's say I have a server serving queries. I want
         | to measure min/max/avg/stdev/99p you name it. You can do it
         | cheaply with reservoir sampling, without having to save all
         | data points.
        
       | nonbirithm wrote:
       | Colonies/hives, which have fast insertion/erasure/iteration and
       | preserve pointers on insertion/erasure. I remember that
       | Cataclysm: DDA used this data structure for keeping track of the
       | items on each map tile.
       | 
       | https://plflib.org/colony.htm
        
       | Joel_Mckay wrote:
       | First contact with Functors, multidimensional Kalman filtering
       | abstractions, and Golomb ruler optimized DFT... kind of like
       | meeting the Wizard of Oz, and finding out it was a chicken all
       | along. ;)
        
       | anonymoushn wrote:
       | Robin Hood tables are a common-ish hash map implementation that
       | often shows up on benchmarks for fast inserts. Depending on the
       | implementation, a Robin Hood table is also a sorted sparse array!
       | This could be useful for data that needs random access and
       | sequential access to a subset of the sorted data, like an order
       | book.
        
       | abetusk wrote:
       | Succinct Data Structures [0] [1]. It encompass many different
       | underlying data structure types but the overarching idea is that
       | you want small data size while still keeping "big O" run time.
       | 
       | In other words, data structures that effectively reach a
       | 'practical' entropy lower bound while still keeping asymptotic
       | run time.
       | 
       | [0] https://en.wikipedia.org/wiki/Succinct_data_structure
       | 
       | [1] https://github.com/simongog/sdsl-lite
        
       | EnderShadow8 wrote:
       | Treap: https://en.wikipedia.org/wiki/Treap
       | 
       | It's like a Red-Black tree in use case, but much faster to
       | implement, which is good for competitive programming. The average
       | case complexity is the same for all operations, but there's an
       | unlikely degeneration to worst-case linked-list behaviour.
       | 
       | Lazy Propagation Segment Tree: https://cp-
       | algorithms.com/data_structures/segment_tree.html
       | 
       | Like a segment tree in that it supports range queries in
       | logarithmic time, but supports range updates in log time as well.
       | 
       | I know a few more fun ones that I occasionally use in contests,
       | but I've never touched otherwise.
        
         | _benedict wrote:
         | The nicest thing about treaps is how easy
         | union/intersection/disjunction are.
        
         | cassepipe wrote:
         | Speaking of ease of implementation, I just discovered AA trees.
         | They are probably not "obscure" but I think they are worthy of
         | more fame because they perform jsut as well as red-black trees
         | and are easier to implement. Finding clear comprehensive
         | documentation about them was not easy though, so here is for
         | you, the best I could find :
         | https://www.cs.umd.edu/class/fall2019/cmsc420-0201/Lects/lec...
         | 
         | And the, unhelpful to me, orginal paper :
         | https://user.it.uu.se/~arnea/ps/simp.pdf
        
           | kgr wrote:
           | The Wikipedia article on AA-Trees is good:
           | https://en.wikipedia.org/wiki/AA_tree
        
       | mignato wrote:
        
       | reality_inspctr wrote:
       | Resource Description Framework (RDF)
       | 
       | The Resource Description Format (RDF) is the W3C standard for web
       | data. It allows for data the be linked, anywhere, and to be
       | "grounded" in semantic descriptions of the data. The core RDF
       | data model is very simple: It is a set of triples orgainzed into
       | a RDF Graph.
       | 
       | https://www.w3.org/egov/wiki/Resource_Description_Format
        
       ___________________________________________________________________
       (page generated 2022-07-22 23:00 UTC)