[HN Gopher] Roaring bitmaps: what they are and how they work ___________________________________________________________________ Roaring bitmaps: what they are and how they work Author : voberoi Score : 122 points Date : 2022-09-02 15:35 UTC (1 days ago) (HTM) web link (vikramoberoi.com) (TXT) w3m dump (vikramoberoi.com) | celeritascelery wrote: | I see this common theme among very fast practical algorithms | (like timsort or pdqsort) where there is not some secret math or | algorithm, rather they are just a bunch of special cases based on | heuristics. Often they involve specific knowledge of the hardware | instead of treating software as abstract. To me this is the big | difference between "computer science" and "computer engineering". | coldtea wrote: | > _I see this common theme among very fast practical algorithms | (like timsort or pdqsort) where there is not some secret math | or algorithm, rather they are just a bunch of special cases | based on heuristics_ | | The core here, which is bitmap storage and the basic | optimization are simple, but solid and general. Mathematically | it takes less space to store data as positions on a bitmap | rather than fully spelt associations, and it takes even less | space to seggregate the bitmaps in chunks. | | This will hold and be smaller and faster in any computer. So, | it's not some case of special case based on heuristics, either | related to the specific frequencies or sample characteristics | of some particular set of data, or of specific CPU | peculiarities or whatever. | | More exotic optimizations piled on top, sure. But "compressed | bitmaps" themselves as a concept, not. | voberoi wrote: | I had the same takeaway when I read these papers. It felt like | a game of whack-a-mole. | | There's a massive performance benefit to doing this at the cost | of implementation complexity. I haven't studied the | implementations or tried my hand one, but I get the impression | that these are tough to implement correctly in a way that takes | full advantage of the hardware. | | (In that sense, it's awesome that the researchers also did the | legwork to implement and maintain a library!) | jandrewrogers wrote: | Much of this is just threshold effects for algorithm | performance, not special cases per se. | | You don't even need to have specific knowledge of the hardware | as long as you can identify cross-over points where one | algorithm starts significantly outperforming others. An old HPC | trick is to write software that thoroughly measures several | algorithm strategies on your specific hardware environment and | then code-gens an algorithm that has an optimal set of | strategies and strategy-switching thresholds. The meta- | algorithm is mostly focused on cheaply detecting these | thresholds at runtime. | fuckstick wrote: | No this is still comp sci. Comp sci doesn't mean that practical | considerations in algorithm design are ignored. Computer | engineering has to do with the actual design and construction | of computing machinery. It's an accredited engineering field | and the undergraduate coursework includes electronics, | semiconductors, computer architecture and generally a lot of | overlap with electrical engineering. I had to take one dumb | programming course - all the data structures and algorithms | stuff I had to learn elsewhere. | jerven wrote: | I love roaringbitmaps and maybe even over use this library. On | the other hand it made a lot of things possible for | legacy.uniprot.org over the years that would have been | unaffordable otherwise. Quick faceting using permanent filters | over 500+ million Documents would have been really hard | otherwise. | voberoi wrote: | I'm really curious about this. Are you able to share a bit | about your use case and how you use roaring bitmaps outright | (vs. through search infra like Solr or ElasticSearch)? | pvillano wrote: | I thought the trick was going to be indirection. | | sorted-list/bitmap/runs, in a two level tree. cool. | | it's technically missing sorted compliment-list, i.e. only the | items that are missing, although worst case runs only use twice | as much space, so more complexity without much savings, esp. | considering real workloads | | performs better with sequential ids than random uuids because you | use fewer pages | | further research | | investigating the effect of adding more layers to the tree | | using additional set representations e.g. balanced tree | nattaylor wrote: | I enjoyed this walkthrough. I'm interested in RB because it's | used in Lucene but never dug in and assumed (without much | thought) that it had to do with consecutive runs of 1s and 0s. | Wrong! | LouisSayers wrote: | Compressing consecutive 1's and 0's was the first idea that | popped into my head as well. Then when they started to talk | about inserting into the 800,000th position my brain went "What | if they just reversed the array and stored some metadata as the | type of array they're dealing with?" | | It's funny that in the end Bitmaps essentially are a modified | data structure and process. | | The way things are going, I imagine someone will at some point | take all the different tricks we have of inserting, sorting, | finding, deleting data, take millions of datasets and run some | machine learning type process to create libraries that can | perform these operations optimally. | nattaylor wrote: | I was thinking a similar thought that 4096 seemed quite magic | (although it seems to be chosen based on research) and that | RB perf could probably be tuned based on the workload | nicoburns wrote: | I don't think 4096 is arbitrary. It's an array of 16bit | integers, and 4096 * 16 = 65536. So 4096 represents the | boundary point below which an array of integers uses less | memory than a traditional bitmap. | nattaylor wrote: | Oops, thanks for pointing that out | TacticalCoder wrote: | > When a set is sparse, traditional bitmaps compress poorly. | | They waste space. But if you were to compress them, they'd | compress _very_ well. As in, for example, if you were to compress | them using roaring bitmaps. | nullc wrote: | Those who do not know judy1 are doomed to reinvent it. | | http://judy.sourceforge.net/ | latchkey wrote: | http://nothings.org/computer/judy/ | froh wrote: | which possibly is a good thing | | the Judy project seems inactive to me: | | https://sourceforge.net/p/judy/bugs/ | | this short HN thread resonates with the intuition of jude1 | being stalled, it's last optimistic comment points to an | orphaned comment in the big list above | | https://news.ycombinator.com/item?id=32188204 | Xeoncross wrote: | Can someone explain how this is laid out in memory and what it | would look like serialized? It's easy to have an array of N size | that fits all bits and just write to disk. You can also mmap a | file as a large array of bytes and just mutate it and let the OS | handle the disk syncs if you're okay with some data loss of | recent state changes if the machine crashes. | | What would an array of pointers to X containers which are N size | each look like on disk? | extesy wrote: | This article has a section titled "how Roaring bitmaps are | represented in memory". I suggest you actually read the | article, it will likely answer all your questions. | pvillano wrote: | I'm reading this on a runaway with terrible internet and the | figures slowly load to reveal... text :/ | superjan wrote: | It looks a bit too simplistic: Blocks with up to 4k items are | stored as sorted lists. Having to insert in a sorted list up to | 4k items seems very expensive to me. | jasonwatkinspdx wrote: | The threshold was chosen empirically. Modern processors are | _very_ fast when operating on memory sequentially, as the | prefetcher achieves the best case scenario. The cost of a cache | line miss is so substantial this ends up dominating things | until "n" is in the 1000s. A lot of programmers' intuition is | off the mark on this point. | jandrewrogers wrote: | CPUs have a strong performance bias toward sequential memory | access and there are large threshold effects at work here. The | block size used is not arbitrary. Improvements in prefetching | and cache line utilization can have such large performance | benefits that it more than justifies any apparent increase in | computational cost because of how the code is organized to | obtain those improvements. | | Most developers do not have a good intuition for the efficiency | and scalability of sequential brute-force on modern | architectures. There is a cross-over threshold but I think many | people would be surprised at how high it is in practice. | foota wrote: | The 4k items are only 512 bytes though, so expensive yes but | not overly so. | skitter wrote: | They are two byte per item; it's the bitmaps (>4k entries) | that use 1 bit per item. | Veserv wrote: | No. The person you are replying to is talking about the | sparse leaf case where the contained integers are actually | being stored (to be more precise the low bits are being | stored) in a sorted structure. The switch to the bitmap | occurs when the sparse structure takes the same number of | bits to directly encode the sparsely contained items. In this | case they use a 2^16 bitmap and encode sparse elements using | 16-bits, so it occurs at ((2^16 / 16) == 2^12) elements. ___________________________________________________________________ (page generated 2022-09-03 23:00 UTC)