[HN Gopher] Show HN: HyperLogLog in Zig ___________________________________________________________________ Show HN: HyperLogLog in Zig Author : seiflotfy Score : 112 points Date : 2023-01-09 16:41 UTC (6 hours ago) (HTM) web link (github.com) (TXT) w3m dump (github.com) | eatonphil wrote: | Awesome! Does Axiom use Zig? Most of your projects seem to be in | Go. Why did you choose to do this in Zig? | henry_viii wrote: | I'm also very curious to know this. HyperLogLog is written in | Go: | | https://github.com/axiomhq/hyperloglog | | I would expect V to be a more natural choice for a port than | Zig. | shilldetector wrote: | As a notice to others reading this user's comments, they are | a recently made account that seems to exist mostly to post | about V and post negatively about Zig. Extremely suspect | given the history of poor behavior from the V project. | seiflotfy wrote: | Author here, will publish my vlang take on it in the next | couple of days. | henry_viii wrote: | Looking forward to it :). | | It seems the V port should be pretty straightforward (the | repo is 880 LOC and doesn't have any dependencies). | extasia wrote: | Looks cool. What are some applications of a cardinality | estimator? | zX41ZdbW wrote: | Memory bound alternative for the COUNT(DISTINCT ...) aggregate | function in SQL. | | ClickHouse[1] has a few functions for that, with different | tradeoffs in memory usage, precision, and performance: | - uniqExact - the same as COUNT(DISTINCT ...) - the exact | version, using a hash table; - uniqCombined - a | combination of a linear array of small size in memory arena, a | hash table, and a HyperLogLog - this data structure is similar | to HyperLogLog++; the log2 of the number of cells is | controllable by the user; - uniq - a cardinality | estimator based on adaptive sampling by hash, lower quality to | memory usage compared to HyperLogLog, but beats it in CPU | efficiency; - uniqTheta - uses the Theta sketch | algorithm, whatever it means; - uniqUpTo(N) - my | favorite, uses a linear array to give the precise answer up to | N, and always answers N + 1 when the number of distinct | elements is larger than N; - groupBitmap - an exact | calculation using Roaring Bitmaps, makes sense for dense | integer sets; | | [1] https://github.com/ClickHouse/ClickHouse/ | | What is often forgotten in designing a data structure for a | cardinality estimator - is that it should work well not only | for a few large states but also for a large number of small | sets. | | For example, in a query like follows: SELECT | URL, COUNT(DISTINCT UserID) FROM pageviews GROUP BY URL | | You should expect that there are a large number of URLs, but | most of them will get just 1..2 unique UserIDs. Obviously, it's | not practical to allocate even a 4 KB HyperLogLog for every | resulting record. That's how complex combined data structures | are justified. They should start with ~16..64 bytes of | automatic memory in arena or stack, and only then upgrade to a | HyperLogLog. | jedisct1 wrote: | Rate-limiting is one. | refset wrote: | Various aspects of query optimization like join planning and | predicate selectivity estimation, e.g. as discussed in | https://arxiv.org/abs/2010.00728 | kristoff_it wrote: | One change to make this library more idiomatic would be to make | HyperLogLog avoid forcing heap allocation on the user. | | So from this (simplified code): pub fn | HyperLogLog(comptime p: u8) type { return struct { | dense: []u6, const Self = @This(); | pub fn init(allocator: Allocator) !Self { | return Self{ .dense = try | allocator.alloc(u6, 0), }; } | } } | | To this: pub fn HyperLogLog(comptime p: u8) | type { return struct { dense: | [1<<p]u6; const Self = @This(); | pub fn init() Self { var s = Self{}; | for (s.dense) |*x| x.* = 0; return s; | } } } | | The advantage to this API is that the user can choose where to | place the HLL (stack, heap, global memory). Also note how the new | init can't fail anymore. | | That said, in the code that I've conveniently omitted there's one | big complication: when the HLL has only few items in it, it | doesn't use the dense representation and instead uses a | `std.AutoHashMap` to keep counts. | | Unfortunately `std.AutoHashMap` is not a type that can be | reasonably disentangled from allocation. | | I was about to drop my idea of commenting when I realized that a | very neat solution would be to also provide the user with the | choice (and admittedly extra responsibility) of when to switch | from a hash map to a proper HLL. pub fn | init(start: std.AutoHashMap) Self { var s = Self{}; | for (s.dense) |*s| s.* = 0; | self.addFromHashMap(start); return s; } | | There you go, allocation-free HyperLogLog :^) | | I've also come up with my own funky interface when I implemented | Cuckoo Filters in Zig, for those who are interested: | | https://github.com/kristoff-it/zig-cuckoofilter | seiflotfy wrote: | This is awesome... question though: ```pub fn | HyperLogLog(comptime p: u8) type { return struct { dense: | [1<<p]u6; const Self = @This(); | pub fn init() Self { var s = Self{}; | for (s.dense) |*x| x.* = 0; return s; | } } }``` | | doesn't this allocate 1<<p upfront though. If yes then the HLL | of size 16384 bytes upfront which kinda beats the purpose of | having a sparse representation no? | kristoff_it wrote: | > doesn't this allocate 1<<p upfront though | | Yes, it does. The idea (see the last code snipped in that | post), is that the user delays the creation of the HLL until | they are ready to switch to a dense representation. Before | then, they just use a std.AutoHashMap directly. | | Or, alternatively, the HLL could use the same buffer for both | dense and sparse representation (see the Redis code). | kristoff_it wrote: | Or alternatively one could make good use of the existing buffer | and use that to encode the sparse representation, like Antirez | did in Redis: | | https://github.com/redis/redis/blob/unstable/src/hyperloglog... ___________________________________________________________________ (page generated 2023-01-09 23:00 UTC)