[HN Gopher] Show HN: HyperLogLog in Zig
       ___________________________________________________________________
        
       Show HN: HyperLogLog in Zig
        
       Author : seiflotfy
       Score  : 112 points
       Date   : 2023-01-09 16:41 UTC (6 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | eatonphil wrote:
       | Awesome! Does Axiom use Zig? Most of your projects seem to be in
       | Go. Why did you choose to do this in Zig?
        
         | henry_viii wrote:
         | I'm also very curious to know this. HyperLogLog is written in
         | Go:
         | 
         | https://github.com/axiomhq/hyperloglog
         | 
         | I would expect V to be a more natural choice for a port than
         | Zig.
        
           | shilldetector wrote:
           | As a notice to others reading this user's comments, they are
           | a recently made account that seems to exist mostly to post
           | about V and post negatively about Zig. Extremely suspect
           | given the history of poor behavior from the V project.
        
           | seiflotfy wrote:
           | Author here, will publish my vlang take on it in the next
           | couple of days.
        
             | henry_viii wrote:
             | Looking forward to it :).
             | 
             | It seems the V port should be pretty straightforward (the
             | repo is 880 LOC and doesn't have any dependencies).
        
       | extasia wrote:
       | Looks cool. What are some applications of a cardinality
       | estimator?
        
         | zX41ZdbW wrote:
         | Memory bound alternative for the COUNT(DISTINCT ...) aggregate
         | function in SQL.
         | 
         | ClickHouse[1] has a few functions for that, with different
         | tradeoffs in memory usage, precision, and performance:
         | - uniqExact - the same as COUNT(DISTINCT ...) - the exact
         | version, using a hash table;         - uniqCombined - a
         | combination of a linear array of small size in memory arena, a
         | hash table, and a HyperLogLog - this data structure is similar
         | to HyperLogLog++; the log2 of the number of cells is
         | controllable by the user;         - uniq - a cardinality
         | estimator based on adaptive sampling by hash, lower quality to
         | memory usage compared to HyperLogLog, but beats it in CPU
         | efficiency;         - uniqTheta - uses the Theta sketch
         | algorithm, whatever it means;         - uniqUpTo(N) - my
         | favorite, uses a linear array to give the precise answer up to
         | N, and always answers N + 1 when the number of distinct
         | elements is larger than N;         - groupBitmap - an exact
         | calculation using Roaring Bitmaps, makes sense for dense
         | integer sets;
         | 
         | [1] https://github.com/ClickHouse/ClickHouse/
         | 
         | What is often forgotten in designing a data structure for a
         | cardinality estimator - is that it should work well not only
         | for a few large states but also for a large number of small
         | sets.
         | 
         | For example, in a query like follows:                   SELECT
         | URL, COUNT(DISTINCT UserID) FROM pageviews GROUP BY URL
         | 
         | You should expect that there are a large number of URLs, but
         | most of them will get just 1..2 unique UserIDs. Obviously, it's
         | not practical to allocate even a 4 KB HyperLogLog for every
         | resulting record. That's how complex combined data structures
         | are justified. They should start with ~16..64 bytes of
         | automatic memory in arena or stack, and only then upgrade to a
         | HyperLogLog.
        
         | jedisct1 wrote:
         | Rate-limiting is one.
        
         | refset wrote:
         | Various aspects of query optimization like join planning and
         | predicate selectivity estimation, e.g. as discussed in
         | https://arxiv.org/abs/2010.00728
        
       | kristoff_it wrote:
       | One change to make this library more idiomatic would be to make
       | HyperLogLog avoid forcing heap allocation on the user.
       | 
       | So from this (simplified code):                   pub fn
       | HyperLogLog(comptime p: u8) type {             return struct {
       | dense: []u6,                      const Self = @This();
       | pub fn init(allocator: Allocator) !Self {
       | return Self{                         .dense = try
       | allocator.alloc(u6, 0),                     };                 }
       | }         }
       | 
       | To this:                   pub fn HyperLogLog(comptime p: u8)
       | type {             return struct {                 dense:
       | [1<<p]u6;                      const Self = @This();
       | pub fn init() Self {                     var s = Self{};
       | for (s.dense) |*x| x.* = 0;                     return s;
       | }             }         }
       | 
       | The advantage to this API is that the user can choose where to
       | place the HLL (stack, heap, global memory). Also note how the new
       | init can't fail anymore.
       | 
       | That said, in the code that I've conveniently omitted there's one
       | big complication: when the HLL has only few items in it, it
       | doesn't use the dense representation and instead uses a
       | `std.AutoHashMap` to keep counts.
       | 
       | Unfortunately `std.AutoHashMap` is not a type that can be
       | reasonably disentangled from allocation.
       | 
       | I was about to drop my idea of commenting when I realized that a
       | very neat solution would be to also provide the user with the
       | choice (and admittedly extra responsibility) of when to switch
       | from a hash map to a proper HLL.                   pub fn
       | init(start: std.AutoHashMap) Self {             var s = Self{};
       | for (s.dense) |*s| s.* = 0;
       | self.addFromHashMap(start);             return s;         }
       | 
       | There you go, allocation-free HyperLogLog :^)
       | 
       | I've also come up with my own funky interface when I implemented
       | Cuckoo Filters in Zig, for those who are interested:
       | 
       | https://github.com/kristoff-it/zig-cuckoofilter
        
         | seiflotfy wrote:
         | This is awesome... question though: ```pub fn
         | HyperLogLog(comptime p: u8) type { return struct { dense:
         | [1<<p]u6;                           const Self = @This();
         | pub fn init() Self {                     var s = Self{};
         | for (s.dense) |*x| x.* = 0;                     return s;
         | }             }         }```
         | 
         | doesn't this allocate 1<<p upfront though. If yes then the HLL
         | of size 16384 bytes upfront which kinda beats the purpose of
         | having a sparse representation no?
        
           | kristoff_it wrote:
           | > doesn't this allocate 1<<p upfront though
           | 
           | Yes, it does. The idea (see the last code snipped in that
           | post), is that the user delays the creation of the HLL until
           | they are ready to switch to a dense representation. Before
           | then, they just use a std.AutoHashMap directly.
           | 
           | Or, alternatively, the HLL could use the same buffer for both
           | dense and sparse representation (see the Redis code).
        
         | kristoff_it wrote:
         | Or alternatively one could make good use of the existing buffer
         | and use that to encode the sparse representation, like Antirez
         | did in Redis:
         | 
         | https://github.com/redis/redis/blob/unstable/src/hyperloglog...
        
       ___________________________________________________________________
       (page generated 2023-01-09 23:00 UTC)