[HN Gopher] Implementing "seen by" functionality with Postgres
       ___________________________________________________________________
        
       Implementing "seen by" functionality with Postgres
        
       Author : kiwicopple
       Score  : 143 points
       Date   : 2022-07-18 14:27 UTC (8 hours ago)
        
 (HTM) web link (supabase.com)
 (TXT) w3m dump (supabase.com)
        
       | js4ever wrote:
       | About 500 rps max? That seems pretty low, I can easily do
       | something like 25k rps with 1 cpu using clickhouse for the same
       | scenario. Am I missing something?
        
       | fermentation wrote:
       | This is a bit of a weird post. The author sets up benchmarks,
       | shows that hstore appears to work best, then suggests to use
       | HyperLogLog despite performing worse. The reason being because it
       | scales better, but the author didn't really discuss how HLL works
       | so I'm not sure why it scales better other than that it has a
       | cool name.
        
         | asperous wrote:
         | I agree; why not do the work to show that it scales better? It
         | violates scientific integrity to have a hypothesis, run the
         | experiment, find that your hypothesis was wrong but then say it
         | is probably right regardless
        
         | hardwaresofton wrote:
         | Looks like I could have been much clearer, apologies!
         | 
         | The main ideas around picking HLL was that it was a good
         | compromise between fast operation, forget nothing nature of the
         | assoc table, and avoided the should-you-really-do-that nature
         | of storing possibly 100s of thousands or millions of numeric
         | kvs in a hstore column alone.
         | 
         | There's definitely a whole 'nother post in there on pushing it
         | to it's limits but it was already long enough, I thought!
        
       | kiwicopple wrote:
       | If you like to jump to the comments before reading the post (like
       | me), this is an exploration of a fairly common situation:
       | calculating the total number of people who have "viewed/seen" a
       | blog post.
       | 
       | tldr: use either Use HyperLogLog [0] (if you can install
       | extensions on your postgres database), or hstore[1] if you can't.
       | 
       | HyperLogLog is a probablistic data structure, which can
       | "estimate" the number of viewers, so it scales very well for
       | Twitter/FB-scale platforms.
       | 
       | [0] HyperLogLog: https://en.wikipedia.org/wiki/HyperLogLog
       | 
       | [1] hstore: https://www.postgresql.org/docs/current/hstore.html
        
       | WayToDoor wrote:
       | So how does the HyperLogLog work exactly? It seems like the
       | solution of choice, but I'd like to know more about the
       | internals.
        
         | kiwicopple wrote:
         | I don't have a good enough grasp HLL to summarize accurately,
         | so for convenience here is the original paper[0] and hopefully
         | HN comes through with a more digestible explanation.
         | 
         | The blog post uses Citus' Postgres extension[1] which uses an
         | augmented various of the original algorithm to reduce the size
         | of the data structure so that it is fixed-size
         | 
         | [0] HLL paper:
         | http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf
         | 
         | [1] postgresql-hll: https://github.com/citusdata/postgresql-hll
        
         | periodontal wrote:
         | Basically, hash all your data items, keep track of the maximum
         | number of leading 0 bits seen in a hash so far.
         | 
         | This is a decent proxy (but not without error) for how many
         | distinct items you've seen so far since large numbers of
         | leading zeros should be uncommon. Finding this means you
         | probably saw a lot of other things to get to that point
         | (intuitively, about N leading zeroes are expected after seeing
         | 2^N items).
         | 
         | This is actually the same thing that a proof of work
         | cryptocurrency does to control difficulty: change target number
         | of leading zeros so miners have to do more/less work to find a
         | match.
         | 
         | Of course, you could get "lucky" with a single counter and the
         | resolution isn't great, so HLL first separates the values into
         | buckets which are estimated separately and then combined to
         | give a more robust estimate.
        
         | [deleted]
        
         | hardwaresofton wrote:
         | Hey I wrote the article --- the way I think of HyperLogLog and
         | similar (but different) approaches like bloomfilters and count-
         | min sketches is that the key insight is hashing.
         | 
         | The fact that we can reduce some piece of data to a hash, then
         | work on the distributions of data/entropy of those hashes in
         | number space many different ways is what makes these data
         | structures work.
         | 
         | I think of it this way --- if you have a hashing algo that
         | turns "someone@example.com" into "7395718184927", you can
         | really easily count with relative certainty how often you see
         | that email address by keeping a sparse numeric associative
         | array. Getting the same exact value again produces the same
         | hash. Obviously doing this isn't SUPER useful because then you
         | just get strict cardinality amount (same as just checking
         | equality with the other keys). but if you choose a weak enough,
         | fast hashing function (or multiple, in the case of a count-min
         | sketch), you can have collisions in what some values hash to
         | but others do not -- so that means there will be some amount of
         | error -- you can control that to suit your needs, on a scale of
         | everything in one bucket to exact cardinality # of buckets.
         | 
         | Here's an actual proper guide (CS lectures, as you might
         | expect!) I found after a quick search, which you might find
         | useful:
         | 
         | https://williams-cs.github.io/cs358-f21/lectures/lecture10/l...
         | 
         | (IMO focus on the algorithm description)
         | 
         | And here's some old HN discussion:
         | 
         | https://news.ycombinator.com/item?id=7508658
         | 
         | To really try and understand the concept I found it useful a
         | while back to actually try and build it:
         | 
         | https://vadosware.io/post/countmin-sketch-in-haskell/
         | 
         | http://dimacs.rutgers.edu/~graham/pubs/papers/cmencyc.pdf (This
         | paper has a really good visual depiction)
         | 
         | I find the count-min sketch to be much more approachable
        
       | [deleted]
        
       | mongrelion wrote:
       | Sidetracking a bit: does anybody know what tool was used to
       | generate the table diagrams?
        
         | kiwicopple wrote:
         | These ones are custom (we use figma)
        
         | junon wrote:
         | If it's a tool I'd love to know, too. However it very well
         | might just be a hand-crafted vector.
        
       | proto-n wrote:
       | If I understand correctly these numbers are a result of users
       | viewing posts uniformly random? While the post is a nice overview
       | of options, I doubt the numbers generalize to the much more
       | realistic scenario of a power-law distribution of view counts for
       | posts.
        
         | hardwaresofton wrote:
         | I thought of that! I cover it very crudely and the generation
         | code isn't as nice as I would have liked, but I did try to pick
         | certain numbers of posts that would receive a disproportionate
         | amount of views (and also making sure posts were not generated
         | equally by everyone).
         | 
         | I'd love to have some feedback on the attached repo!
        
       | ilrwbwrkhv wrote:
       | The ending was weird. After all that test setup and data
       | crunching it became a bit of "I feel this is better".
        
       | maxloh wrote:
       | The mentioned Progres extension, citus/postgresql-hll, hasn't
       | been updated for about a year.
       | 
       | Is it really reliable to use it in production?
        
         | frankgrecojr wrote:
         | Have there been issues that have gone on unmaintained?
        
           | cldellow wrote:
           | When we used it (granted, years ago), it would crash
           | occasionally.
           | 
           | Looking at GitHub, there's at least one open issue around
           | crashes based on how large of an HLL you use.
        
       | SigmundA wrote:
       | Can anyone else see the source? I am getting 404 and would like
       | to see the sql statements that implement these tests.
       | 
       | >not to mention a lot of concurrent database updates (which isn't
       | necessarily Postgres's forte to begin with)
       | 
       | Isn't PG multithreaded db server and the default row store is
       | tuned for OLTP with row level locking over OLAP? I could see this
       | being said for Sqlite but not PG.
        
         | cldellow wrote:
         | I can't see the source either.
         | 
         | Re the concurrent updates - yes, Postgres is multi-process.
         | However, the hstore and HLL approaches are going to have a lot
         | of contention at the row level. The association table wouldn't
         | have that contention, since each user would get their own row.
         | 
         | The hstore/HLL approaches _could_ be sharded on a hash of the
         | user's ID. In fact, this would work really well for HLL since I
         | think you can union HLL structures quickly. But I don't think
         | any of this was addressed in the article...
        
           | SigmundA wrote:
           | >Re the concurrent updates - yes, Postgres is multi-process.
           | However, the hstore and HLL approaches are going to have a
           | lot of contention at the row level. The association table
           | wouldn't have that contention, since each user would get
           | their own row.
           | 
           | Each process is a OS level thread, again saying that
           | concurrent updates is not PG's forte seems pretty ridiculous
           | just because one some of your approaches use in row storage
           | rather than the typical relational design. The statement
           | seems unnecessary and may give someone with less experience
           | the wrong idea about PG's concurrent abilities.
        
       | mdoms wrote:
        
       | bastawhiz wrote:
       | A way to do this for read heavy applications is to use the
       | separate table to store who has viewed the post, but then use a
       | trigger to update a seen count column. When you write a (unique)
       | row to the table, the trigger counts all the views and updates
       | the posts table. All reads have no extra overhead, only views.
       | You can queue/process the views at your leisure, or pause them
       | altogether as a circuit breaker.
        
         | hardwaresofton wrote:
         | This is a great tip, thanks for sharing -- reducing read
         | overhead to a minimum and even delaying the processing is
         | fantastic.
         | 
         | This would have been great to include as a sort of better assoc
         | table approach
        
         | hyzyla wrote:
         | I expected to find that solution in article. Or if someone
         | doesn't like triggers, you can just insert row in associated
         | table, if failed with duplication error, then do nothing,
         | otherwise increase counter.
        
       | stareatgoats wrote:
       | I have to admit this is over my head. But sometimes being stupid
       | also what works, and I've always abided by the old adage that
       | "it's better to have a tractor that works than a limo that you
       | can't get out of the garage". I realize that in this case maybe
       | the problem was just a pretext for demonstrating HyperLogLog, but
       | in case someone is looking for a simple solution to this very
       | problem maybe this naive solution could work (use at your own
       | peril!):
       | 
       | So I'd create an associated table that records the user, time and
       | object viewed, and a field with the rolled up count on the object
       | record. The problem with duplicate views is solved by wrapping
       | the insert and the upping of the counter in the same transaction:
       | if the insert fails because of a breach of the unique index then
       | the upping fails too.
       | 
       | I probably missed something?
        
         | hardwaresofton wrote:
         | Hey sorry the post wasn't more well written!
         | 
         | I tried to cover all the basic or obvious approaches (counter,
         | assoc table) so that people could see the complexity kind of
         | ramp up.
         | 
         | You're right that an assoc table is a very workable solution,
         | along with the partitioning capabilities of postgres these
         | days.
        
         | davidw wrote:
         | Unless I knew there was going to be a huge boatload of data,
         | I'd go for an association table too. That's the cleanest thing
         | in terms of the DB.
        
         | cldellow wrote:
         | I felt like I didn't understand the blog post very well,
         | either.
         | 
         | It spends a lot time building a benchmark framework and
         | measuring things. At the end, hstore is the choice supported by
         | the data. But instead the author writes:
         | 
         | > If we go strictly with the data, the best way looks to be the
         | hstore-powered solution, but I think the HLL is probably the
         | right choice.
         | 
         | ?!? Why bother defining a framework for measurement if you're
         | just going to substitute your own judgment? Perhaps the
         | framework didn't capture something important -- like the
         | absolute number of people looking at any given post, concurrent
         | updates, etc.
         | 
         | I'm also confused by the benchmark. The autocannon script has:
         | // 10% of the time, get *all* the users         request.method
         | = 'GET'         request.path = `/posts/${postId}/seen-by/users`
         | 
         | Does this imply there's a path it can hit to get the list of
         | users who saw a post? HLL is a probabilistic data structure, so
         | it doesn't support that with 100% accuracy. You could reverse
         | engineer it from the list of users, but the performance will
         | vary with the number of users, which didn't seem to be captured
         | by the benchmark. I tried to look at the source code, but the
         | GitLab URL was 404ing for me.
        
           | dvlsg wrote:
           | I don't think I'm getting 404s, but the repo's gitlab url
           | does keep bouncing me to a login page.
           | 
           | > You need to sign in or sign up before continuing.
           | 
           | Has gitlab always enforced that, even for public repos? Or is
           | the linked repo not public?
        
             | jakear wrote:
             | Linked repo isn't public, you can check through the
             | author's public repos on gitlab.
        
           | derefr wrote:
           | > ?!? Why bother defining a framework for measurement if
           | you're just going to substitute your own judgment? Perhaps
           | the framework didn't capture something important -- like the
           | absolute number of people looking at any given post,
           | concurrent updates, etc.
           | 
           | Just guessing: the benchmark tells you the time complexity.
           | It doesn't tell you the space complexity. The author is
           | optimizing between the time- and space-complexity of the
           | solutions, with the time-complexity benchmarks as an input.
           | (Also, at a scale the benchmark doesn't reach, space-
           | complexity starts to affect time-complexity, as large
           | datasets become less able to be hot in disk cache.)
        
             | hardwaresofton wrote:
             | Yup this is it! You've said it much better than I did,
             | thank you.
        
             | nnnnnande wrote:
             | Yeah, this is probably the reason and the author even
             | elaborates on this in the sentences following the bit
             | quoted by cldellow:
             | 
             | > Even though the data says hstore, knowing that posts will
             | be seen by more and more people over time, I might choose
             | the HLL solution for an actual implementation. It's far
             | less likely to pose a bloated row problem, [...]
        
               | hardwaresofton wrote:
               | Just posting here too, but yup this is exactly what I was
               | trying to convey.
               | 
               | Hstore might have been the fastest but the way I am using
               | it or the scales the use case could scale to might not
               | work out.
               | 
               | Bigger benchmarks could have been done! Maybe a multi
               | part post would have been better so I could split apart
               | methodology and results/blabbering about approach!
        
       | derefr wrote:
       | If the goal is to count distinct users who have seen something,
       | but _also_ to efficiently retrieve the actual set of users who
       | have seen something -- then why wasn 't a roaring bitmap
       | (https://pgxn.org/dist/pg_roaringbitmap/) considered as an
       | option? Presuming user IDs are non-sparse nonnegative integers
       | (as SERIAL/BIGSERIAL pkeys usually are), a roaring bitmap should
       | do everything hstore does, just a lot more efficiently.
        
         | djKianoosh wrote:
         | the docs/paper for roaring bitmap says it's to be used for
         | large datasets, but I must have missed where "large" is
         | defined. In practice, how large does a set have to be to take
         | advantage of roaring bitmaps?
        
           | derefr wrote:
           | The essential consideration is that roaring bitmaps are
           | _compressed_ -- trading increased CPU overhead per read
           | /write, for lower size-on-disk => more rows fetched per IOP +
           | better cache-coherency. So think of the point where the CPU
           | overhead of compression/decompression for every read/write
           | becomes worth those gains.
           | 
           | Most developers make this kind of judgement every day: when
           | deciding whether to e.g. gzip a file before SCPing it
           | somewhere; or whether to apply e.g. lz4 to the data being
           | written to a KV store like Redis. IMHO that same intuition
           | applies here.
           | 
           | See also: the heuristic Redis uses to decide whether to back
           | a set with a hashtable, vs. a sorted list -- sits around 512
           | elements: https://redis.com/ebook/part-2-core-
           | concepts/01chapter-9-red...
           | 
           | ---
           | 
           | Mind you, the compression in roaring bitmaps is very
           | efficient, has extremely low start-up costs, and is
           | effectively "random access" rather than streaming; these low
           | costs are why roaring bitmaps are preferred over other
           | compressed-bitmap approaches.
           | 
           | And also, it not like, at small data sizes, even a "large"
           | (hundreds/thousands of ops) CPU penalty would really be
           | _much_ of a penalty, either -- because, with small data, you
           | wouldn 't be paying it very often! And within a "heavyweight"
           | RDBMS like Postgres, that does a bunch of things when you
           | make a request to it, these sorts of small CPU costs usually
           | get lost entirely under the static overhead of query
           | planning. (Postgres takes advantage of this property in a lot
           | of other ways as well.)
           | 
           | It's only when you're trying to do something _at scale_ , but
           | where dataset size _isn 't_ the scaling factor (example:
           | running a core DNS server -- trillions of queries, but
           | against only a small amount of data) that a roaring bitmap
           | would be an explicit "wrong choice" vs. a sorted-list or
           | b-tree or hashtable-backed presence set.
        
             | 5e92cb50239222b wrote:
             | > deciding whether to e.g. gzip a file before SCPing it
             | somewhere
             | 
             | Off topic, but to not waste time on such trivia in the
             | future, add                 Compression yes
             | 
             | to your ~/.ssh/config.
        
               | derefr wrote:
               | Not a good idea if you're frequently sending things that
               | are _already_ compressed, though. Which many formats
               | today are.
               | 
               | (Also, I was using `scp` as a general abstraction;
               | personally, I'm not often pushing things around between
               | VMs directly, but rather between VMs and object storage;
               | where part of the question is not just about the
               | transport encoding, but rather whether it's worth it to
               | have the data be compressed-at-rest in the object store.
               | Usually depends on how many times you're going to be
               | downloading it.)
        
         | hardwaresofton wrote:
         | Author here, a roaring bitmap could absolutely have been used!
         | 
         | The concept is similar and it absolutely should have been
         | mentioned.
         | 
         | > Presuming user IDs are non-sparse nonnegative integers (as
         | SERIAL/BIGSERIAL pkeys usually are),
         | 
         | I personally wanted to build a method that worked for textual
         | content easily but with that constraint (which is valid and
         | what I ended up doing in practice) a roaring bitmap definitely
         | works!
         | 
         | I want to give this a try in the future
        
         | jakswa wrote:
         | I just wanted to say thanks for linking. I feel like I'm always
         | on the search for new PG storage extensions, even if I'm not
         | able to use them everywhere (sad to see AWS RDS wasn't in list
         | of vendors that include/support it).
        
       ___________________________________________________________________
       (page generated 2022-07-18 23:00 UTC)