[HN Gopher] Implementing "seen by" functionality with Postgres ___________________________________________________________________ Implementing "seen by" functionality with Postgres Author : kiwicopple Score : 143 points Date : 2022-07-18 14:27 UTC (8 hours ago) (HTM) web link (supabase.com) (TXT) w3m dump (supabase.com) | js4ever wrote: | About 500 rps max? That seems pretty low, I can easily do | something like 25k rps with 1 cpu using clickhouse for the same | scenario. Am I missing something? | fermentation wrote: | This is a bit of a weird post. The author sets up benchmarks, | shows that hstore appears to work best, then suggests to use | HyperLogLog despite performing worse. The reason being because it | scales better, but the author didn't really discuss how HLL works | so I'm not sure why it scales better other than that it has a | cool name. | asperous wrote: | I agree; why not do the work to show that it scales better? It | violates scientific integrity to have a hypothesis, run the | experiment, find that your hypothesis was wrong but then say it | is probably right regardless | hardwaresofton wrote: | Looks like I could have been much clearer, apologies! | | The main ideas around picking HLL was that it was a good | compromise between fast operation, forget nothing nature of the | assoc table, and avoided the should-you-really-do-that nature | of storing possibly 100s of thousands or millions of numeric | kvs in a hstore column alone. | | There's definitely a whole 'nother post in there on pushing it | to it's limits but it was already long enough, I thought! | kiwicopple wrote: | If you like to jump to the comments before reading the post (like | me), this is an exploration of a fairly common situation: | calculating the total number of people who have "viewed/seen" a | blog post. | | tldr: use either Use HyperLogLog [0] (if you can install | extensions on your postgres database), or hstore[1] if you can't. | | HyperLogLog is a probablistic data structure, which can | "estimate" the number of viewers, so it scales very well for | Twitter/FB-scale platforms. | | [0] HyperLogLog: https://en.wikipedia.org/wiki/HyperLogLog | | [1] hstore: https://www.postgresql.org/docs/current/hstore.html | WayToDoor wrote: | So how does the HyperLogLog work exactly? It seems like the | solution of choice, but I'd like to know more about the | internals. | kiwicopple wrote: | I don't have a good enough grasp HLL to summarize accurately, | so for convenience here is the original paper[0] and hopefully | HN comes through with a more digestible explanation. | | The blog post uses Citus' Postgres extension[1] which uses an | augmented various of the original algorithm to reduce the size | of the data structure so that it is fixed-size | | [0] HLL paper: | http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf | | [1] postgresql-hll: https://github.com/citusdata/postgresql-hll | periodontal wrote: | Basically, hash all your data items, keep track of the maximum | number of leading 0 bits seen in a hash so far. | | This is a decent proxy (but not without error) for how many | distinct items you've seen so far since large numbers of | leading zeros should be uncommon. Finding this means you | probably saw a lot of other things to get to that point | (intuitively, about N leading zeroes are expected after seeing | 2^N items). | | This is actually the same thing that a proof of work | cryptocurrency does to control difficulty: change target number | of leading zeros so miners have to do more/less work to find a | match. | | Of course, you could get "lucky" with a single counter and the | resolution isn't great, so HLL first separates the values into | buckets which are estimated separately and then combined to | give a more robust estimate. | [deleted] | hardwaresofton wrote: | Hey I wrote the article --- the way I think of HyperLogLog and | similar (but different) approaches like bloomfilters and count- | min sketches is that the key insight is hashing. | | The fact that we can reduce some piece of data to a hash, then | work on the distributions of data/entropy of those hashes in | number space many different ways is what makes these data | structures work. | | I think of it this way --- if you have a hashing algo that | turns "someone@example.com" into "7395718184927", you can | really easily count with relative certainty how often you see | that email address by keeping a sparse numeric associative | array. Getting the same exact value again produces the same | hash. Obviously doing this isn't SUPER useful because then you | just get strict cardinality amount (same as just checking | equality with the other keys). but if you choose a weak enough, | fast hashing function (or multiple, in the case of a count-min | sketch), you can have collisions in what some values hash to | but others do not -- so that means there will be some amount of | error -- you can control that to suit your needs, on a scale of | everything in one bucket to exact cardinality # of buckets. | | Here's an actual proper guide (CS lectures, as you might | expect!) I found after a quick search, which you might find | useful: | | https://williams-cs.github.io/cs358-f21/lectures/lecture10/l... | | (IMO focus on the algorithm description) | | And here's some old HN discussion: | | https://news.ycombinator.com/item?id=7508658 | | To really try and understand the concept I found it useful a | while back to actually try and build it: | | https://vadosware.io/post/countmin-sketch-in-haskell/ | | http://dimacs.rutgers.edu/~graham/pubs/papers/cmencyc.pdf (This | paper has a really good visual depiction) | | I find the count-min sketch to be much more approachable | [deleted] | mongrelion wrote: | Sidetracking a bit: does anybody know what tool was used to | generate the table diagrams? | kiwicopple wrote: | These ones are custom (we use figma) | junon wrote: | If it's a tool I'd love to know, too. However it very well | might just be a hand-crafted vector. | proto-n wrote: | If I understand correctly these numbers are a result of users | viewing posts uniformly random? While the post is a nice overview | of options, I doubt the numbers generalize to the much more | realistic scenario of a power-law distribution of view counts for | posts. | hardwaresofton wrote: | I thought of that! I cover it very crudely and the generation | code isn't as nice as I would have liked, but I did try to pick | certain numbers of posts that would receive a disproportionate | amount of views (and also making sure posts were not generated | equally by everyone). | | I'd love to have some feedback on the attached repo! | ilrwbwrkhv wrote: | The ending was weird. After all that test setup and data | crunching it became a bit of "I feel this is better". | maxloh wrote: | The mentioned Progres extension, citus/postgresql-hll, hasn't | been updated for about a year. | | Is it really reliable to use it in production? | frankgrecojr wrote: | Have there been issues that have gone on unmaintained? | cldellow wrote: | When we used it (granted, years ago), it would crash | occasionally. | | Looking at GitHub, there's at least one open issue around | crashes based on how large of an HLL you use. | SigmundA wrote: | Can anyone else see the source? I am getting 404 and would like | to see the sql statements that implement these tests. | | >not to mention a lot of concurrent database updates (which isn't | necessarily Postgres's forte to begin with) | | Isn't PG multithreaded db server and the default row store is | tuned for OLTP with row level locking over OLAP? I could see this | being said for Sqlite but not PG. | cldellow wrote: | I can't see the source either. | | Re the concurrent updates - yes, Postgres is multi-process. | However, the hstore and HLL approaches are going to have a lot | of contention at the row level. The association table wouldn't | have that contention, since each user would get their own row. | | The hstore/HLL approaches _could_ be sharded on a hash of the | user's ID. In fact, this would work really well for HLL since I | think you can union HLL structures quickly. But I don't think | any of this was addressed in the article... | SigmundA wrote: | >Re the concurrent updates - yes, Postgres is multi-process. | However, the hstore and HLL approaches are going to have a | lot of contention at the row level. The association table | wouldn't have that contention, since each user would get | their own row. | | Each process is a OS level thread, again saying that | concurrent updates is not PG's forte seems pretty ridiculous | just because one some of your approaches use in row storage | rather than the typical relational design. The statement | seems unnecessary and may give someone with less experience | the wrong idea about PG's concurrent abilities. | mdoms wrote: | bastawhiz wrote: | A way to do this for read heavy applications is to use the | separate table to store who has viewed the post, but then use a | trigger to update a seen count column. When you write a (unique) | row to the table, the trigger counts all the views and updates | the posts table. All reads have no extra overhead, only views. | You can queue/process the views at your leisure, or pause them | altogether as a circuit breaker. | hardwaresofton wrote: | This is a great tip, thanks for sharing -- reducing read | overhead to a minimum and even delaying the processing is | fantastic. | | This would have been great to include as a sort of better assoc | table approach | hyzyla wrote: | I expected to find that solution in article. Or if someone | doesn't like triggers, you can just insert row in associated | table, if failed with duplication error, then do nothing, | otherwise increase counter. | stareatgoats wrote: | I have to admit this is over my head. But sometimes being stupid | also what works, and I've always abided by the old adage that | "it's better to have a tractor that works than a limo that you | can't get out of the garage". I realize that in this case maybe | the problem was just a pretext for demonstrating HyperLogLog, but | in case someone is looking for a simple solution to this very | problem maybe this naive solution could work (use at your own | peril!): | | So I'd create an associated table that records the user, time and | object viewed, and a field with the rolled up count on the object | record. The problem with duplicate views is solved by wrapping | the insert and the upping of the counter in the same transaction: | if the insert fails because of a breach of the unique index then | the upping fails too. | | I probably missed something? | hardwaresofton wrote: | Hey sorry the post wasn't more well written! | | I tried to cover all the basic or obvious approaches (counter, | assoc table) so that people could see the complexity kind of | ramp up. | | You're right that an assoc table is a very workable solution, | along with the partitioning capabilities of postgres these | days. | davidw wrote: | Unless I knew there was going to be a huge boatload of data, | I'd go for an association table too. That's the cleanest thing | in terms of the DB. | cldellow wrote: | I felt like I didn't understand the blog post very well, | either. | | It spends a lot time building a benchmark framework and | measuring things. At the end, hstore is the choice supported by | the data. But instead the author writes: | | > If we go strictly with the data, the best way looks to be the | hstore-powered solution, but I think the HLL is probably the | right choice. | | ?!? Why bother defining a framework for measurement if you're | just going to substitute your own judgment? Perhaps the | framework didn't capture something important -- like the | absolute number of people looking at any given post, concurrent | updates, etc. | | I'm also confused by the benchmark. The autocannon script has: | // 10% of the time, get *all* the users request.method | = 'GET' request.path = `/posts/${postId}/seen-by/users` | | Does this imply there's a path it can hit to get the list of | users who saw a post? HLL is a probabilistic data structure, so | it doesn't support that with 100% accuracy. You could reverse | engineer it from the list of users, but the performance will | vary with the number of users, which didn't seem to be captured | by the benchmark. I tried to look at the source code, but the | GitLab URL was 404ing for me. | dvlsg wrote: | I don't think I'm getting 404s, but the repo's gitlab url | does keep bouncing me to a login page. | | > You need to sign in or sign up before continuing. | | Has gitlab always enforced that, even for public repos? Or is | the linked repo not public? | jakear wrote: | Linked repo isn't public, you can check through the | author's public repos on gitlab. | derefr wrote: | > ?!? Why bother defining a framework for measurement if | you're just going to substitute your own judgment? Perhaps | the framework didn't capture something important -- like the | absolute number of people looking at any given post, | concurrent updates, etc. | | Just guessing: the benchmark tells you the time complexity. | It doesn't tell you the space complexity. The author is | optimizing between the time- and space-complexity of the | solutions, with the time-complexity benchmarks as an input. | (Also, at a scale the benchmark doesn't reach, space- | complexity starts to affect time-complexity, as large | datasets become less able to be hot in disk cache.) | hardwaresofton wrote: | Yup this is it! You've said it much better than I did, | thank you. | nnnnnande wrote: | Yeah, this is probably the reason and the author even | elaborates on this in the sentences following the bit | quoted by cldellow: | | > Even though the data says hstore, knowing that posts will | be seen by more and more people over time, I might choose | the HLL solution for an actual implementation. It's far | less likely to pose a bloated row problem, [...] | hardwaresofton wrote: | Just posting here too, but yup this is exactly what I was | trying to convey. | | Hstore might have been the fastest but the way I am using | it or the scales the use case could scale to might not | work out. | | Bigger benchmarks could have been done! Maybe a multi | part post would have been better so I could split apart | methodology and results/blabbering about approach! | derefr wrote: | If the goal is to count distinct users who have seen something, | but _also_ to efficiently retrieve the actual set of users who | have seen something -- then why wasn 't a roaring bitmap | (https://pgxn.org/dist/pg_roaringbitmap/) considered as an | option? Presuming user IDs are non-sparse nonnegative integers | (as SERIAL/BIGSERIAL pkeys usually are), a roaring bitmap should | do everything hstore does, just a lot more efficiently. | djKianoosh wrote: | the docs/paper for roaring bitmap says it's to be used for | large datasets, but I must have missed where "large" is | defined. In practice, how large does a set have to be to take | advantage of roaring bitmaps? | derefr wrote: | The essential consideration is that roaring bitmaps are | _compressed_ -- trading increased CPU overhead per read | /write, for lower size-on-disk => more rows fetched per IOP + | better cache-coherency. So think of the point where the CPU | overhead of compression/decompression for every read/write | becomes worth those gains. | | Most developers make this kind of judgement every day: when | deciding whether to e.g. gzip a file before SCPing it | somewhere; or whether to apply e.g. lz4 to the data being | written to a KV store like Redis. IMHO that same intuition | applies here. | | See also: the heuristic Redis uses to decide whether to back | a set with a hashtable, vs. a sorted list -- sits around 512 | elements: https://redis.com/ebook/part-2-core- | concepts/01chapter-9-red... | | --- | | Mind you, the compression in roaring bitmaps is very | efficient, has extremely low start-up costs, and is | effectively "random access" rather than streaming; these low | costs are why roaring bitmaps are preferred over other | compressed-bitmap approaches. | | And also, it not like, at small data sizes, even a "large" | (hundreds/thousands of ops) CPU penalty would really be | _much_ of a penalty, either -- because, with small data, you | wouldn 't be paying it very often! And within a "heavyweight" | RDBMS like Postgres, that does a bunch of things when you | make a request to it, these sorts of small CPU costs usually | get lost entirely under the static overhead of query | planning. (Postgres takes advantage of this property in a lot | of other ways as well.) | | It's only when you're trying to do something _at scale_ , but | where dataset size _isn 't_ the scaling factor (example: | running a core DNS server -- trillions of queries, but | against only a small amount of data) that a roaring bitmap | would be an explicit "wrong choice" vs. a sorted-list or | b-tree or hashtable-backed presence set. | 5e92cb50239222b wrote: | > deciding whether to e.g. gzip a file before SCPing it | somewhere | | Off topic, but to not waste time on such trivia in the | future, add Compression yes | | to your ~/.ssh/config. | derefr wrote: | Not a good idea if you're frequently sending things that | are _already_ compressed, though. Which many formats | today are. | | (Also, I was using `scp` as a general abstraction; | personally, I'm not often pushing things around between | VMs directly, but rather between VMs and object storage; | where part of the question is not just about the | transport encoding, but rather whether it's worth it to | have the data be compressed-at-rest in the object store. | Usually depends on how many times you're going to be | downloading it.) | hardwaresofton wrote: | Author here, a roaring bitmap could absolutely have been used! | | The concept is similar and it absolutely should have been | mentioned. | | > Presuming user IDs are non-sparse nonnegative integers (as | SERIAL/BIGSERIAL pkeys usually are), | | I personally wanted to build a method that worked for textual | content easily but with that constraint (which is valid and | what I ended up doing in practice) a roaring bitmap definitely | works! | | I want to give this a try in the future | jakswa wrote: | I just wanted to say thanks for linking. I feel like I'm always | on the search for new PG storage extensions, even if I'm not | able to use them everywhere (sad to see AWS RDS wasn't in list | of vendors that include/support it). ___________________________________________________________________ (page generated 2022-07-18 23:00 UTC)