[HN Gopher] What's good about offset pagination; designing paral...
       ___________________________________________________________________
        
       What's good about offset pagination; designing parallel cursor-
       based web APIs
        
       Author : clra
       Score  : 54 points
       Date   : 2021-01-03 17:18 UTC (5 hours ago)
        
 (HTM) web link (brandur.org)
 (TXT) w3m dump (brandur.org)
        
       | draw_down wrote:
       | > it uses offsets for pagination... understood to be bad practice
       | by today's standards. Although convenient to use, offsets are
       | difficult to keep performant in the backend
       | 
       | This is funny. Using offsets is known to be bad practice
       | because.... it's hard to do.
       | 
       | Look I'm just a UI guy so what do I know. But this argument gets
       | old because I'm sorry, but people want a paginated list and to
       | know how many pages are in the list. Clicking "next page" 10
       | times instead of clicking to page 10 is bullshit, and users know
       | it.
        
         | alexchamberlain wrote:
         | I can't find one right now, but I feel like there must be an
         | algorithm that can identify the key that can be used for each
         | page, yet cheaper than a full sort (ie cheaper than offset
         | pagination in generality). Of course, a skip list would work
         | for a secondary index.
        
         | Rafert wrote:
         | How do people even figure out they need to be on page 10?
         | Sounds like a filter of some kind (e.g. before a date) would be
         | better, except they trained themselves to use x amount of pages
         | to approximate.
         | 
         | Either way, 10 pages isn't so bad but tens of thousands can
         | become troublesome as explained on
         | https://shopify.engineering/pagination-relative-cursors
        
           | PurplePanda wrote:
           | Maybe they've been there before and happen to remember "page
           | 10".
           | 
           | Maybe someone else has been there before and told them "Go to
           | page ten".
           | 
           | Maybe you know that there are 20 pages, and are looking to
           | find the median, or are just informally playing around,
           | looking at the distribution.
           | 
           | Same as you'd do with an interesting page of a book. I don't
           | think I'd stop using page numbers in dead tree books if they
           | somehow came with filters.
        
       | ppeetteerr wrote:
       | Pagination of an immutable collection is one thing and can be
       | parallelized. Pagination of a mutable collection (e.g. a database
       | table), on the other hand, is risky since two requests might
       | return intersecting data if new data was added between the
       | requests being executed.
       | 
       | True result sets require relative page tokens and a
       | synchronization mechanism if the software demands it.
        
         | simonw wrote:
         | Intersecting data is fine provided there's a unique ID for each
         | result that can be used to de-duplicate them.
         | 
         | Ideally I'd want a system that guarantees at-least-once
         | delivery of every item. I can handle duplicates just fine, what
         | I want to avoid is an item being missed out entirely due to the
         | way I break up the data.
        
           | ppeetteerr wrote:
           | It's more than just de-duplicating, tho. Imagine you query a
           | dataset and get something like a page count and a chunk size.
           | That page count cannot be trusted if the dataset is mutable.
           | If an item is inserted at the beginning of the set, you're
           | going to miss the last item.
           | 
           | Pagination is hard
        
             | the_arun wrote:
             | For dynamic usecase, DynamoDB has implemented pagination
             | with something called _lastEvaluatedKey_ - https://docs.aws
             | .amazon.com/amazondynamodb/latest/developerg...
             | 
             | This is different from LIMIT in RDBMS
             | 
             | Wouldn't this pattern solve the complexity you are talking
             | about?
        
       | jasonhansel wrote:
       | It's important here that "created" is an _immutable_ attribute.
       | Otherwise you could get issues where the same item appears on
       | multiple lists (or doesn 't appear at all) because its attributes
       | changed during the scanning process.
        
       | adontz wrote:
       | I believe data export and/or backup should be a separate API,
       | which is low priority and ensures consistency.
       | 
       | Here we just see regular APIs are being abused for data export.
       | I'm rather surprised the author did not face rate limiting.
        
       | gampleman wrote:
       | To point out the obvious: generally API providers don't
       | particularly want you to pararelize your request (they even
       | implement rate limiting to make it harder on purpose). If they
       | wanted to make it easy to get all the results, they would allow
       | you to access the data without pagination - just download all the
       | data in one go.
        
         | sb8244 wrote:
         | > If they wanted to make it easy to get all the results
         | 
         | Speaking from experience...we want to make it easy but also
         | want to keep it performant. Getting the data all in one go is
         | generally not performant and is easy to abuse as an API
         | consumer. For example, always asking for all of the data rather
         | than maintaining a cursor and secondary index (which is so much
         | more performant for everyone involved).
        
           | alexchamberlain wrote:
           | We provide (internal) access to data where we provide
           | interactive access via GraphQL-based APIs and bulk access via
           | CSV or RDF dumps - I feel like dump files are grossly
           | undervalued these days.
        
             | sb8244 wrote:
             | I agree. I am going to reflect on this and see if there's a
             | way to support dump files long term in our app. We sorta
             | support it today but it's ad hoc implementation since an
             | export can range from a few hundred of a thing to tens of
             | millions of a thing.
             | 
             | Is there any good literature or patterns on supporting
             | dumps in the tens of millions or larger?
             | 
             | I wrote a sheets plug-in that uses our cursor API to
             | provide a full dump into a spreadsheet. Our professional
             | services team is in love with it, so I bet they'd love
             | generic data export capability.
        
           | tshaddox wrote:
           | That's the point. Running multiple paginated queries in
           | parallel is essentially circumventing the API provider's
           | intent to limit the number of items requested at one time.
        
         | eyelidlessness wrote:
         | A certain level of parallelism is generally within the realm of
         | good API citizenship. Even naive rate limiting schemes tend to
         | permit a certain number of concurrent requests (as they well
         | should, since even browsers may perform concurrent requests
         | without any developer intervention).
         | 
         | Rate limiting and pagination aren't (necessarily) about making
         | full data consumption more difficult. They're more often about
         | optimizing common use cases and general quality of service.
         | 
         | Edit to add: in certain circles (eg those of us who take REST
         | and HATEOAS as baseline HTTP API principles), parallelism is
         | often not just expected but often encouraged. A service can
         | provide efficient, limited subsets of a full representation and
         | allow clients to retrieve as little or as much of the full
         | representation as they see fit.
        
           | corty wrote:
           | One thing that frequently bugs me is APIs limiting number of
           | items per page for reasons of efficiency. I can perfectly
           | understand low limits for other reasons, like not helping
           | people scrape your data.
           | 
           | But limiting for efficiency is usually done in a way that I
           | would call a cargo cult: First, the number of items per
           | "page" is usually a number one would pick per displayed page,
           | in the range of 10 to 20. This is inefficient for the general
           | case, the amount of data transmitted is usually just the same
           | size as the request plus response headers. So if the API
           | isn't strictly for display purposes, pick a number of items
           | per page that gives a useful balance between not transmitting
           | too much useless data, but keeping query and response
           | overhead low. Paginate in chunks of 100kB or more.
           | 
           | In terms of computation and backend load, pagination can be
           | as expensive for a 1-page-query as for a full query. Usually
           | this occurs when the query doesn't directly hit an index or
           | similar data structure where a full sweep over all the data
           | cannot be avoided. So think and benchmark before you
           | paginate, and maybe add an index here and there.
        
       | eyelidlessness wrote:
       | I think keeping temporal history and restricting paginated
       | results to the data at the point in time where the first page was
       | retrieved would be a pretty decent way to solve offset based
       | _interfaces_ (regardless of the complexity of making the query
       | implementation efficient). Data with a lot of churn could churn
       | on, but clients would see a consistent view until they return to
       | the point of entry.
       | 
       | Obviously this has some potential caveats if that churn is also
       | likely to quickly invalidate data, or revoke sensitive
       | information. Time limits for historical data retrieval can be
       | imposed to help mitigate this. And individual records can be
       | revised (eg with bitemporal modeling) without altering the set of
       | referenced records.
        
         | ako wrote:
         | I think for most use cases, as a user i'd rather see the newest
         | items in a list, then consistency of pagination. If i forget to
         | manually refresh, i might miss out on important new items.
         | 
         | Why do you think it is important for users to have temporal
         | consistency?
        
           | eyelidlessness wrote:
           | Well I'll use a recent example I encountered that was
           | actually very frustrating. I was looking for a font to use
           | for a logo for a personal project. The site I was using
           | (won't name and shame, and I can't recall the site now
           | anyway) had no sorting options, items were ordered by
           | whatever "popularity" formula they use. As I paginated, many
           | of the fonts I'd previously viewed would appear on subsequent
           | pages, often in a different order. It was frustrating not
           | just because I could tell that I was probably missing fonts
           | that were being bumped up to previous pages, but also because
           | it made me doubt my mental model of my own browsing history:
           | "Did I navigate back too far? Did I forget a tangential click
           | and end up on a different search path?"
           | 
           | It's not a great UX. And in some ways I suspect that _my own
           | views_ were at least partially causing it, which made me more
           | hesitant to even click on anything unless I was sure it was
           | worth the disruption.
        
             | ako wrote:
             | This doesn't sound like a "temporal consistency" problem,
             | rather an inconsistent and untransparent ordering issue.
        
               | eyelidlessness wrote:
               | Huh? It's absolutely a temporal consistency problem. The
               | ordering was absolutely clear and consistent (most->least
               | popular at time of request). But the popularity scores
               | were changing so rapidly that "at time of request" makes
               | the ordering _unstable_. If the ordering was determined
               | by popularity at the time I accessed page 1, the ordering
               | would have been stable.
               | 
               | Sure, that popularity score would be stale. But who
               | cares?
               | 
               | Think of it this way. Suppose you're viewing your Twitter
               | timeline in recent order, and suppose the pagination
               | (lazy loaded on scroll) worked this way, and suppose you
               | have new Tweets arriving at the same rate you scroll.
               | What you would see is the same page of Tweets repeat
               | forever (well, until you hit the pagination cap).
               | 
               | This is why people come up with solutions like cursors.
               | But what I was suggesting is that you can instead
               | continue to use offsets (for the benefits discussed in
               | the article like parallelism and predictability) if you
               | paginate on the state of the data _at the time you began_
               | (edit: or on the state of your sorting criteria at the
               | time you began, which allows for the mitigations I
               | described upthread).
               | 
               | That's not to suggest that once you begin a pagination,
               | you'll forever access stale data. It's to suggest that a
               | set of pagination requests can be treated as a session
               | accessing a stable snapshot.
               | 
               | This can also be totally transparent to the client, and
               | entirely optional (eg pagination is performed with an
               | offset and an optional validAt token).
        
               | ako wrote:
               | Ok, thanks for the explanation. I hadn't expected the
               | popularity score to be so unstable. Means that a lot of
               | users are concurrently scoring the fonts, and that the
               | averages are continuously being recalculated. Unexpected.
               | 
               | But you're right, if the dataset is continuously changing
               | at high frequency, pagination makes no sense.
        
       | arcbyte wrote:
       | I think you could accomplish something similar with token
       | pagination by requesting a number of items that will result in
       | multiple "pages" for your user interface. Then as the user
       | iterates through you can request additional items. This isn't
       | parallelizing, but provides the same low-latency user experience.
        
       | felixhuttmann wrote:
       | A few thoughts:
       | 
       | 1) AWS dynamodb has a parallel scanning functionality for this
       | exact use case.
       | https://docs.aws.amazon.com/amazondynamodb/latest/developerg...
       | 
       | 2) A typical database already internally maintains an
       | approximately balanced b-tree for every index. Therefore, it
       | should in principal be cheap for the database to return a list of
       | keys that approximately divide the keyrange into N similarly
       | large ranges, even if the key distribution is very uneven. Is
       | somebody aware of a way where this information could be obtained
       | in a query in e.g. postgres?
       | 
       | 3) The term 'cursor pagination' is sometimes used for different
       | things, either referring to an in-database concept of cursor, or
       | sometimes as an opaque pagination token. Therefore, for the
       | concept described in the article, I have come to prefer the term
       | keyset pagination, as described in
       | https://www.citusdata.com/blog/2016/03/30/five-ways-to-pagin....
       | The term keyset pagination makes it clear that we are paginating
       | using conditions on a set of columns that form a unique key for
       | the table.
        
       | gigatexal wrote:
       | From the code sample in the article I didn't know you could
       | append to a slice from within a go func
        
         | mssundaram wrote:
         | As long as you use the mutex locks
        
       ___________________________________________________________________
       (page generated 2021-01-03 23:01 UTC)