[HN Gopher] What's good about offset pagination; designing paral... ___________________________________________________________________ What's good about offset pagination; designing parallel cursor- based web APIs Author : clra Score : 54 points Date : 2021-01-03 17:18 UTC (5 hours ago) (HTM) web link (brandur.org) (TXT) w3m dump (brandur.org) | draw_down wrote: | > it uses offsets for pagination... understood to be bad practice | by today's standards. Although convenient to use, offsets are | difficult to keep performant in the backend | | This is funny. Using offsets is known to be bad practice | because.... it's hard to do. | | Look I'm just a UI guy so what do I know. But this argument gets | old because I'm sorry, but people want a paginated list and to | know how many pages are in the list. Clicking "next page" 10 | times instead of clicking to page 10 is bullshit, and users know | it. | alexchamberlain wrote: | I can't find one right now, but I feel like there must be an | algorithm that can identify the key that can be used for each | page, yet cheaper than a full sort (ie cheaper than offset | pagination in generality). Of course, a skip list would work | for a secondary index. | Rafert wrote: | How do people even figure out they need to be on page 10? | Sounds like a filter of some kind (e.g. before a date) would be | better, except they trained themselves to use x amount of pages | to approximate. | | Either way, 10 pages isn't so bad but tens of thousands can | become troublesome as explained on | https://shopify.engineering/pagination-relative-cursors | PurplePanda wrote: | Maybe they've been there before and happen to remember "page | 10". | | Maybe someone else has been there before and told them "Go to | page ten". | | Maybe you know that there are 20 pages, and are looking to | find the median, or are just informally playing around, | looking at the distribution. | | Same as you'd do with an interesting page of a book. I don't | think I'd stop using page numbers in dead tree books if they | somehow came with filters. | ppeetteerr wrote: | Pagination of an immutable collection is one thing and can be | parallelized. Pagination of a mutable collection (e.g. a database | table), on the other hand, is risky since two requests might | return intersecting data if new data was added between the | requests being executed. | | True result sets require relative page tokens and a | synchronization mechanism if the software demands it. | simonw wrote: | Intersecting data is fine provided there's a unique ID for each | result that can be used to de-duplicate them. | | Ideally I'd want a system that guarantees at-least-once | delivery of every item. I can handle duplicates just fine, what | I want to avoid is an item being missed out entirely due to the | way I break up the data. | ppeetteerr wrote: | It's more than just de-duplicating, tho. Imagine you query a | dataset and get something like a page count and a chunk size. | That page count cannot be trusted if the dataset is mutable. | If an item is inserted at the beginning of the set, you're | going to miss the last item. | | Pagination is hard | the_arun wrote: | For dynamic usecase, DynamoDB has implemented pagination | with something called _lastEvaluatedKey_ - https://docs.aws | .amazon.com/amazondynamodb/latest/developerg... | | This is different from LIMIT in RDBMS | | Wouldn't this pattern solve the complexity you are talking | about? | jasonhansel wrote: | It's important here that "created" is an _immutable_ attribute. | Otherwise you could get issues where the same item appears on | multiple lists (or doesn 't appear at all) because its attributes | changed during the scanning process. | adontz wrote: | I believe data export and/or backup should be a separate API, | which is low priority and ensures consistency. | | Here we just see regular APIs are being abused for data export. | I'm rather surprised the author did not face rate limiting. | gampleman wrote: | To point out the obvious: generally API providers don't | particularly want you to pararelize your request (they even | implement rate limiting to make it harder on purpose). If they | wanted to make it easy to get all the results, they would allow | you to access the data without pagination - just download all the | data in one go. | sb8244 wrote: | > If they wanted to make it easy to get all the results | | Speaking from experience...we want to make it easy but also | want to keep it performant. Getting the data all in one go is | generally not performant and is easy to abuse as an API | consumer. For example, always asking for all of the data rather | than maintaining a cursor and secondary index (which is so much | more performant for everyone involved). | alexchamberlain wrote: | We provide (internal) access to data where we provide | interactive access via GraphQL-based APIs and bulk access via | CSV or RDF dumps - I feel like dump files are grossly | undervalued these days. | sb8244 wrote: | I agree. I am going to reflect on this and see if there's a | way to support dump files long term in our app. We sorta | support it today but it's ad hoc implementation since an | export can range from a few hundred of a thing to tens of | millions of a thing. | | Is there any good literature or patterns on supporting | dumps in the tens of millions or larger? | | I wrote a sheets plug-in that uses our cursor API to | provide a full dump into a spreadsheet. Our professional | services team is in love with it, so I bet they'd love | generic data export capability. | tshaddox wrote: | That's the point. Running multiple paginated queries in | parallel is essentially circumventing the API provider's | intent to limit the number of items requested at one time. | eyelidlessness wrote: | A certain level of parallelism is generally within the realm of | good API citizenship. Even naive rate limiting schemes tend to | permit a certain number of concurrent requests (as they well | should, since even browsers may perform concurrent requests | without any developer intervention). | | Rate limiting and pagination aren't (necessarily) about making | full data consumption more difficult. They're more often about | optimizing common use cases and general quality of service. | | Edit to add: in certain circles (eg those of us who take REST | and HATEOAS as baseline HTTP API principles), parallelism is | often not just expected but often encouraged. A service can | provide efficient, limited subsets of a full representation and | allow clients to retrieve as little or as much of the full | representation as they see fit. | corty wrote: | One thing that frequently bugs me is APIs limiting number of | items per page for reasons of efficiency. I can perfectly | understand low limits for other reasons, like not helping | people scrape your data. | | But limiting for efficiency is usually done in a way that I | would call a cargo cult: First, the number of items per | "page" is usually a number one would pick per displayed page, | in the range of 10 to 20. This is inefficient for the general | case, the amount of data transmitted is usually just the same | size as the request plus response headers. So if the API | isn't strictly for display purposes, pick a number of items | per page that gives a useful balance between not transmitting | too much useless data, but keeping query and response | overhead low. Paginate in chunks of 100kB or more. | | In terms of computation and backend load, pagination can be | as expensive for a 1-page-query as for a full query. Usually | this occurs when the query doesn't directly hit an index or | similar data structure where a full sweep over all the data | cannot be avoided. So think and benchmark before you | paginate, and maybe add an index here and there. | eyelidlessness wrote: | I think keeping temporal history and restricting paginated | results to the data at the point in time where the first page was | retrieved would be a pretty decent way to solve offset based | _interfaces_ (regardless of the complexity of making the query | implementation efficient). Data with a lot of churn could churn | on, but clients would see a consistent view until they return to | the point of entry. | | Obviously this has some potential caveats if that churn is also | likely to quickly invalidate data, or revoke sensitive | information. Time limits for historical data retrieval can be | imposed to help mitigate this. And individual records can be | revised (eg with bitemporal modeling) without altering the set of | referenced records. | ako wrote: | I think for most use cases, as a user i'd rather see the newest | items in a list, then consistency of pagination. If i forget to | manually refresh, i might miss out on important new items. | | Why do you think it is important for users to have temporal | consistency? | eyelidlessness wrote: | Well I'll use a recent example I encountered that was | actually very frustrating. I was looking for a font to use | for a logo for a personal project. The site I was using | (won't name and shame, and I can't recall the site now | anyway) had no sorting options, items were ordered by | whatever "popularity" formula they use. As I paginated, many | of the fonts I'd previously viewed would appear on subsequent | pages, often in a different order. It was frustrating not | just because I could tell that I was probably missing fonts | that were being bumped up to previous pages, but also because | it made me doubt my mental model of my own browsing history: | "Did I navigate back too far? Did I forget a tangential click | and end up on a different search path?" | | It's not a great UX. And in some ways I suspect that _my own | views_ were at least partially causing it, which made me more | hesitant to even click on anything unless I was sure it was | worth the disruption. | ako wrote: | This doesn't sound like a "temporal consistency" problem, | rather an inconsistent and untransparent ordering issue. | eyelidlessness wrote: | Huh? It's absolutely a temporal consistency problem. The | ordering was absolutely clear and consistent (most->least | popular at time of request). But the popularity scores | were changing so rapidly that "at time of request" makes | the ordering _unstable_. If the ordering was determined | by popularity at the time I accessed page 1, the ordering | would have been stable. | | Sure, that popularity score would be stale. But who | cares? | | Think of it this way. Suppose you're viewing your Twitter | timeline in recent order, and suppose the pagination | (lazy loaded on scroll) worked this way, and suppose you | have new Tweets arriving at the same rate you scroll. | What you would see is the same page of Tweets repeat | forever (well, until you hit the pagination cap). | | This is why people come up with solutions like cursors. | But what I was suggesting is that you can instead | continue to use offsets (for the benefits discussed in | the article like parallelism and predictability) if you | paginate on the state of the data _at the time you began_ | (edit: or on the state of your sorting criteria at the | time you began, which allows for the mitigations I | described upthread). | | That's not to suggest that once you begin a pagination, | you'll forever access stale data. It's to suggest that a | set of pagination requests can be treated as a session | accessing a stable snapshot. | | This can also be totally transparent to the client, and | entirely optional (eg pagination is performed with an | offset and an optional validAt token). | ako wrote: | Ok, thanks for the explanation. I hadn't expected the | popularity score to be so unstable. Means that a lot of | users are concurrently scoring the fonts, and that the | averages are continuously being recalculated. Unexpected. | | But you're right, if the dataset is continuously changing | at high frequency, pagination makes no sense. | arcbyte wrote: | I think you could accomplish something similar with token | pagination by requesting a number of items that will result in | multiple "pages" for your user interface. Then as the user | iterates through you can request additional items. This isn't | parallelizing, but provides the same low-latency user experience. | felixhuttmann wrote: | A few thoughts: | | 1) AWS dynamodb has a parallel scanning functionality for this | exact use case. | https://docs.aws.amazon.com/amazondynamodb/latest/developerg... | | 2) A typical database already internally maintains an | approximately balanced b-tree for every index. Therefore, it | should in principal be cheap for the database to return a list of | keys that approximately divide the keyrange into N similarly | large ranges, even if the key distribution is very uneven. Is | somebody aware of a way where this information could be obtained | in a query in e.g. postgres? | | 3) The term 'cursor pagination' is sometimes used for different | things, either referring to an in-database concept of cursor, or | sometimes as an opaque pagination token. Therefore, for the | concept described in the article, I have come to prefer the term | keyset pagination, as described in | https://www.citusdata.com/blog/2016/03/30/five-ways-to-pagin.... | The term keyset pagination makes it clear that we are paginating | using conditions on a set of columns that form a unique key for | the table. | gigatexal wrote: | From the code sample in the article I didn't know you could | append to a slice from within a go func | mssundaram wrote: | As long as you use the mutex locks ___________________________________________________________________ (page generated 2021-01-03 23:01 UTC)