[HN Gopher] Paginating Requests in APIs (2020) ___________________________________________________________________ Paginating Requests in APIs (2020) Author : exemel_hn Score : 181 points Date : 2022-05-28 15:31 UTC (7 hours ago) (HTM) web link (ignaciochiazzo.medium.com) (TXT) w3m dump (ignaciochiazzo.medium.com) | a-dub wrote: | cursors result in expensive server side state. if ordering is | less important an alternative is to make cardinality queries | cheaper and then allow the client to request sections thereby | making the scheme stateless for the purposes of the server. | | so one query returns an approximate cardinality of the result set | or a partition count based on approximate size of the result set | and a fixed request response size. | | the second to n queries are gets, but the gets include the | partition count and the partition requested. the server uses a | hashing scheme to either reduce the input data used for the | construction of the response or filtering of the response itself. | | use of numerical partition ids and hashing allows for the | response to maintain consistency over time as well as pushing of | the sharding/selection scheme as deep as necessary into the data | layer to avoid scalability issues. | | requests that are too big are cancelled and given abuse points. | therusskiy wrote: | How do you find on the backend if a page is the last one? The | trick I used to avoid extra SQL queries is if a user asks for 10 | records I load up to 11 and then return 10. If there were 11 then | I return a cursor for the next page, otherwise next page is null. | Is there a better way? | yashap wrote: | Just let people query for an empty page every now and then, no | big deal. | whenlambo wrote: | Moreover, you can query, like, +3 more, and if the extra is | less than 3, return that extra items (short tail) also. This | way, you won't have the last page with 1-2 items only. | omegalulw wrote: | How do you deal with cases where results must be ordered by | some field and the data changes between page 1 request and | page 2 request? | NonNefarious wrote: | You could select into a temporary table and return pages of | results from that. | | You'd need some mechanism to delete the temporary table | after a time, of course, but I imagine this is not | uncommon. | | Another method would be having a last-modified timestamp on | every row in the DB; then you could select records at or | before the initial query time. Seems like overkill just for | this one purpose, but I imagine it might be generally | useful information to have in the DB. | spockz wrote: | If the ordering changed because the data changed then just | show the new situation at that page. If the results' order | is important there are several things you can do (at | least): | | 1. You could save the order under a token and iterate | respective to that saved order, only showing the data for | the ids that were originally included. 2. Instead of | continuing from a page count show the next amount. Any | mutation before that point will be invisible without having | to store the original result list (only ids). | noneeeed wrote: | That's what we do, I'm not sure there is another way other than | making a request for the next page. | drbojingle wrote: | You could ask for a count of records and divide by the page | size. | neoglow wrote: | Counts become incredibly slow on some databases like | Postgres. It needs to do a sequential scan, which results in | slow(er) performance on even a ten- thousands of rows. When | operating at scale, this is very, very slow. | kadoban wrote: | This seems unlikely to depend on which database software. | It could easily depend on what indexes exist for the query. | fabian2k wrote: | From what I understand it's not slow if it can use an | index, but the criteria for that are rather hard to | understand as a normal user. If you vacuum often enough and | the visibility map is current, it doesn't need a full table | scan. | spurgu wrote: | This is what I do, since counts should cause minimal overhead | (I've never had to deal with huge databases). But neither | does the N+1 records trick which OP describes, that's clever. | :) | SonOfLilit wrote: | Counts can cause significant overhead on even millions of | rows. | spurgu wrote: | Thanks, that's good to know (will have to perform some | tests at some point). I do have a few tables (namely | logs) that will have millions of rows eventually over | time. But there I've also been planning on rotating them | whenever that becomes a problem, so I'm not overly | concerned. Or experiment with OP's solution. | MereInterest wrote: | If I were to ask the number of even numbers greater than | zero and less than 10 billion, you could very quickly | give an exact answer. If I were to ask the exact number | of prime numbers on that same range, it would take much | longer to find the correct answer. | | In the same way, asking for counts may or may not be a | quick answer, depending on both the table structure and | the query being performed. Query the number of customers | who started this year, on a table indexed by start date, | and it's incredibly cheap. If that same table is only | indexed by last name, then counts become more expensive. | I'd say that the advice to return the counts has an | implicit addendum to make sure that your database is | appropriately indexed such that your typical queries are | cheap to perform. | inferiorhuman wrote: | So I'm coming at this from a Postgres POV, IIRC MySQL can | take some shortcuts for COUNT(*). Constraining the query | will help but counting is expensive as a sequential table | scan is always required. | | Meanwhile I can only think of one use case for pagination | with exact row counts - a user facing application. If | you're preallocating storage you can get by with an | estimate. If you're consuming an API and want something | from near the end you can flip the sort order. | nolevels wrote: | Not necessarily in postgres. If an index is available it | may use that, though it will still have to refer to the | visibility map to see if any row is available to the | current transaction (cheap check), and if it can't | determine from that, then and only then does it need to | also refer to the heap / actual table (slower check). | inferiorhuman wrote: | Is that a recent change? Back in 2016 the Citus guys | claimed that: There is no single | universal row count that the database could cache, so it | must scan through all rows counting how many are | visible. Performance for an exact count grows | linearly with table size. | | https://www.citusdata.com/blog/2016/10/12/count- | performance/ | nolevels wrote: | Apologies, I should've clarified. I just meant postgres | doesn't necessarily have to scan the actual table to get | the count in some cases if an index is available, not | that it caches the actual count. | inferiorhuman wrote: | So that's not the greatest quote but the article shows | postgres having to fall back on a sequential scan for a | naive COUNT(*). The author goes on to mention that as of | 9.2 Postgres can do an index scan (in the context of | SELECT DISTINCT): But now we come to a | quirk, SELECT COUNT(DISTINCT n) FROM items will not use | the index even though SELECT DISTINCT n does. As many | blog posts mention ("one weird trick to make | postgres 50x faster!") you can guide the planner by | rewriting count distinct as the count of a subquery | | To me it seems like you can trick the query planner into | doing an index scan sometimes, but that it's a bit | brittle. Has this improved much since? | inferiorhuman wrote: | Sure, but that point you've really got to start thinking | about architecture. | | Do you really need to paginate millions of rows _and_ | count them? Can you distribute the database and | parallelize counting? Can you save counts somewhere else? | Can you use estimates? | [deleted] | [deleted] | SPBS wrote: | So cursor pagination is just keyset pagination, except instead of | using a column value directly you use an opaque string that gets | translated back to a column value. So you can modify how the | cursor gets translated back to a column value anytime without | breaking API compatibility. | | That said I'm not sure if everyone agrees on that definition, | from what I know people do consider keyset pagination and cursor | pagination to basically mean the same thing. | richbell wrote: | > So cursor pagination is just keyset pagination, except | instead of using a column value directly you use an opaque | string that gets translated back to a column value. | | HashIds is a popular solution if those columns are numerical or | can be represented numerically (e.g. timestamp). | | https://hashids.org/ | noneeeed wrote: | I thinks that's what the article is saying. I found the | descriptions didn't really explain the differences very well. | | We use a cursor that combines the sort key and normally the ID | as a secondary sort and pointer value, otherwise you can't | paginate through data where the sort column has duplicates. We | don't really do much to obfuscate these, but we don't document | the structure and people who try to tweak the values won't get | far. | giaour wrote: | One advantage of opaque cursors is that they can include | information that's not part of your public API. For instance, | if you allow soft deletes and do not return deleted items in | the list, you can end up with a massive stretch of deleted | items separating two live results. You can leave a note in the | cursor indicating what record was last scanned, and that's not | possible in keyset pagination. | blenderdt wrote: | A REST API can't use page based pagination because it's not | stateless. It is also not cachable. | ok123456 wrote: | if the page, limit and offset are parameters it is. | blenderdt wrote: | If you query the first page with a limit of 10, then insert | 10 items, do you get the same items with the same parameters? | isbvhodnvemrwvn wrote: | Using that logic you can only have restful APIs if your | application is useless and data never changes (which is | fine for me, I hate debates about restfulness) | ok123456 wrote: | If you really want to cache the results use e-tags based on | a hash set on write, or just include that hash as part of | your request as well. | skrebbel wrote: | It's exactly as RESTful and cacheable and stateless as common | websites, eg the HN frontpage or a newspaper homepage. I think | your bar for when something is RESTful or not is too high to be | useful. | blenderdt wrote: | Edit: my bad, I was looking at the wrong 'more' button. | joemi wrote: | How is https://news.ycombinator.com/news?p=2 not | pagination? | croes wrote: | Why? The server doesn't need to know if I already requested any | previous page when I request page 9. And if the data doesn't | change it can easily be cached. | hinkley wrote: | "Rest in Practice" (if I'm remembering the right book) would | disagree strenuously with you. | | Once you're doing REST instead of just "somewhat sane JSON | responses" then the response has self-descriptive elements, | including links to related resources. And importantly, URLs for | older resources become static very, very quickly. Not unlike | the internals of subversion or git. | | The trick there is to realize that O(1) = O(2). If you're | trying to show 25 items per page, you _do not need a query that | returns the most recent 25 items_. That creates a situation | where every single interaction is distinct from every other | interaction. | | Asking for 25 results and everything newer than that is barely | more complicated than pagination already is. | branko_d wrote: | I don't like the term "cursor-based" pagination. Database cursors | are different from this, have been around for decades, and can | actually be used to implement pagination (albeit with requiring | the transaction to be kept alive between pages which can be... | problematic). | | Perhaps, "token-based" pagination would be a better term? | abraxas wrote: | This is highly aggravating and confusing when hipsters try to | coopt terms that had a defined meaning for decades. The ones | that grate me are "retina display", "real time", "x-nanometer | node". | [deleted] | blurker wrote: | I agree. I was unsure if the author meant db cursors which | would be keeping transactions alive. That has obvious scaling | issues. But reading the comments here has me convinced they are | talking about application generated cursors. I think this is a | pretty reasonable confusion and it would benefit from added | clarity, like naming them differently. | kortex wrote: | > Perhaps, "token-based" pagination would be a better term? | | Let's do it. When I was first learning about it, I wondered how | the concept related to database cursors, if at all. | | It's really just a pointer to the next item at its most simple. | But it really is just any opaque blob. Maybe you want it to be | a jwt/encrypted blob with a request counter so you can track | request usage by specific accounts (along with the pointer). jk | that's probably a terrible idea (too entangled). Idk, just came | up with it. Point is, you _could_ do something like that, it 's | just a blob opaque to the client. | | So I like "token-based pagination". | [deleted] | drdec wrote: | Thank you, now I don't need to write this comment | dwaite wrote: | I tried an experiment a while back to handle this using custom | units for the "Range" HTTP Header. | | The biggest drawback is that the expectation is still that data | would be valid concatenated, so (for example) multiple results | would be expressed as multiple JSON values concatenated together, | rather than entries in a single array. | aaaaaaaaaaab wrote: | theteapot wrote: | If I take keyset approach then obfuscate my sorting key+value in | some token it becomes a "cursor"? The only point in doing this is | to stop abuse of large offsets? | | Also article states a con of Keyset is you can't use an offsets. | Technically you can, just the same a pagin (WHERE ID > X ORDER BY | ID LIMIT Y OFFSET Z)?? | sa46 wrote: | The API improvement proposals (AIP, yes it's a terrible acronym) | are a rich source of API design wisdom. It assumes protobufs but | the majority of proposals would work for any API. | | The relevant one is https://google.aip.dev/158, which recommends | cursor-based pagination and covers different pagination use- | cases, including: | | - support for different sort orders - mentioned in article as | weakness of keyset pagination | | - support for skipping results - mentioned as a weakness of | cursor-based pagination | | - why to use opaque page tokens | grogers wrote: | This is a good reference, but IMO page tokens need to expire, | even if you aren't storing any data for them. Expiring tokens | make it possible to change pagination formats over time (you | only need to wait the expiry time while supporting both | formats), eg to improve efficiency. | | Also I'll add that if you support a pagination API that accepts | args (e.g. to filter on different dimensions) you should | include a hash of the args in the opaque next page token, and | fail the call if the args change. This prevents some nonsense | scenarios where you started paging by scanning index A but the | user switched filters such that you'd instead scan index B, but | you don't have the key for that column. | fabian2k wrote: | Are these actually server-side cursors or keyset pagination? For | keyset pagination you need essentially the last value of the | columns you used to sort by, but this seems to have some random | or encoded cursor value. I didn't think server-side cursors would | be used at that scale so this has to be keyset pagination, so I | think I'm missing something. | | And while I really like keyset pagination, it is annoying to | implement in some cases. If your sorting column is not unique you | need to sort by multiple columns. And that is really annoying and | less efficient if you don't have row value comparisons. And not | all databases and especially not all ORMs support those. I'm | still waiting on EF Core support here, it looks a bit like this | might happen but it's not certain yet. | xdfgh1112 wrote: | Surely a keyset, even if it's multi-key, is more performant | than an offset query. Offset queries usually require the | previous rows to be computed first, so large offsets are super | slow. | Jweb_Guru wrote: | It is, but it's not a cursor in the traditional sense. | ko27 wrote: | EF core now supports it with PostgreSQL | | https://github.com/npgsql/efcore.pg/pull/2350 | oconnor663 wrote: | By server-side cursors do you mean like open connections to | MySQL reading a stream of query results, or something like | that? I don't think that's feasible for most websites. The | biggest issue is that you'd have to guarantee that when someone | clicks "next page", their browser ends up talking to the exact | same machine that it got the previous page from. But websites | running on big fleets of machines don't usually work like that; | instead your queries are usually somewhat randomly distributed | across a whole tier. You might work around that by trying to | keep a persisitent connection open from the browser to the | server, but you'd lose that connection when you refreshed the | page or when the server restarted, or if you lost WiFi for a | minute. Making hiccups like that visible to the user is pretty | frustrating. It's a lot more robust to use some sort of cursor | design that doesn't rely on any persistent connections. | | Some other problems that crop up: | | - A lot of websites expect to serve most results from cache, | and to only hit the DB a minority of the time, but to get your | hands on a real DB cursor you'd have to hit the DB every time. | | - Not all DB's support moving cursors backwards. Maybe most of | them don't? I'm not sure. | | - Open DB connections are usually a somewhat limited resource. | gruez wrote: | >but this seems to have some random or encoded cursor value | | Maybe the actual key value is stored server side for | obfuscation/optimization reasons? | terandle wrote: | Wish this article would have gone into details about how cursor | based pagination is actually implemented. Like they did for the | other options by showing SQL. Is this something the database | engine has to provide support for? | andrewmackrodt wrote: | You can do something like the following in application code. | | Imagine a table with the following schema/data: | id=1,created_at='2022-28-05T18:00Z' | id=2,created_at='2022-28-05T18:00Z' | id=3,created_at='2022-28-05T18:00Z' | | To retrieve articles in order of descending creation time, our | sort order would be: created_at DESC, id DESC | | The last item in the ORDER BY query should be a unique key to | ensure we have consistent ordering across multiple queries. We | must also ensure all supported sort columns have an INDEX for | performance reasons. | | Assume our application returns one article at a time, we have | already retrieved the first article in the result set. Our | cursor must include information for that last records ORDER BY | values, e.g. serialize('2022-28-05T18:00Z,3'), for example | purposes I will use base64, so our cursor is | MjAyMi0yOC0wNVQxODowMFosMw==. | | When the user requests the next set of results, they will | supply the cursor and it will be used to construct a WHERE | (AND) expression: created_at < '2022-28-05T18:00Z' OR | (created_at = '2022-28-05T18:00Z' AND id < 3) | | So our query for the next set of results would be: SELECT * | FROM articles WHERE (created_at < '2022-28-05T18:00Z' OR | (created_at = '2022-28-05T18:00Z' AND id < 3) ORDER BY | created_at DESC, id DESC LIMIT 1 | | For queries with more than 2 sort columns the WHERE expression | will slightly increase in complexity but it need only be | implemented once. If the sort order direction is flipped, make | sure to adjust the comparator appropriately, e.g. for 'DESC' | use '<' and for 'ASC' use '>'. | delusional wrote: | Isn't that basically just a trivial extension of what they | article calls "keyset based pagination"? | | I'm not complaining since this is also what I usually do, I'm | just wondering of theres more to it. | cortesoft wrote: | Oftentimes under the hood, cursor based paging IS keyset | paging. The difference is that the keyset is obfuscated, | which can allow the implementation to change without | changing the API call. | TravelPiglet wrote: | How do you handle removal? | andrewmackrodt wrote: | I haven't had a need to implement this but something like | the following should work to provide the same* records | regardless of insertion/deletion. | | * data of the record may have been updated, but if there | are 1,000 records total, paginating in either direction | will always return those same 1,000 records regardless of | insertion/deletion. | | - Filter by created_at <= time the endpoint first returned | a result. | | - Utilize soft deletes, where deleted_at IS NULL OR | deleted_at < time the endpoint first returned a result. | | Any keys allowed to be used in the ORDER BY query should | also be immutable. | catlifeonmars wrote: | Makes total sense. It also seems trivially automatable on the | surface. Echoing GP: Is this an existing DBMS feature? If | not, why not? | qsort wrote: | No need for anything special, you would just query with | something like "SELECT whatever FROM Table WHERE cursor>value | ORDER BY cursor LIMIT n". Obviously there needs to be an index | over the columns of the cursor, but any reasonable engine | should allow you to add that effortlessly. | mbStavola wrote: | The complication comes from having a cursor over results that | are sorted on multiple columns. It's not incredibly difficult | to implement, but it can certainly be annoying to do, | especially if you want to support before+after cursor as | well. | gabetax wrote: | Check https://use-the-index-luke.com/no-offset as a better | reference. | | But in most SQL databases, cursors are something you implement | and parse at the application layer, and translate to a WHERE | clause on the (hopefully indexed) column you're ordering on. | That turns the O(N) "OFFSET" into a O(log(n)) index seek. | nafey wrote: | How is a cursor different from key set in that case? | cdash wrote: | It's not in that case, cursors can be implemented however | you want though. Its an abstraction, that prevents you from | having to change the schema of the API when you change the | implementation. | lazide wrote: | Many databases have actual cursor implementations which are | transactional. That means you'll get a consistent view at the | point in time you created the cursor. | | That said, they tend to live only as long as the db | connection (with some exceptions), so yeah you need some | application work to make it sane. | sgtfrankieboy wrote: | Some database engines will support Cursors like this, but from | what I can gather Cursor based navigation is a wrapper around | KeySet or page navigation, but base64 encoded parameters. | | Making the implementation details of pagination unimportant to | the API layer the user uses. | pragmatic wrote: | Continuation Token is a better term than cursor at it avoids | confusion with DB cursors. | | https://phauer.com/2018/web-api-pagination-timestamp-id-cont... | rwieruch wrote: | Kinda related to this discussion about pagination APIs: | | One year ago I had to design an API [0] for paginated nodes which | also allows nested nodes. Overall there would be more than | 100.000 nodes in the database. | | On the UI side there would be a tree table to display the nodes. | First you would fetch page 0 with 100 nodes and from there you | could either perform a paginated fetch for the next page (page 1) | or a nested fetch for page 0 for one of the initially fetched | nodes. So essentially the API allows paginated and nested fetches | where each nested fetch would be a paginated fetch too. | | We used cursor based pagination, because there would be added and | deleted nodes eventually in a collaborative environment. It was | definitely a fun challenge! | | [0] https://www.robinwieruch.de/react-tree-list/ | zikohh wrote: | "Cursor-based pagination is the most efficient method of paging | and should always be used when possible." Facebook API Docs | | Why is it more efficient than key set pagination? According to | this article | https://archive.ph/2021.04.30-125536/https://medium.com/swlh... | the only difference is that the value of the key is encoded to | allow the implementation to change | wespiser_2018 wrote: | Both approaches use the underlying index in the database, but | cursor approaches need to look at far fewer results in order to | return the page. If you are deep into the result, your offset | could be high, like 1000 or so. The database would have to | start at the beginning and go through 1000 results, especially | if filters are in play, in order to get the the N items in the | page. | nesarkvechnep wrote: | Can't we do both offset and keyset-based pagination? If a user | wants a specific page, let them specify it, return them links to | the next and previous pages, using filter + limit. | mmastrac wrote: | You should never design an API that uses offsets for pagination | unless you're dealing with small amounts of data (<10000 rows). | Cursors give you far more flexibility and if you want to be lazy, | you can just hide an offset in an opaque cursor blob and upgrade | later on. | | I don't think I've used offsets in APIs for at least 10 years. | Lightly-obfuscated cursor tokens are one of the first things I | build in web projects and that's usually less than an hour's | work. | | If you _really_ need the ability to drop the needle in your | dataset with pagination, design your system to use pseudo- | pagination where you approximate page-to-record mappings and | generate cursors to continue forward or backward from that point. | xdfgh1112 wrote: | Iirc digitalocean API uses offsets. We had to write special | code to handle the possibility of the list changing while it | was being queried. | JamesSwift wrote: | Do any of these avoid that scenario? Unless you are | paginating on `updated_at ASC` there is an edge case of | missing new data on previously requested pages. | singron wrote: | Missing data that began existing after you started querying | is usually OK. If you had requested the data 1 second | earlier, it wouldn't have been returned anyway. With offset | pagination, you can miss data that always existed. This can | make you believe the item that previously existed had been | deleted. | | Be very careful with timestamp or auto-increment id | pagination too. These don't necessarily become visible in | the same order since the id or timestamp is generated | before the transaction commits unless your database has | some specific way of ensuring otherwise. | xdfgh1112 wrote: | Exactly. | giaour wrote: | When I worked for a news site, we moved all listing endpoints | from offset to timestamp based pagination because offset | based pagination assumes the data being queried will remain | stable between requests. This was, of course, very rarely | true during business hours. | | Duplicates or missed entries in the result set are the most | likely outcome of offset-based pagination. | gopher_space wrote: | Something always feels off about repopulating lists that | people are working on. Like that concept needs an entirely | different display method. | lazide wrote: | It's what happens with stateless protocols and offset based | pagination 100% of the time, but most people don't notice | it. | MAGZine wrote: | I consider offset-based paged pagination to be a mark of the | novice. | steve_adams_86 wrote: | edit: I just saw another comment of yours, so I see this was | meant more contextually than I realized. | | Can you explain why offsets would never be a suitable | solution? Is there a clear explanation as to why? | | I understand how cursors are superior in some cases. But what | if you have a site which paginates static data? Offsets would | allow you to cache the results easily, overcoming any | performance concerns (which would be irrelevant if the | dataset was small anyway), providing a better user experience | (and developer experience due to simpler implementation | details) overall. | | I can see that it would be a novice move if you've got | staggering amounts of records that take a while to query. But | that's actually pretty rare in my experience. | MAGZine wrote: | It doesn't even have to be a cursor, you can technically | page off of any field you can index (i've written a lot of | sync APIs so there are gotchas, but the point stands). | | Limit/offset is usually (though not always, as you point | out) egregious for anything more than hobbies or small | projects. But, I am biased as when I build APIs, I | definitely expect some amount of traffic/volume/tuple | count, and offset/limit will not do. | mcphage wrote: | Or the mark of someone who asked the users what they wanted. | MAGZine wrote: | I don't disagree, but everyone would prefer to have an API | that worked over one that doesn't, or one that causes | service outages due to database latency. (did anyone ask | the users if they wanted the api to work?) | | If you're dealing with very small datasets, its fine. I'm | an average person using average APIs, which means that when | I see offset-based pagination, it's usually on a service | deployed and used by a lot of people. | | Unsurprisingly, the offset based APIs often include some | other arbitrary limit like "offset limited to 10k" or | something silly but understandable if you've built an API | used by thousands of people before, or understand how | databases work. | | They're also often superseded by betters APIs that actually | allow you to page the entire result set. Then you have a | deprecated API that you either support forever or annoy | users by turning it off. | | So yes, if you are building something non-internal/pet | project, limit/offset is probably the mark of the novice. | xthrowawayxx wrote: | how do you jump to arbitrary pages with cursor pagination? | singron wrote: | You don't, but if your cursor is just a position in some | well-defined order, you can sometimes craft a cursor out of a | known position. Continuing the book metaphor, instead of | skipping to page 100, which has no semantic meaning, you | could skip to the beginning of the "M" section of the | dictionary, or see the first 10 words after "merchant". | lazide wrote: | Depending on what you mean by 'page', you probably want a | query? (Show me all results with values between 1 and 10) | | If you want to jump ahead in the results, that is | fundamentally unstable though. | tshaddox wrote: | You generally don't, unless you want to linearly scan through | n pages. If your API is using offset or page numbers and the | underlying collection is also having items added or removed, | the behavior is undefined or straight up wrong. I think it's | okay to use offsets or page numbers in cases where the | collection is static or where it's acceptable to occasionally | skip over an item or see a duplicate item while paginating. | rhacker wrote: | Or... Search not page. which is typically way more useful in | reality. | orf wrote: | Cursor based pagination makes sense. At a certain scale every | part of your API becomes something you need to support, and that | includes pagination. This makes it super difficult to change the | storage you use, as suddenly you need to support integer based | offsets rather than opaque tokens. | duskwuff wrote: | Cursors also mean you get more predictable behavior when | interacting with data that's changing under you. Imagine | scrolling back through an active chat, for example -- you | wouldn't expect each page to contain some of the same chat | messages that are already on screen. | pornel wrote: | Another downside of offset-based pagination is less consistent | handling of data changes during pagination. | | If any row is inserted anywhere before the page you requested, | you'll see some other item twice (once at the end of the previous | page, then again at the top of the next page). Similarly if any | item was removed, you'll skip over some _other_ item when | paginating due to every page shifting by one. | catlifeonmars wrote: | I'd take it further: offset based pagination failure modes look | different based on the implementation. If you have API clients, | you're exposing that implementation detail to them. | cortesoft wrote: | That was mentioned as a con for offset based pagination. | verst wrote: | Related: Do you think SDKs for these APIs should themselves | require you to interact with the cursors? | | If I ask a SDK to get me 100 items but internally it returns only | 50 before needing to make another call with a marker or cursor It | would be nice if the SDK had an option to handle this for me | (possibly exposing cursors still for those who really want it). | jahewson wrote: | What if it fails after 50? | verst wrote: | Maybe return the items it found so far and provide the cursor | it failed on? Have a way to optionally provide the cursor? I | can think of some ways I'd implement this in an SDK. Probably | for simplicity I'd have some sort of JSON status field | indicating the request (traversing all cursors) completed. If | it's not complete I still would look for the next cursor it | returned and do the usual thing manually. | progval wrote: | SDKs should typically return a lazy iterator in languages which | support it, so caller do not even care about pagination and | just iterate on it until they don't want any more data; while | the SDK fetches new pages when needed. | | In pseudo-Python: def get_stuff(): | url = "https://example.org/api.php" while url: | response = requests.get(url) yield from | response.json() url = | parse_link(response.headers["link"]).next | | and you call it with: for item in | get_stuff(): print(item) | [deleted] | scarface74 wrote: | The AWS Python API users "paginators" to allow you to iterate | over results with a single for loop and it handles the | pagination in the background. The underlying API - which is | also exposes - uses and opaque "NextToken" | nthypes wrote: | Cursor based pagination is a good idea for APIs because it allows | for more flexibility in how data is displayed to the user. It | also helps to prevent large amounts of data from being returned | all at once, which can cause performance issues. | robertlagrant wrote: | All pagination can help prevent large amounts of data being | returned all at once. | skeeter2020 wrote: | > It also helps to prevent large amounts of data from being | returned all at once, which can cause performance issues. | | Isn't that the point of all pagination, regardless of how it's | implemented? | cortesoft wrote: | > It also helps to prevent large amounts of data from being | returned all at once, which can cause performance issues | | You can also limit page size and accomplish the same thing. | gregw2 wrote: | For large datasets, consider avoiding pagination complexity by | having your api return a (asynch) job ID, and dumping the whole | resultset to a file which can be fetched (over https) via the | jobID. | | I have done this with both Postgres/S3 and Redshift/S3 backends | and presigned URLs and looks like I could do it with Snowflake | too. | wespiser_2018 wrote: | Nice write up! | | I've really struggled with Haskell's lack of support of | pagination, and worked on at least two "inner-source" libraries | that offer what the author is calling "Key Set" or limit/offset | and Cursor Based as Servant combinators. | | What I've found, is that "Key Set" is much more ergonomic for | library authors to implement, if they have to write the | database/sql calls themselves, although "Cursor Based" is | technically more efficient. | | It's my opinion that the standard, out of the box behavior for an | endpoint should be to paginate every single endpoint that returns | a list of results, but in order to do that easily you'll need | some framework support. More mature frameworks have this right | now, and IMO this is one of the weak points in using a less | supported language like Haskell. You'll either have to roll your | own solution, or rely on spotty library support that does that | pagination operation in CPU. | Cthulhu_ wrote: | The limitation of cursor-based pagination - where you only know | the previous, current, and next page's cursor identifier - sounds | ideal for infinite scrolling applications, where the user does | not get the option to pick a page. | | Another solution or optimization I've seen is in forum software, | where if a user executes a search, the IDs of the search results | are copied into a "search results" table with the equivalent of a | session ID; that way, the data remains stable between paging, and | the data set over which pagination has to be done (e.g. via | offset/limit) is very small; joins can be used to link to the | items in question in the search results (e.g. posts, threads), or | all the items that need to be shown in the paginated search | results can be copied over to the ephemeral search results table. | | And of course, since it's ephemeral and memory is cheap these | days, the whole search result can be cached in server-side | memory. | buro9 wrote: | Moving forward and back in a cursor (pages) is easy... but how | about random pages? | | i.e. this forum, HN. If there were ten pages of comments, going | from page 1 to page 2, or vice versa is trivial with a "after" or | "before" query. But what about jumping directly to page 3 without | previously visiting pages 2 or 4? | | So far I've implemented a hybrid... pages still exist, but next | and previous page (the most common actions) are cursor. | | Is there a better way? | jack_squat wrote: | IMO if a solution requires users to address data by page, it's | a sign that the search functionality is not good enough. | | In general a user doesn't want to find "page 3", they are | looking for a record that has certain characteristics. They | shouldn't need to think about pages, just search terms. | kortex wrote: | If you use ULIDs for IDs (sortable by time, millisecond | resolution) and a tombstone field (nullable deleted_at is a | good one) then you have a very stable collection that | guarantees no new objects will be inserted/deleted before the | write head - it is append only in effect. | | You can then do some cool things, especially if your objects | are evenly distributed and especially if you don't need exact | page sizes or can over-query and hide them as needed. | | If you then know the approximate frequency of objects, you can | then map some linear scaling to an approximate range of ULIDs. | Basically German tank problem in reverse. | | https://en.m.wikipedia.org/wiki/German_tank_problem | higeorge13 wrote: | I have seen a variation of the cursor based approach in Intercom | API [1] but with a fixed cursor (while the example shows a | different cursor on every request), which also does not allow you | to create another one within 1 minute. It feels almost like a | database server cursor, but anyone wants to guess how this is | implemented? | | [1] https://developers.intercom.com/intercom-api- | reference/refer... | jmull wrote: | > Bad performance for large OFFSET in SQL. When doing OFFSET Nin | SQL, the database needs to scan and count N rows. | | For user-facing apps, you've failed pretty hard already if users | can't find what they need on the first page or so. | | If your offsets are big enough for this to matter, I'd rather | spend time on why users are so many pages in to the data than | optimizing the performance of later pages. | | (Processing clients, on the other hand, should query for what | they need. Processing in batches may make sens, so there cursor- | or what they call keyset-based "pagination" makes good sense. | Though in the case of processing clients, I wouldn't call it | "pagination"... it's more like batches. I've mainly used "kelset- | based" pagination for processing events, which can alleviate some | of the "cons".) | giaour wrote: | Users do fuzzy searches across data all the time. They may | remember approximately when they last saw a record but not its | exact name, or their recollection may be close but imperfect. | z3t4 wrote: | Pagination in API's are a PITA as you have to make many requests | to get all the data. 99% of the time you want all data, not just | the first page. For the 1% use case you could use http-range or | page. Pagination is often implemented as an premature | optimization. | indymike wrote: | 8 minutes of wire time is bad and pagination is usually part of | server side frameworks... | russellendicott wrote: | In my experience the only reason you would want all the data is | if the API lacks good filter queries. IMO, making it a PITA to | download all data is kind of a feature. | cortesoft wrote: | > 99% of the time you want all data, not just the first page. | | This certainly doesn't seem to be the case. Take viewing this | very website, for example... are you suggesting that most | people want to see all posts of all time, rather than just the | first page or two? Paging is used extensively for feeds, and | feeds are crazy common these days, and people rarely want to | see all posts ever on a feed. | crescentfresh wrote: | > Pagination is often implemented as an premature optimization. | | or for forwards-compatibility. | stevebmark wrote: | I'd also point out that the GraphQL/Relay spec for pagination | enforces cursor based pagination: | https://relay.dev/graphql/connections.htm. It's also a really | nice pagination spec overall, edges/node/pageInfo are well | thought out and flexible. | | In my professional experience, lots of devs will implement | fetching APIs and not consider pagination, and then when the | responses start growing (testing data, use, whatever), user | experience sucks. Pagination should be a default API design for | anything returning an unbounded list of items. | Ambolia wrote: | How do you deal in GraphQL when you have a query hitting with | multiple nodes which may return multiple cursors for the | different nodes? Seems like a nightmare if you have to check | and retry manually for each node, maybe there are libraries | that take care of it transparently? | stevebmark wrote: | How is this different than hitting multiple REST endpoints | for different resources with their own individual pagination? | If the client needs to glue together paginated resources from | two places, it's a hairy problem no matter what the protocol. | A backend for frontend that does this under the hood is one | solution to both, or a specialized endpoint that does it | under the hood. | Ambolia wrote: | In the case of hitting multiple REST endpoints for | different resources, on each query it's clear what | pagination I'm using and combining the data is just | appending it to the final list. In each case the partial | results are flat and linear. | | But in the case of GraphQL, like I may get an incomplete | list, or I may get a complete list that will have some | fields with missing data, or some combination of both, | going down deep depending on the structure of the query. I | don't see a clear way to request the missing data without | duplicating part of what I already have, and I don't see a | clear way to combine the results. | | But I may be missing something about GraphQL or something | about the tools that makes this simple. | | For example something like this: { | users(first: 10000, after: "cursor123") { edges | { cursor node { | id name | friends(first: 10000, after: "cursor235") { | edges { cursor | node { id | name } | } } } } | } } | hamandcheese wrote: | In the relay JS library you can avoid some duplication | with fragments. Fragments are pieces of a graphql query | that you can include in multiple queries. If you are | using flow, the type system will also ensure that you | can't render a component unless you have used the correct | fragment in your query. | | That said, paginating multiple resources on the same page | is still going to be complicated. | stevebmark wrote: | Oh. I think you'd handle it the same as rest, make a new | individual field query for that friend by ID and ask for | the pagination there. I think that would be manual work | for graphql or rest, im not aware of anything that does | that automatically. It's also not a case I've personally | ran into so not sure. | grahamm wrote: | Agreed, so many times I've raised paging with an internal API | provider only to be told they can retrofit it later if needed. | Then you start to use production data and bam the one call | doesn't work. | User23 wrote: | I disagree. I think pagination is a clumsy error prone kludge. | That it performs better than the awful naive approach by | additionally burdening the caller doesn't remotely make it | good. The correct approach is a buffered stream with proper | flow control. Incidentally that's what _Transmission Control | Protocol_ was invented for. | | Probably the nicest way to expose this that fits contemporary | sensibilities would be as a lazy sequence that does sensible | prefetching and asks the server for more data as it's iterated | through. | stevebmark wrote: | When dealing with the most common case (I think) for | pagination, clients (web/native) requesting data, are you | suggesting keep a long running websocket only to fetch data | in response to new page requests? What benefit would that | afford, given that the requests are pretty far apart in time? | | Or are you focusing more on the backend machine to machine | case? In that case, some kind of automatic prefetching, lazy | loading/yield setup sounds pretty nice if that's abstracted | away when you want to iterate over a large dataset from a | separate resource server. It's not a pattern I've used | before, mainly because it's not often that one server wants | to know _everything_ that another server has. | giaour wrote: | > a lazy sequence that does sensible prefetching and asks the | server for more data as it's iterated through | | Isn't this an accurate description of cursor-based | pagination? You can have the server hold the cursor and | associate a request for the next page based on the client ID | (like in the PostgreSQL wire protocol), or you can make the | flow stateless (at the network level) by handing the client a | ticket for the next page. | jayd16 wrote: | Pagination implies many requests where as a streaming API | would keep to a single request, no? No worry about multiple | servers or juggling state. | giaour wrote: | It's still multiple requests on a streaming API if you're | asking the client to notify you when it's ready for the | next page. Most DB engines use TCP connection affinity to | identify which cursor is in play, but the client still | needs to send a packet asking for the next page. Both | client and server need to keep the TCP connection open, | which is state for both parties. | | I see you mentioned gRPC streams in a sibling comment, | which are a great alternative to REST-based pagination, | but they are highly stateful! Streams are built into the | protocol, so a generic gRPC client will handle most of | the state juggling that would have been your | responsibility with a REST client. | jayd16 wrote: | This is a bit inaccurate I think. You can make a RESTful | api over gRPC. The protocol is irrelevant. An open | connection is not a violation of statelessness. If | anything, a streaming API where you can close the cursor | upon completion and treat new requests as fully new is a | lot less stateful. | | We're also talking about APIs in general, not just http | REST. | | >It's still multiple requests on a streaming API | | Perhaps, but the requests can be pipelined. You don't | need to wait for the response to complete before asking | for more. | giaour wrote: | The original article is talking about pagination over | stateless (at the protocol level) HTTP requests. | (Referring to APIs that are offered over stateless HTTP | as "REST APIs" is technically incorrect but reflects | common usage and is a convenient shorthand.) | | gRPC is able to offer powerful streaming abstractions | because it utilizes HTTP/2 streams, which are cursor | based rather than strictly connection oriented. The state | is still there; it's just a protocol-level abstraction | rather than an application-level abstraction. | | > Perhaps, but the requests can be pipelined. You don't | need to wait for the response to complete before asking | for more. | | That sort of defeats the purpose of grpc flow control, | doesn't it? | jayd16 wrote: | >That sort of defeats the purpose of grpc flow control, | doesn't it? | | Why do you say that? I don't know the full implementation | details themselves but generally there's no reason you | can't safely ask for even more after having asked and | consumed some. If you have a buffer of M bytes and ask | for M, then consume N, you could immediately ask for N | more without waiting to receive all of the original M. | | Although, I wasn't speaking about gRPC in that case | though. I'm not sure how exactly gRPC achieves back | pressure. I was only speaking abstractly about a | pipelined call vs many pages. You seemed to claim that | multiple requests was a requirement. | | But fine, ignoring gRPC, you could possibly tune your | stack such that normal http calls achieve back pressure | from network stacks and packet loss. Http is built on top | of TCP streams, after all. That doesn't make it | inherently stateful does it? | | Going all the way back, I still think its fair to say | that a streaming response with backpressure can be | stateless and without multiple requests. If you want to | argue that multiple packets or TCP signals are needed | then perhaps so, but I think that's a far cry from the | many separate requests a paginated call requires and I | dont think its accurate enough to conflate them. | giaour wrote: | > I don't know the full implementation details themselves | but generally there's no reason you can't safely ask for | even more after having asked and consumed some | | I think we're saying the same thing but using different | formulations. If you send an HTTP request for a list with | 20 items, then get back a response with 10 items and a | link for the next page, that is essentially the same as | cosuming 20 items over a stream with flow control. The | point of returning a cursor in the response and having | the client send a new request for the next page is to | support a stateful stream over a stateless protocol. In | neither case are you waiting for the response to be | complete before processing items, since your message | indicates that "complete" here means the full result set | has been sent to the client. | | > But fine, ignoring gRPC, you could possibly tune your | stack such that normal http calls achieve back pressure | from network stacks and packet loss. Http is built on top | of TCP streams, after all. That doesn't make it | inherently stateful does it? | | That's pretty much how streaming responses are | implemented in TCP-based protocols (like the SQL query | APIs exposed by Postgres or MySQL). TCP connections can | be terminated for unrelated reasons, which is why you | don't see this pattern very often in recent protocols. | When a TCP connection is dropped and restarted due to, | say, network congestion, you have to restart the | paginated operation from the head of the list. H2 (and, | vicariously, gRPC) streams are built to be more resilient | to network noise. | | But to answer your question, yes, that pattern is | inherently stateful. Pagination has to be, since the | server has to know what the client has seen in order to | prepare the next batch of results. You can manage this | state at the protocol level (with streams), or you can | push it to the application level (with cursor tokens | embedded in the messages exchanged). The streaming | approach requires a stateful server and client, whereas | the application-level approach only requires a stateful | client. | jayd16 wrote: | I think my opinion on close enough also differs from | yours (and that's ok!) but you're also forgetting that a | streamed/connection based approach only involves a single | client and server and state during the connection. | | A paged response needs to sync state across many | machines. As you have said, you need a clientID, a cursor | reference, or a sticky session. That's not nothing. | | I think they're inherently different approaches. | freedomben wrote: | I find pagination really annoying on the client, but the | majority of the time it does seem like the client doesn't | actually need the whole thing so it's a good (and often very | necessary) performance optimization. For endpoints where that | isn't the case, it's (usually) not too hard to allow a | sentinel value for page size that means infinity, so the | client really can get it all in one swoop. | jayd16 wrote: | >Probably the nicest way to expose this that fits | contemporary sensibilities would be as a lazy sequence that | does sensible prefetching and asks the server for more data | as it's iterated through. | | gRPC's streaming API is a good example. It has backpressure | built in and you just need to configure the buffer sizes. | Clients just consume items off the stream and async writes | will suspend as needed. | robertlagrant wrote: | What are you saying that would be different in practice? What | advantages would it have? | jayd16 wrote: | Not sure I agree with the parent but thinking it through, | you have a single connection and a cleaner life cycle. | Streaming writes and reads could mean back-pressure is | possible. A client can consume items as needed as opposed | to mapping usage to a page size. | | There's no worry about wasting a full request on the last | page with a single item. In fact there's no worry about | round trips at all, it's just a matter of buffer back | pressure that communicates the need for more or less. | | Hmm, when you think about it, its a pretty good way to go. | User23 wrote: | That's consistent with what I have in mind. I too think | it's pretty much best of both worlds. You get better | efficiency than pagination with the convenience of an API | that's no different from fetching all the data. | nerdponx wrote: | I worked with an API like this once, which streamed | NDJSON one line at a time. I loved the amount of control | that it gave me on the client side! | amelius wrote: | > There is no way to skip pages. If the user wants page X, it | needs to request pages from 1 to X. | | Huh? The client can ask the server to generate a new cursor, any | desired amount of items ahead in the list. | mypalmike wrote: | With opaque cursors, how does a client specify "a desired | amount"? | amelius wrote: | You just say "this cursor + 5 pages", for example. | hombre_fatal wrote: | You send `GET /items[?cursor=<cursor>]` to my server. | | I respond: [ "items": [ | {...}, {...}, {...}, ... ], "next_cursor": | "77963b7a" ] | | How do you request "this cursor + 5 pages"? | amelius wrote: | GET /items?cursor=77963b7a&add_to_cursor=5pages | dingleberry420 wrote: | If the server supports this, which they won't, because they | don't want to support pagination. | travisgriggs wrote: | In 32 years of software development, it fascinates me how ever | much more time I spend just moving/translating data between | domains. I used to model things more, and do human interaction | more. Now it seems like a dominant task I do in any of the | products I work on is just shifting data from one | place/format/representation/node to the other. | itisit wrote: | Ah, yes...the GRIMM stack: General Repetition of Integrations, | Migrations, and Misery | GiorgioG wrote: | "Modern day best practices"... | RandyRanderson wrote: | Let's rewrite the microservices wiki as bit: | | "A microdata architecture - a variant of the data-oriented | architecture structural style - arranges data as a collection | of loosely-coupled data. In a microdata architecture, data are | fine-grained and the relations are lightweight." | | Likely most experienced data architects are now laughing at | what a dumb idea the above is. That's the same reaction I had | when I first heard of microservices - "that's obviously a bad | idea; ppl aren't that dumb", I thought. | travisgriggs wrote: | But big monoliths/spaghetti balls of code aren't great | either, you'd agree with that? | | It seems like we don't have the ability to do "all things in | moderation" as an industry very well. We're in constant | search of The Silver Bullet, that if we just apply it, | will... buy the world a coke and live in perfect harmonies or | something. And we go through the cycle again and again and | again. | [deleted] | [deleted] | [deleted] | Lightbody wrote: | The Google Calendar API leaves a lot to be desired | (documentation, etc), but over time what I've found is that it's | really well-designed once you understand it. | | In their case, they went with a "pageToken" variable which is | basically a cursor. | | The API has been around for years. ___________________________________________________________________ (page generated 2022-05-28 23:00 UTC)