[HN Gopher] We switched to cursor-based pagination ___________________________________________________________________ We switched to cursor-based pagination Author : qin Score : 64 points Date : 2022-08-17 13:09 UTC (4 days ago) (HTM) web link (www.moderntreasury.com) (TXT) w3m dump (www.moderntreasury.com) | [deleted] | fein wrote: | I have a few main requirements for what I consider easy to use | pagination. | | 1. I can set how many items per page. | | 2. I can get to an arbitrary page either through a path/ query | param or at least close with a pagination row that contains a way | to jump around. If an item gets removed, whatever I was looking | for should still be in the same vicinity. | | 3. As a result of #1 and #2, I can go back and find items I saw | previously because their placement is at some fairly reliable | position on the page I remember it being on or around. | | You know, like how a physical book or catalogue with pages works. | | Please stop trying to improve on this and making usability more | difficult. I hate infinity scroll and I hate not being able to | pick my page in an intuitive fashion. | int0x2e wrote: | I totally agree with you from a usability standpoint, and while | it is possible to make this work well, for larger-scale | services it is rarely without costs, and for the most part - | people don't seem to care as much as you'd think. | | While it can be frustrating, I doubt much revenue (relatively | speaking) was lost due to this issue, which means for most apps | and services, it will likely not get done. | | (I'm not disputing the merit of your arguments, just explaining | why this will rarely get done well in real life...) | nemothekid wrote: | > _I can get to an arbitrary page either through a path / query | param or at least close with a pagination row that contains a | way to jump around_ | | I've had quite a few heated discussions on this point. The | problem is, once your dataset gets large enough, this use case | is incredibly difficult to scale; not impossible (Google does | it with millions of search results), but prohibitively | expensive compared to how often the need of being able to jump | to any specific page arises. | | Now I always try to stick to cursor based pagination as the | default in order to prevent people from building workflows on | top of offsets. | robertknight wrote: | > Google does it with millions of search results | | Google cheats, but in a way that very few users will notice. | You can only access the first 20 pages of search results. | Depending on user behavior, this is one way to offer | navigation via page number while limiting worst-case cost. | zaroth wrote: | I doubt that 21st page actually exists, even on the back | end. | | I would bet the top-line number of results is some sort of | extrapolated estimation thing, a hand-wavey way to | represent how deep the hypothetical result set is for that | query. | camgunz wrote: | I think the challenge is always the stateless nature of HTTP. | It's pretty hard to keep a connection to the API server that has | your database cursor. | | Everything else has big tradeoffs. You can make sure your rows | are sortable by a specific column, but that means your searches | must all be sorted by that column. | | You can save the results of a search and page through it, but | that's a lot of I/O. | | You can not do any of the above, but then you run the risk of | non-deterministic pagination (I went back and because someone | added/deleted a row, that page is slightly different now). You | also can't really link to it. | | These things might all be fine. In my experience though, you run | into people who really want to do search in way A, and will use | the tradeoffs for way B/C/D to argue against it, as though A has | no tradeoffs. | aarondf wrote: | There are ways to mitigate the (although not eliminate) the | slowing down of offset/limit pagination in later pages. The | technique is called a "deferred join" and it is most effective in | MySQL. The basic idea is to paginate as little data as necessary, | and then do a self-join to get the rest of the data for a single | page. | | You can read more about it here: | https://aaronfrancis.com/2022/efficient-pagination-using-def... | or here https://planetscale.com/blog/fastpage-faster-offset- | paginati.... | | There are libraries for Laravel | (https://github.com/hammerstonedev/fast-paginate) and Rails | (https://github.com/planetscale/fast_page) as well! | | Cursor based pagination is wonderful, but sometimes you're stuck | with offset/limit for whatever reason. Might as well make it | fast. | mike_hock wrote: | TL;DR: The headline. TFA doesn't really add any information. | G3rn0ti wrote: | Well, except that TFA explains what DB cursors are and why they | are faster than page offsets (because they skip DB entries). | jsmith99 wrote: | I'm pretty sure they aren't using actual SQL cursors as that | wouldn't scale well so it's probably not a 'DB cursor'. It's | a shame as actual cursors are the native database way to do | pagination. | theogravity wrote: | It needs to talk about how to actually implement it. Most | articles like this one mention its existence and why it's | good, but generally stop there. | G3rn0ti wrote: | Postgres supports cursors and documents them very well: | | https://www.postgresql.org/docs/current/plpgsql- | cursors.html | | Basically, you declare a cursor like that: | | DECLARE cname CURSOR FOR <select statement>; | | You can pass the cursor's name ,,cname" (and even store it | on the client side, although, best encrypted) and obtain | the ,,next" slice of your data corresponding to the query | on demand like that: | | FETCH next cname; | | Not sure you really gain that much performance on everyday | queries but with a large number of rows it might. | chowells wrote: | Of course, this isn't what the article is talking | about... | theteapot wrote: | TFA isn't talking about that kind of cursor. | thaumasiotes wrote: | > TFA explains what DB cursors are and why they are faster | than page offsets (because they skip DB entries) | | Except that no such information appears in the article. | Here's the entirety of the explanation: | | >> Once you've been returned a cursor, you can provide it in | your requests to act as a farther-along starting point within | your data entries. The server is then able to efficiently | skip all entries that come before your specified cursor value | feike wrote: | Reminds me of Markus Winand who hands out stickers on database | conferences banning offset. | | His site is a great resource for anyone wanting to take a deeper | dive on SQL performance: | | https://use-the-index-luke.com/sql/partial-results/fetch-nex... | wootest wrote: | Which in turn reminds me of: | http://simonwillison.net/2022/Aug/16/efficient-pagination-us... | pvorb wrote: | Do you know if this is specific to MySQL or does it also | apply to other RDBMS like PostgreSQL? | ReactiveJelly wrote: | So basically do it like Reddit? | | https://old.reddit.com/?count=25&after=t3_wtpvdp | | I noticed Reddit's pagination has that "after" parameter, which | points to the last post on the current page. | | It glitches out if the last item is deleted by moderators, but | otherwise it works smoothly. | MatmaRex wrote: | Yeah, or Wikipedia. | | https://en.wikipedia.org/w/index.php?title=Category:Living_p. | .. | djbusby wrote: | On Reddit I frequently see the "next" page having the same | posts as the previous page. Not all the same but many of the | same. Like, maybe after is being respected but the sorting is | different or something. | saghm wrote: | I see that on Hacker News a decent amount as well when | going through the top stories across multiple pages. My | assumption has always been that the order changes between | when I load the page and when I move to the next one (which | sometimes is not for another several minutes). | radiojasper wrote: | You took some time to read the page and while reading it, | the homepage changed and some new posts got added to the | top. Therefore some posts get moved to the 2nd page. | dinkledunk wrote: | how to jump to an arbitrary page? | nextaccountic wrote: | You can do that with postgres histograms | https://www.citusdata.com/blog/2016/03/30/five-ways-to- | pagin... - go to the section "Keyset with Estimated | Bookmarks" | | > As we saw, plain keyset pagination offers no facility to | jump a certain percentage into the results except through | client guesswork. However the PostgreSQL statistics collector | maintains per-column histograms of value distribution. We can | use these estimates in conjunction with limits and small | offsets to get fast random-access pagination through a hybrid | approach. | jsmith99 wrote: | Do you really need to jump to an arbitrary page and land on | the exact item? For many applications an approximate jump is | fine. If your column is fairly uniformly distributed you can | guess the index for any arbitrary page. | dinkledunk wrote: | Yes, my business users will feel like they don't have | sufficient access to their data if they can't. | | > If your column is fairly uniformly distributed you can | guess the index for any arbitrary page. | | I don't think that'll work in a multi-tenancy situation | with complex filters. | waspight wrote: | I bet your users does not always know what is best for | them. | eurasiantiger wrote: | Some people thoroughly enjoy a linear saccade search! See | for example any social media app. | | It definitely isn't in the users' best interest to have | any method of scrolling through a lot of records. | squeaky-clean wrote: | Jumping to a specific page is a bit of an ambiguous / | undefined term in this case. Like asking for a specific page | in a book that's still being written. Maybe today the plot | twist occurred on page 100, but then the author decides | chapter 1 needs more backstory, and now the plot twist | happens on page 115. | | Unless you can guarantee your data is static, or that the | sorting order cannot be mutated and only append later values, | the concept of what data belongs in which page could be | changing every millisecond. | MatmaRex wrote: | You don't, but instead you can jump to an arbitrary place in | the results. For example, you could show results starting | from the letter P, or show results starting from 2022-04-02. | hnuser847 wrote: | Spoiler: you can't. | thaumasiotes wrote: | But the entire concept is that this is an adaptation to the | fact that data may be added to or removed from the | database. If that's true, there would be no benefit in | jumping to a specific page - there's no guarantee that that | page will display any particular data. | croes wrote: | Can't he just use rowversion? | | No need for row value syntax and it works with MS SQL Server | jvolkman wrote: | For anyone curious about this approach, I've posted an excerpt | [1] of the shared Python + SQLAlchemy + Postgres code we use to | handle pagination at my company. | | The name "cursor pagination" is super confusing given that | "cursor" is an overloaded term in databases. I always call this | "token pagination", given that the APIs I've seen usually call | the value a token. | | [1] | https://gist.github.com/jvolkman/b8c0e3d05929a1506c99fbc9474... | hirundo wrote: | My sympathies to the coders downstream of this large change for | the amount of work required. I would like to _add_ cursor based | pagination to the APIs I manage. It would not be an option to go | to our clients and explain that offset-pagination will be | removed. There seems to be very little cost in supporting both. | mgkimsal wrote: | I didn't see anything specific about timelines. I would expect | you'd want to offer both for some length of time, with some | deprecation notice and a sunset date for the older approach. | But perhaps they seem use cases in their logs that the current | offset is used minimally already, and it's better to switch now | vs later if/when adoption is higher? | drdec wrote: | I was hoping to read about how they handled cleaning up cursors, | what the effect on the server was when having a bunch of open, | long-running cursors, etc. Unfortunately the article only treated | the subject at a superficial level. | | So, anyone here implement pagination via cursors? What do you | find to be the drawbacks and how do you mitigate them? | imbusy111 wrote: | You probably assume they are talking about database cursors. | The cursor is just a record ID in this article. There is no | long term storage of database cursors on the server. Assuming | you can sort all your data, next query just returns all records | after the given record ID, plus the record with that ID. | | One corner case would be if the cursor record is deleted. I | don't see it mentioned how they handle it. | [deleted] | erikpukinskis wrote: | Does that mean you have to scan the entire results set to get | the right page? So if I am on page 100, I have to query pages | 1-99 and discard them? | | Or is there a trick here I'm missing? | imbusy111 wrote: | There are no pages anymore. You fetch a record by ID and | next N records. | nerdponx wrote: | I think the point is that clients are not even given the | option to think in terms of "page numbers". | | What's the use case for needing the 100th page of a query | result, that also doesn't allow you to cache the 100th page | locally to retrieve it later? | erikpukinskis wrote: | Yah, another downside of cursor-based pagination is: what | happens when the record the cursor refers to is deleted? | | Do you just crash and ask the user to start over? | | Do you have to nudge open cursors on every delete? | hamandcheese wrote: | Instead of the cursor being an ID, it could directly encode | whatever column(s) you are sorting by. Then you don't have to | locate any record in particular, you can always return | records that sort after the cursor. | simonw wrote: | My implementation of cursors works by encoding the primary ID | of the last row on the page, along with additional | information corresponding to the sort order if that's needed. | | That way it doesn't matter if the record is deleted - I can | still return the next page by showing records that come after | that provided cursor value. | | There's an example on this page: | https://latest.datasette.io/fixtures/sortable?_sort=sortable | | Since the table is sorted by the "sortable" column, the next | page link includes this: ?_next=15%2Cg%2Cz | | 15 is the last value for "sortable" on the page, then g,z are | the compound primary key for that last row. | erikpukinskis wrote: | Interesting... what happens if new records are added with | that 15 value? Do you need an implied secondary sort with | the record created time? | | Also what if there are more than $page_size records with | that 15 value? | simonw wrote: | Yes, if you want to support new records being added it's | up to you to include something like a created date as | part of your specified sort order. | | More than page_size records with that value works fine - | that's why the primary key is included as a tie-breaker. | christophilus wrote: | It's not a cursor as in a relational database cursor. It's a | cursor, as in an ID of an item in a stably ordered set. There's | no long-running anything to be worried about. | radiojasper wrote: | ReactiveJelly wrote: | If we're still using SQL, what did PHP do correctly to make | pagination easier? | masklinn wrote: | A: nothing whatsoever, gp was just dealing with low offsets, | or ignoring the issue. | WesleyJohnson wrote: | At my current job, our intranet site has lackluster performance | due, in part, to limit/offset pagination. Unfortunately, the | business treats the "reports" we author like glorified | spreadsheets and want the ability to filter on any column and | order by any column. It makes it near impossible to | tune/optimize. | | The legacy system we're replacing used cursor pagination in a lot | of areas and was perfectly acceptable to them, but now that we're | going web based - it's not. Unfortunately, really - it seems | vastly superior. | yrgulation wrote: | "It makes it near impossible to tune/optimize." | | I recommend using elastic search or a nosql database to | optimise performance. Relational databases can be slow for this | use case. | jitl wrote: | What kind of NoSQL database are you thinking about? What | strategy would you take with that database to optimize this | problem? | yrgulation wrote: | This is hilarious - i am sharing my knowledge and getting | downvoted for it. | | Anyway, the gist of it is that you store data in | denormalised documents whereby searchable columns become | keys of a single entry. The storage is a secondary storage | not the main data source. You write data in both - sync in | your relational db, async via a queue or what works best | for your infrastructure. Searches are then made against it. | If you go for es you can rank results assuming filtering is | done using free text search. I prefer es, the of flavour of | nosql doesn't matter, but es is great for free text search. | nerdponx wrote: | You said in a sibling comment that denormalized analytics | tables are a hack, but I don't see how this is any less | of a hack. It's literally the same technique, except now | you're adding substantial operational complexity with a | whole extra database server (versus just extra tables in | the same database). And it does not at all solve the | problem of needing to figure out _which fields to | denormalize_ and which ones to leave in separate | documents /collections. | | And even if you do decide that it makes sense to split | your analytics data into a separate database system | entirely, document-oriented databases "ain't it". | | I have very little experience with Elasticsearch, | although I'm surprised to hear it being recommended for | anything other than full text search. GP was talking | about spreadsheet-style filtering, not full-text search. | | But I do have a good amount of hands-on experience in | production with Mongo, and I can say for sure that it is | absolutely not a good option for an analytics database, | and that I would much rather deal with traditional joins | and filters. Even using it as a primary "application | database" for a web app was a constant headache from | beginning to end. I never thought I would miss MySQL so | much, and I would never consider using Mongo again unless | I had the very specific use case of storing a high volume | of semi-structured data with no consistent schema, and | even then I would convert the data to a more traditional | tabular/relational format for use in a data warehouse if | the business analysts or data scientists needed to work | with it. | yrgulation wrote: | I hate mongodb with a passion. It may very well be that | we are misunderstanding ops use case and making | assumptions. My assumptions are: 1) many types of reports | (as such many tables, es can create docs on the fly), 2) | reports are made of many rows (otherwise why compare them | with spreadsheets). My second assumption is that once you | add pagination you can no longer ctrl f for content, you | need full text search. | | For a set of reports with consistent column names and | values made of aggregate or static data what you are | proposing works fine - since you can just increase | counters or cache values as data comes in. | | But for a use case where different types of reports have | varying columns you can just dump everything into es | documents and run basic aggregate queries. Or you can | precompute data when making inserts. | | Anyway, i am assuming too much about the use case, my | bad. I'd have to hear more about it to defend my point. | nerdponx wrote: | This sounds pretty typical for "analytics" workloads, which | relational databases handle just fine. Maybe by "noSQL" | they just meant something with column-oriented storage? But | even that seems like it might be overkill, compared to | setting up a couple of denormalized "analytics tables" to | avoid the cost of complicated joins. | yrgulation wrote: | Yeah with all due respect but hacks like these are a bit | amateurish. I heard of a dude i think at intuit building | their queues in a relational db because they work "just | fine". Prompted a giggle or two. Use the right tool for | the task at hand, dont do clever hacks as they bite back | later on. | jvolkman wrote: | "works just fine" might be a perfectly reasonable | tradeoff if it avoids adding additional architectural | complexity. | yrgulation wrote: | Depends what you define as complexity. What i described | is trivial. Unless the userbase and or data are small. | andy800 wrote: | Why does every web site default to aggressively paginate their | information? Pagination sucks, it's a waste of time and of | clicks, and should be a last resort. Sure, when Google returns | millions of search results, paginate. But: | | _For instance, if you have 40 entries and specify that there | should be 10 items per page_ | | 40 entries??? Just show them all to me. My browser has a | scrollbar, CTRL-F is much faster than your search box. | | "but not everyone has a reliable fast connection" -- yes, which | is a good reason to deliver more data per http request than | breaking it up and requiring lots of slow requests. | | "but the database load" -- half the queries, each returning 2x | data, is almost always going to be easier on a RDBMS. If it's not | then you probably need to rethink your schema. | ilammy wrote: | > _Why does every web site default to aggressively paginate | their information?_ | | You get to see more ads while flipping through pages. | | Timing metrics for loading smaller pages make marketing happy. | | Timing metrics for time spent on the website make marketing | happy. | erikpukinskis wrote: | Worth noting that another solution to the problems with offset | cursors is to do updates. When you add/remove rows you just | update the results set and then there's no issue with missing or | duplicated rows. | | Not easy to do of course, but it's one direction you can go. | debrice wrote: | The pagination con given in the article is wrong, switching to | stream isn't fixing the dropping issue, removing sorting is | likely why no records are being dropped (you wouldn't drop | records using a numerical sorted ID or creation date) | | Pagination is incredibly useful to human. If I tell you I found | something on page 15 you can relate to it, something I cannot do | with infinite scroll. | bjourne wrote: | Ime, concurrent updates to the data set isn't a problem in | practice and nobody cares if they occasionally get duplicate | entries. The cluttered and sometimes useless urls cursors cause | are, again ime, a much bigger usability problem. Sure, naively | implemented queries suffer if the user types ?page=123456 in the | url but such problems are quite easy to fix. | nickkell wrote: | How do you fix those problems then? Let's say you have a "last | page" button in the UI for example | dmacedo wrote: | Have used the conditional that if current page is greater | than last page, just return the last. And same with negative | just returning the first. If records are updated / deleted | and the last page changed, then you'll just get the results | of what the "new last page" are. | | At scale you might care about the duplicate or up-to-date | records. But cursor-based doesn't solve the problem if a | results page is left open for "too long" and stuff was added | behind your cursor (or after it, if navigating backwards). | | It's as if making things less intuitive (to articles' | reference to book pages), makes it any easier as long as you | don't think about any pitfalls. | | My suggestion is to just use pages, and optimise for the | right order (I.e.: sequential IDs, or creation date, | alphabetical, etc) that make sense for your data. | | If you REALLY must care if results have changed, some delta | being stored would be best (like some timestamp that allows | the server side to indicate "hey, your results are out of | date, from 7 days ago, maybe you left that page/API response | unused for too long") | bjourne wrote: | Don't have a last page button. :) Or limit the number of | results to, say, 1000, which is trivial for an rdbms to | handle. Or precompute the result set's rankings and transform | the offset-limit query into a "where rank >= 991 and rank <= | 1000" query. | hamandcheese wrote: | Cursor based pagination doesn't actually solve the described | pitfalls if your results are ordered by anything mutable, though. | | A result you haven't yet seen can be mutated to sort before your | current cursor, likewise a result that you've already seen can be | mutated to sort after the current cursor, causing you to see it | twice. | | Cursor based pagination does minimize the issue somewhat, because | only the mutated rows are possibly affected, not unrelated | records that lie on page boundaries. But depending on the use | case I'm not sure that is worth that added complexity of cursor | based pagination (it does get a bit tricky once you have any non- | trivial sort clause). | simonw wrote: | The only approach I can think of which might be able to handle | mutability of the rows used for sorting would be to support | paginating through a snapshot: provide a mechanism whereby a | full snapshot of the query at a specific point in time is | captured such that the user can then paginate through that | snapshot. | | Expensive to implement, so this would only work for situations | where you can afford to spend significant storage and | computation resources to keep a specific user happy. | mikedouglas wrote: | This theoretically should be possible with MVCC, right? It's | not an area I've explored and I could immediately see some | issues with resource clean-up, but I could imagine it being | possible with most modern DBs. | vbezhenar wrote: | Yep, keep transaction open with necessary isolation. But it | requires very thorough design of queries, as you'll run | into locks pretty quickly. MVCC is not magic. | twic wrote: | Or using an Oracle style AS OF query: | | https://oracle-base.com/articles/10g/flashback-query-10g | mbreese wrote: | You could also store timestamps as part of the records. Then | your query will always be consistent if you add an extra | clause of tstamp < query_tstamp. So long as you store both | the query_tstamp and the last record, you'll get consistent | results without needing to store individual cursors/snapshots | for each user. (You'll still have more CPU time per query, | but that's kind of a given here). | | You probably also need to switch from deleting records to | adding an archive bit (or timestamp). | | This gets complicated fast. | kibwen wrote: | That might be even worse (or just as bad) for users, as now | they won't see any updates to the underlying data set even if | they want to, and will presumbaly need to perform some | explicit action to get a new snapshot. | | Personally, if you care about users not missing _any_ item in | a query, you just can 't use pagination at all, and you have | to give them every item in the query in a single huge dump | (running the query again would be the "explicit action" | mentioned above that gets the user new data). Conversely, if | you use pagination, users are free to assume that they might | miss some items unless they already expect the underlying | data to be immutable. | simonw wrote: | This post is not about database cursors. It's about the style of | pagination where you have a ?_next=xxx link to get to the next | page, where the xxx bit encodes details about the last item on | the current page such that the next page can show everything that | comes after that record. | | This is also sometimes known as keyset pagination. My favourite | technical explanation of that is here: https://use-the-index- | luke.com/no-offset | jlg23 wrote: | This post is about "database cursors" and "keyset pagination". | In practice, these terms refer to the same thing, one seen | bottom-up the other seen top-down. Implementation-wise, one | saves the state of the cursor in the pagination [parameters] | and resumes reading from the DB with an equivalent cursor. | nextaccountic wrote: | > these terms refer to the same thing | | No, cursor is this | https://en.wikipedia.org/wiki/Cursor_(databases) | https://www.postgresql.org/docs/current/plpgsql-cursors.html | | I once did pagination using database cursors, which is | something different than keyset pagination: The server would | keep a cursor open and keep fetching more data from the same | query. This enabled the system to have an interface similar | to offset pagination (you get the first page, then the second | page, etc) but without doing a new query for each page | discarding the first n-1 pages per query | | The downside is that it makes the server stateful, and | doesn't scale (you would need to keep hundreds of cursors | open if you had hundreds of simultaneous users) | orf wrote: | Words can have two meanings. Cursor pagination and key set | pagination do indeed refer to the same thing. | | _database_ cursors are a different thing. | mbreese wrote: | Yes words can have two meanings, but given the similarity | of context between these two meanings, calling this | cursor pagination is not a great idea. It screams of | someone that didn't know about database cursors (which is | only one implementation method) trying to describe a | method for web site pagination. I'm not blaming the | author here for this, as they likely know the difference. | But for a new developer trying to Google this, it will be | very confusing. ___________________________________________________________________ (page generated 2022-08-21 23:00 UTC)