[HN Gopher] You don't need a CRDT to build a collaborative exper... ___________________________________________________________________ You don't need a CRDT to build a collaborative experience Author : zknill Score : 142 points Date : 2023-11-16 13:36 UTC (9 hours ago) (HTM) web link (zknill.io) (TXT) w3m dump (zknill.io) | MontagFTB wrote: | One of the main points of this article is to "just use locks", | which glosses over a lot of technical complications about locking | elements within a document. How long is the lock held? Can it be | stolen from a user who has gone offline, or is still online but | off to lunch, and we _really_ need to make this change before the | presentation in an hour? What if the user comes back online and | they have changes, but the lock was stolen - how are those | changes reconciled to the document? | | I am generally in favor of simpler is better, and if there is a | way to build a collaborative experience without using CRDTs, then | go for it. However, sometimes the cure can be worse than the | disease, and solutions like locking may introduce more technical | complexity than originally thought. | dsmmcken wrote: | > How long is the lock held? | | For however long the user has a browser focus state on the | element seems like a reasonable answer, and submit changes as | they are made. However, I don't know how you resolve conflicts | of two users simultaneously attempting to acquire a lock. | Arelius wrote: | > However, I don't know how you resolve conflicts of two | users simultaneously attempting to acquire a lock. | | It turns out you just have to pick one... This all depends on | a source of truth, and when you are there it's easy to pick | one, say based on whichever arrived at the network interface | first. | filleokus wrote: | The server must keep track of the locks, and it can only know | about the lock being released if the client tells it. E.g by | sending a message that the field is not focused any more. The | tricky thing is in the "edge" cases, or really the non- | perfect cases, which there are plenty of (as I think GP | described). | | The server can decide that the client is offline if the | server misses expected heartbeat messages from the client. | But how often will those be sent and how long grace period | will we allow? If it's too short then it will be unreliable | on shaky 4G connections, if it's too long then it will be | annoying in the other direction. | | And that's not considering the "social" problems with locks. | I've worked on replacing a system that was lock-based with | CRDTs where the lunch scenario from MontagFTB actually was a | common occurrence. | | In an "ideal" scenario your lock acquisition problem is not | hard. Client's just show the UI optimistically and whoever | the server decide was first gets the lock. The loosing client | throws any state the user created in the short time-frame. | Over reliable and fast connections for granular locks, this | works fine. But in the real world that's just one of the | issues with a lock based approach... | jitl wrote: | Having used an app with locks like this (Quip) I can tell you | it really sucks to have a lock lease >10s after the last | edit. The "lock" should be a UI level concept anyways - | clients should broadcast "hey I'm taking field X"; if you | have two users submit a write+lock request concurrently | you'll need to pick a winner anyways. | chii wrote: | By exploring that locking and unlocking mechanism, you will | find that the logical conclusion in the end, when enough | complexity and edge cases get covered/fixed as bugs, that it | turns into a crude form of "CRDT" (where it's not actually | consistent, but merges within reason for 99% of use cases). | | It might as well have been CRDT from the get go. | zknill wrote: | You're absolutely right, locking is a hard problem. Especially | when you get into the edge cases of clients not disconnecting | cleanly. | | If you've ever used one of those awful early-2000s Microsoft | word collaboration systems where you have to check-out the doc, | and remember to check it back in, and no one can use it until | you've checked it back in... it's awful! | | I'm not directly in this team, but one of the teams at my | company have been working on this problem. They call it | "Spaces", and one of the features solves this component locking | problem. | | https://ably.com/examples/component-locking | charles_f wrote: | That's true for collaborative experience. Crdts are a mechanism | to handle eventual consistency (that's even the preface of the | paper). If you assume that said collaborative experience is | always online, you don't need them, and "using locks" as you | described is probably enough. | | If you want a mechanism to handle that eventual consistency, it's | probably better to reuse their principles rather than reinventing | something that will eventually ressemble Crdts. | | You mentioned "offline first", I think it's probably a good place | to pluck that ib https://www.inkandswitch.com/local-first/ | insanitybit wrote: | > Ever-growing state: for CRDTs to work well they need to keep a | record of both what exists, and what has been deleted (so that | the deletes aren't accidentally added back in later). This means | that CRDT state will continually expand. There's a bunch of magic | that CRDT library authors are doing with clever compression | techniques to make this problem less-bad, but it's basically in- | escapable. The size of your CRDT state is not purely a function | of the size of the state the CRDT represents, but also of the | number of updates that state has gone through. | | A) This is only the case for certain CRDTs, such as sets that | support deletion - so, if you want Set semantics with deletion | support, you need two sets, one to that tracks all deletions and | one that tracks all insertions. | | B) You can garbage collection your sets. They don't have to grow | forever. | | > Complex implementations: CRDTs are easy to implement wrong, so | probably don't roll your own. | | Personally, I've never done this. I've just added a `merge(&mut | self, other: &Self)` method to structs in Rust. Guaranteeing CRDT | properties is often trivial, or at least it was in my case. | | > Opaque state: Because the CRDT has to represent both the | underlying state and the updates that led to that state | | Again, this is only if you need specific operations on your CRDTs | _and_ if your CRDTs are encoded in specific ways. | | I've said it before, but a trivial crdt looks like this | struct Grows(u64); impl Grows { fn | merge(&mut self, other: &Self) { self.0 = | max(self.0, other.0); } } | | et voila? Obviously you lose all intermediary states, but since | that is specified to be a negative thing, I just want to be clear | that it's often optional. | | > So maybe you are convinced that CRDTs are not the be-all-and- | end-all of collaboration, and that you aren't in one of the two | categories where you probably should use a CRDT, and you've made | it this far in the post. | | I am convinced that CRDTs are not the be-all-and-end-all, because | Strong Eventual Consistency does not provide strong enough | guarantees for all use cases. | | Once again we have a CRDT article that's about user | collaboration, which I find somewhat frustrating because CRDTs | can be used in far more places than that, and user collaboration | is like _the_ most complicated thing you could ever write since | it 's all of the problems of a distributed system _and then we | add humans into the mix_. There is no "good" solution to this | problem - CRDTs aren't going to solve it, and neither is any | other algorithm, because it's not possible to encode every | possible state update in a way that never conflicts and is also | what a human expects (especially since humans have varying | expectations). | | The algorithm/ approach, as described, seems perfectly fine - it | will have edge cases just like CRDTs will. In reality, for such | an impossibly complex problem, you're probably going to end up | with something really complex to solve it. You're almost | certainly going to start adding CRDT-like operations, like "ok | technically this user held a lock on X, but the other user | performed an operation on X that technically commutes, so we can | allow both" to alleviate some of the inherent complexities (and | UX issues) with locking. | saqadri wrote: | You may not need CRDT per-se, but building a collaborative | experience is so difficult. I worked on collaborative systems for | a bit, and also have read a bit about how Figma and Notion do it | (this is a good read: https://www.figma.com/blog/how-figmas- | multiplayer-technology...) -- it's still super hard to get right. | | This talk by Karri about Linear's "sync engine" is also a good | watch: https://www.youtube.com/watch?v=Wo2m3jaJixU. | jitl wrote: | I agree broadly with the article's position but I think locks are | more harmful than helpful. When I was a Quip user (2018) it was | super frustrating to get locked out of a paragraph because | someone's cursor idled there. Instead just allow LWW overwrites. | If users have contention and your sync & presence is fast, | they'll figure it out pretty quick, and at most lose 1-2 | keystrokes, or one drag gesture, or one color pick. | | Notion is "collaborative" and we don't use a CRDT for text, it's | all last-write-wins decided by the server. However our LWW texts | are individually small - one block/paragraph in size - and | adding/moving/removing blocks is intention-preserving if not | perfectly convergent. | | As the article says, the downside for LWW is that "offline" / | async collaboration isn't so great. That's why we're working on | switching to CRDT for our texts. If you're interested in bringing | CRDTs to a product with a lot of users, consider joining Notion's | Docs team - https://boards.greenhouse.io/notion/jobs/5602426003 / | @jitl on Twitter / jake@makenotion.com | btown wrote: | Are you considering having a CRDT for each text block | individually, or moving to a CRDT for the entire data model for | a document? Really curious about the design approach here, | especially insofar as there's now an external API that the data | models need to service! | jitl wrote: | It's Complicated (tm), a hybrid of the two. Text content | across multiple CRDT enabled blocks may be connected in a | single logical CRDT, but the segment visible in any given | block is stored in that block object, which maintains our | permission system invariants. We'll still support moving | blocks to different pages, so one page may have a few | different CRDTs visually interleaved. | | All the edge cases are very interesting, like what happens if | I use backspace to join a block of CRDT B into block of CRDT | A. | | API consumers are unaffected. Likely we'll support API | updates of text as best-effort diff-match-patch of the text | on our side applied as CRDT ops. | npunt wrote: | always great hearing your perspective on the notion text | engine jake, thanks for taking the time to explain | everything! | zknill wrote: | So Last-Write-Wins (LWW) basically _is_ a CRDT, but not in the | sense that anyone really expects, because they aren't that | useful or intention preserving. Especially if the two writes | happen in very quick succession / concurrently. | | LWW becomes useful if you can: | | a) help humans to see who is doing what on a doc | | b) reduce the size of the change that is LWW | | As you've said: | | > However our LWW texts are individually small - one | block/paragraph in size | | This is really important, because by reducing the size of the | individual element that any single user is updating you can | also reduce the chance of conflicts. With a small amount of | contextual feedback (like the notion icon next to the block) a | lot of the problem cases are just avoided. | | Clearly locking and updating the entire document would be | terrible, but if you can do it on a small enough scope that | others can change other elements, it can work really well. | | (If you've worked on the notion things you're describing then | I'm sure you know this better than I do, but just spelling it | out really clearly.) | jitl wrote: | You can have a LWW CRDT, but not every LWW is a CRDT. LWW | CRDTs generally pick a winner based on causal order which is | _convergent_ , the C in CRDT, because every peer receiving | the same ops in any order would pick the same winner. | | Picking a winner based on wall clock time order (as suggested | in the article, and implemented by Notion) is not convergent; | if peers used that algorithm to apply ops I they would not | converge to the same state. Instead you need an authority | (your server) to essentially pick an arbitrary order of ops | and clients just have to deal with it. | | A practical example is to consider three users (A, B, C) | working on a document. | | 1. A, B, C start online working together. | | 2. C goes offline | | 3. A, B stay online, and make 100s of changes. | | 4. C makes a few changes while offline (not sent to other | peers yet) | | 5. C comes back online and sends their changes to (the server | / other peers). | | In wall-clock LWW, C's changes will overwrite A & B's | changes, even though A & B have done a lot more work. | | In a causal ordering CRDT implementing LWW, C's changes will | "lose" to A & B's changes, since they were actually made | "earlier" in causal ordering. | | > Clearly locking and updating the entire document would be | terrible, but if you can do it on a small enough scope that | others can change other elements, it can work really well. | | I'm sure good UX is possible with locks, but I haven't used | one yet for document editing. I'm still traumatized from Quip | which did per-paragraph (per block?) locking and was really | annoying. Eventually they added an unlock button next to the | block so anyone could "steal" the lock at will due to all the | user complaints, but at that point, why put it in at all? | maclockard wrote: | I understand what you are saying here in terms of the | difference between using wall-clock or causal ordering to | determine who 'wins' for LWW. However, both of these | strategies seem convergent to me? In any case, all clients | will agree on whose changes win. | | 1. With wall-clock decided by clients, A + B changes will | win since C's wall-time is earlier (yes, C could lie, but | still would converge). | | 2. With wall-clock decided by server C will win and | everyone will agree. | | 3. With causal ordering, everyone will agree that A + B | won. | | 2 is not a CRDT since it requires a central server, but I | think 1 would still count? Or stated another way: I'm not | sure the _convergence_ is what determines if these | strategies are CRDTs or not, but rather whether or not this | decision making is _distributed_ or not. | asib wrote: | > ...based on causal order which is convergent, the C in | CRDT... | | Doesn't the C stand for conflict-free? I suppose both are | kind of getting at the same idea though. | lewisjoe wrote: | > everyone's gonna say "but hey, google docs uses operational | transform not CRDTs".. OK yes, but you are not google. | | Well, google docs works not because they somehow figured out how | to converge OT edits with as much precision as CRDTs do, but | simply because they have a central server which orders edits | anyway and don't need true leader-less convergence. | | In fact, I agree not many things don't need a CRDT. CRDTs help | with mathematical rigidity of convergence when you want true | peer-2-peer collaboration which works without any central | authority. | | However, most apps anyway work on top of a central authority | (example SaaS apps) so there is no real reason to accomodate all | the compexity that comes with CRDT. They might get far with a | simpler OT based model + central server based ordering. | | For example even Figma doesn't call its model a 100% pure CRDT. | It's a partial, simpler CRDT implemented with an assumption that | there's going to be a server that understands ordering. | | It's the same with Google Docs. They don't need a CRDT because | it's a cloud app after all, which means OT is more convenient | with major heavylifting (ordering and conflict handlings) | outsourced to the server. | josephg wrote: | Yeah. Text based OT is pretty simple to implement, too. It's | about 200 lines of code, plus a simple network protocol. It's | fast and relatively straightforward to code up, and unlike text | CRDTs it's pretty fast by default. | | I use it as my standard test when trying out a new programming | language. It was unexpectedly ugly in go because of go's lack | of enums & unions, and that's one of the big reasons I never | got in to programming Go. | jay-aye-see-key wrote: | Operational transforms are one of those interesting | technologies I've wanted to learn by implementing but haven't | made the time yet. I also didn't realise it could be | implemented in that little code. | | Can you recommend any learning resources for implementing an | OT? (Ideally typescript, python, or rust) | jahfer wrote: | I haven't explored this space in a while, but I have a | couple of examples that could be helpful. A Clojure library | of mine [0] has a decent README with some background | reading on how to use operational transform. | | I also reimplemented it in a surprisingly tiny amount of | OCaml [1] which was a fun way to learn that language :) | | [0]: https://github.com/jahfer/othello [1]: | https://github.com/jahfer/othello-ocaml | dugmartin wrote: | This single file shows the entire set of OT transformations | (retain, insert, delete): | | https://github.com/Operational- | Transformation/ot.js/blob/mas... | | and this is a good post outlining the basics of OT, from | the creator of CodeMirror: | | https://marijnhaverbeke.nl/blog/collaborative-editing- | cm.htm... | antidnan wrote: | I don't think you need a pure CRDT either but I think locking and | presence is a bit of an oversimplification. | | LWW is a good place to start, and updating the smallest piece of | information possible is the right idea in general but there is a | lot more nuance to handling complex applications like a | spreadsheet (I'm working on one) and whiteboard apps. | | Things like reparenting or grouping shapes [1], or updating | elements that aren't at the lowest scale like deleting a row or | column in a spreadsheet make locking challenging to implement. Do | you lock the entire row while I'm moving it? Do you lock the | entire group of shapes? | | With the exception of text editing, the popular libraries like | Yjs don't just give you a perfect CRDT out of the box. You still | have to construct your data model in a way that enables small | scale updates [2], and CRDT libraries and literature are the best | source of thinking for these problems that I've found. | | [1] https://www.figma.com/blog/how-figmas-multiplayer- | technology... | | [2] https://mattweidner.com/2022/02/10/collaborative-data- | design... | earthboundkid wrote: | I've seen locks used at the CMSes of large news organizations. | It's fine, but they all need a mechanism to kick out an editor | who has an idle tab left open. For my own small scale CMS, I just | wrapped Google Docs and let them handle all the syncing | headaches. | ivanjermakov wrote: | Conflict-Free Replicated Data Type (CRDT) is a type of data | structure that enables concurrent updates across multiple | replicas without the need for coordination between them. | jes5199 wrote: | CRDT is a different paradigm. Ideally we'd use it to _replace_ | client-server | iamwil wrote: | I think we can. Following all the implications down the rabbit | hole, you end up with a system architecture where there's no | front-end, and there's no back-end. | | Since CRDTs are still kinda new for most people, the discussion | hasn't really gotten there yet. | yencabulator wrote: | And wrangling CRDTs into a client-server architecture is | actually extra work. The basic design assumes you can trust | every peer -- making sure that jes5199 actually was the peer | who added a comment that is marked as author: "jes5199" is not | trivial. | spion wrote: | For offline first apps, or for applications where very high | degree of control for the content is needed (e.g. legal docs) and | realtime collaboration isn't that valuable, there is also the | option to use 3-way merge instead. | | The benefit is that you can even allow the user to resolve | conflicts in a satisfactory way. | | Another benefit is that the document doesn't even have to be | derived from the original, it could go through exports and re- | imports and it will still be possible to run a 3-way merge as | long as a common base version is declared. This can be especially | covnenient for systems that involve e.g. MS Word. | namelosw wrote: | That's not gonna work for real-world projects. Real-world apps | often have larger edits than locking individual cells/cards e.g. | Move columns or replace large chunks of spreadsheets in Google | Sheets, or Ctrl-A to select all and then drag to move. | | Also, if you consider latency, locking does not work well because | client B might do operations before he/she even acknowledges the | lock from client A because of latency. | iamwil wrote: | > Ever-growing state: for CRDTs to work well they need to keep a | record of both what exists, and what has been deleted (so that | the deletes aren't accidentally added back in later). This means | that CRDT state will continually expand. | | I guess a couple things: | | It depends on the CRDT. Some CRDTs grow with the number of | replicas and others with the number of events. | | State-based CRDTs don't need to keep history and don't need | causal ordering of messages, but internal bookkeeping grows with | the number of replicas. And for large states (like sets and | maps), it can be prohibitive to send the state all over the wire | for an idempotent merge. | | That's why in practice, people implement Op-based CRDTs, which | makes the trade: in order to send small ops over the wire, we now | need causal ordering of messages. To make sure we can sync with | replicas long offline, we keep as much history so that they can | catch up. | | There are other variations, such as delta-state based CRDTs that | send diffs, and merkle CRDTs, which use merkle data structures to | calculate diffs and detect concurrency, which have different | growth characteristics. | | --- | | As for a growing state: Is this actually a concern for devs that | aren't using CRDTs for collaborative text? I can see that being | an issue with the amount of changes that can happen. | | But outside of that, lots of data don't grow that fast. We all | regularly use Git and it keeps a history of everything. Our disks | are huge, and having an immutable record is great for lots of | things (providing you can access it). | | > Opaque state: ...you're generally left with an opaque blob of | binary encoded data. | | Most CRDT libraries take a document-orientated angle. It assumes | that you can contain the entire "unit of work", like a document, | inside of a CRDT. However, if your data is more relational, it | doesn't quite fit. And while there's immutable data in a CRDT, I | do wish it was more accessible and queryable. In addition, being | a binary blob, it's not exactly composable. I think CRDT | libraries should be composable with each other. | chromatin wrote: | We took a super simple (IMO) approach to collaborative editing in | my current project: | | Each block of text has a version number which must be incremented | by one by the client at the time of submission. The database | provides conflict prevention by uniqueness constraint which | bubbles up to the API code. The frontend is informed of conflict, | so that the user can be notified and let the human being perform | conflict resolution. | | Because most concurrent users are working on different blocks, | this works great. | zknill wrote: | How do you handle getting the changes that one client makes | onto the other clients? Are you pushing it from the server to | the clients with websockets, or waiting for the clients to ask | for new info, or waiting for the conflict to happen when | someone else tries to make a change, or something else? | | I'm thinking a lot about keeping server and client data in sync | while working on our hopefully-soon-to-be-released LiveSync | product[1] | | [1] https://ably.com/livesync | chromatin wrote: | When the client gets HTTP 409 Conflict, it asks for the | current version and presents to the user for manual | resolution | maclockard wrote: | Wrote about something similar a while ago | https://hex.tech/blog/a-pragmatic-approach-to-live-collabora... | | Using a server to tie break and locking has worked pretty well | for us | parhamn wrote: | At this point, given the maturity of libraries (I was exploring | this recently), I think you'd have to make the case that CRDTs | are bad not just "too much". | | Interfacing with the 'blob' is a real thing (y-js is solving some | of this with a rust implementation that has cross language | binding) but generally the things they noted (e.g. a Figma | canvas) aren't things you commonly do joins across and if you did | you'd have an independent indexing store for that functionality. | | With tools like SyncedStore [1] and HocusPocus [2] you end up | with a pretty good, we'll tested, easy to implement base for good | collaboration. | | [1] syncedstore.org | | [2] github.com/ueberdosis/hocuspocus | lmm wrote: | > You can't inspect your model represented by the CRDT without | using the CRDT library to decode the blob, and you can't just | store the underlying model state because the CRDT needs its | change history also. You're left with an opaque blob of data in | your database. You can't join on it, you can't search it, you | can't do much without building extra features around that state | blob. | | So use the CRDT library when building your indices? Or better yet | use a CRDT-aware datastore. This doesn't seem like a real | problem. | | > Locking for safety | | Please don't. You're inevitably going to have lost locks, lost | updates, or most likely both. | zknill wrote: | > So use the CRDT library when building your indices? | | Yeah, sure, you can build a secondary index over the data. But | you're still having to decode the blob and index it. There's no | version where you can index the data without using the library | to expose the underlying model state (like you could if you | weren't using a CRDT). | | On locking, yes, it's hard. But it's not the same kind of | locking that you'd expect in other parts of a system. You're | locking the UI, not the actual data, so it's a tiny bit more | forgiving. In general the locks aren't trying to force | consistency, instead they are trying to prompt the humans to | reduce the chance of conflict happening in the first place. | Ofc, you still have to care about the locking, unlocking, | disconnection problems, etc. | | Here's a decent example/demo you can play around with in | multiple windows: | | https://examples.ably.dev/component-locking?space=W0V-t5oY6A... | | https://ably.com/spaces | swyx wrote: | > I'll run through a bunch of broad categories of applications, | and describe how to make use of these features. | | i love these kinds of taxonomies of apps, because then you can | get specific about tech stack choices. just offering a couple | more that i've come across in my years: | | - more prototypical: 7GUIs | https://eugenkiss.github.io/7guis/tasks | | - Application holotypes: | https://twitter.com/arthurwuhoo/status/1470489178186170374 | aboodman wrote: | It's true that a CRDT is often not the right thing for a classic | client/server application. But this doesn't mean we should just | give up on ux and use locking. | | There are approaches to multiplayer that are client/server | native. By leveraging the authoritative server they can offer | features that CRDTs can't, while preserving the great ux. | | I'm partial to server reconciliation: | | https://www.gabrielgambetta.com/client-side-prediction-serve... | | My product, Reflect, implements server reconciliation as a | service. You can learn more about how it works here: | | https://rocicorp.dev/blog/ready-player-two | zknill wrote: | > But this doesn't mean we should just give up on ux | | It's a little unfair to describe locking as "[giving] up on | UX", especially given some well known collaboration products | use it quite successfully; Google Sheets cell locking, Miro | element/Text locking, etc. | | Ofc, it's going to depend on the scope of what's being locked. | These two examples are quite finely scoped element locks. | speps wrote: | The Wikipedia page for Operational Transformation [1] mentions | Differential Synchronization [2] as an alternative, does anyone | have any experience with DS? | | [1] https://en.wikipedia.org/wiki/Operational_transformation [2] | https://neil.fraser.name/writing/sync/ ___________________________________________________________________ (page generated 2023-11-16 23:00 UTC)