[HN Gopher] Show HN: Pg_CRDT - an experimental CRDT extension fo... ___________________________________________________________________ Show HN: Pg_CRDT - an experimental CRDT extension for Postgres This is an experimental extension for CRDTs, pg_crdt: GitHub repo[0]. It supports Yjs/Yrs and Automerge. The linked blog post [1] describes how we're thinking about this extension in a Supabase context. I want to emphasise this part: "pg_crdt has not been released onto the Supabase platform (and it may never be). We're considering many options for offline-sync/support and, while CRDTs will undoubtedly factor in, we're not sure if this is the right approach." [0] GitHub repo: https://github.com/supabase/pg_crdt [1] Blog post: https://supabase.com/blog/postgres-crdt Author : kiwicopple Score : 233 points Date : 2022-12-10 12:20 UTC (10 hours ago) (HTM) web link (supabase.com) (TXT) w3m dump (supabase.com) | kiwicopple wrote: | For quick viewing, here's how it works in Postgres (using Yjs): | create table posts ( id serial primary key, | content crdt.ydoc default crdt.new_ydoc() ); | insert into posts (content) values (crdt.new_ydoc()); | update posts set content = content || crdt.new_ydoc() | where id = 1; | | The heavy-lifting is done by the teams at Yjs[0] and | Automerge[1], who have re-implemented their JS libs in Rust. This | made it possible for us to wrap them into a Postgres extension | using pgx[2] (which is fast becoming a de-facto tool at supabase) | | [0] Yjs: https://github.com/yjs/yjs | | [1] Automerge: https://github.com/automerge/automerge | | [3] pgx: https://github.com/tcdi/pgx | samwillis wrote: | Super excited to see this Paul, and amused to see the nod to my | comment in the post :-) | | I'm going to enjoy getting stuck into this! (When I'm not stuck | driving for six hours... just a quick coffee break reading | this) | | Support for CRDT columns within a db to enable JSON like | indexing/querying capabilities is fundamental to then building | some exciting further developments. Supporting both Yjs and | Automerge is exactly the right route forward. | samwillis wrote: | Had a chance to take a little more of a look, did you | investigate querying/indexing within the documents? (Didn't | spot any sign in the repository) | | Without the ability to index properties, or do ad-hock | queries, on the documents there is little reason to to the | merging in-DB. It then makes more sense to merge outside the | DB and save both the raw CRDT object and a serialised JSON | next to it for querying. I assume that if you had found it's | performance better, indexing+querying would have been the | next step? | | However I'm not surprised you found performance not ideal, I | suspect a layer in-front of PG throttling writes is the only | way forward for a real-time use case. | | Personally I find the offline + eventual consistency the most | interesting use case. The ability to sync a users document | set to a local db (SQLite or a future mini-pg/supabase) with | _local_ querying and then merge them back later. This could | be as basic as a three column table (guid, doc, clock), the | clock being a global Lamport like clock to enable syncing | only changed documents. | aaronblohowiak wrote: | Read modify write outside the db seems worse though? | kiwicopple wrote: | > _This could be as basic as a three column table (guid, | doc, clock), the clock being a global Lamport_ | | This would be achievable, but also perhaps quite strict - | we're dictating a table structure rather than working with | existing CRDT libs. That said, we could probably do this | today, without any mods/extensions | | > _querying /indexing_ | | Yeah, there is nothing in this implementation for that | | > _Without the ability to index properties, or do ad-hock | queries, on the documents there is little reason to to the | merging in-DB_ | | One benefit is the ability to send "directly" to the | database, but I agree that lack of additional features in | the current form outweighs this benefit | | > _a serialised JSON_ | | This is how we might do it with with "Realtime as an | authority" approach (and is the most likely approach tbh). | Horusiath wrote: | Yrs (Yjs on Rust) maintainer here: we actually had some idea | about building extension to Postgres ;) Great to see that | others share the excitement about this idea as well. See: | https://github.com/y-crdt/y-crdt/issues/220 | kiwicopple wrote: | perfect, we'd love to collaborate with you. I'll jump into | your github after our Launch Week (this week), or feel free | to reach out directly (contact details in my bio) | zombodb wrote: | Always nice to see pgx being used in the wild! Awesome work. | nixpulvis wrote: | I for one respect the need to resolve merge conflicts. | danbruc wrote: | I do not get peoples' obsession with CRDTs. They are useful in a | limited set of use cases when your operations are commutative but | having only commutative operations is pretty rare except for some | almost exotic use cases. Distributed document editing is for some | reason a reoccurring theme but document editing operations are | all but commutative. What people then build with all kinds of ad | hoc conflict resolution strategies is barely worth being called a | CRDT - nothing there is conflict-free, you are just defining | conflicts away. Sometimes that works okay, at least for a limited | uses case, but the resulting operations are rarely general, and | sometimes they are just weird. | preseinger wrote: | > CRDTs ... are useful in a limited set of use cases when your | operations are commutative | | As you correctly observe, very few applications that exist | today have data models with operations that satisfy the | requirements of CRDTs, including e.g. commutativity. But CRDTs | were never meant to be a "swappable" state layer for existing | applications. The idea has always been that applications would | need to change how they interact with state, to a model which | does satisfy those requirements. Such models can absolutely | generalize, it just takes a bit of thinking. | kiwicopple wrote: | I touch on this in the blog post: I | personally believe CRDTs are the future. *For some things.* | It's possible that developers will need to be selective about | their "level" of offline support [example table | for a blog post] Perhaps it's not that | important if the title has a "last write wins" strategy, | because it's rarely updated and less likely to have a merge | conflict. But it is critical to use a smarter algorithm for the | content of the blog post, especially in a team where multiple | users are editing a blog post at the same time. | danbruc wrote: | The C in CRDT is for conflict-free, i.e. there must not be | any conflicts between operations. Incrementing and | decrementing a counter, the final result will be the same | independent of the order of operations. You can almost not | get any further away from a CRDT than last write wins. And | for more complicated merge strategies you are essentially | guessing what the desired result is but you do not know and | everything really depends on the quality of your guesses. | | Like the example with the changed fruits in the list, maybe | one or the other should end up in the final result, maybe | both, maybe none. The correct conflict resolution strategy is | not obvious and highly dependent on the use case, that is why | I said that implementations are usually not general or they | are only general in the sense that you can plug in any | behavior you like. Compare this with a real CRDT like a | counter, there there is absolutely no doubt about the desired | outcome. | AlexErrant wrote: | It's on the application developer to choose the correct | CRDT for the job. Last write wins is there as a CRDT of | last resort - there's a trade off between the very general | but "weak" LWW and the specific but "powerful" counter. (In | my own app, I plan on observing "failed" LWW writes and | surfacing them to the user as notifications or something | transient.) Ideally one could choose a CRDT where the | semantics of an application conflict are the same as CRDT | conflicts, but we don't live in an ideal world. I don't | think you'll find anyone arguing that CRDTs are a silver | bullet. | danbruc wrote: | Maybe my point is just that those things should not be | called CRDTs. There are conflicts in the basic operations | and you just decide on a case by case basis how to | resolve them which makes the entire construct look | superficially like a CRDT but kind of ignores the core | characteristic of CRDTs, namely that the operations never | conflict. | preseinger wrote: | LWWRegister does provide the deterministic guarantee that | operations never conflict. The "problem" is just that the | decision is basically arbitrary. | jamil7 wrote: | The point is that they all eventually converge on the same | value, and can do that independent of some centralised | merging location, rather than the exact way in which they | do that, whether that's LLW or anything else. | danbruc wrote: | Just store all updates - as if you do event sourcing - | ordered by a timestamp. The list of updates will | eventually converge as will any function of that list, | including any function that merges all the updates into a | current state. You might have to [partially] reevaluate | the function in case a delayed update shows up and has to | be inserted into the middle of the list of updates. So I | guess there are additional requirements in order to | qualify as a CRDT, bounded additional space or bounded | amount of computation on updates. But I don't know as it | turned out my understanding of what qualifies as a good | CRDT was too narrow. | jamil7 wrote: | Yeah event sourcing and what you described is what these | libraries are calling delta CRDTs. The appeal is they're | hiding away the actual event log, metadata and | rebasing/merging parts and exposing data types that look | something like what you'd normally use. | jclem wrote: | An LWW register is most certainly a CRDT, it just has | limited use. I think you're making a good point about how | one must be careful applying CRDTs, but I don't think it's | accurate to say that a suboptimal merge strategy in terms | of user experience means that something does not meet the | definition of a CRDT. | | In other words, the "conflict-free" part of the term refers | to the fact that all replicas converge on the same value | when all updates have been exchanged. What you're | discussing seems to have more to do with concerns around | intention preservation when using CRDTs. In other words, an | LWW register on a text field is still a CRDT, but in many | applications it is potentially a poor choice in terms of | preserving the intent of user-generated operations, but | determining this is something done in a case-by-case basis. | | If the content is a long document which users collaborate | on over time, then a register is most certainly a poor | choice. If it is a short text field akin to a label, an LWW | register isn't necessarily bad, because user intent is | likely to replace the entire value, rather than to perform | minute edits on the value. | danbruc wrote: | I just reread the Wikipedia article on CRDTs and was | actually wrong. I only had a subset of CRDTs in mind but | they actually encompass a much wider class than I | thought. In some way I now understand the fuzz about them | even less, they now look almost trivial, but maybe I am | still not understanding some important aspect properly. | Should maybe read the paper introducing them. But at the | very least I should not complain about people calling | things CRDTs that actually fall under the actual | definition. | kiwicopple wrote: | > _But at the very least I should not complain about | people calling things CRDTs that actually fall under the | actual definition_ | | that's... a very reasonable response. Nice one. | | I think the fuzz comes from the more esoteric CRDTs, | which solve actually hard problems (traditionally solved | by Operational Transforms). But I don't think I could | have created a simple example in the blog post for one of | these. | matlin wrote: | Last Writer Wins is not the most compelling use case for CRDT | because if those fruits were denormalized into Postgres rows, | you'd already have the behavior described. A more interesting | example, might be if both users were inserting a new fruit, one | at the beginning and one at the end. What should the sequence of | fruits be after that? Likely a list with both new fruits | included. But it depends on the domain and use case. | | In general, Sequence CRDTs seem to be the most difficult and most | interesting area of continuous research. In my own work, I've | implemented a sequence CRDT to create a distributed log of events | across users that then (because the order is guaranteed to | eventually be the same for each device) can just be reduced to | whatever data structure you need. It means, we end up with one | CRDT to rule them all and then can just write the "conflict | resolution" in simple term with the business logic. | jamil7 wrote: | > Last Writer Wins is not the most compelling use case for CRDT | because if those fruits were denormalized into Postgres rows, | you'd already have the behavior described. | | I'm not sure from the article, but I'm guessing the difference | here is that it will apply the updates in the correct order | (using last writer wins) regardless of when the yjs doc hits | the server. I guess the idea here is you can get decent | offline-ish support without directly storing timestamps, | version numbers, soft-deletes etc and associated logic. | Hixon10 wrote: | Hm, I am a bit worried about using PostgreSQL with CRDT | (Limitations section from the blog). I guess that PostgreSQL | doesn't like frequently updates, because of MVCC. Maybe it is OK | to use PG as a cold storage, if you have another application, | which would work with hot data (like, when users change | something), and eventually store it inside PG. | kiwicopple wrote: | unless we can figure out how to overcome those limitations, | you're right. Potentially some sort of throttling system for | CRDT updates (offloaded to a background worker so that it's | non-blocking?). If anyone else has ideas/approaches which could | work in a PG-only context, we'd love to hear them | Hixon10 wrote: | Maybe you can research a bit pluggable table storage | interface in PostgreSQL. It is possible either to implement | your storage, which works fine with CRDT, or try to find some | storage implementation, which works fine with such usage | pattern (like, https://wiki.postgresql.org/wiki/Zheap ) | cranium wrote: | CRDTs are really cool if your merging function behaves | intuitively _in the context of your application_. You can easily | implement a distributed VCS by using the Last Writer Wins, but it | would be a nightmare to work with - imagine having your | modifications rolled back because someone committed after you. | | In the same way some operations are embarrassingly parallelizable | (think adding elements to a set), some operations are inherently | conflicting: editing elements of the same text, image, | directory,... More than trying to decide _in all cases_ which | modification wins, we should detect when there are multiple | legitimate options and ask the user to choose. Like a merge | conflict in Git for example. | | The application state feels so clean when conflicts can't arise | or are automatically dealt with. But we have to acknowledge there | are only so much datastructures that can merge automatically and | still be intuitive for the user. | preseinger wrote: | +1, I think the most valuable lesson we've learned w.r.t. CRDTs | at the moment is that valid merge operations basically don't | generalize, and basically have to leverage properties that are | application-specific. This isn't a fatal flaw, it just | constrains the design space. | Smaug123 wrote: | Forgive me, I'm still recovering from the flu and am having some | trouble thinking, but I didn't find your "Why not build this into | Realtime?" section very compelling. If you've already got a | central source of truth (so you don't currently need the peer-to- | peer nature of CRDTs), isn't it faster to iterate on your | solution if you implement the merge semantics within the central | source of truth that you have full control over? I'm not | convinced from your post that turning the database into a | component of a peer-to-peer distributed system is predictably | worthwhile; it's certainly predictably going to be very complex. | It smacks to me of spending a great deal of engineering effort to | keep your options much further open than I've seen you justify. | kiwicopple wrote: | Thanks, perhaps I can try explaining it from another angle. | | At the moment, every tool in supabase is decoupled from the | other tools. The only thing that each tool depends on is | Postgres. This is beneficial, because now we don't need to | build interfaces - Postgres is the interface. | | Putting this implementation into Realtime would introduce a | second dependency for some tools or create a burden for the | developer. For example, this query goes directly to PostgREST: | supabase.from('posts').update({ title: 'Hello', content: | 'World' }); | | If the "content" part of this is a CRDT, then suddenly the | developer needs to know that _that particular field_ needs to | be sent through realtime (psuedo-code): | supabase.from('posts').update({ title: 'Hello' }); | supabase.from('posts').merge({ content: 'World' }); | | Or we'd need to build some automatic-routing logic into the | libraries (or PostgREST), which is also complex. | | In my view, CRDTs are a data problem, and therefore probably a | database problem. If we can make Postgres act like "just | another peer", then it simplifies the architecture. | | > _isn 't it faster to iterate on your solution if you | implement the merge semantics within the central source of | truth that you have full control over?_ | | We have full control over both Realtime and Postgres (at least | the extension), so this is less of a concern. We're also less | worried about fastest iteration than finding the correct | solution long-term ___________________________________________________________________ (page generated 2022-12-10 23:00 UTC)