[HN Gopher] Show HN: Pg_CRDT - an experimental CRDT extension fo...
       ___________________________________________________________________
        
       Show HN: Pg_CRDT - an experimental CRDT extension for Postgres
        
       This is an experimental extension for CRDTs, pg_crdt: GitHub
       repo[0]. It supports Yjs/Yrs and Automerge.  The linked blog post
       [1] describes how we're thinking about this extension in a Supabase
       context. I want to emphasise this part: "pg_crdt has not been
       released onto the Supabase platform (and it may never be). We're
       considering many options for offline-sync/support and, while CRDTs
       will undoubtedly factor in, we're not sure if this is the right
       approach."  [0] GitHub repo: https://github.com/supabase/pg_crdt
       [1] Blog post: https://supabase.com/blog/postgres-crdt
        
       Author : kiwicopple
       Score  : 233 points
       Date   : 2022-12-10 12:20 UTC (10 hours ago)
        
 (HTM) web link (supabase.com)
 (TXT) w3m dump (supabase.com)
        
       | kiwicopple wrote:
       | For quick viewing, here's how it works in Postgres (using Yjs):
       | create table posts (           id serial primary key,
       | content crdt.ydoc default crdt.new_ydoc()         );
       | insert into posts (content)         values (crdt.new_ydoc());
       | update posts          set content = content || crdt.new_ydoc()
       | where id = 1;
       | 
       | The heavy-lifting is done by the teams at Yjs[0] and
       | Automerge[1], who have re-implemented their JS libs in Rust. This
       | made it possible for us to wrap them into a Postgres extension
       | using pgx[2] (which is fast becoming a de-facto tool at supabase)
       | 
       | [0] Yjs: https://github.com/yjs/yjs
       | 
       | [1] Automerge: https://github.com/automerge/automerge
       | 
       | [3] pgx: https://github.com/tcdi/pgx
        
         | samwillis wrote:
         | Super excited to see this Paul, and amused to see the nod to my
         | comment in the post :-)
         | 
         | I'm going to enjoy getting stuck into this! (When I'm not stuck
         | driving for six hours... just a quick coffee break reading
         | this)
         | 
         | Support for CRDT columns within a db to enable JSON like
         | indexing/querying capabilities is fundamental to then building
         | some exciting further developments. Supporting both Yjs and
         | Automerge is exactly the right route forward.
        
           | samwillis wrote:
           | Had a chance to take a little more of a look, did you
           | investigate querying/indexing within the documents? (Didn't
           | spot any sign in the repository)
           | 
           | Without the ability to index properties, or do ad-hock
           | queries, on the documents there is little reason to to the
           | merging in-DB. It then makes more sense to merge outside the
           | DB and save both the raw CRDT object and a serialised JSON
           | next to it for querying. I assume that if you had found it's
           | performance better, indexing+querying would have been the
           | next step?
           | 
           | However I'm not surprised you found performance not ideal, I
           | suspect a layer in-front of PG throttling writes is the only
           | way forward for a real-time use case.
           | 
           | Personally I find the offline + eventual consistency the most
           | interesting use case. The ability to sync a users document
           | set to a local db (SQLite or a future mini-pg/supabase) with
           | _local_ querying and then merge them back later. This could
           | be as basic as a three column table (guid, doc, clock), the
           | clock being a global Lamport like clock to enable syncing
           | only changed documents.
        
             | aaronblohowiak wrote:
             | Read modify write outside the db seems worse though?
        
             | kiwicopple wrote:
             | > _This could be as basic as a three column table (guid,
             | doc, clock), the clock being a global Lamport_
             | 
             | This would be achievable, but also perhaps quite strict -
             | we're dictating a table structure rather than working with
             | existing CRDT libs. That said, we could probably do this
             | today, without any mods/extensions
             | 
             | > _querying /indexing_
             | 
             | Yeah, there is nothing in this implementation for that
             | 
             | > _Without the ability to index properties, or do ad-hock
             | queries, on the documents there is little reason to to the
             | merging in-DB_
             | 
             | One benefit is the ability to send "directly" to the
             | database, but I agree that lack of additional features in
             | the current form outweighs this benefit
             | 
             | > _a serialised JSON_
             | 
             | This is how we might do it with with "Realtime as an
             | authority" approach (and is the most likely approach tbh).
        
         | Horusiath wrote:
         | Yrs (Yjs on Rust) maintainer here: we actually had some idea
         | about building extension to Postgres ;) Great to see that
         | others share the excitement about this idea as well. See:
         | https://github.com/y-crdt/y-crdt/issues/220
        
           | kiwicopple wrote:
           | perfect, we'd love to collaborate with you. I'll jump into
           | your github after our Launch Week (this week), or feel free
           | to reach out directly (contact details in my bio)
        
         | zombodb wrote:
         | Always nice to see pgx being used in the wild! Awesome work.
        
       | nixpulvis wrote:
       | I for one respect the need to resolve merge conflicts.
        
       | danbruc wrote:
       | I do not get peoples' obsession with CRDTs. They are useful in a
       | limited set of use cases when your operations are commutative but
       | having only commutative operations is pretty rare except for some
       | almost exotic use cases. Distributed document editing is for some
       | reason a reoccurring theme but document editing operations are
       | all but commutative. What people then build with all kinds of ad
       | hoc conflict resolution strategies is barely worth being called a
       | CRDT - nothing there is conflict-free, you are just defining
       | conflicts away. Sometimes that works okay, at least for a limited
       | uses case, but the resulting operations are rarely general, and
       | sometimes they are just weird.
        
         | preseinger wrote:
         | > CRDTs ... are useful in a limited set of use cases when your
         | operations are commutative
         | 
         | As you correctly observe, very few applications that exist
         | today have data models with operations that satisfy the
         | requirements of CRDTs, including e.g. commutativity. But CRDTs
         | were never meant to be a "swappable" state layer for existing
         | applications. The idea has always been that applications would
         | need to change how they interact with state, to a model which
         | does satisfy those requirements. Such models can absolutely
         | generalize, it just takes a bit of thinking.
        
         | kiwicopple wrote:
         | I touch on this in the blog post:                   I
         | personally believe CRDTs are the future. *For some things.*
         | It's possible that developers will need to be selective about
         | their "level" of offline support                [example table
         | for a blog post]                  Perhaps it's not that
         | important if the title has a "last write wins" strategy,
         | because it's rarely updated and less likely to have a merge
         | conflict. But it is critical to use a smarter algorithm for the
         | content of the blog post, especially in a team where multiple
         | users are editing a blog post at the same time.
        
           | danbruc wrote:
           | The C in CRDT is for conflict-free, i.e. there must not be
           | any conflicts between operations. Incrementing and
           | decrementing a counter, the final result will be the same
           | independent of the order of operations. You can almost not
           | get any further away from a CRDT than last write wins. And
           | for more complicated merge strategies you are essentially
           | guessing what the desired result is but you do not know and
           | everything really depends on the quality of your guesses.
           | 
           | Like the example with the changed fruits in the list, maybe
           | one or the other should end up in the final result, maybe
           | both, maybe none. The correct conflict resolution strategy is
           | not obvious and highly dependent on the use case, that is why
           | I said that implementations are usually not general or they
           | are only general in the sense that you can plug in any
           | behavior you like. Compare this with a real CRDT like a
           | counter, there there is absolutely no doubt about the desired
           | outcome.
        
             | AlexErrant wrote:
             | It's on the application developer to choose the correct
             | CRDT for the job. Last write wins is there as a CRDT of
             | last resort - there's a trade off between the very general
             | but "weak" LWW and the specific but "powerful" counter. (In
             | my own app, I plan on observing "failed" LWW writes and
             | surfacing them to the user as notifications or something
             | transient.) Ideally one could choose a CRDT where the
             | semantics of an application conflict are the same as CRDT
             | conflicts, but we don't live in an ideal world. I don't
             | think you'll find anyone arguing that CRDTs are a silver
             | bullet.
        
               | danbruc wrote:
               | Maybe my point is just that those things should not be
               | called CRDTs. There are conflicts in the basic operations
               | and you just decide on a case by case basis how to
               | resolve them which makes the entire construct look
               | superficially like a CRDT but kind of ignores the core
               | characteristic of CRDTs, namely that the operations never
               | conflict.
        
               | preseinger wrote:
               | LWWRegister does provide the deterministic guarantee that
               | operations never conflict. The "problem" is just that the
               | decision is basically arbitrary.
        
             | jamil7 wrote:
             | The point is that they all eventually converge on the same
             | value, and can do that independent of some centralised
             | merging location, rather than the exact way in which they
             | do that, whether that's LLW or anything else.
        
               | danbruc wrote:
               | Just store all updates - as if you do event sourcing -
               | ordered by a timestamp. The list of updates will
               | eventually converge as will any function of that list,
               | including any function that merges all the updates into a
               | current state. You might have to [partially] reevaluate
               | the function in case a delayed update shows up and has to
               | be inserted into the middle of the list of updates. So I
               | guess there are additional requirements in order to
               | qualify as a CRDT, bounded additional space or bounded
               | amount of computation on updates. But I don't know as it
               | turned out my understanding of what qualifies as a good
               | CRDT was too narrow.
        
               | jamil7 wrote:
               | Yeah event sourcing and what you described is what these
               | libraries are calling delta CRDTs. The appeal is they're
               | hiding away the actual event log, metadata and
               | rebasing/merging parts and exposing data types that look
               | something like what you'd normally use.
        
             | jclem wrote:
             | An LWW register is most certainly a CRDT, it just has
             | limited use. I think you're making a good point about how
             | one must be careful applying CRDTs, but I don't think it's
             | accurate to say that a suboptimal merge strategy in terms
             | of user experience means that something does not meet the
             | definition of a CRDT.
             | 
             | In other words, the "conflict-free" part of the term refers
             | to the fact that all replicas converge on the same value
             | when all updates have been exchanged. What you're
             | discussing seems to have more to do with concerns around
             | intention preservation when using CRDTs. In other words, an
             | LWW register on a text field is still a CRDT, but in many
             | applications it is potentially a poor choice in terms of
             | preserving the intent of user-generated operations, but
             | determining this is something done in a case-by-case basis.
             | 
             | If the content is a long document which users collaborate
             | on over time, then a register is most certainly a poor
             | choice. If it is a short text field akin to a label, an LWW
             | register isn't necessarily bad, because user intent is
             | likely to replace the entire value, rather than to perform
             | minute edits on the value.
        
               | danbruc wrote:
               | I just reread the Wikipedia article on CRDTs and was
               | actually wrong. I only had a subset of CRDTs in mind but
               | they actually encompass a much wider class than I
               | thought. In some way I now understand the fuzz about them
               | even less, they now look almost trivial, but maybe I am
               | still not understanding some important aspect properly.
               | Should maybe read the paper introducing them. But at the
               | very least I should not complain about people calling
               | things CRDTs that actually fall under the actual
               | definition.
        
               | kiwicopple wrote:
               | > _But at the very least I should not complain about
               | people calling things CRDTs that actually fall under the
               | actual definition_
               | 
               | that's... a very reasonable response. Nice one.
               | 
               | I think the fuzz comes from the more esoteric CRDTs,
               | which solve actually hard problems (traditionally solved
               | by Operational Transforms). But I don't think I could
               | have created a simple example in the blog post for one of
               | these.
        
       | matlin wrote:
       | Last Writer Wins is not the most compelling use case for CRDT
       | because if those fruits were denormalized into Postgres rows,
       | you'd already have the behavior described. A more interesting
       | example, might be if both users were inserting a new fruit, one
       | at the beginning and one at the end. What should the sequence of
       | fruits be after that? Likely a list with both new fruits
       | included. But it depends on the domain and use case.
       | 
       | In general, Sequence CRDTs seem to be the most difficult and most
       | interesting area of continuous research. In my own work, I've
       | implemented a sequence CRDT to create a distributed log of events
       | across users that then (because the order is guaranteed to
       | eventually be the same for each device) can just be reduced to
       | whatever data structure you need. It means, we end up with one
       | CRDT to rule them all and then can just write the "conflict
       | resolution" in simple term with the business logic.
        
         | jamil7 wrote:
         | > Last Writer Wins is not the most compelling use case for CRDT
         | because if those fruits were denormalized into Postgres rows,
         | you'd already have the behavior described.
         | 
         | I'm not sure from the article, but I'm guessing the difference
         | here is that it will apply the updates in the correct order
         | (using last writer wins) regardless of when the yjs doc hits
         | the server. I guess the idea here is you can get decent
         | offline-ish support without directly storing timestamps,
         | version numbers, soft-deletes etc and associated logic.
        
       | Hixon10 wrote:
       | Hm, I am a bit worried about using PostgreSQL with CRDT
       | (Limitations section from the blog). I guess that PostgreSQL
       | doesn't like frequently updates, because of MVCC. Maybe it is OK
       | to use PG as a cold storage, if you have another application,
       | which would work with hot data (like, when users change
       | something), and eventually store it inside PG.
        
         | kiwicopple wrote:
         | unless we can figure out how to overcome those limitations,
         | you're right. Potentially some sort of throttling system for
         | CRDT updates (offloaded to a background worker so that it's
         | non-blocking?). If anyone else has ideas/approaches which could
         | work in a PG-only context, we'd love to hear them
        
           | Hixon10 wrote:
           | Maybe you can research a bit pluggable table storage
           | interface in PostgreSQL. It is possible either to implement
           | your storage, which works fine with CRDT, or try to find some
           | storage implementation, which works fine with such usage
           | pattern (like, https://wiki.postgresql.org/wiki/Zheap )
        
       | cranium wrote:
       | CRDTs are really cool if your merging function behaves
       | intuitively _in the context of your application_. You can easily
       | implement a distributed VCS by using the Last Writer Wins, but it
       | would be a nightmare to work with - imagine having your
       | modifications rolled back because someone committed after you.
       | 
       | In the same way some operations are embarrassingly parallelizable
       | (think adding elements to a set), some operations are inherently
       | conflicting: editing elements of the same text, image,
       | directory,... More than trying to decide _in all cases_ which
       | modification wins, we should detect when there are multiple
       | legitimate options and ask the user to choose. Like a merge
       | conflict in Git for example.
       | 
       | The application state feels so clean when conflicts can't arise
       | or are automatically dealt with. But we have to acknowledge there
       | are only so much datastructures that can merge automatically and
       | still be intuitive for the user.
        
         | preseinger wrote:
         | +1, I think the most valuable lesson we've learned w.r.t. CRDTs
         | at the moment is that valid merge operations basically don't
         | generalize, and basically have to leverage properties that are
         | application-specific. This isn't a fatal flaw, it just
         | constrains the design space.
        
       | Smaug123 wrote:
       | Forgive me, I'm still recovering from the flu and am having some
       | trouble thinking, but I didn't find your "Why not build this into
       | Realtime?" section very compelling. If you've already got a
       | central source of truth (so you don't currently need the peer-to-
       | peer nature of CRDTs), isn't it faster to iterate on your
       | solution if you implement the merge semantics within the central
       | source of truth that you have full control over? I'm not
       | convinced from your post that turning the database into a
       | component of a peer-to-peer distributed system is predictably
       | worthwhile; it's certainly predictably going to be very complex.
       | It smacks to me of spending a great deal of engineering effort to
       | keep your options much further open than I've seen you justify.
        
         | kiwicopple wrote:
         | Thanks, perhaps I can try explaining it from another angle.
         | 
         | At the moment, every tool in supabase is decoupled from the
         | other tools. The only thing that each tool depends on is
         | Postgres. This is beneficial, because now we don't need to
         | build interfaces - Postgres is the interface.
         | 
         | Putting this implementation into Realtime would introduce a
         | second dependency for some tools or create a burden for the
         | developer. For example, this query goes directly to PostgREST:
         | supabase.from('posts').update({ title: 'Hello', content:
         | 'World' });
         | 
         | If the "content" part of this is a CRDT, then suddenly the
         | developer needs to know that _that particular field_ needs to
         | be sent through realtime (psuedo-code):
         | supabase.from('posts').update({ title: 'Hello' });
         | supabase.from('posts').merge({ content: 'World' });
         | 
         | Or we'd need to build some automatic-routing logic into the
         | libraries (or PostgREST), which is also complex.
         | 
         | In my view, CRDTs are a data problem, and therefore probably a
         | database problem. If we can make Postgres act like "just
         | another peer", then it simplifies the architecture.
         | 
         | > _isn 't it faster to iterate on your solution if you
         | implement the merge semantics within the central source of
         | truth that you have full control over?_
         | 
         | We have full control over both Realtime and Postgres (at least
         | the extension), so this is less of a concern. We're also less
         | worried about fastest iteration than finding the correct
         | solution long-term
        
       ___________________________________________________________________
       (page generated 2022-12-10 23:00 UTC)