[HN Gopher] Dolt is Git for Data: a SQL database that you can fo...
       ___________________________________________________________________
        
       Dolt is Git for Data: a SQL database that you can fork, clone,
       branch, merge
        
       Author : crazypython
       Score  : 144 points
       Date   : 2021-03-06 21:15 UTC (1 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | twobitshifter wrote:
       | This is cool, but the parent dolthub project is even cooler!
       | Dolthub.com
        
         | laurent92 wrote:
         | Free for public repos!
        
           | zachmu wrote:
           | Also private repos under a gig, but you have to give us a
           | credit card to be private.
        
       | justincormack wrote:
       | I collected all the git for data open source projects I could
       | find a few months back, there have been a bunch of interesting
       | approaches
       | https://docs.google.com/spreadsheets/d/1jGQY_wjj7dYVne6toyzm...
        
         | chub500 wrote:
         | I've had a fairly long-term side project working on git for
         | chronological data (data is a cause and effect DAG), know of
         | anybody doing that?
        
           | michaelmure wrote:
           | It might not be exactly what you are looking for, but git-
           | bug[1] is encoding data into regular git objects, with merges
           | and conflict resolution. I'm mentioning this because the hard
           | part is providing an ordering of events. Once you have that
           | you can store and recreate whatever state you want.
           | 
           | This branch[2] I'm almost done with remove the purely linear
           | branch constraint and allow to use full DAGs (that is,
           | concurrent edition) and still provide a good ordering.
           | 
           | [1]: https://github.com/MichaelMure/git-bug [2]:
           | https://github.com/MichaelMure/git-bug/pull/532
        
         | glogla wrote:
         | This one seems to be missing: https://projectnessie.org/
        
       | bitslayer wrote:
       | Is it for versions of the database design or versions of the
       | data?
        
         | skybrian wrote:
         | Both. Schema changes are versioned like everything else. But
         | depending on what the change is, it might make merges
         | difficult.
         | 
         | (I haven't used it; I just read the blog.)
        
         | [deleted]
        
       | kyrieeschaton wrote:
       | People interested in this approach should compare Rich Hickey's
       | Datomic.
        
       | einpoklum wrote:
       | But do you really need this functionality, if you already have an
       | SQL database?
       | 
       | That is, you can:
       | 
       | 1. Create a table with an extra changeset id column and a branch
       | id column, so that you can keep historical values.
       | 
       | 2. Have a view on that table with the latest version of each
       | record on the master branch.
       | 
       | 3. Express branching-related actions as actions on the main table
       | with different record versions and branch names
       | 
       | 4. For the chocolate sprinkles, have tables with changeset info
       | and branch info
       | 
       | and that gives you a poor man's git already - doesn't it?
        
       | andrewmcwatters wrote:
       | Reminds me a bit of datahub.io, but potentially more useful.
        
       | pizzabearman wrote:
       | Is this mysql only?
        
         | zachmu wrote:
         | It uses the mysql SQL dialect for queries. But it's its own
         | database.
        
       | joshspankit wrote:
       | I never understood why we don't have SQL databases that track all
       | changes in a "third dimension" (column being one dimension, row
       | being the second dimension).
       | 
       | It might be a bit slower to write, but hook the logic in to
       | write/delete, and suddenly you can see _exactly_ when a field was
       | changed to break everything. The right middleware and you could
       | see the user, IP, and query that changed it (along with any other
       | queries before or after).
        
         | kenniskrag wrote:
         | Because you can do that with after update triggers or server-
         | side in software.
        
       | iamwil wrote:
       | which db does it use?
        
         | zachmu wrote:
         | It is a database. It implements the MySQL dialect and binary
         | protocol, but it isn't MySQL. Totally separate storage engine
         | and implementation.
        
       | jrumbut wrote:
       | It's amazing this isn't a standard feature. The database world
       | seems to have focused on large, high volume, globally distributed
       | databases. Presumably you would't version clickstream or IoT
       | sensor data.
       | 
       | Features like this that are only feasible below a certain scale
       | are underdeveloped and I think there's opportunity there.
        
         | fiddlerwoaroof wrote:
         | Datomic has some sort of zero-cost forming of the database:
         | it's "add-only" design makes this cheap.
        
         | qbasic_forever wrote:
         | Every DB engine used at scale has a concept of snapshots and
         | backups. This just looks like someone making a git-like
         | porcelain for the same kind of DB management constructs.
        
         | 101008 wrote:
         | Isnt the mysql log journal* what you are looking for?
         | 
         | * I dont remember the exact name but I refer the feature that
         | is used to replicate actions if there was an error.
        
       | strogonoff wrote:
       | You can also just use Git for data!
       | 
       | It's a bit slower, but smart use of partial/shallow clones can
       | address performance degradation on large repositories over time.
       | You just need to take care of the transformation between
       | "physical" trees/blobs and "logical" objects in your dataset
       | (which may not have 1:1 mapping, as having physical layer more
       | granular reduces likelihood of merge conflicts).
       | 
       | In this regard (versioning data) I think Pijul is promising, it
       | looks like they might introduce primitives allowing to operate on
       | changes in actual data structures rather than between lines in
       | files, like with Git.
        
         | teej wrote:
         | The fact that I can use git for data if I carefully avoid all
         | the footguns is exactly why I don't use git for data.
        
         | pradn wrote:
         | Git is too complicated. It's barely usable for daily tasks.
         | Look at how many people have to Google for basic things like
         | uncommitting a commit, or cleaning your local repo to mirror a
         | remote one. Complexity is a liability. Mercurial has a nicer
         | interface. And now I see the real simplicity of non-distributed
         | source control systems. I have never actually needed to work in
         | a distributed manner, just client-server. I have never sent a
         | patch to another dev to patch into their local repo or whatnot.
         | All this complexity seems like a solution chasing after a
         | problem - at least for most developers. What works for Linux
         | isn't necessary for most teams.
        
       | yarg wrote:
       | Merging is hard, but the rest can be done with copy-on-write
       | cloning (or am I missing something?).
        
       | laurent92 wrote:
       | Wordpress would have benefited from this.
       | 
       | What a lot of webmasters want is, test the site locally, then
       | merge it back. A lot of people turned to Jekyll or Hugo for the
       | very reason that it can be checked into git, and git is reliable.
       | A static website can't get hacked, whereas anyone who has been
       | burnt with Wordpress security fail knows they'd prefer a static
       | site.
       | 
       | And even more: People would like to pass the new website from the
       | designer to the customer to managers -- Wordpress might have not
       | needed to develop their approval workflows (permission schemes,
       | draft/preview/publish) if they had had a forkable database.
        
       | Klwohu wrote:
       | Problematic name, could become a millstone on the neck of the
       | developer far into the future.
        
         | ademarre wrote:
         | Agreed. I couldn't immediately see if it was "DOLT" or "do it",
         | as in "just do it". It's the former.
        
           | rapnie wrote:
           | I was going back and forth between the two until seeing doLt
           | in terminal font.
        
             | zachmu wrote:
             | This ambiguity in sans serif fonts has actually been pretty
             | annoying. Especially since GitHub doesn't let you choose
             | your font on readmes and stuff.
        
         | TedDoesntTalk wrote:
         | Already I would not use this project because of its name. I'm
         | not offended by it, but I know others will be, and it will only
         | be a matter of time before we have to replace it with something
         | else. So why bother in the first place?
         | 
         | I know the name is not DOLT but it is close enough to cause
         | offense. Imagine the N-word with one typo. Would it still be
         | offensive? Probably to some.
        
         | maest wrote:
         | It was most likely picked as an analogy to "git".
        
           | gerdesj wrote:
           | Dolt and git are closer to synonymous rather than analogous.
        
           | zachmu wrote:
           | This is correct. Specifically, to pay homage to git and how
           | Linus named it.
        
       | Ericson2314 wrote:
       | What people usually miss about these things is normal version
       | control benefits hugely from content addressing and normal forms.
       | 
       | The salient aspect of relational data is that it's cyclic, this
       | makes content addressing unable to provide normal forms on it's
       | own (unless someone figures out how to Merkle cylic graphs!), but
       | the normal form can still made other ways.
       | 
       | The first part is easier enough, store rows in some order.
       | 
       | The second part is more interesting: making the choice of
       | surrogate keys not matter (quotienting it away). Sorting table
       | rows containing surrogate keys depending on the sorting of table
       | rows makes for some interesting bags of constraints, for which
       | there may be more than one fixed point.
       | 
       | Example:                 CREATE TABLE Foo (         a uuid
       | PRIMARY KEY,         b text,         best_friend uuid REFERENCES
       | Foo(b)       );
       | 
       | DB 0:                 0 Alice 0
       | 
       | 1 reclusive Alice, best friends with herself. Just fine.
       | 0 Alice 1       1 Alice 1
       | 
       | 2 reclusive Alices, both best friends with the second one. The
       | alices are the same up to primary keys, but while primary keys
       | are to be quotiented out, primary key equality isn't, so this is
       | valid. And we have an asymmetry by which to sort.
       | 0 Alice 1       1 Alice 0
       | 
       | 2 reclusive Alices, each best friends with the other. The Alices
       | are completely isomorphic, and one notion of normal forms would
       | say this is exactly the same as DB 0: as if this is reclusive
       | Alice in a fun house of mirrors.
        
       ___________________________________________________________________
       (page generated 2021-03-06 23:00 UTC)