[HN Gopher] Dolt is Git for Data: a SQL database that you can fo... ___________________________________________________________________ Dolt is Git for Data: a SQL database that you can fork, clone, branch, merge Author : crazypython Score : 144 points Date : 2021-03-06 21:15 UTC (1 hours ago) (HTM) web link (github.com) (TXT) w3m dump (github.com) | twobitshifter wrote: | This is cool, but the parent dolthub project is even cooler! | Dolthub.com | laurent92 wrote: | Free for public repos! | zachmu wrote: | Also private repos under a gig, but you have to give us a | credit card to be private. | justincormack wrote: | I collected all the git for data open source projects I could | find a few months back, there have been a bunch of interesting | approaches | https://docs.google.com/spreadsheets/d/1jGQY_wjj7dYVne6toyzm... | chub500 wrote: | I've had a fairly long-term side project working on git for | chronological data (data is a cause and effect DAG), know of | anybody doing that? | michaelmure wrote: | It might not be exactly what you are looking for, but git- | bug[1] is encoding data into regular git objects, with merges | and conflict resolution. I'm mentioning this because the hard | part is providing an ordering of events. Once you have that | you can store and recreate whatever state you want. | | This branch[2] I'm almost done with remove the purely linear | branch constraint and allow to use full DAGs (that is, | concurrent edition) and still provide a good ordering. | | [1]: https://github.com/MichaelMure/git-bug [2]: | https://github.com/MichaelMure/git-bug/pull/532 | glogla wrote: | This one seems to be missing: https://projectnessie.org/ | bitslayer wrote: | Is it for versions of the database design or versions of the | data? | skybrian wrote: | Both. Schema changes are versioned like everything else. But | depending on what the change is, it might make merges | difficult. | | (I haven't used it; I just read the blog.) | [deleted] | kyrieeschaton wrote: | People interested in this approach should compare Rich Hickey's | Datomic. | einpoklum wrote: | But do you really need this functionality, if you already have an | SQL database? | | That is, you can: | | 1. Create a table with an extra changeset id column and a branch | id column, so that you can keep historical values. | | 2. Have a view on that table with the latest version of each | record on the master branch. | | 3. Express branching-related actions as actions on the main table | with different record versions and branch names | | 4. For the chocolate sprinkles, have tables with changeset info | and branch info | | and that gives you a poor man's git already - doesn't it? | andrewmcwatters wrote: | Reminds me a bit of datahub.io, but potentially more useful. | pizzabearman wrote: | Is this mysql only? | zachmu wrote: | It uses the mysql SQL dialect for queries. But it's its own | database. | joshspankit wrote: | I never understood why we don't have SQL databases that track all | changes in a "third dimension" (column being one dimension, row | being the second dimension). | | It might be a bit slower to write, but hook the logic in to | write/delete, and suddenly you can see _exactly_ when a field was | changed to break everything. The right middleware and you could | see the user, IP, and query that changed it (along with any other | queries before or after). | kenniskrag wrote: | Because you can do that with after update triggers or server- | side in software. | iamwil wrote: | which db does it use? | zachmu wrote: | It is a database. It implements the MySQL dialect and binary | protocol, but it isn't MySQL. Totally separate storage engine | and implementation. | jrumbut wrote: | It's amazing this isn't a standard feature. The database world | seems to have focused on large, high volume, globally distributed | databases. Presumably you would't version clickstream or IoT | sensor data. | | Features like this that are only feasible below a certain scale | are underdeveloped and I think there's opportunity there. | fiddlerwoaroof wrote: | Datomic has some sort of zero-cost forming of the database: | it's "add-only" design makes this cheap. | qbasic_forever wrote: | Every DB engine used at scale has a concept of snapshots and | backups. This just looks like someone making a git-like | porcelain for the same kind of DB management constructs. | 101008 wrote: | Isnt the mysql log journal* what you are looking for? | | * I dont remember the exact name but I refer the feature that | is used to replicate actions if there was an error. | strogonoff wrote: | You can also just use Git for data! | | It's a bit slower, but smart use of partial/shallow clones can | address performance degradation on large repositories over time. | You just need to take care of the transformation between | "physical" trees/blobs and "logical" objects in your dataset | (which may not have 1:1 mapping, as having physical layer more | granular reduces likelihood of merge conflicts). | | In this regard (versioning data) I think Pijul is promising, it | looks like they might introduce primitives allowing to operate on | changes in actual data structures rather than between lines in | files, like with Git. | teej wrote: | The fact that I can use git for data if I carefully avoid all | the footguns is exactly why I don't use git for data. | pradn wrote: | Git is too complicated. It's barely usable for daily tasks. | Look at how many people have to Google for basic things like | uncommitting a commit, or cleaning your local repo to mirror a | remote one. Complexity is a liability. Mercurial has a nicer | interface. And now I see the real simplicity of non-distributed | source control systems. I have never actually needed to work in | a distributed manner, just client-server. I have never sent a | patch to another dev to patch into their local repo or whatnot. | All this complexity seems like a solution chasing after a | problem - at least for most developers. What works for Linux | isn't necessary for most teams. | yarg wrote: | Merging is hard, but the rest can be done with copy-on-write | cloning (or am I missing something?). | laurent92 wrote: | Wordpress would have benefited from this. | | What a lot of webmasters want is, test the site locally, then | merge it back. A lot of people turned to Jekyll or Hugo for the | very reason that it can be checked into git, and git is reliable. | A static website can't get hacked, whereas anyone who has been | burnt with Wordpress security fail knows they'd prefer a static | site. | | And even more: People would like to pass the new website from the | designer to the customer to managers -- Wordpress might have not | needed to develop their approval workflows (permission schemes, | draft/preview/publish) if they had had a forkable database. | Klwohu wrote: | Problematic name, could become a millstone on the neck of the | developer far into the future. | ademarre wrote: | Agreed. I couldn't immediately see if it was "DOLT" or "do it", | as in "just do it". It's the former. | rapnie wrote: | I was going back and forth between the two until seeing doLt | in terminal font. | zachmu wrote: | This ambiguity in sans serif fonts has actually been pretty | annoying. Especially since GitHub doesn't let you choose | your font on readmes and stuff. | TedDoesntTalk wrote: | Already I would not use this project because of its name. I'm | not offended by it, but I know others will be, and it will only | be a matter of time before we have to replace it with something | else. So why bother in the first place? | | I know the name is not DOLT but it is close enough to cause | offense. Imagine the N-word with one typo. Would it still be | offensive? Probably to some. | maest wrote: | It was most likely picked as an analogy to "git". | gerdesj wrote: | Dolt and git are closer to synonymous rather than analogous. | zachmu wrote: | This is correct. Specifically, to pay homage to git and how | Linus named it. | Ericson2314 wrote: | What people usually miss about these things is normal version | control benefits hugely from content addressing and normal forms. | | The salient aspect of relational data is that it's cyclic, this | makes content addressing unable to provide normal forms on it's | own (unless someone figures out how to Merkle cylic graphs!), but | the normal form can still made other ways. | | The first part is easier enough, store rows in some order. | | The second part is more interesting: making the choice of | surrogate keys not matter (quotienting it away). Sorting table | rows containing surrogate keys depending on the sorting of table | rows makes for some interesting bags of constraints, for which | there may be more than one fixed point. | | Example: CREATE TABLE Foo ( a uuid | PRIMARY KEY, b text, best_friend uuid REFERENCES | Foo(b) ); | | DB 0: 0 Alice 0 | | 1 reclusive Alice, best friends with herself. Just fine. | 0 Alice 1 1 Alice 1 | | 2 reclusive Alices, both best friends with the second one. The | alices are the same up to primary keys, but while primary keys | are to be quotiented out, primary key equality isn't, so this is | valid. And we have an asymmetry by which to sort. | 0 Alice 1 1 Alice 0 | | 2 reclusive Alices, each best friends with the other. The Alices | are completely isomorphic, and one notion of normal forms would | say this is exactly the same as DB 0: as if this is reclusive | Alice in a fun house of mirrors. ___________________________________________________________________ (page generated 2021-03-06 23:00 UTC)