[HN Gopher] Show HN: Cozo - new Graph DB with Datalog, embedded ... ___________________________________________________________________ Show HN: Cozo - new Graph DB with Datalog, embedded like SQLite Hi HN, I have been making this Cozo database since half a year ago, and now it is ready for public release. My initial motivation is that I want a graph database. Lightweight and easy to use, like SQLite. Powerful and performant, like Postgres. I found none of the existing solutions good enough. Deciding to roll my own, I need to choose a query language. I am familiar with Cypher but consider it not much of an improvement over CTE in SQL (Cypher is sometimes notationally more convenient, but not more expressive). I like Gremlin but would prefer something more declarative. Experimentations with Datomic and its clones convinced me that Datalog is the way to go. Then I need a data model. I find the property graph model (Neo4j, etc.) over-constraining, and the triple store model (Datomic, etc.) suffering from inherent performance problems. They also lack the most important property of the relational model: being an algebra. Non-algebraic models are not very composable: you may store data as property graphs or triples, but when you do a query, you always get back relations. So I decided to have relational algebra as the data model. The end result, I now present to you. Let me know what you think, good or bad, and I'll do my best to address them. This is the first time that I use Rust in a significant project, and I love the experience! Author : zh217 Score : 278 points Date : 2022-11-08 12:25 UTC (10 hours ago) (HTM) web link (github.com) (TXT) w3m dump (github.com) | canadiantim wrote: | You mention how Cypher is not much of an improvement over CTE in | SQL, I was wondering if you could expand on this point a bit if | possible? | | Some part of me is considering using Apache AGE graph extension | for postgres, but another part wonders whether it's worth it | considering CTE's can do a lot very similarly. | | I'll definitely be following the progress for Cozo though, sounds | great on the face of it. Definitely will have to consider | potentially using Cozo as well. I wonder if it could make sense | to use Postgres and Cozo together? | zh217 wrote: | Yes of course. | | Perhaps I should start by clarifying that I am talking about | the number of queries the Cypher language can express, without | any vendor-specific extensions, since my consideration was | whether to use it as the query language for my own database. | And Cypher is of course much more convenient to _type_ than SQL | for expressing graph traversals - it was built for that. | | With that understanding, any cypher pattern can be translated | into a series of joins and projections in SQL, and any | recursive query in cypher can be translated into a recursive | CTE. Theoretically, SQL with recursive CTE is not Turing | complete (unless you also add in window functions in recursive | CTE, which I don't think any of the Cypher databases currently | provide), whereas Datalog with function symbol is. Practically, | you can easily write a shortest path query in pure Datalog | without recourse to built-in algorithms (an example is shown in | README), and at least in Cozo it executes essentially as a | variant of Dijkstra's algorithm. I'm not sure I can do that in | Cypher. I don't think it is doable. | samuell wrote: | Does Cypher even support nested and/or recursive queries? I | remember asking the Neo4j guys at a meetup about that many | years ago, and they didn't even seem to understand the | question. Might have changed since then of course. | | Otherwise the thing I have noticed with the datalog (as well | as prolog) syntax, is you are able to build a vocabulary of | re-usable queries, in a much more usable was than any of the | solutions I've seen in SQL, or other similar languages. | | It thus allows you to raise your level of abstraction, by | layer by layer define your definitions (or "classes" if you | will) with well crafted queries, that can be used for further | refined classifying queries. | zh217 wrote: | Re Datalog syntax: yes, the "composability" is the main | reason that I decided to adopt it as the query language. | This is also the reason why we made storing query results | back into the database very easy (no pre-declaration of | "tables" necessary) so that intermediate results can be | materialized in the database at will and be used by | multiple subsequent queries. | samuell wrote: | Indeed, composability is the spot-on keyword here. | [deleted] | samuell wrote: | How I have waited for this: A simple, accessible library for | graph-like data with datalog (also in a statically compiled | language, yay). Have even pondered using SWI-prolog for this kind | of stuff, but it seems so much nicer to be able to use it | embedded in more "normal" types of languages. | | Looking forward to play with this! | | The main thing I will be wondering now is how it will scale to | really large datasets. Any input on that? | samuell wrote: | For folks looking for documentation or getting started- | examples, see: | | - The tutorial: https://nbviewer.org/github/cozodb/cozo- | docs/blob/main/tutor... | | - The language documentation: | https://cozodb.github.io/current/manual/ | | - The pycozo library README for some examples on how to run | this from inline python: | https://github.com/cozodb/pycozo#readme | zh217 wrote: | Thanks for your interest in this! | | It currently uses RocksDB as the storage engine. If your server | has enough resources, I believe it can store TBs of data with | no problem. | | Running queries on datasets this big is a complicated story. | Point lookups should be nearly instant, whereas running | complicated graph algorithms on the whole dataset is | (currently) out of the question, since all the rows a query | touches must reside in memory. Also, the algorithmic complexity | of some of the graph algorithms is too high for big data and | there's nothing we can do about it. We aim to provide a smooth | way for big data to be distilled layer by layer, but we are not | there yet. | samuell wrote: | Many thanks for the detailed answer! | mark_l_watson wrote: | Thank you, this looks very useful. I will try the Python embedded | mode when I have time. | | I especially like the Datalog query examples in your READ project | file. I usually use RDF/RDFS and the SPARQL query language, with | must less use of property graphs using Neo4J. I expect an easy | ramp up learning your library. | | BTW, I read the discussion of your use of the AGPL license. For | what it is worth, that license is fine with me. I usually release | my open source projects using Apache 2, but when required | libraries use GPL or AGPL, I simply use those licenses. | dmitriid wrote: | I nitpick for the README: consider converting examples from | images to code blocks (you can even directly copy-paste them into | the code blocks and they should retain their formatting) | | Otherwise: yes, please. I love the idea. | mola wrote: | Graph query over relational data, brilliant. I need this | yesterday. | OtomotO wrote: | Awesome work, congrats. | | For someone who never did anything datalog I didn't see an | example in the repo and the docs (docs.rs) could need some more | content. | | I hope to see a 1.0 at some point and performance that can | compete with SQLite. | | Would love to have an alternative, especially as I have a few pet | projects that have graph data (well, in the end the whole | universe can be modelled as a graph ;)) | zh217 wrote: | I'm very happy that you like it! | | The "teasers" section in the repo README contains a few | examples. Or you could look at the tutorial | (https://nbviewer.org/github/cozodb/cozo- | docs/blob/main/tutor...), which contains all sorts of examples. | | The Rust documentation on docs.rs could certainly be improved, | will do that later! | OtomotO wrote: | Ah, yes, mea culpa. Was browsing on the phone and did miss | that link indeed. | | Is is also okay to store big data that would otherwise go | into another storage like e.g. blog-posts? | | I mean the content could also be modeled as a leaf-node and | not be part of the db itself. (not sure if that would be | abusing the kv storage) | zh217 wrote: | In short: yes, but not right now. See this issue: | https://github.com/cozodb/cozo/issues/2. Also in this case | you are not really using it as an embedded database | anymore, which is our original motivation. We currently | also provide a "cozoserver", but it is pretty primitive at | the moment. "Big data" capabilities, when they arrive in | Cozo, will probably go into the server instead of the | embedded binaries. | OtomotO wrote: | Hm, why wouldn't that be embedded? | | How do you define embedded? | | One of my application is a simple "blog-like" webservice | where you can either use a SQLite db or Postgres. | | Personally I often prefer SQLite because it doesn't need | a thousand configurations and I can just migrate all the | content with copying a file. | zh217 wrote: | My use of "embedded" means that the whole database runs | in the same process as your application. This is how | SQLite works. Your application doesn't "connect" to an | SQLite database in the usual sense. Your application | simply contains SQLite as part of itself. Contrast this | with Postgres, where you first need to start a Postgres | server and then have your application talk to it. | OtomotO wrote: | Exactly. | | I was just curious because of your comment: | | > Also in this case you are not really using it as an | embedded database anymore, which is our original | motivation | | As by your (and mine) definition, I am indeed using it as | an embedded database. It's running inside the process and | storing (and persisting) blog-posts. | Serow225 wrote: | I'm excited to get some more Rust docs! | | Even just a pointer to serde ::from_value(value).unwrap(), | and <TheType as Deserialize>::deserialize(value), would be | helpful to get people pointed in the right direction. | | Looks like a super cool project, congrats! | ithrow wrote: | _you may store data as property graphs or triples, but when you | do a query, you always get back relations_ | | Can you elaborate on this? in datomic you can get back | hierarchical data | ekidd wrote: | This is a really impressive piece of work! Congratulations! | | I note that it appears to be a library, but it's licensed under | the Affero GPL. I believe this means that if I link your library | into a program, and if I then allow users to interact with that | combined program in any way over a network, then I have to make | it possible for users to download the source code to my entire | program. Is that your goal here? Were you thinking of some kind | of commercial licensing model for people writing server-side apps | that use your library? | | (I'm curious because I've been deciding whether or not to roll my | own toy Datalog for a permissively-licensed open source Rust | project.) | zh217 wrote: | No, my understanding is that if you don't make any changes to | the Cozo code, you don't need to release anything to the | public. If you do, and you cannot release your non-Cozo code, | then you must dynamically link to the library (and release your | changes to the Cozo code). The Python, NodeJS and Java/Clojure | libraries all use dynamic linking. | | There is no plan for any commercial license - this is a | personal project at the moment. My hope is for this project to | grow into a true FOSS database with wide contributions and no | company controlling it. If a community forms and after I | understand the consequences a little bit more, the license may | change if the community decides that it is better for the long- | term good of the project. For the moment though, it is staying | AGPL. | Cu3PO42 wrote: | Let me preface by saying that this seems like a great piece | of software and it is absolutely within your right to license | it as whatever you would like, no matter what any of the | commenters here think. | | However, I don't believe your understanding of AGPL is | accurate. | | > No, my understanding is that if you don't make any changes | to the Cozo code, you don't need to release anything to the | public. If you do, and you cannot release your non-Cozo code, | then you must dynamically link to the library (and release | your changes to the Cozo code). The Python, NodeJS and | Java/Clojure libraries all use dynamic linking. | | This sounds like you're thinking of the LGPL, not AGPL. | Whereas LGPL is less strict than GPL because the exception | you describe above applies. AGPL on the other hand is more | strict. Essentially, if you use any AGPL code to provide a | service to users then you must also make the source code | available, even if the software itself is never delivered to | users. | | The intention here is that you can't get around GPL by hiding | any use of the GPL code behind a server, so it makes perfect | sense to use it for a database. But I don't think it does | what you want. | | Whichever way you decide to go, be it AGPL, LGPL or something | else, I encourage you to make a choice before accepting any | outside contributions. As soon as you have code from other | authors without a CLA you will need to obtain their | permission to change the license (with some exceptions). | | (Disclaimer: I'm not a lawyer, just interested in licenses.) | zh217 wrote: | It seems that I really did misunderstand the differences. | It is now under LGPL. The repo still requires CLA for | contribution for the moment until I am really sure. | zh217 wrote: | Thank you for your perspective. | | Maybe I was confused about the case of using an executable | vs linking against a library. Let me double-check with a | few friends who understand copyright laws better than me. | If everything checks out, the next release will be under | LGPL. | | About CLA: at the previous suggestion of a friend, the repo | was locked with CLA requirement currently (even though | nobody outside contributed yet). This will be lifted once | the situation becomes clearer. | [deleted] | georgewfraser wrote: | Licensing under AGPL will make it hard for any startup to use | Cozo. Lawyers always ask about AGPL in venture financing | diligence and it is considered a red flag. You can argue that | they are wrong, the linking exception and so on, but you're | basically shouting into the wind. | ekidd wrote: | > If a community forms and after I understand the | consequences a little bit more, the license may change if the | community decides that it is better for the long-term good of | the project. For the moment though, it is staying AGPL. | | Yes, I do want to be clear: I encourage you to use whatever | license you like. You wrote the code! I was just curious, | because it would also affect the license of any hypothetical | software I wrote that used the library. | | Here's a _super oversimplified_ version of the main license | types (I am not a lawyer): | | - Permissive: "Do whatever you want but don't sue me." | | - LGPL: "If you give this library to other people, you must | 'share and share alike' the source and your changes to this | library." | | - GPL: "If you use this code in your program, you must 'share | and share alike' your entire program, but only if you give | people copies of the program." | | - AGPL: "If you use this code in your program, you must | 'share and share alike' your entire program with anyone who | can interact with it over a network." | | The AGPL makes a ton of sense for an advanced database | _server,_ because otherwise AWS may make their own version | and run it on their servers as a paid service, without | contributing back. | | But like I said, I'm simplifying way too much. Take a look at | the FSF's license descriptions and/or talk to a lawyer. This | shouldn't be stressful. Figure out what license supports the | kind of users and community you want, pick it, and don't look | back. :-) | | (I may end up writing a super-simple non-persistent Datalog | at some point for an open source project. My needs are _much_ | simpler than the things you support, anyways--I only ever | need to run one particular query.) | zh217 wrote: | I realized my mistake, as I said in the other comments. The | main repo is now under LGPL. I'll see what I'll do with the | bindings. Writing code is so much better than dealing with | licenses! | ekidd wrote: | Oh, cool! | | And yeah, licenses can be challenging and frustrating, | especially the first time you release a major project. | | I am really super excited by the idea of embedded Datalog | in Rust. I sometimes run into situations where I need | something that fits in that awkward gap between SQL and | Prolog. I want more expressiveness, better composability, | and better graph support than SQL. But I also want | finite-sized results that I can materialize in bounded | time. | | There has been some very neat work with incrementally- | updated Datalog in the Rust community. For example, I | think Datafrog is really neat: https://github.com/frankmc | sherry/blog/blob/master/posts/2018... But it's great to | see more cool projects in this space, so thank you. | kylebarron wrote: | If I'm not mistaken that sounds more like LGPL than the AGPL? | zh217 wrote: | Maybe, and maybe I need to consult a lawyer someday to get | the facts straight. To tell you the truth my head hurts | when I attempt to understand what these licenses say. | Regardless, I intend this project to be true FOSS, the | "finer detail" of which FOSS license it uses may change. | mijoharas wrote: | My understanding is the same as kylebarron's[0] since you | lack linking protections (which you would get under | LGPL), so any work that includes cozo would be a "derived | work" under the (A)GPL. Interestingly there doesn't seem | to be an affero LGPL license[1], which could be what you | might want here. | | Otherwise, simplest solution provided you want a copyleft | license would be to use the LGPL I think. | | NOTE: not a lawyer. | | [0] https://softwareengineering.stackexchange.com/questio | ns/1078... | | [1] https://redmonk.com/dberkholz/2012/09/07/opening-the- | infrast... (old link, but I couldn't find anything since | then describing this kind of license?) | wizzwizz4 wrote: | We kinda do have it; it's just mostly useless, given the | linking clause. (Not entirely useless, though, as that | article sets out.) | | GPL and AGPL have the same layout, so you can just take | the LGPL, and replace all references to 'GPL' and 'GNU | General Public License' with 'AGPL' and 'GNU Affero | General Public License'. Of course, you couldn't call | that license 'GNU ALGPL' or 'GNU LAGPL'; you'd have to | come up with your own name. (Disclaimer: I'm not a | lawyer, and I haven't checked this as thoroughly as I | would if I were going to use this for my own software.) | | Maybe it's worth bothering Bradley M. Kuhn | (http://ebb.org/bkuhn/) again and seeing what the current | status of a Lesser AGPL is? | _frkl wrote: | That's a fair enough stance. I'd recommend not taking any | outside contributions until you are sure about the | license, since it'll make it much harder to change the | license if you do. Or maybe require all outside | contributions to be licensed very permissively, like | using the BSD license. Or you could use a CLA, but that's | not something I'd recommend. Either way, licensing is | hard :(. I can emphasise with the head hurting.... Oh, | also, check out https://tldrlegal.com/ . | kapilvt wrote: | its also odd then re the python bindings being MIT, as | the AGPL will convey throughout any aggregation or | library usage, as would GPL, the primary delta for GPL vs | AGPL is the intent on the later for network offered | services, which in the context of an embedded library/db | is odd. rightly or wrongly many orgs will refuse to allow | usage of gpl/agpl software due to the licensing concerns | around the effects of the rest of their ip. duckdb | (embedded analytics sql) uses mit, etc. so in terms of | creating a "true foss" project ie a community of users | and contributors, its definitely worth considering a | licensing change imho, but of course dealers choice. | zh217 wrote: | OP here. Nothing about the license is final yet since | there are no outside contributors. I just changed the | main repo to LGPL, not because what I believed in | changed, but because it seems that I really misunderstood | the licenses. | dangoor wrote: | I am not a lawyer, but I work in an open source programs | office and am currently working specifically on open source | license compliance. | | Beyond what the sibling comments have said about LGPL | sounding more like what you're going for, I'll just note that | if you'd like broad adoption of this while still ensuring | that changes to your code remain open, you might also want to | consider the Mozilla Public License. | | From what I understand of MPL and LGPL is that MPL is better | for instances where dynamic linking isn't possible. The MPL | basically says that any changes _to the files you created_ | must be available under the MPL, preserving their public | availability. | | That said, most organizations are fine with the LGPL, but it | just gets gnarly if there are instances where you really want | to statically link something but you still fully want to | support the original library's openness. | pie_flavor wrote: | AGPL is a variant of the GPL, not the LGPL. Meaning that | dynamic linking still constitutes (according to them) a | derivative work, meaning that even programs that dynamically | link against it must themselves be AGPL in their entirety. | Dynamic linking is also meaningfully complicated to do in | Rust, and this licensure of the crates.io crate will be a | footgun for anyone not using cargo-deny. | | I think this is a very cool project, but its use of *GPL | essentially ensures I'm not going to use it for anything. If | you're planning on reducing it to LGPL, I'm not sure what the | GPL is getting you over going with the Rust standard license | set of MIT + Apache 2.0. | jitl wrote: | This is amazing! | | Have you looked at differential-datalog? It's rust-based, | maintained by VMWare, and has a very rich, well-typed Datalog | language. differential-datalog is in-memory only right now, but | could be ideal to integrate your graph as a datastore or disk | spill cache. | | https://github.com/vmware/differential-datalog | abc3354 wrote: | This look nice ! | | Datascript seems to be another Datalog engine (in memory only) | | https://github.com/tonsky/datascript | fsiefken wrote: | there are a few more, including ones supporting on disk | databases | https://en.wikipedia.org/wiki/Datalog#Systems_implementing_D... | billylindeman wrote: | This is amazing. I can't wait to play with it | typon wrote: | I have been meaning to do this exact project for 5 years at | least. Congrats on making it happen - looking forward to using it | stevesimmons wrote: | This does look very nice! | | Especially (from my point of view) having the Python interface. | | What's the max practical graph sizes you anticipate? | zh217 wrote: | For the moment: you can have as much data as you want on disk | as long as the RocksDB storage engine can handle it, which I | believe is quite large. For any single query though, you want | all the data you touch to fit in memory. The good news is that | Rust is very efficient in using memory. This will be improved | in future versions. | | For the built-in graph algorithms, you are also limited by the | algorithmic complexity, which for some of them is quite high | (notably betweenness centrality). There is nothing the database | can help in this case, though we may add some approximate | algorithms with lower complexities later. | pgt wrote: | Good job! How to transact? The examples only show queries. | zh217 wrote: | Transactions are described in the manual: https://cozodb.github | .io/current/manual/stored.html#chaining.... | | Sorry about the docs being all over the place at the moment! My | only excuse is that Cozo is very young. The documentation (and | the implementation) still needs a lot of work! | dwenzek wrote: | Really nice! | | I like the design choices of Datalog for the query language and | Relations for the data model. This contrasts with the typical | choices made for graph databases where the word graph seems to | make _links_ a mandatory query and representation tool. | philzook wrote: | Very cool! I love the sqlite install everywhere model. | | Could you compare use case with Souffle? https://souffle- | lang.github.io/ | | I'd suggest putting the link to the docs more prominently on the | github page | | Is the "traditional" datalog `path(x,z) :- edge(x,y), path(y,z).` | syntax not pleasant to the modern eye? I've grown to rather like | it. Or is there something that syntax can't do? | | I've been building a Datalog shim layer in python to bridge | across a couple different datalog systems | https://github.com/philzook58/snakelog (including a datalog built | on top of the python sqlite bindings), so I should look into | including yours | zh217 wrote: | I find nothing wrong with the classical syntax, but there is a | very practical, even stupid reason why the syntax is the way it | is now. As you can see from the tutorial | (https://nbviewer.org/github/cozodb/cozo- | docs/blob/main/tutor...), you can run Cozo in Jupyter notebooks | and mix it with Python code. This is the main way that I myself | interact with Cozo. Since I don't fancy writing an | unmaintainable mess of Jupyter frontend code that may become | obsolete in a few years, CozoScript had better look like python | enough so as not to completely baffle the Jupyter syntax | highlighter. That's why the syntax for comments is `#`, not | `//`. That's also why the syntax for stored relation is | `*stored`, not `&stored` or `%stored`. | | This is a hack from the beginning, but over time I grew to like | the syntax quite a bit. And hopefully by being similar to | Python or JS superficially, fewer confusion results for new | users :) | philzook wrote: | Ah, that's very interesting. Thank you. `s.add(path(x,z) <= | edge(x,y) & path(y,z))` is what I chose as python syntax, but | it is clunkier. | samuell wrote: | Interesting! I'm thinking ... perhaps a small syntax | comparison for prolog/classical datalog vs cozo, would help | people used to the classical syntax quickly get started. | packetlost wrote: | This is very similar to the goals of a project I've been working | on, though I've been focusing on the raw storage format | (literally a drop-in replacement for RocksDB, so this could be | interesting). I think datalog databases are _far_ underrated. ___________________________________________________________________ (page generated 2022-11-08 23:00 UTC)