[HN Gopher] Diffsitter: A tree-sitter based AST difftool to get ... ___________________________________________________________________ Diffsitter: A tree-sitter based AST difftool to get meaningful semantic diffs Author : todsacerdoti Score : 88 points Date : 2021-07-18 18:41 UTC (4 hours ago) (HTM) web link (github.com) (TXT) w3m dump (github.com) | def-lkb wrote: | The output is a bit underwhelming. You might be interested in | https://codinuum.github.io/gallery-cca/ It is not based on tree | sitter , the parsers they use are quite impressive. Yet a | difficult part is computing large tree diff (it is O(n^3) if I | remember well), the authors devise heuristics that are efficient | in practice and should be adaptable to tree sitter. | nemetroid wrote: | The example is very confusing. According to the output of | diffsitter, the diff is (in order): | | 1. "let x = 1" was removed | | 2. "fn add_one {" was removed | | 3. a closing brace was _added_ | | 4. "fn addition() {" was added | | 5. "fn add_two() {" was added | | (1) is neat, (2) is reasonable. (3) showing up before both (4) | and (5) is super weird. Is there a bug, or does this just | demonstrate a disconnect between how a parser parses the code and | how a human parses it? | comex wrote: | Indeed, the example seems a little underwhelming. | | It does demonstrate that diffsitter can tell that `fn main() {` | was not semantically changed at all, while still being able to | cope with syntactically invalid code like `fn add_one {`. | | But look at the following, where I've attempted to manually | convert diffsitter's generated diff into a unified diff format, | making it easier to understand what's going on: | 1 fn main() { 2 - let x = 1; 3 - fn | add_one { 4 + } 5 + fn addition() { 6 | } 7 + fn add_two() { 8 } | | Look at the closing brace on line 6, which doesn't have a - or | + in front of it, meaning that the diff tool has found a | closing brace in the old code and one in the new code and | decided that they correspond to each other. But this | correspondence is not syntactically meaningful. In the old | code, that closing brace belongs to `fn main`. In the new code, | however, it belongs to `fn addition`, while `fn main`'s closing | brace is now represented by the "inserted" closing brace on | line 4! | | Mixing up braces this way is a typical weakness of | _traditional_ diff tools. I 'd hope that a syntax-aware diff | tool could prevent such mix-ups, and instead generate a diff | like: 1 fn main() { 2 - let x = | 1; 3 } 4 - fn add_one { 5 - } | 6 + fn addition() { 7 + } 8 + fn add_two() { | 9 + } | | Even though this diff is one line longer (while diff tools | typically aim to produce the shortest possible diff), it's more | semantically correct, and would be much easier to deal with if | a merge conflict came into play. | | Unfortunately, diffsitter does not do this, at least not in | that demo. And I don't see how it would be [edit: caused by] a | disconnect between how a parser parses the code and how a human | parses it. The parser has the same idea of closing braces | belonging to particular opening constructs as humans do. | feanaro wrote: | > And I don't see how it would be because of a disconnect | between how a parser parses the code and how a human parses | it. The parser has the same idea of closing braces belonging | to particular opening constructs as humans do. | | I don't understand the argument you made here. Why couldn't | it do this? | | Compared to the first snippet, in the second snippet: | | 1. There is _still_ a function called `main`, but it has no | lines, compared to a single line before. The conclusion is | that this line was removed. | | 2. There is no longer an `add_one` function. | | 3. There are two new functions, `addition` and `add_two`. | | This is exactly what your wanted diff is showing and all of | those things can be determined by a parser. | comex wrote: | I agree. I think you misinterpreted my comment. By "I don't | see how it would be because of a disconnect", I meant "I | don't see how the issue could be caused by a disconnect". | (Maybe you read it as "I don't see how diffsitter would do | that, because there's a disconnect"?) The suggestion of a | disconnect came from the parent comment, and I was | attempting to refute it. | feanaro wrote: | Oops, yep. I mentally dropped a few words there from the | quoted part, even though I re-read multiple times. Thanks | for bearing with me. | marcodiego wrote: | Definitely underwhelming. Actually I couldn't see how it is | better than "diff -u". | touisteur wrote: | I wish we could standardise on such tools for all languages, | maybe through the language server protocol? I'm afraid of | building and maintaining yet another kind of parser (as awesome | as tree-sitters are) and to be on the language update treadmill | then... | iwwr wrote: | Is this tool obsoleted by autoformatters? | codetrotter wrote: | Not unless you you have a git hook to ensure that every commit | is run through the auto formatter, and you additionally ensure | that everyone that commits to the repositories you work on have | this hook installed too. And even then, many people still | interact with repositories that are outside of their power to | enforce such rules upon. | polynomial wrote: | sorry for the dumb question but why can't you use GH Actions | for this? (Instead of making sure all committers have the | hook installed.) | sahkopoyta wrote: | >git hook to ensure that every commit is run through the auto | formatter, and you additionally ensure that everyone that | commits to the repositories you work on have this hook | installed too | | Eg. with node project this is a trivial task with tools like | Husky in place | fierro wrote: | no, because the bigger win isn't ingoring whitespacing. It's | ignore refactoring changes like renaming functions | craftkiller wrote: | It could be useful if it was modified to become a great merge | tool. I've been using kdiff3 as my git mergetool and its always | splits my code up at the worst spots because it does not | understand python. | [deleted] | dataflow wrote: | How in the world does it handle C++? With macros and all that | being in C++, surely it's relying on some heuristics? (Which is | fine; it's still an improvement. Just wondering if it's really | working semantically.) | | Also, one nice thing about diff is that it also gives you a patch | that will turn the input into the output. Can this do that? | tyingq wrote: | Someone is working on tree-sitter for Perl too: | https://github.com/ganezdragon/tree-sitter-perl | | Which is supposed to be difficult, if not impossible: | https://www.perlmonks.org/?node_id=663393 | | I suppose, though, if it's for diffs and syntax highlighting, | flaws matter less. | armchairhacker wrote: | Looks similar to https://github.com/github/semantic | bhl wrote: | Github is also behind the library that diffsitter builds upon: | https://github.com/tree-sitter/tree-sitter. | dataflow wrote: | That one (like most such tools) doesn't handle C++. ___________________________________________________________________ (page generated 2021-07-18 23:00 UTC)