[HN Gopher] Diffsitter: A tree-sitter based AST difftool to get ...
       ___________________________________________________________________
        
       Diffsitter: A tree-sitter based AST difftool to get meaningful
       semantic diffs
        
       Author : todsacerdoti
       Score  : 88 points
       Date   : 2021-07-18 18:41 UTC (4 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | def-lkb wrote:
       | The output is a bit underwhelming. You might be interested in
       | https://codinuum.github.io/gallery-cca/ It is not based on tree
       | sitter , the parsers they use are quite impressive. Yet a
       | difficult part is computing large tree diff (it is O(n^3) if I
       | remember well), the authors devise heuristics that are efficient
       | in practice and should be adaptable to tree sitter.
        
       | nemetroid wrote:
       | The example is very confusing. According to the output of
       | diffsitter, the diff is (in order):
       | 
       | 1. "let x = 1" was removed
       | 
       | 2. "fn add_one {" was removed
       | 
       | 3. a closing brace was _added_
       | 
       | 4. "fn addition() {" was added
       | 
       | 5. "fn add_two() {" was added
       | 
       | (1) is neat, (2) is reasonable. (3) showing up before both (4)
       | and (5) is super weird. Is there a bug, or does this just
       | demonstrate a disconnect between how a parser parses the code and
       | how a human parses it?
        
         | comex wrote:
         | Indeed, the example seems a little underwhelming.
         | 
         | It does demonstrate that diffsitter can tell that `fn main() {`
         | was not semantically changed at all, while still being able to
         | cope with syntactically invalid code like `fn add_one {`.
         | 
         | But look at the following, where I've attempted to manually
         | convert diffsitter's generated diff into a unified diff format,
         | making it easier to understand what's going on:
         | 1   fn main() {         2 -     let x = 1;         3 - fn
         | add_one {         4 + }         5 + fn addition() {         6
         | }         7 + fn add_two() {         8   }
         | 
         | Look at the closing brace on line 6, which doesn't have a - or
         | + in front of it, meaning that the diff tool has found a
         | closing brace in the old code and one in the new code and
         | decided that they correspond to each other. But this
         | correspondence is not syntactically meaningful. In the old
         | code, that closing brace belongs to `fn main`. In the new code,
         | however, it belongs to `fn addition`, while `fn main`'s closing
         | brace is now represented by the "inserted" closing brace on
         | line 4!
         | 
         | Mixing up braces this way is a typical weakness of
         | _traditional_ diff tools. I 'd hope that a syntax-aware diff
         | tool could prevent such mix-ups, and instead generate a diff
         | like:                   1   fn main() {         2 -     let x =
         | 1;         3   }         4 - fn add_one {         5 - }
         | 6 + fn addition() {         7 + }         8 + fn add_two() {
         | 9 + }
         | 
         | Even though this diff is one line longer (while diff tools
         | typically aim to produce the shortest possible diff), it's more
         | semantically correct, and would be much easier to deal with if
         | a merge conflict came into play.
         | 
         | Unfortunately, diffsitter does not do this, at least not in
         | that demo. And I don't see how it would be [edit: caused by] a
         | disconnect between how a parser parses the code and how a human
         | parses it. The parser has the same idea of closing braces
         | belonging to particular opening constructs as humans do.
        
           | feanaro wrote:
           | > And I don't see how it would be because of a disconnect
           | between how a parser parses the code and how a human parses
           | it. The parser has the same idea of closing braces belonging
           | to particular opening constructs as humans do.
           | 
           | I don't understand the argument you made here. Why couldn't
           | it do this?
           | 
           | Compared to the first snippet, in the second snippet:
           | 
           | 1. There is _still_ a function called `main`, but it has no
           | lines, compared to a single line before. The conclusion is
           | that this line was removed.
           | 
           | 2. There is no longer an `add_one` function.
           | 
           | 3. There are two new functions, `addition` and `add_two`.
           | 
           | This is exactly what your wanted diff is showing and all of
           | those things can be determined by a parser.
        
             | comex wrote:
             | I agree. I think you misinterpreted my comment. By "I don't
             | see how it would be because of a disconnect", I meant "I
             | don't see how the issue could be caused by a disconnect".
             | (Maybe you read it as "I don't see how diffsitter would do
             | that, because there's a disconnect"?) The suggestion of a
             | disconnect came from the parent comment, and I was
             | attempting to refute it.
        
               | feanaro wrote:
               | Oops, yep. I mentally dropped a few words there from the
               | quoted part, even though I re-read multiple times. Thanks
               | for bearing with me.
        
           | marcodiego wrote:
           | Definitely underwhelming. Actually I couldn't see how it is
           | better than "diff -u".
        
       | touisteur wrote:
       | I wish we could standardise on such tools for all languages,
       | maybe through the language server protocol? I'm afraid of
       | building and maintaining yet another kind of parser (as awesome
       | as tree-sitters are) and to be on the language update treadmill
       | then...
        
       | iwwr wrote:
       | Is this tool obsoleted by autoformatters?
        
         | codetrotter wrote:
         | Not unless you you have a git hook to ensure that every commit
         | is run through the auto formatter, and you additionally ensure
         | that everyone that commits to the repositories you work on have
         | this hook installed too. And even then, many people still
         | interact with repositories that are outside of their power to
         | enforce such rules upon.
        
           | polynomial wrote:
           | sorry for the dumb question but why can't you use GH Actions
           | for this? (Instead of making sure all committers have the
           | hook installed.)
        
           | sahkopoyta wrote:
           | >git hook to ensure that every commit is run through the auto
           | formatter, and you additionally ensure that everyone that
           | commits to the repositories you work on have this hook
           | installed too
           | 
           | Eg. with node project this is a trivial task with tools like
           | Husky in place
        
         | fierro wrote:
         | no, because the bigger win isn't ingoring whitespacing. It's
         | ignore refactoring changes like renaming functions
        
         | craftkiller wrote:
         | It could be useful if it was modified to become a great merge
         | tool. I've been using kdiff3 as my git mergetool and its always
         | splits my code up at the worst spots because it does not
         | understand python.
        
       | [deleted]
        
       | dataflow wrote:
       | How in the world does it handle C++? With macros and all that
       | being in C++, surely it's relying on some heuristics? (Which is
       | fine; it's still an improvement. Just wondering if it's really
       | working semantically.)
       | 
       | Also, one nice thing about diff is that it also gives you a patch
       | that will turn the input into the output. Can this do that?
        
         | tyingq wrote:
         | Someone is working on tree-sitter for Perl too:
         | https://github.com/ganezdragon/tree-sitter-perl
         | 
         | Which is supposed to be difficult, if not impossible:
         | https://www.perlmonks.org/?node_id=663393
         | 
         | I suppose, though, if it's for diffs and syntax highlighting,
         | flaws matter less.
        
       | armchairhacker wrote:
       | Looks similar to https://github.com/github/semantic
        
         | bhl wrote:
         | Github is also behind the library that diffsitter builds upon:
         | https://github.com/tree-sitter/tree-sitter.
        
         | dataflow wrote:
         | That one (like most such tools) doesn't handle C++.
        
       ___________________________________________________________________
       (page generated 2021-07-18 23:00 UTC)