[HN Gopher] Static analysis at GitHub
       ___________________________________________________________________
        
       Static analysis at GitHub
        
       Author : trptcolin
       Score  : 132 points
       Date   : 2022-03-30 13:27 UTC (9 hours ago)
        
 (HTM) web link (cacm.acm.org)
 (TXT) w3m dump (cacm.acm.org)
        
       | evmar wrote:
       | I contributed to a similar system used within Google (partially
       | open source at kythe.io), that took the very different approach
       | of integrating with the language-native toolchain for each
       | language.
       | 
       | As this article describes, doing this requires per-language
       | integrations and also effectively being able to "run the build"
       | for any given code (because e.g. the C++ header search path can
       | vary on a per-source-file basis), which is untenable for a
       | codebase as large and varied as GitHub's. However, if you can
       | make it work, you get the benefit of having the compiler's
       | understanding of the semantics of the code, which is especially
       | finicky in complex languages like C++ or, say, Rust.
       | 
       | For example, if you look at this[1] method call it refers to a
       | symbol generated by a chain of macros, but the browser is still
       | able to point you at the definition of it.
       | 
       | It's an interesting tradeoff to make: the GitHub approach likely
       | doesn't handle corner cases like the above but it makes up for it
       | in broad applicability and performance. I recall an IDE developer
       | once telling me they made a similar tradeoff in code completion,
       | in that it's better DX to pop up completions quickly even if
       | they're "only" 99% correct.
       | 
       | (To be clear, I absolutely think the approach taken in the
       | article was the right one for the domain they're working in, I
       | was just contrasting it against my experience in a similar
       | problem where we took a very different approach.)
       | 
       | [1]
       | https://source.chromium.org/chromium/chromium/src/+/main:v8/...
        
         | dcreager wrote:
         | Note that this article describes our implementation of "search-
         | based" or "ctags-like" Code Navigation, which definitely has
         | the imprecision that you describe. We've also been working over
         | the previous ~year on a framework called Stack Graphs [1,2,3],
         | which lets us tackle "precise" Code Navigation while still
         | having the zero-config and incremental aspects that are
         | described in the paper.
         | 
         | The build-based approach that you describe is also used by the
         | Language Server Protocol (LSP) ecosystem. You've summarized the
         | tradeoffs quite well! I've described a bit more about why we
         | decided against a build-based/LSP approach here [4]. One of the
         | biggest deciding factors is that at our scale, incremental
         | processing is an absolute necessity, not a nice-to-have.
         | 
         | [1] https://github.blog/2021-12-09-introducing-stack-graphs/
         | 
         | [2] https://dcreager.net/talks/2021-strange-loop/
         | 
         | [3] https://news.ycombinator.com/item?id=29500602
         | 
         | [4] https://news.ycombinator.com/item?id=29501824
        
           | evmar wrote:
           | I read about stack graphs before, it sounds interesting!
           | 
           | I think they help, but ultimately I expect you need a
           | compiler solve the absolute madness of the totality of C++.
           | For example I think getting argument-dependent lookup right
           | in the presence of 'auto' requires type information? And
           | there are other categories of things (like header search
           | paths) where I think you are forced to involve the build
           | system too.
        
             | dmoy wrote:
             | Yup, it is probably fair to say that C++ accounts for like
             | 50% of the complexity of Kythe at Google. Or certainly it
             | feels like it.
             | 
             | And it is also worth noting that Kythe goes a bit deeper
             | than what LSP can accomplish. In particular Kythe is built
             | around a sort of two-layer graph, where it separates the
             | physical code/line representation from a more abstract
             | semantic graph. This allows us to accomplish some things
             | that are very difficult to do in LSP.
             | 
             | Finally, Kythe at least internally has a big reliance on a
             | unified build system (Blaze, or Bazel). It becomes rapidly
             | more difficult to do when you have to hook in N different
             | build systems up front, which is why search-based
             | references are so appealing. Build integration is hard.
        
           | nerdponx wrote:
           | Has Tree Sitter been useful to projects like this? Does it
           | have promise to be useful in the future? It seems to be
           | gaining a lot of adoption among Neovim users and plugin
           | developers, but not really anywhere else. I'm curious if
           | that's because of lack of familiarity, or because it's
           | technically deficient somehow.
        
       | [deleted]
        
       | beliu wrote:
       | Sourcegraph CTO here. It's interesting to read about GitHub's
       | approach and how it contrasts with the approach we've taken at
       | Sourcegraph. One of the key tradeoffs the article highlights is
       | GitHub's decision to take the "shallow-but-wide" approach to code
       | navigation, which has enabled them to provide some level of code
       | navigation for most open-source repositories on GitHub, but at
       | the expense of precision/accuracy (i.e., the system can't
       | necessarily differentiate between different symbols with the same
       | name).
       | 
       | Sourcegraph decided early on to take the opposite approach,
       | favoring precision and accuracy over supporting every public
       | codebase. Part of the reason why is that we aren't a code host
       | that hosts millions of open-source repositories, so we didn't
       | feel the need to support all of those at once. Another big reason
       | is we heard from our users and customers that code navigation
       | accuracy was critical for exploring their private code and
       | enabling them to stay in flow (inaccurate results would break the
       | train of thought because you'd have to actively think about how
       | to navigate to the referenced symbol). We actually built out a
       | language-agnostic search-based code navigation, but increasingly
       | user feedback has driven us to adopt a more precise model, based
       | at first on our own protocol (https://srclib.org) and also the
       | LSIF protocol open-sourced by Microsoft that now enables code
       | navigation for many popular editor extensions.
       | 
       | This is not to say that GitHub's approach is wrong, but more to
       | say that it's interesting how different goals and constraints
       | have led to systems that are quite different despite tackling the
       | same general problem. GitHub aiming to provide some level of
       | navigation to every repository on GitHub, and Sourcegraph aiming
       | to provide best-in-class navigation for private codebases and
       | dependencies.
       | 
       | (Btw, hats off to the GitHub team for open-sourcing tree-sitter,
       | a great library which we've incorporated into parts of our stack.
       | We actually hosted the creator of tree-sitter, Max Brunsfeld, on
       | our podcast awhile back and it was a really fun and insightful
       | conversation if people are interested in hearing some of the
       | backstory of tree-sitter:
       | https://about.sourcegraph.com/podcast/max-brunsfeld.)
        
       | mistrial9 wrote:
       | figure 2 is repeated for some reason?
        
         | spatulon wrote:
         | That's odd. The error was not present in the original
         | publication of this article in ACM's Queue magazine:
         | https://queue.acm.org/detail.cfm?id=3487022
        
       | marceloabsousa wrote:
       | This article more about parsing at scale than static analysis at
       | scale.
        
         | dcreager wrote:
         | Parsing is definitely a big part of it, and it's a fair point
         | that for search-based Code Navigation, we don't have to do any
         | real heavy lifting on the analysis side. That said, I think the
         | article describes our non-functional requirements well (zero-
         | config, incremental, language-agnostic). It's those non-
         | functional requirements which are most important for the "at
         | scale" part. I'd go so far as to suggest that any static
         | analysis implementation that can't meet those requirements
         | would be nigh impossible for us to roll out across the entire
         | GitHub corpus.
        
       | miohtama wrote:
       | This is Big Code
        
       ___________________________________________________________________
       (page generated 2022-03-30 23:01 UTC)