[HN Gopher] Static analysis at GitHub ___________________________________________________________________ Static analysis at GitHub Author : trptcolin Score : 132 points Date : 2022-03-30 13:27 UTC (9 hours ago) (HTM) web link (cacm.acm.org) (TXT) w3m dump (cacm.acm.org) | evmar wrote: | I contributed to a similar system used within Google (partially | open source at kythe.io), that took the very different approach | of integrating with the language-native toolchain for each | language. | | As this article describes, doing this requires per-language | integrations and also effectively being able to "run the build" | for any given code (because e.g. the C++ header search path can | vary on a per-source-file basis), which is untenable for a | codebase as large and varied as GitHub's. However, if you can | make it work, you get the benefit of having the compiler's | understanding of the semantics of the code, which is especially | finicky in complex languages like C++ or, say, Rust. | | For example, if you look at this[1] method call it refers to a | symbol generated by a chain of macros, but the browser is still | able to point you at the definition of it. | | It's an interesting tradeoff to make: the GitHub approach likely | doesn't handle corner cases like the above but it makes up for it | in broad applicability and performance. I recall an IDE developer | once telling me they made a similar tradeoff in code completion, | in that it's better DX to pop up completions quickly even if | they're "only" 99% correct. | | (To be clear, I absolutely think the approach taken in the | article was the right one for the domain they're working in, I | was just contrasting it against my experience in a similar | problem where we took a very different approach.) | | [1] | https://source.chromium.org/chromium/chromium/src/+/main:v8/... | dcreager wrote: | Note that this article describes our implementation of "search- | based" or "ctags-like" Code Navigation, which definitely has | the imprecision that you describe. We've also been working over | the previous ~year on a framework called Stack Graphs [1,2,3], | which lets us tackle "precise" Code Navigation while still | having the zero-config and incremental aspects that are | described in the paper. | | The build-based approach that you describe is also used by the | Language Server Protocol (LSP) ecosystem. You've summarized the | tradeoffs quite well! I've described a bit more about why we | decided against a build-based/LSP approach here [4]. One of the | biggest deciding factors is that at our scale, incremental | processing is an absolute necessity, not a nice-to-have. | | [1] https://github.blog/2021-12-09-introducing-stack-graphs/ | | [2] https://dcreager.net/talks/2021-strange-loop/ | | [3] https://news.ycombinator.com/item?id=29500602 | | [4] https://news.ycombinator.com/item?id=29501824 | evmar wrote: | I read about stack graphs before, it sounds interesting! | | I think they help, but ultimately I expect you need a | compiler solve the absolute madness of the totality of C++. | For example I think getting argument-dependent lookup right | in the presence of 'auto' requires type information? And | there are other categories of things (like header search | paths) where I think you are forced to involve the build | system too. | dmoy wrote: | Yup, it is probably fair to say that C++ accounts for like | 50% of the complexity of Kythe at Google. Or certainly it | feels like it. | | And it is also worth noting that Kythe goes a bit deeper | than what LSP can accomplish. In particular Kythe is built | around a sort of two-layer graph, where it separates the | physical code/line representation from a more abstract | semantic graph. This allows us to accomplish some things | that are very difficult to do in LSP. | | Finally, Kythe at least internally has a big reliance on a | unified build system (Blaze, or Bazel). It becomes rapidly | more difficult to do when you have to hook in N different | build systems up front, which is why search-based | references are so appealing. Build integration is hard. | nerdponx wrote: | Has Tree Sitter been useful to projects like this? Does it | have promise to be useful in the future? It seems to be | gaining a lot of adoption among Neovim users and plugin | developers, but not really anywhere else. I'm curious if | that's because of lack of familiarity, or because it's | technically deficient somehow. | [deleted] | beliu wrote: | Sourcegraph CTO here. It's interesting to read about GitHub's | approach and how it contrasts with the approach we've taken at | Sourcegraph. One of the key tradeoffs the article highlights is | GitHub's decision to take the "shallow-but-wide" approach to code | navigation, which has enabled them to provide some level of code | navigation for most open-source repositories on GitHub, but at | the expense of precision/accuracy (i.e., the system can't | necessarily differentiate between different symbols with the same | name). | | Sourcegraph decided early on to take the opposite approach, | favoring precision and accuracy over supporting every public | codebase. Part of the reason why is that we aren't a code host | that hosts millions of open-source repositories, so we didn't | feel the need to support all of those at once. Another big reason | is we heard from our users and customers that code navigation | accuracy was critical for exploring their private code and | enabling them to stay in flow (inaccurate results would break the | train of thought because you'd have to actively think about how | to navigate to the referenced symbol). We actually built out a | language-agnostic search-based code navigation, but increasingly | user feedback has driven us to adopt a more precise model, based | at first on our own protocol (https://srclib.org) and also the | LSIF protocol open-sourced by Microsoft that now enables code | navigation for many popular editor extensions. | | This is not to say that GitHub's approach is wrong, but more to | say that it's interesting how different goals and constraints | have led to systems that are quite different despite tackling the | same general problem. GitHub aiming to provide some level of | navigation to every repository on GitHub, and Sourcegraph aiming | to provide best-in-class navigation for private codebases and | dependencies. | | (Btw, hats off to the GitHub team for open-sourcing tree-sitter, | a great library which we've incorporated into parts of our stack. | We actually hosted the creator of tree-sitter, Max Brunsfeld, on | our podcast awhile back and it was a really fun and insightful | conversation if people are interested in hearing some of the | backstory of tree-sitter: | https://about.sourcegraph.com/podcast/max-brunsfeld.) | mistrial9 wrote: | figure 2 is repeated for some reason? | spatulon wrote: | That's odd. The error was not present in the original | publication of this article in ACM's Queue magazine: | https://queue.acm.org/detail.cfm?id=3487022 | marceloabsousa wrote: | This article more about parsing at scale than static analysis at | scale. | dcreager wrote: | Parsing is definitely a big part of it, and it's a fair point | that for search-based Code Navigation, we don't have to do any | real heavy lifting on the analysis side. That said, I think the | article describes our non-functional requirements well (zero- | config, incremental, language-agnostic). It's those non- | functional requirements which are most important for the "at | scale" part. I'd go so far as to suggest that any static | analysis implementation that can't meet those requirements | would be nigh impossible for us to roll out across the entire | GitHub corpus. | miohtama wrote: | This is Big Code ___________________________________________________________________ (page generated 2022-03-30 23:01 UTC)