[HN Gopher] Stack Graphs ___________________________________________________________________ Stack Graphs Author : todsacerdoti Score : 220 points Date : 2021-12-09 17:51 UTC (5 hours ago) (HTM) web link (github.blog) (TXT) w3m dump (github.blog) | spinningslate wrote: | Cool to see this is inspired by Eelco Visser's [1] group at TU- | Delft, who originated the concept of scope graphs [2]. | | The back story of their language design research and tools is | worth reading [3]. Not least because Nix originated as their | build system! | | [1] https://eelcovisser.org/ | | [2] https://github.blog/2021-12-09-introducing-stack-graphs/ | | [3] https://eelcovisser.org/blog/2021/02/08/spoofax-mip/ | | -- | | EDIT: Fixed formatting. | dcreager wrote: | Hard agree, this is very much a "standing on the shoulders of | giants" situation. Eelco's group has done amazing work on scope | graphs, and we would not have been able to make this without | that work as a foundation. | spinningslate wrote: | ...and I should have said: great that you acknowledged it in | your blog too. | gorgoiler wrote: | How do I plumb this precise code jumping into vim? | | My tags files just don't seem to cut it any more. | dcreager wrote: | The most straightforward option would probably be to write an | LSP wrapper around the stack graphs code. One of the people on | my team wrote a very "contrib directory" version of that for | internal testing, which lets us test our stack graph rules for | a new language in something like VS Code before deploying to | production. If someone were to write a more polished version of | that, that would be a great way to get code nav into any LSP- | compatible editor for any language that supports stack graphs. | bbkane wrote: | NeoVim has Tree-sitter and Language Server integration. I | couldn't make it work for me, bit lots of people really love | it! | botdan wrote: | If you haven't tried again recently the neovim team has done | a ton of work updating the documentation on [nvim- | lspconfig](https://github.com/neovim/nvim-lspconfig). There's | also projects like [kickstart.nvim](https://github.com/nvim- | lua/kickstart.nvim) which aim to provide a very simple | starting point for new users. It's "batteries-included" | neovim which notably includes LSP, TreeSitter, completion | engines, and some basic git functionality. | jandinter wrote: | Looks neat! | | I wish this was available for legal texts, making it easy to jump | from one law to the referenced next legal provision. Many legal | provisions, especially in very regulated areas, make use of | "functions" "imported" from other, totally different laws. | | Sorry for being off-topic, but if anyone knows a resource for | that, I am super interested! | earth_walker wrote: | I started doing this for a niche area: US and European | regulations and guidance documents for Good Laboratory | Practice, and later for Canadian Cannabis regulations. | Basically I created a standard XML schema for regulations and | parsed them into XML [1]. This allowed for e.g. presenting | tables of contents and section folding, pulling and linking | definitions into their own search engine, etc. [2] | | I thought that I could easily write a parser for each | jurisdiction's formats, and then get predicate rules and | related regulations for free. | | I was wrong. a) there are many jurisdictions and sub-groups all | doing their own thing; and b) most don't have any standard | document formatting or tagging, let alone a defined structure. | Even in the most structured formats (like the US eCFR's XML) | the focus is on display rather than content. In the worst cases | it was just whoever wrote up the Word document chose how they | numbered and formatted chapters and sections etc. | | There were so many special cases that it was a huge amount of | work to add or update each document, and I ended up doing a lot | of categorization and fixing by hand. | | [1] I know people hate XML on HN, but I did my research and had | specific reasons for choosing it at the time, including human | readable, nesting sections, being able to easily publish and | validate a schema, etc. | | [2] See ReadtheRegs.com. You can browse the definitions page | without an account. | jandinter wrote: | This looks great! I share your sentiment: I looked into the | XML files for the published German legal texts[1], and they | seem to be made for display purposes only. | | [1] Table of contents for XML files: https://www.gesetze-im- | internet.de/gii-toc.xml | masklinn wrote: | > I wish this was available for legal texts, making it easy to | jump from one law to the referenced next legal provision. Many | legal provisions, especially in very regulated areas, make use | of "functions" "imported" from other, totally different laws. | | I mean, it "is", to the extent that if you put in the work of | hyperlinking all the things during the digitizing process they | can be. | | Legifrance is fairly highly (though nowhere near completely) | hyperlinked for instance, here's one of the laws I selected | from the front page: | https://www.legifrance.gouv.fr/jorf/id/JORFTEXT000044446848 | | Many (though nowhere near all) the legal texts being | referenced, modified, or inserted (as references, into other | texts) are hyperlinked. | ova-throwaway wrote: | ? baby lawyer and former dev here: don't we have that anyways? | E.g. on Casetext, Lexis, all the usual legal research sites. | | I personally haven't encountered a situation where it was | _totally_ lacking. | jandinter wrote: | Many available options seem to be based on manual annotation | and, therefore, cover a limited range of all legal texts. | Especially with regard to regulatory topics, those research | sites usually fall short. | ss108 wrote: | I would be interested to know where you're encountering | these issues, specifically. I'm interested in legal tech, | would like to know where the gaps are | cloogshicer wrote: | I just want to say: This is an absolutely fantastic blog post. So | well written, and really well done graphics that help | understanding. | dcreager wrote: | Thanks very much, I appreciate the kind words! | armchairhacker wrote: | is this from Github semantic | (https://github.com/github/semantic)? | | Seems very suspicious since it's the same goal using the same | technologies. The latest commit is 4mo ago but i assume they have | a closed-source version they've been working on. | dcreager wrote: | It's from the same team (which I am the manager of), but it's | not using that same codebase. In Semantic, you would have to | write Haskell code to add support for a new language, and we've | found that the declarative DSLs that tree-sitter provides are a | lower barrier to entry. (Semantic also uses tree-sitter for the | parsing step, btw.) We do still have plans for Semantic, but | our stack graph code does not live there. | | _[Edit]_ Also, the stack graph implementation is also open- | source, just like Semantic, and we do our development on the | core algorithms directly there. The Python extraction rules | have not yet been moved over to the public tree-sitter-python | language repo, but that 's on the docket. Future language | support would happen directly in each language's public open- | source tree-sitter repo. | | https://github.com/github/stack-graphs/ | parhamn wrote: | Are there any editors that let you edit code in a structure | closer to the way the language handles it (e.g. graphs/stacks)? | | You could achieve a lot more than the conventional file-per- | module approach. Off the top of my head benefits could include: | much easier refactoring, comments bound to specific tokens, | function level versioning, much smarter git diffs, and so much | more. | | The mapping graphs to flat-files thing feels especially silly | when I do FE dev. Manipulating JSX doms on top of the JS function | stack is a constant reminder of how the flattening step feels | unnecessary. | sweetsocks21 wrote: | I don't know of any generic editor that can take a current | (popular) language and do that. There is however a lot going on | in this space, and I'll list some examples. Most are about | editing the tree structure better, but some do branch out a bit | more to the graph idea. | | https://docs.darklang.com/structured-editing | | https://hazel.org/ | | http://lighttable.com/ | | Emacs has some structural editing plugins like ParEdit | | Many graph/node based editors like Blender shaders (3D), | Reaktor (music) | | https://www.unisonweb.org/ (this has some of the function | versioning ideas) | | https://enso.org/ | | Smalltalk | | There's many many more, these are just some I remembered off | the top of my head. | parhamn wrote: | I appreciate the links! | hiaux0 wrote: | The Dion editor [1] is trying to tackling this problem, if I | understand correctly. [2] has some sweet GIFs for you. | | [1]: https://dion.systems/dion_format.html | | [2]: https://dion.systems/gallery.html | ball_of_lint wrote: | Lisp + SLIME is probably the closest thing that exists to what | you're describing. | dcreager wrote: | OP author here. I also gave a talk about this at Strange Loop | back in October if folks want to watch/listen instead of read: | https://dcreager.net/talks/2021-strange-loop/ | dsanchez97 wrote: | I haven't had time to watch the full talk yet, so sorry if this | is answered there. | | When python resolves 'import' statements, it looks for the | modules based on the PYTHONPATH. Although not done that often, | it is possible to modify the PYTHONPATH at runtime, changing | what an imported symbol will resolve to. How do you handle | situations like that? | | Just from a hypothetical stand point, someone could take | advantage of this to make it seem like the library is linking | to a safe implementation of a function such that when using | this feature people are directed to the safe implementation. | Then at runtime without the user knowing, they could | dynamically change the PYTHONPATH so a malicious version of the | function is loaded. | dcreager wrote: | Ooh that's a good one. Right now, our lookups are only within | the single repository. And so if you had two files that | _could_ provide the same (fully qualified) symbol, we aren't | doing any PYTHONPATH analysis to determine which one it is. | We'll show you both. | | We do eventually want to support cross-repository use cases, | and there, the answer boils down to needing to find the set | of dependencies in which to do the search. One we have that, | it's no different than an in-repo case -- we look for any | file in _any_ of the repos (yours and your dependencies) that | could provide the symbol that we 're currently looking for. | | So, short version, we'd be aiming for a solution where we'd | be able to show you both the "good" and "bad" definitions, | and let you the user decide how to use that information. | dsanchez97 wrote: | I have been doing some stuff where I analyze python code | via the AST to try to figure out symbol reference so it was | top of mind when I read the article. My tool works at | runtime by importing the users code as module, which means | all the symbols are evaluated by the python interpreter and | then I can inspect the loaded module to determine the | references. This is all part of a larger framework that has | lifecycle rules for how/when it will load user defined | code, which allows me some flexibility and information. | | Even with that flexibility, there are still some things | that just weren't possible because of how configurable | python is at runtime. For example, someone could write a | factory style class that dynamically creates python object | instances based on a passed in string that represents the | class the object will be of. Then they could pass user | input into this factory making the created objects | completely dependent on runtime input. | | I would wage 99% of python written doesn't use these kinds | of runtime abilities, and it probably isn't a great | practice to use them in general from a maintainability | point of view but they do exist. My solution to this is | that if you are sophisticated enough to be using these | features then you should be able to understand why my tool | can't capture that information from the AST. | | Not sure if that solution would work for what you are | working on, but I figured I'd let you know about my | experience because it can get gnarly quickly once you start | thinking about all the things that are possible in python. | dcreager wrote: | Yes! Those are exactly the kinds of examples that mess up | this kind of analysis. Anything where the _structure_ of | your program depends on arbitrary computation: | https://twitter.com/dcreager/status/1467654252516589571 | dsanchez97 wrote: | Yep I feel you on that making life harder for these kinds | of tools! | jkaptur wrote: | I think the (unstated) expectation with code navigation tools | is that they are best-effort. Beyond your example, plenty of | languages allow exotic and dynamic runtime behavior - | eval(..)ing user/network input, monkeypatching, etc. - that | makes it impossible to know a priori exactly what a call site | might invoke. | one_off_comment wrote: | For people like me who don't have time to watch the talk, | what's the answer to the question posed on the blog post? "Why | aren't we using the Language Server Protocol (LSP) or Language | Server Index Format (LSIF)?" | dcreager wrote: | I go into some amount of detail in a talk I gave at last | year's FOSDEM: https://dcreager.net/talks/2020-fosdem/ | | For LSP, the short version is that running separate sidecar | services in production for every language that we want to | support is a complete non-starter. That would completely eat | up my team's time budget handling operational duties. | | LSIF is a great technology that lets you run LSP servers in a | "batch" mode. But we _really_ need our analysis to be | incremental, where we can reuse results for unchanged files | when new commits come in. Language servers tend to do | monolithic analyses, where _every_ file needs to be | reanalyzed whenever _any_ new commit comes in. If you want to | analyze your dependencies, as well, that exacerbates the | problem. _LSIF_ (the data format) has recently grown the | ability to produce incremental data, but that requires | language servers to work in an incremental mode as well. Very | few (if any?) do, and because language servers tend to piggy- | back on existing compiler technology (which is also not | typically incremental), it will be a heavy lift to get | incrementality into the LSP /LSIF world. | | Whereas stack graphs have incrementality out of the box. | (This was the primary thing that we added to "scope graphs", | the academic framework that stack graphs are built on.) It's | the core algorithm (which is implemented once for all | languages) where the incrementality happens. The only | language-specific parts are figuring out which graph | structures you need to create to mimic the name binding rules | of your language. | billconan wrote: | Thank you very much for this and for open sourcing it! | | For production, is there a good database system that can index | this graph structure? | | for incremental update, how do you prune deprecated part of the | graph (for example, removed/renamed files/functions?) | | and for this example | | (function_definition name: (identifier) @name) @function { | node @function.def attr (@function.def) kind = | "definition" attr (@function.def) symbol = @name | edge @function.containing_scope -> @function.def | | } | | how can it guarantee the python shadowing rule? it doesn't seem | to encode any order preference. does the code traverse the | source file in the reverse order basically? | | And, probably not closely related to stack graph, but about | using tree-sitter for c/c++ understanding, how to handle the | preprocessor? | | because the c preprocessor can make the code look like a | completely different language and mess up the parser. | | And how to prune and simplify CST to AST at scale (supporting | many languages)? | dcreager wrote: | > And, probably not closely related to stack graph, but about | using tree-sitter for c/c++ understanding, how to handle the | preprocessor? | | Ha yeah that's a good question. Some uses of the preprocessor | won't be problematic -- it would require deep token mangling, | for instance, to really start to cause a problem. You can | treat more basic `#ifdef` style conditional compilation as | parsing/analyzing both sides and showing both as potential | definitions. (And from there you could extend it further to | try to identify (or define) "profiles" that have different | preprocessor symbols defined, and use that to actually prune | some of the results.) | billconan wrote: | Thank you very much for the answers! This is a great work! | | I'm thinking maybe stack graph can be used to understand | the preprocessor. finding the original toggle/condition | that turns on/off a #ifdef block. | | I heard a simple c++ hello world contains 5000 #defines | introduced by standard libs. if stack graph can improve | exhaustive search somehow, that would be awesome. | dcreager wrote: | > And how to prune and simplify CST to AST at scale | (supporting many languages)? | | We're not doing any pruning or CST-AST translation, we just | operate directly on the CST. With the new graph DSL you | should be able to implement something like that, since an AST | is a tree, and a tree is one shape of graph that you could | create. For our purposes, that isn't a meaningfully useful | step, since we can just as easily generate the stack graph | structures that we need directly from the CST we get from the | tree-sitter grammar. | dcreager wrote: | > For production, is there a good database system that can | index this graph structure? | | For awhile, we were storing this in a (very large) MySQL | database, sharded with Vitess. The sharding behavior worked | great (since repo ID gives you a nice sharding key), but we | found that it wasn't elastic enough for our needs, since we | quickly filled up the available capacity of the machines that | we had reserved. | | Since then we've switched over to storing this data in Azure | Blob Storage, basically using it as a glorified key/value | store. We had to write custom logic for deciding how to | structure our data so that we can efficiently write it at | index time and read it at query time, but so far it's been | working quite nicely! | | > for incremental update, how do you prune deprecated part of | the graph | | Short version is that we're storing everything on a per-file | basis. So whenever a file is changed, we generate a new stack | graph snippet for that file. There might be lots of content | in that stack graph that is identical to the stack graph of | the previous version of the file, but we don't try to do any | structural sharing more fine-grained than the file. | | Right now we aren't going in any pruning old files that | aren't being touched by any active queries, but we could. Or | move it to a colder storage tier in Blob Storage, something | like that. At least for now, the marginal costs of storing | the data for longer aren't our cost bottleneck. | dcreager wrote: | > how can it guarantee the python shadowing rule? it doesn't | seem to encode any order preference. does the code traverse | the source file in the reverse order basically? | | That snippet of graph DSL does not show the precedences being | applied, but if you look at the diagram a bit earlier in the | post, you'll see that some of edges do have precedence values | applied. In the graph DSL, that would appear as an additional | statement in the stanza: attr | (@function.containing_scope -> @function.def) precedence = 1 | dahart wrote: | This is way cool!! I don't have any deep questions yet, but to | start I'm a bit curious about some very minor things like the | name and description. I'm curious why "stack graph" as opposed | to "call graph" or "callstack graph" or something like that, | I'm guessing you do have some thoughts there. Also curious | about the way you described it, the process overall certainly | sounds a lot like parsing, compiling, and linking at the end, | but you haven't really used that analogy. I guess I'm just | wondering if you're framing the name and description carefully | and if there are specific reasons you'd be willing to discuss. | dcreager wrote: | Great questions! The framework is based on some great | existing academic work from Eelco Visser's group at TU Delft. | Their framework is called "scope graphs": | https://pl.ewi.tudelft.nl/research/projects/scope-graphs/ | | We extended scope graphs to have the symbol stack (described | in OP) and also a "scope stack", which allows us to support | the more advanced examples that I alluded to at the end. So | we chose the name "stack graphs" because it was "scope graphs | but using stacks". | ZeroCool2u wrote: | Just out of curiosity, what's the timeline look like for adding | precise code navigation to other supported languages and what | languages do you think will get support first? | | No need for precise answers, just wondering which ones we're | likely to see next after Python :) | | Also, I'm looking at the list of supported languages here[1]. | Maybe you're not the right person to ask, but are there any | plans to add support for one of the lower level / systems | programming languages like C, C++, or Rust, etc? | | Finally, thank you so much for you and your teams hard work. | This feature is _incredibly_ helpful, especially in Python! | | [1]: https://docs.github.com/en/repositories/working-with- | files/u... | dcreager wrote: | It's not "easy", but we've found that because it's based on a | declarative DSL it's less effort than you might expect. And | we're finding that there are common patterns that you use in | your graph construction rules, because there are many aspects | of name binding that end up working the same way in different | languages. So, hand-wavily (not out of secrecy but because of | not having rigorous data yet), we're finding that it's | O(months) to get a new language out the door. | | We do have a couple of other languages in the pipeline that | my team has been working on, both in terms of writing stack | graph rules to get precise support, and also to write "fuzzy" | tagging rules to get search-based support. And we definitely | do plan to include lower level languages like the ones you | mentioned. | | Lastly, one major reason that we're doing all of this in | open-source projects is that we want to ensure that language | communities can self-serve support for their languages, | should they wish to. That will be especially useful for the | long tail of languages that my team will honestly never be | able to get to ourselves. We have some work to do to get the | documentation written to properly support self-serve stack | graph rules, but it's definitely a goal that we're aiming | for. | ZeroCool2u wrote: | Got it, that's really helpful in terms of estimating the | work involved. Thanks for taking the time to answer all | these questions and congrats again on the release! | chefandy wrote: | As an aside to the technical conversation-- I appreciate the | use of cooking as an analogy in your code samples in the posted | article. | | It's not just because I was a chef! Using nouns, adjectives and | verbs from (familiar) concrete hierarchical analogies makes | technical writing much more accessible. It reduces cognitive | load by implying more about the structure of the relationship | than variables like "intVar" and functions like "printVar" tied | together in entirely contrived, abstract ways. In particular, | newer developers, or ones unfamiliar with your language | paradigms will benefit heartily. | | I implore my fellow developers to follow suit. ___________________________________________________________________ (page generated 2021-12-09 23:00 UTC)