[HN Gopher] AST-grep(sg) is a CLI tool for code structural searc... ___________________________________________________________________ AST-grep(sg) is a CLI tool for code structural search, lint, and rewriting Author : methou Score : 213 points Date : 2023-12-10 12:03 UTC (10 hours ago) (HTM) web link (github.com) (TXT) w3m dump (github.com) | beardedwizard wrote: | Is this meant to compliment or compete with semgrep? | andrewshadura wrote: | Well, it _is_ semgrep (hence sg). | beardedwizard wrote: | yeah I had this feeling a bit, I guess im curious what | problems they solve differently (if any). My sense it that | semgrep is an enterprise managed solution of the same kind | (and btw, is still itself OSS) | ekidd wrote: | Well, when I seach for "semgrep", I get a very nice corporate | landing page with a "Book Demo" button. Which is a level of | hassle that just isn't worth it for smaller teams, because | "Book Demo" usually means "We're going to do a dance to see how | much money we can extract from you." Which smaller teams may | only want to do for a handful of key tools. | | (4 years ago, I was more willing to put up with enterprise | licensing. But in the last two years, I've seen way too many | enterprise vendors try to squeeze every penny they can get from | existing clients. An enterprise sales process now often means | "Expect 30% annual price hikes once you're in too deep to back | out." The lack of easy VC money seems to have made some | enterprise vendors pretty desperate.) | | There's also an open source "semgrep" project here: | https://github.com/semgrep/semgrep. But this seems to be | basically a vulernability scanner, going by the README. | | Whereas AST-grep seems to focus heavily on things like: | | 1. One-off searching: "Search my tree for this pattern." | | 2. Refactoring: "Replace this pattern with this other pattern." | | AST-grep also includes a vulnerability scanning mode like | semgrep. | | It's possible that semgrep also has nice support for (1) and | (2), but it isn't clearly visible on their corporate landing | page or the first open source README I found. | icholy wrote: | Semgrep is capable of one-off searching and refactoring. I | agreed that the docs are a little hard to navigate. | herrington_d wrote: | Thank ekidd for your kind words! ast-grep author here. This | is a hobby project and mainly focuses on developers' daily | job like search and linting. Appreciate you like it! | | Semgrep's vulnerability scanning is much more advanced, | mostly for enterprise security usage. | icholy wrote: | Looks like a competitor to me. | herrington_d wrote: | Hi, ast-grep author here. This is a great question and I asked | this in the first place before I started the hobby project. | | TLDR; I designed ast-grep to be on different tracks than | semgrep. | | Semgrep is for security and ast-grep is for development. | | First and foremost, I have always been in awe of semgrep. | Semgrep's documentation, product sites and Padioleau's podcast | all gave me a lot of inspiration. Using code to find code is | such a cool idea that I never need to craft an intricate regex | or write a lengthy AST program. sgrep and patch from | https://github.com/facebookarchive/pfff/wiki/Sgrep have helped | me a lot in real large codebases. | | When I used semgrep as a software engineer, instead of a | security researcher, I found semgrep has not touched too much | on routine development works. I can use `semgrep -e PATTERN` | but the Python wrapper is not too fast compared to grep. While | pattern is cool, it cannot precisely match some syntax nodes. | (example, selecting generator expression in Semgrep is very | hard). It also does not have API to find code programmatically. | | I have also a short summary for tool comparison. https://ast- | grep.github.io/advanced/tool-comparison.html | herrington_d wrote: | Why I think semgrep is a security tool different from ast- | grep: | | * Semgrep is security focused. It has many advanced static | analysis features in its core product, such as dataflow | analysis, symbolic propagation, and semantic equivalence, all | of which are useful for security analysis. They are not | available in ast-grep. * Semgrep's pattern syntax also | prefers matching more potentially vulnerable semantics than | matching precise syntax. Semantic level information is the | better level of abstraction for security model. ast-grep, on | the other hand, sticks to faithfully translating users' | queries syntactically. * Semgrep has a one-off search and | rewrite feature, but it is not its primary focus. The CLI is | a bit slow compared to other tools. ast-grep strives to be a | fast CLI tool. * Semgrep has a product matrix for | vulnerability detection: detecting secrets, supply chain | vulnerabilities, and cross-file detection. It also has a | plethora of security rules in the registry. These features | will not be included in ast-grep. | hprotagonist wrote: | Nice to see treesitter showing up in tools that aren't just | syntax highlighting. | herrington_d wrote: | treesitter gives us a uniform interface to parse and manipulate | code, which is awe-inspiring work. I wish tree-sitter could | have more contributors to the core library. It still has a lot | of improvement space. | | Say, like performance. tree-sitter's initial parsing speed can | be easily beaten by a carefully hand-crafted parser. Tree- | sitter, written in C, has a similar JavaScript parsing speed as | Babel, a JS-based parser. See the benchmark | https://dev.to/herrington_darkholme/benchmark-typescript-par... | teo_zero wrote: | Besides, it doesn't shine at syntax highlighting, either! In | the sense that it doesn't add anything that the traditional | text-based algorithms embedded in practically any text editor | can't already do. For example, if I declare a variable called | "something", it should highlight all successive occurrences of | "something" in a remarkably different style than "somethink". | And the "a" in "sizeof(a)" should be rendered differently when | it's a variable and when it's a type. | gpuhacker wrote: | Does anyone happen to know of a similar tool that can compare two | codes for semantic similarity? | LelouBil wrote: | Maybe look here (never used it though) | | https://github.com/Wilfred/difftastic | dorian-graph wrote: | Or https://github.com/afnanenayet/diffsitter. I've tried both | and I like them. No preference or notable opinions on them | yet! | _a_a_a_ wrote: | define 'semantic similarity' | | would your hoped-for tool recognise that 1 | | and sin(x)^2 + cos(x)^2 | | are the same? (I think that identity holds, but if not you get | the picture) | _a_a_a_ wrote: | to the downvoter: I thought this was a reasonable question? | Semantic equivalence is IIRC undecidable in general. Some | languages (Backus' FL?) try to deal with that but I dunno. | tyingq wrote: | > Semantic equivalence is IIRC undecidable in general. | | They did mention code, and said "similarity" rather than | equivalence. | | But, as a trivial example, two different pieces of code can | compile down to the same AST, or bytecode, or assembler. | mst wrote: | That looks like a case where "analyse the AST after constant | folding" might be a theoretical path if you had a language | frontend that could emit the AST at that point. | | I suspect that things like "these two functions both start | with the same conditional+early return" would be more useful | to -me- given the sort of things I tend to be working on _. | Also a 'fuzzy possible copy+paste detector' in general to | help identify refactoring targets. | | It also strikes me that something that was mostly 'just' a | structure-aware diff so e.g. you got diffs within-if-body and | similar but I'm now into vigorous hand waving because it's | been ages since I've thought about this and I probably need | more coffee. | | _ I -did- do a pure maths degree many years ago but I don't | generally seem to end up working on computational code | thfuran wrote: | Not with floats it isn't. | _a_a_a_ wrote: | umm, touche | benmanns wrote: | You could try embedding the two codes with an LLM and run any | number of similarity measures on the output vectors. | alexpovel wrote: | Wow! What a coincidence. Just the other day I finished "v1" of a | similar tool: https://github.com/alexpovel/srgn , calling it a | combination of tr/sed, ripgrep and tree-sitter. It's more about | editing code in-place, not finding matches. | | I've spent a lot of time trying to find similar tools, and even | list them in the README, but `AST-grep` did not come up! I was a | bit confused, as I was sure such a thing _must_ exist already. | AST-grep looks much more capable and dynamic, great work, | especially around the variable syntax. | tekacs wrote: | This looks really interesting, thank you for putting this | together! I'll likely give it a go today. I say that as someone | who has explored quite a few of these and found them mostly | quite basic. srgn looks like more than the usual. | | One minor comment: I personally found the first Python example | involving a docstring a little hard to parse (ha). It may show | a variety of features, but in particular I found that it was | hard to spot at a glance what had changed. | | If you could use diff formatting or a screenshot with color to | show the differences it would make it much easier to follow. If | I get around to using it later today, I might submit a PR for | that. :) | alexpovel wrote: | > diff formatting | | Thank you for the feedback! That sounds good, I'll add that. | alchemist1e9 wrote: | Such an awesome idea and useful tool! | | Do you use tree-sitter for the AST part also? | alexpovel wrote: | Exactly, all the parsing is done by tree-sitter. The Rust | bindings to the tree-sitter C lib are a "first-class | consumer". | eloh wrote: | There is also a neovim plugin doing structural search/replace, | also based on treesitter: https://github.com/cshuaimin/ssr.nvim | wslh wrote: | ELI5: should you specify the target language? The example is in | TS, how we expand it to other programming languages? | lyjackal wrote: | I see an -l ts | | And an -l rs | | In the examples. Those target typescript and rust. Looks like | it's built in tree-sitter, so presumably any language that | supports that should work | wslh wrote: | I understand this approach is different from Semmle [1] (has | queries and states). Do you know if they are modern | alternatives to it? | | [1] https://en.wikipedia.org/wiki/Semmle | simonw wrote: | There is a list of supported languages here: https://ast- | grep.github.io/guide/introduction.html#supported... | | If you leave off the language command line option it detects | the language from the extension on your files. | gushogg-blake wrote: | I came up with a similar concept for in-editor SSR as an | extension to existing find/replace functionality: | https://codepatterns.io/ | | It worked great for the use case I built it around initially but | I think it would need a scripting/logic component to generalise | to any conceivable refactoring. | elric wrote: | If you're into this sort of thing, there's OpenRewrite[1] for the | Java ecosystem. | | [1] https://docs.openrewrite.org/ | anotherpaulg wrote: | I'll share my similarly named tool `grep-ast` [0], which sort of | does the opposite of the OP's `ast-grep`. The OP's tool lets you | specify your search as a chunk of code/AST (and then do AST | transforms on matches). | | My tool let's you grep a regex as usual, but shows you the | matches in a helpful AST aware way. It works with most popular | languages, thanks to tree-sitter. | | It uses the abstract syntax tree (AST) of the source code to show | how the matching lines fit into the code structure. It shows | relevant code from every layer of the AST, above and below the | matches. It's useful when you're grepping to understand how | functions, classes, variables etc are used within a non-trivial | codebase. | | Here's a snippet that shows grep-ast searching the django repo. | Notice that it finds `ROOT_URLCONF` and then shows you the method | and class that contain the matching line, including a helpful | part of the docstring. If you ran this in the terminal, it would | also colorize the matches. django$ grep-ast | ROOT_URLCONF middleware/locale.py: |from | django.conf import settings |from django.conf.urls.i18n | import is_language_prefix_patterns_used |from django.http | import HttpResponseRedirect [?]... |class | LocaleMiddleware(MiddlewareMixin): | """ | | Parse a request and decide what translation | object to | install in the current thread context. [?]... | | def process_request(self, request): > urlconf = | getattr(request, "urlconf", settings.ROOT_URLCONF) | | [0] https://github.com/paul-gauthier/grep-ast | herrington_d wrote: | Hey paulg, ast-grep author here! This is something I also want | to do in ast-grep! ast-grep prints the surrounding lines around | matches but they are not aware of which function/scope the | matches are in. May I ask how you do the scope detection in a | general fashion? (say language agnostic) | https://github.com/ast-grep/ast-grep/issues/155 | anotherpaulg wrote: | Nice, thanks for checking out grep-ast. | | The command line tool is a thin wrapper around the | `TreeContext` class, whose purpose is show you a set of | "lines of interest" in the context of the entire AST. This | all exists because my other project aider [0] uses | TreeContext to display a repository map [1] so that GPT-4 can | understand how the most important classes, methods, | functions, etc fit into the entire code base of a git | repository. | | But it was easy to make a CLI interface to grep lines of | interest and display them with TreeContext, and it turned out | to be quite useful. | | The TreeContext class is line-oriented, and is mainly | interested in tracking language constructs whose scope spans | multiple lines. Typically these are things like classes, | methods, functions, loops, if/else constructs, etc. Given a | line of interest, we look at all the multi-line scopes which | contain it. For each such multi-line scope, we want to | display some "header" lines to provide context. | | In this example, the match for "two" is contained in the | multi-line scopes of a method and a class. So we print their | headers. $ grep-ast two example.py | [?]... |class MyClass: | "MyClass is great" | [?]... | def print2(self): > | print("two") [?]... | | The trick is how to determine the header for each multi-line | scope? It's not ideal to just use the first line. For | example, it's nice that the header for the class included the | docstring as well as the bare `class MyClass:` line. | | For any multi-line scope, we look at all the other AST scopes | which start on the same line. We take the smallest such co- | occurring scope, and declare the header to be the lines that | it spans. For a simple method like `def print2(self):`, | that's all that gets picked up. | | But a complex method like `print1()` below picks up all the | lines which are part of its full function signature: | $ grep-ast one example.py [?]... |class | MyClass: | "MyClass is great" [?]... | | def print1( | self, | | prefix, | suffix, | ): [?]... | > print(f"{prefix} one {suffix}") [?]... | | It's a heuristic, but it seems to work well in practice. | | [0] https://github.com/paul-gauthier/aider | | [1] https://aider.chat/docs/repomap.html | svilen_dobrev wrote: | hey.. are these tools (or combination there of) capable of | converting parts of code in one language to another? Given no (or | minimum) idiosyncracies... e.g. python to javascript or other way | around? (And no, ML is not the answer, i need provable | correctness) | morgante wrote: | I've done a lot of work in this space, and unfortunately the | answer is largely no. | | These provide a nice frontend for writing simple rules, but I | would not want to (essentially) write an entire transpiler in | yaml. | | For Python->JavaScript, you likely want a transpiler focused | specifically on that. | | Unfortunately, every such effort eventually hits serious limits | in the emergent complexity for languages. There's a reason most | of the SOTA techniques ML-based. | herrington_d wrote: | Provable correctness means you have to model your source and | target languages. And then translate the source model to the | target model. It is theoretically possible, but in practice, | modeling an industry language is way too much work. Some | languages do not even have a spec :/ | norir wrote: | The problem with any tree-sitter based tool is that there will | typically be edge cases where the tree-sitter parser is wrong. | Probably not a big deal most of the time, but it makes me wary of | using it for security. | Noumenon72 wrote: | What does it mean to use grep "for security"? | richbell wrote: | E.g., "I just read about CVE-2007-4559 being exploited in the | wild. Are we using this vulnerable method?" | Phelinofist wrote: | So this is like a more general Coccinelle? | morgante wrote: | AST-grep is well done - the speed is particularly impressive and | it's quite easy to get started with. | | One of the downsides of the simplicity is that rules are written | in yaml. This works nicely for simple rules, but if you try to | save a complex migration as a rule, you end up programming in | YAML (which is very hard). | | For my similar tool we decided to build a full query language for | matching code, called GritQL: | https://docs.grit.io/tutorials/gritql | herrington_d wrote: | Hey morgante, nice to meet you again! Indeed YAML is a | compromise between expressiveness and easy-learning. Grit did a | great job for providing advanced code manipulation! | da39a3ee wrote: | This looks exciting. One thing I've always wanted to do is search | Rust code but excluding code in tests (marked by a #[cfg(test)] | annotation). Can it do that? | | I certainly hope some excellent AST-based CLI code search tools | come to exist; hopefully this is one of them. | herrington_d wrote: | Of course, it gets you covered. | | https://ast-grep.github.io/playground.html#eyJtb2RlIjoiQ29uZ... | | I have the same problem also, haha, | https://x.com/hd_nvim/status/1667059966111547392 | da39a3ee wrote: | Thanks! How would you do that for a #[cfg(test)] attribute in | Rust? (I believe that the true identifier of test code; `mod | test {}` is just a convention). I assume Rust attributes | "wrap" the AST node rooted at the node that follows them? | simonw wrote: | Something I find really interesting about this is the way the | tool is packaged. | | You can install the CLI utility in four different ways: | https://ast-grep.github.io/guide/quick-start.html#installati... | # via Homebrew brew install ast-grep # via Cargo | cargo install ast-grep # via npm npm i @ast- | grep/cli -g # via pip pip install ast-grep-cli | # I tested and pipx works too: pipx install ast-grep-cli | | I really like this - it means the tool is available to people | with familiarity of any of those four distribution mechanisms. | | You can also download pre-built binaries from their releases | page: https://github.com/ast-grep/ast-grep/releases/tag/0.14.2 | | On top of that, they offer API bindings for it in three different | languages: | | - Rust (not yet stable): https://docs.rs/ast-grep- | core/latest/ast_grep_core/ | | - JavaScript/TypeScript: https://ast-grep.github.io/guide/api- | usage/js-api.html | | - Python: https://ast-grep.github.io/guide/api-usage/py-api.html | | It's rare to see a tool/library offer this depth of language | support out of the box. | simonw wrote: | I was curious so I had a look at how the "pip install ast-grep- | cli" command works. It downloads a wheel for the correct | platform from https://pypi.org/project/ast-grep-cli/#files | | The wheel just contains the two binaries (sg and ast-grep) and | no Python code: $ unzip -l | ast_grep_cli-0.14.2-py3-none-macosx_10_7_x86_64.whl | Archive: ast_grep_cli-0.14.2-py3-none-macosx_10_7_x86_64.whl | Length Date Time Name --------- ---------- | ----- ---- 6207 12-03-2023 07:34 | ast_grep_cli-0.14.2.dist-info/METADATA 102 | 12-03-2023 07:34 ast_grep_cli-0.14.2.dist-info/WHEEL | 1077 12-03-2023 07:34 ast_grep_cli-0.14.2.dist- | info/license_files/LICENSE 1077 12-03-2023 07:34 | ast_grep_cli-0.14.2.dist-info/license_files/LICENSE | 32865880 12-03-2023 07:34 | ast_grep_cli-0.14.2.data/scripts/sg 32865880 | 12-03-2023 07:34 ast_grep_cli-0.14.2.data/scripts/ast-grep | 639 12-03-2023 07:34 ast_grep_cli-0.14.2.dist-info/RECORD | --------- ------- 65740862 | 7 files | | I haven't seen pip and wheels used to distribute a purely | binary tool like this before. | charliermarsh wrote: | This is how Ruff works too! (Ruff is also a standalone binary | with no Python dependency.) If you're interested, I recommend | checking out Maturin, which makes this pretty easy -- you can | ship any standalone Rust binary as a Python package by | zipping it into a wheel. | herrington_d wrote: | I confess I stole the pip recipe from Charlie :D | | https://github.com/astral- | sh/ruff/blob/main/.github/workflow... | tedunangst wrote: | A looping gif is an unfortunate choice for a demo. It looks cool | to start, but then I'm trying to see what it's done when it | restarts and I have to sit through it again. Some before and | after still screenshots would help. | eviks wrote: | indeed, this is purely text demo, and it wastes too much time | with slow typing in the video while also preventing you from | using search | Conscat wrote: | I've tried using this, but the documentation and learning | resources weren't very good (at least at the time ~6 months ago) | and structuring refactors with YAML made it very cumbersome for | me to write and edit one-off commands. | | Tree Sitter also leaves a lot to be desired for C++ editing, but | that's a special problem. | simonw wrote: | Looks like the project is only about 12 months old, so if you | last checked it out 6 months ago it's worth taking another | look. | | Was it possible to use it entirely as a CLI tool without any | YAML 6 months ago? | Conscat wrote: | Unless the search/replace is super simple, you need the YAML | as far as I can tell. The refactor I gave up on automating | had to do with changing variadic C++ macros into arithmetic | expressions, which wasn't conceptually very complicated, but | felt almost impossible while constantly tripping over YAML | syntax errors. | simonw wrote: | The YAML syntax I find most useful for this kind of thing | is this: something: subkey: | | I can put any characters I like in here And | they "won't be messed up" by anything Because | they are part of a multi-line string | elanning wrote: | Also plugging my related project: https://github.com/Ichigo- | Labs/cgrep From the comments in this thread, it seems a lot of | people have built or needed an easy way to quickly create static | analysis checks, without a bunch of hassle. I think extended | regex is a great way to do this. | cglong wrote: | I was hoping this could be a local replacement for Azure DevOps's | functional code search[1], but this seems lower-level than that. | Basically, I want a tool where I can write something like | `class:Logger` and it'll show me which file(s) define a class | with that name, or `ref:Logger` to find all usages of that/those | class(es). | | [1]: https://learn.microsoft.com/en- | us/azure/devops/project/searc... ___________________________________________________________________ (page generated 2023-12-10 23:00 UTC)