[HN Gopher] AST-grep(sg) is a CLI tool for code structural searc...
       ___________________________________________________________________
        
       AST-grep(sg) is a CLI tool for code structural search, lint, and
       rewriting
        
       Author : methou
       Score  : 213 points
       Date   : 2023-12-10 12:03 UTC (10 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | beardedwizard wrote:
       | Is this meant to compliment or compete with semgrep?
        
         | andrewshadura wrote:
         | Well, it _is_ semgrep (hence sg).
        
           | beardedwizard wrote:
           | yeah I had this feeling a bit, I guess im curious what
           | problems they solve differently (if any). My sense it that
           | semgrep is an enterprise managed solution of the same kind
           | (and btw, is still itself OSS)
        
         | ekidd wrote:
         | Well, when I seach for "semgrep", I get a very nice corporate
         | landing page with a "Book Demo" button. Which is a level of
         | hassle that just isn't worth it for smaller teams, because
         | "Book Demo" usually means "We're going to do a dance to see how
         | much money we can extract from you." Which smaller teams may
         | only want to do for a handful of key tools.
         | 
         | (4 years ago, I was more willing to put up with enterprise
         | licensing. But in the last two years, I've seen way too many
         | enterprise vendors try to squeeze every penny they can get from
         | existing clients. An enterprise sales process now often means
         | "Expect 30% annual price hikes once you're in too deep to back
         | out." The lack of easy VC money seems to have made some
         | enterprise vendors pretty desperate.)
         | 
         | There's also an open source "semgrep" project here:
         | https://github.com/semgrep/semgrep. But this seems to be
         | basically a vulernability scanner, going by the README.
         | 
         | Whereas AST-grep seems to focus heavily on things like:
         | 
         | 1. One-off searching: "Search my tree for this pattern."
         | 
         | 2. Refactoring: "Replace this pattern with this other pattern."
         | 
         | AST-grep also includes a vulnerability scanning mode like
         | semgrep.
         | 
         | It's possible that semgrep also has nice support for (1) and
         | (2), but it isn't clearly visible on their corporate landing
         | page or the first open source README I found.
        
           | icholy wrote:
           | Semgrep is capable of one-off searching and refactoring. I
           | agreed that the docs are a little hard to navigate.
        
           | herrington_d wrote:
           | Thank ekidd for your kind words! ast-grep author here. This
           | is a hobby project and mainly focuses on developers' daily
           | job like search and linting. Appreciate you like it!
           | 
           | Semgrep's vulnerability scanning is much more advanced,
           | mostly for enterprise security usage.
        
         | icholy wrote:
         | Looks like a competitor to me.
        
         | herrington_d wrote:
         | Hi, ast-grep author here. This is a great question and I asked
         | this in the first place before I started the hobby project.
         | 
         | TLDR; I designed ast-grep to be on different tracks than
         | semgrep.
         | 
         | Semgrep is for security and ast-grep is for development.
         | 
         | First and foremost, I have always been in awe of semgrep.
         | Semgrep's documentation, product sites and Padioleau's podcast
         | all gave me a lot of inspiration. Using code to find code is
         | such a cool idea that I never need to craft an intricate regex
         | or write a lengthy AST program. sgrep and patch from
         | https://github.com/facebookarchive/pfff/wiki/Sgrep have helped
         | me a lot in real large codebases.
         | 
         | When I used semgrep as a software engineer, instead of a
         | security researcher, I found semgrep has not touched too much
         | on routine development works. I can use `semgrep -e PATTERN`
         | but the Python wrapper is not too fast compared to grep. While
         | pattern is cool, it cannot precisely match some syntax nodes.
         | (example, selecting generator expression in Semgrep is very
         | hard). It also does not have API to find code programmatically.
         | 
         | I have also a short summary for tool comparison. https://ast-
         | grep.github.io/advanced/tool-comparison.html
        
           | herrington_d wrote:
           | Why I think semgrep is a security tool different from ast-
           | grep:
           | 
           | * Semgrep is security focused. It has many advanced static
           | analysis features in its core product, such as dataflow
           | analysis, symbolic propagation, and semantic equivalence, all
           | of which are useful for security analysis. They are not
           | available in ast-grep. * Semgrep's pattern syntax also
           | prefers matching more potentially vulnerable semantics than
           | matching precise syntax. Semantic level information is the
           | better level of abstraction for security model. ast-grep, on
           | the other hand, sticks to faithfully translating users'
           | queries syntactically. * Semgrep has a one-off search and
           | rewrite feature, but it is not its primary focus. The CLI is
           | a bit slow compared to other tools. ast-grep strives to be a
           | fast CLI tool. * Semgrep has a product matrix for
           | vulnerability detection: detecting secrets, supply chain
           | vulnerabilities, and cross-file detection. It also has a
           | plethora of security rules in the registry. These features
           | will not be included in ast-grep.
        
       | hprotagonist wrote:
       | Nice to see treesitter showing up in tools that aren't just
       | syntax highlighting.
        
         | herrington_d wrote:
         | treesitter gives us a uniform interface to parse and manipulate
         | code, which is awe-inspiring work. I wish tree-sitter could
         | have more contributors to the core library. It still has a lot
         | of improvement space.
         | 
         | Say, like performance. tree-sitter's initial parsing speed can
         | be easily beaten by a carefully hand-crafted parser. Tree-
         | sitter, written in C, has a similar JavaScript parsing speed as
         | Babel, a JS-based parser. See the benchmark
         | https://dev.to/herrington_darkholme/benchmark-typescript-par...
        
         | teo_zero wrote:
         | Besides, it doesn't shine at syntax highlighting, either! In
         | the sense that it doesn't add anything that the traditional
         | text-based algorithms embedded in practically any text editor
         | can't already do. For example, if I declare a variable called
         | "something", it should highlight all successive occurrences of
         | "something" in a remarkably different style than "somethink".
         | And the "a" in "sizeof(a)" should be rendered differently when
         | it's a variable and when it's a type.
        
       | gpuhacker wrote:
       | Does anyone happen to know of a similar tool that can compare two
       | codes for semantic similarity?
        
         | LelouBil wrote:
         | Maybe look here (never used it though)
         | 
         | https://github.com/Wilfred/difftastic
        
           | dorian-graph wrote:
           | Or https://github.com/afnanenayet/diffsitter. I've tried both
           | and I like them. No preference or notable opinions on them
           | yet!
        
         | _a_a_a_ wrote:
         | define 'semantic similarity'
         | 
         | would your hoped-for tool recognise that                 1
         | 
         | and                 sin(x)^2 + cos(x)^2
         | 
         | are the same? (I think that identity holds, but if not you get
         | the picture)
        
           | _a_a_a_ wrote:
           | to the downvoter: I thought this was a reasonable question?
           | Semantic equivalence is IIRC undecidable in general. Some
           | languages (Backus' FL?) try to deal with that but I dunno.
        
             | tyingq wrote:
             | > Semantic equivalence is IIRC undecidable in general.
             | 
             | They did mention code, and said "similarity" rather than
             | equivalence.
             | 
             | But, as a trivial example, two different pieces of code can
             | compile down to the same AST, or bytecode, or assembler.
        
           | mst wrote:
           | That looks like a case where "analyse the AST after constant
           | folding" might be a theoretical path if you had a language
           | frontend that could emit the AST at that point.
           | 
           | I suspect that things like "these two functions both start
           | with the same conditional+early return" would be more useful
           | to -me- given the sort of things I tend to be working on _.
           | Also a 'fuzzy possible copy+paste detector' in general to
           | help identify refactoring targets.
           | 
           | It also strikes me that something that was mostly 'just' a
           | structure-aware diff so e.g. you got diffs within-if-body and
           | similar but I'm now into vigorous hand waving because it's
           | been ages since I've thought about this and I probably need
           | more coffee.
           | 
           | _ I -did- do a pure maths degree many years ago but I don't
           | generally seem to end up working on computational code
        
           | thfuran wrote:
           | Not with floats it isn't.
        
             | _a_a_a_ wrote:
             | umm, touche
        
         | benmanns wrote:
         | You could try embedding the two codes with an LLM and run any
         | number of similarity measures on the output vectors.
        
       | alexpovel wrote:
       | Wow! What a coincidence. Just the other day I finished "v1" of a
       | similar tool: https://github.com/alexpovel/srgn , calling it a
       | combination of tr/sed, ripgrep and tree-sitter. It's more about
       | editing code in-place, not finding matches.
       | 
       | I've spent a lot of time trying to find similar tools, and even
       | list them in the README, but `AST-grep` did not come up! I was a
       | bit confused, as I was sure such a thing _must_ exist already.
       | AST-grep looks much more capable and dynamic, great work,
       | especially around the variable syntax.
        
         | tekacs wrote:
         | This looks really interesting, thank you for putting this
         | together! I'll likely give it a go today. I say that as someone
         | who has explored quite a few of these and found them mostly
         | quite basic. srgn looks like more than the usual.
         | 
         | One minor comment: I personally found the first Python example
         | involving a docstring a little hard to parse (ha). It may show
         | a variety of features, but in particular I found that it was
         | hard to spot at a glance what had changed.
         | 
         | If you could use diff formatting or a screenshot with color to
         | show the differences it would make it much easier to follow. If
         | I get around to using it later today, I might submit a PR for
         | that. :)
        
           | alexpovel wrote:
           | > diff formatting
           | 
           | Thank you for the feedback! That sounds good, I'll add that.
        
         | alchemist1e9 wrote:
         | Such an awesome idea and useful tool!
         | 
         | Do you use tree-sitter for the AST part also?
        
           | alexpovel wrote:
           | Exactly, all the parsing is done by tree-sitter. The Rust
           | bindings to the tree-sitter C lib are a "first-class
           | consumer".
        
       | eloh wrote:
       | There is also a neovim plugin doing structural search/replace,
       | also based on treesitter: https://github.com/cshuaimin/ssr.nvim
        
       | wslh wrote:
       | ELI5: should you specify the target language? The example is in
       | TS, how we expand it to other programming languages?
        
         | lyjackal wrote:
         | I see an                 -l ts
         | 
         | And an                 -l rs
         | 
         | In the examples. Those target typescript and rust. Looks like
         | it's built in tree-sitter, so presumably any language that
         | supports that should work
        
           | wslh wrote:
           | I understand this approach is different from Semmle [1] (has
           | queries and states). Do you know if they are modern
           | alternatives to it?
           | 
           | [1] https://en.wikipedia.org/wiki/Semmle
        
         | simonw wrote:
         | There is a list of supported languages here: https://ast-
         | grep.github.io/guide/introduction.html#supported...
         | 
         | If you leave off the language command line option it detects
         | the language from the extension on your files.
        
       | gushogg-blake wrote:
       | I came up with a similar concept for in-editor SSR as an
       | extension to existing find/replace functionality:
       | https://codepatterns.io/
       | 
       | It worked great for the use case I built it around initially but
       | I think it would need a scripting/logic component to generalise
       | to any conceivable refactoring.
        
       | elric wrote:
       | If you're into this sort of thing, there's OpenRewrite[1] for the
       | Java ecosystem.
       | 
       | [1] https://docs.openrewrite.org/
        
       | anotherpaulg wrote:
       | I'll share my similarly named tool `grep-ast` [0], which sort of
       | does the opposite of the OP's `ast-grep`. The OP's tool lets you
       | specify your search as a chunk of code/AST (and then do AST
       | transforms on matches).
       | 
       | My tool let's you grep a regex as usual, but shows you the
       | matches in a helpful AST aware way. It works with most popular
       | languages, thanks to tree-sitter.
       | 
       | It uses the abstract syntax tree (AST) of the source code to show
       | how the matching lines fit into the code structure. It shows
       | relevant code from every layer of the AST, above and below the
       | matches. It's useful when you're grepping to understand how
       | functions, classes, variables etc are used within a non-trivial
       | codebase.
       | 
       | Here's a snippet that shows grep-ast searching the django repo.
       | Notice that it finds `ROOT_URLCONF` and then shows you the method
       | and class that contain the matching line, including a helpful
       | part of the docstring. If you ran this in the terminal, it would
       | also colorize the matches.                 django$ grep-ast
       | ROOT_URLCONF            middleware/locale.py:            |from
       | django.conf import settings       |from django.conf.urls.i18n
       | import is_language_prefix_patterns_used       |from django.http
       | import HttpResponseRedirect       [?]...       |class
       | LocaleMiddleware(MiddlewareMixin):       |    """       |
       | Parse a request and decide what translation       |    object to
       | install in the current thread context.       [?]...       |
       | def process_request(self, request):       >        urlconf =
       | getattr(request, "urlconf", settings.ROOT_URLCONF)
       | 
       | [0] https://github.com/paul-gauthier/grep-ast
        
         | herrington_d wrote:
         | Hey paulg, ast-grep author here! This is something I also want
         | to do in ast-grep! ast-grep prints the surrounding lines around
         | matches but they are not aware of which function/scope the
         | matches are in. May I ask how you do the scope detection in a
         | general fashion? (say language agnostic)
         | https://github.com/ast-grep/ast-grep/issues/155
        
           | anotherpaulg wrote:
           | Nice, thanks for checking out grep-ast.
           | 
           | The command line tool is a thin wrapper around the
           | `TreeContext` class, whose purpose is show you a set of
           | "lines of interest" in the context of the entire AST. This
           | all exists because my other project aider [0] uses
           | TreeContext to display a repository map [1] so that GPT-4 can
           | understand how the most important classes, methods,
           | functions, etc fit into the entire code base of a git
           | repository.
           | 
           | But it was easy to make a CLI interface to grep lines of
           | interest and display them with TreeContext, and it turned out
           | to be quite useful.
           | 
           | The TreeContext class is line-oriented, and is mainly
           | interested in tracking language constructs whose scope spans
           | multiple lines. Typically these are things like classes,
           | methods, functions, loops, if/else constructs, etc. Given a
           | line of interest, we look at all the multi-line scopes which
           | contain it. For each such multi-line scope, we want to
           | display some "header" lines to provide context.
           | 
           | In this example, the match for "two" is contained in the
           | multi-line scopes of a method and a class. So we print their
           | headers.                 $ grep-ast two example.py
           | [?]...       |class MyClass:       |    "MyClass is great"
           | [?]...       |    def print2(self):       >
           | print("two")       [?]...
           | 
           | The trick is how to determine the header for each multi-line
           | scope? It's not ideal to just use the first line. For
           | example, it's nice that the header for the class included the
           | docstring as well as the bare `class MyClass:` line.
           | 
           | For any multi-line scope, we look at all the other AST scopes
           | which start on the same line. We take the smallest such co-
           | occurring scope, and declare the header to be the lines that
           | it spans. For a simple method like `def print2(self):`,
           | that's all that gets picked up.
           | 
           | But a complex method like `print1()` below picks up all the
           | lines which are part of its full function signature:
           | $ grep-ast one example.py            [?]...       |class
           | MyClass:       |    "MyClass is great"       [?]...       |
           | def print1(       |            self,       |
           | prefix,       |            suffix,       |    ):       [?]...
           | >        print(f"{prefix} one {suffix}")       [?]...
           | 
           | It's a heuristic, but it seems to work well in practice.
           | 
           | [0] https://github.com/paul-gauthier/aider
           | 
           | [1] https://aider.chat/docs/repomap.html
        
       | svilen_dobrev wrote:
       | hey.. are these tools (or combination there of) capable of
       | converting parts of code in one language to another? Given no (or
       | minimum) idiosyncracies... e.g. python to javascript or other way
       | around? (And no, ML is not the answer, i need provable
       | correctness)
        
         | morgante wrote:
         | I've done a lot of work in this space, and unfortunately the
         | answer is largely no.
         | 
         | These provide a nice frontend for writing simple rules, but I
         | would not want to (essentially) write an entire transpiler in
         | yaml.
         | 
         | For Python->JavaScript, you likely want a transpiler focused
         | specifically on that.
         | 
         | Unfortunately, every such effort eventually hits serious limits
         | in the emergent complexity for languages. There's a reason most
         | of the SOTA techniques ML-based.
        
         | herrington_d wrote:
         | Provable correctness means you have to model your source and
         | target languages. And then translate the source model to the
         | target model. It is theoretically possible, but in practice,
         | modeling an industry language is way too much work. Some
         | languages do not even have a spec :/
        
       | norir wrote:
       | The problem with any tree-sitter based tool is that there will
       | typically be edge cases where the tree-sitter parser is wrong.
       | Probably not a big deal most of the time, but it makes me wary of
       | using it for security.
        
         | Noumenon72 wrote:
         | What does it mean to use grep "for security"?
        
           | richbell wrote:
           | E.g., "I just read about CVE-2007-4559 being exploited in the
           | wild. Are we using this vulnerable method?"
        
       | Phelinofist wrote:
       | So this is like a more general Coccinelle?
        
       | morgante wrote:
       | AST-grep is well done - the speed is particularly impressive and
       | it's quite easy to get started with.
       | 
       | One of the downsides of the simplicity is that rules are written
       | in yaml. This works nicely for simple rules, but if you try to
       | save a complex migration as a rule, you end up programming in
       | YAML (which is very hard).
       | 
       | For my similar tool we decided to build a full query language for
       | matching code, called GritQL:
       | https://docs.grit.io/tutorials/gritql
        
         | herrington_d wrote:
         | Hey morgante, nice to meet you again! Indeed YAML is a
         | compromise between expressiveness and easy-learning. Grit did a
         | great job for providing advanced code manipulation!
        
       | da39a3ee wrote:
       | This looks exciting. One thing I've always wanted to do is search
       | Rust code but excluding code in tests (marked by a #[cfg(test)]
       | annotation). Can it do that?
       | 
       | I certainly hope some excellent AST-based CLI code search tools
       | come to exist; hopefully this is one of them.
        
         | herrington_d wrote:
         | Of course, it gets you covered.
         | 
         | https://ast-grep.github.io/playground.html#eyJtb2RlIjoiQ29uZ...
         | 
         | I have the same problem also, haha,
         | https://x.com/hd_nvim/status/1667059966111547392
        
           | da39a3ee wrote:
           | Thanks! How would you do that for a #[cfg(test)] attribute in
           | Rust? (I believe that the true identifier of test code; `mod
           | test {}` is just a convention). I assume Rust attributes
           | "wrap" the AST node rooted at the node that follows them?
        
       | simonw wrote:
       | Something I find really interesting about this is the way the
       | tool is packaged.
       | 
       | You can install the CLI utility in four different ways:
       | https://ast-grep.github.io/guide/quick-start.html#installati...
       | # via Homebrew         brew install ast-grep         # via Cargo
       | cargo install ast-grep         # via npm         npm i @ast-
       | grep/cli -g         # via pip         pip install ast-grep-cli
       | # I tested and pipx works too:         pipx install ast-grep-cli
       | 
       | I really like this - it means the tool is available to people
       | with familiarity of any of those four distribution mechanisms.
       | 
       | You can also download pre-built binaries from their releases
       | page: https://github.com/ast-grep/ast-grep/releases/tag/0.14.2
       | 
       | On top of that, they offer API bindings for it in three different
       | languages:
       | 
       | - Rust (not yet stable): https://docs.rs/ast-grep-
       | core/latest/ast_grep_core/
       | 
       | - JavaScript/TypeScript: https://ast-grep.github.io/guide/api-
       | usage/js-api.html
       | 
       | - Python: https://ast-grep.github.io/guide/api-usage/py-api.html
       | 
       | It's rare to see a tool/library offer this depth of language
       | support out of the box.
        
         | simonw wrote:
         | I was curious so I had a look at how the "pip install ast-grep-
         | cli" command works. It downloads a wheel for the correct
         | platform from https://pypi.org/project/ast-grep-cli/#files
         | 
         | The wheel just contains the two binaries (sg and ast-grep) and
         | no Python code:                   $ unzip -l
         | ast_grep_cli-0.14.2-py3-none-macosx_10_7_x86_64.whl
         | Archive:  ast_grep_cli-0.14.2-py3-none-macosx_10_7_x86_64.whl
         | Length      Date    Time    Name         ---------  ----------
         | -----   ----             6207  12-03-2023 07:34
         | ast_grep_cli-0.14.2.dist-info/METADATA               102
         | 12-03-2023 07:34   ast_grep_cli-0.14.2.dist-info/WHEEL
         | 1077  12-03-2023 07:34   ast_grep_cli-0.14.2.dist-
         | info/license_files/LICENSE             1077  12-03-2023 07:34
         | ast_grep_cli-0.14.2.dist-info/license_files/LICENSE
         | 32865880  12-03-2023 07:34
         | ast_grep_cli-0.14.2.data/scripts/sg         32865880
         | 12-03-2023 07:34   ast_grep_cli-0.14.2.data/scripts/ast-grep
         | 639  12-03-2023 07:34   ast_grep_cli-0.14.2.dist-info/RECORD
         | ---------                     -------         65740862
         | 7 files
         | 
         | I haven't seen pip and wheels used to distribute a purely
         | binary tool like this before.
        
           | charliermarsh wrote:
           | This is how Ruff works too! (Ruff is also a standalone binary
           | with no Python dependency.) If you're interested, I recommend
           | checking out Maturin, which makes this pretty easy -- you can
           | ship any standalone Rust binary as a Python package by
           | zipping it into a wheel.
        
             | herrington_d wrote:
             | I confess I stole the pip recipe from Charlie :D
             | 
             | https://github.com/astral-
             | sh/ruff/blob/main/.github/workflow...
        
       | tedunangst wrote:
       | A looping gif is an unfortunate choice for a demo. It looks cool
       | to start, but then I'm trying to see what it's done when it
       | restarts and I have to sit through it again. Some before and
       | after still screenshots would help.
        
         | eviks wrote:
         | indeed, this is purely text demo, and it wastes too much time
         | with slow typing in the video while also preventing you from
         | using search
        
       | Conscat wrote:
       | I've tried using this, but the documentation and learning
       | resources weren't very good (at least at the time ~6 months ago)
       | and structuring refactors with YAML made it very cumbersome for
       | me to write and edit one-off commands.
       | 
       | Tree Sitter also leaves a lot to be desired for C++ editing, but
       | that's a special problem.
        
         | simonw wrote:
         | Looks like the project is only about 12 months old, so if you
         | last checked it out 6 months ago it's worth taking another
         | look.
         | 
         | Was it possible to use it entirely as a CLI tool without any
         | YAML 6 months ago?
        
           | Conscat wrote:
           | Unless the search/replace is super simple, you need the YAML
           | as far as I can tell. The refactor I gave up on automating
           | had to do with changing variadic C++ macros into arithmetic
           | expressions, which wasn't conceptually very complicated, but
           | felt almost impossible while constantly tripping over YAML
           | syntax errors.
        
             | simonw wrote:
             | The YAML syntax I find most useful for this kind of thing
             | is this:                   something:           subkey: |
             | I can put any characters I like in here             And
             | they "won't be messed up" by anything             Because
             | they are part of a multi-line string
        
       | elanning wrote:
       | Also plugging my related project: https://github.com/Ichigo-
       | Labs/cgrep From the comments in this thread, it seems a lot of
       | people have built or needed an easy way to quickly create static
       | analysis checks, without a bunch of hassle. I think extended
       | regex is a great way to do this.
        
       | cglong wrote:
       | I was hoping this could be a local replacement for Azure DevOps's
       | functional code search[1], but this seems lower-level than that.
       | Basically, I want a tool where I can write something like
       | `class:Logger` and it'll show me which file(s) define a class
       | with that name, or `ref:Logger` to find all usages of that/those
       | class(es).
       | 
       | [1]: https://learn.microsoft.com/en-
       | us/azure/devops/project/searc...
        
       ___________________________________________________________________
       (page generated 2023-12-10 23:00 UTC)