[HN Gopher] Datasette-ripgrep: a regular expression search engin... ___________________________________________________________________ Datasette-ripgrep: a regular expression search engine for your source code Author : tosh Score : 106 points Date : 2020-11-28 10:01 UTC (12 hours ago) (HTM) web link (simonwillison.net) (TXT) w3m dump (simonwillison.net) | mkl95 wrote: | If you are an Emacs user and are interested in ripgrep, you may | want to check the rg package https://github.com/dajva/rg.el | dig1 wrote: | ripgrep works nice with the default Emacs grep facility as | well. After "M-x grep", replace "grep" path with "rg | --noheading". | | Also, there is a few lines trick [1] that will make it work | with "M-x grep-find", respecting git project tree. | | [1] https://stegosaurusdormant.com/emacs-ripgrep/ | agustif wrote: | Havent tried sourcegraph but seems to fit similar use-cases no? | asicsp wrote: | Suggestion: For the ".plugin_config(" example [0], using fixed- | string search with -F option would be much better. Perhaps you | could add an option to choose among -F, default regex and PCRE | (for features like lookarounds). | | [0] | https://ripgrep.datasette.io/-/ripgrep?pattern=%5C.plugin_co... | simonw wrote: | That's a great idea, thanks. Filed an issue: | https://github.com/simonw/datasette-ripgrep/issues/8 | simonw wrote: | That's now shipped in version 0.3 - demo here: https://ripgre | p.datasette.io/-/ripgrep?pattern=.plugin_confi... | hackerpain wrote: | Seems really useful, there's another tool -- https://grep.app | that searches for code patterns throughout GitHub, very useful to | take inspiration from other's code or, look for examples. | wffurr wrote: | Code search is an amazing tool, and this is a great step, but | it's missing clickable cross-references, which are the killer | feature in Kythe-based code search: https://kythe.io/ | emmanueloga_ wrote: | I'm a little bit confused by the state of that website, almost | makes Kythe look abandoned, but if you look at the pulse tab on | github it tells a different story (looks very much alive). | | Is Kythe used to index some prominent open source project/s? | wffurr wrote: | It's used internally at Google and also for Chrome: | https://source.chromium.org/chromium | | It's also used to index Google's open source: | https://cs.opensource.google/search?q=Kythe&sq= | | Thread from earlier this year: | https://news.ycombinator.com/item?id=22551856 | | Not sure what's giving you that "abandon ware" vibe. Lack of | marketing materials? It's pretty thorough documentation on | the indexer. | thesuperbigfrog wrote: | Reminds me of OpenGrok: https://oracle.github.io/opengrok/ | | OpenGrok is powered by Universal Ctags | (https://github.com/universal-ctags/ctags) underneath and uses a | Java servlet web server (Apache Tomcat) with search powered by | Apache Lucene. | | OpenGrok is older, boring technology but it works. | | It is refreshing to see a new, Rust-powered approach to code | search. | jpxw wrote: | Looks like https://github.com/google/zoekt | eatonphil wrote: | Etsy has a web app for searching a git repo, | https://github.com/hound-search/hound. But you have to figure out | how to efficiently load and update all interesting repos on disk. | | I really wish the git protocol had a way to perform remote git | grep so you don't need to install potentially massive repos just | to search for keywords. | karmakaze wrote: | Hound is awesome. I've set it up and used it at a few | companies. Later I SasS'ed it[0]. Ping me if you want to try it | out. | | Edit: Here's a live sandbox demo[1] of it hosted. | | [0] https://gitgrep.com | | [1] https://demo.gitgrep.com/login | simonw wrote: | Tool author here. There are a bunch of really good tools for code | search already which are a lot more sophisticated than this one - | livegrep for example. | | I built this partly as a learning exercise, to see how far I | could push the idea of a Datasette plugin and to figure out how | to call external processes from Python asyncio. | | I also built it because I had a hunch that the "datasette | publish" mechanism for deploying Datasette to serverless hosting | providers such as Google Cloud Run could make it easy to deploy | code search engines too. I think the most interesting thing about | this project is the GitHub Actions workflow I wrote that deploys | the demo instance running against code from 60+ repositories: | https://github.com/simonw/datasette-ripgrep/blob/main/.githu... | oefrha wrote: | Feedback: code search in a repository (or multiple repositories) | should allow blacklisting and/or whitelisting files and | directories. Every time I try to search for something on GitHub, | I have to sift through tons and tons of useless results from | tests. | | Edit: The discussion has taken a weird turn into support for | ignore files. To be clear, this feedback is about dynamically | filtering searches when using a website to do code search, most | likely searching someone else's repo on someone else's website; | you don't have access to the filesystem, the search form is the | only input. So all this "rg supports ignore files" (I'm aware, | but thanks) is not really relevant. | IshKebab wrote: | I agree. https://grep.app/ does this really well. | simonw wrote: | I've added the ability to use globs (via ripgrep --glob) to | include and exclude directories and file path patterns: | | https://ripgrep.datasette.io/-/ripgrep?pattern=with.*AsyncCl... | zeckalpha wrote: | Include the filename in the output and pipe to grep :) | derefr wrote: | > To be clear, this feedback is about dynamically filtering | searches when using a website to do code search, most likely | searching someone else's repo on someone else's website; you | don't have access to the filesystem, the search form is the | only input. | | I think people are imagining that it should be the repo- | author's responsibility to create the equivalent of an | `.rgignore` file ahead-of-time (since _they_ know the codebase | 's structure, _they_ know what 's useful vs. useless to show up | in a search of the code -- pretty objectively, "code" should | show up, and "not-code" shouldn't); and that code-search on | websites like Github should respect the rules in such a file if | it were discovered in the repo. | | If such a thing existed and was in wide use, you -- a random | passerby to the codebase, who doesn't yet understand it -- | wouldn't need to struggle to do any dynamic filtering, because | static filtering would already have been done for you. | oefrha wrote: | > If such a thing existed and was in wide use, you -- a | random passerby to the codebase, who doesn't yet understand | it -- wouldn't need to struggle to do any dynamic filtering, | because static filtering would already have been done for | you. | | That's the thing, it's impossible to create a generic ignore | list that caters to everyone's interest. Could the_mitsuhiko, | the creator of flask, provide such a list for flask? Nope. | Some people want to understand how register_blueprint is | implemented and don't want to see anything from test, docs | and examples. Some people want to learn how to use | register_blueprint, so they specifically want to limit | themselves to examples. Some other people want to check if | there are test cases for a certain API. You can't ignore any | file that may be interesting to someone, which is basically | all of them. | burntsushi wrote: | Right. Ignore files and search-time dynamic filters solve | different problems. They aren't mutually exclusive. | Otherwise, I wouldn't have added the -g/--glob flag to | ripgrep in the first place. | simonw wrote: | I really want this feature. Filed an issue here, thanks: | https://github.com/simonw/datasette-ripgrep/issues/9 | oefrha wrote: | Thanks! | burntsushi wrote: | I agree your replies are weird. Ignore files would be nice if | you wanted repo specific filters or even user wide global | filters. But for on-demand filters at search time, the frontend | could add a feature for it and implement it with the | `-g/--glob` flag. | the_mitsuhiko wrote: | So I guess since I started this entire thread here is why it | even came up: datasette (a tool I use) is something you run | locally. In particular this plugin also is pointed at a local | checkout of the repositories you're working with. | | So for instance in my case I have all the code I work with in | ~/Development. Since I also use ripgrep for local development | I typically include .ignore/.rgignore files to control what | I'm searching around in my repos. | | Doing "excludes"/"includes" across a variety of repositories | is hard at query time because they layouts of those | repositories can be very different. For instance in some JS | repos you do want to search in "dist" whereas in others you | don't because "dist" gets generated out of "src" for instance | and just ends up being a duplicate etc. | burntsushi wrote: | Ah I see. I didn't know what datasette was. | | The way I see doing includes/excludes at query time is | this. You run a search without excludes for example. You | see a bunch of stuff coming back all in one directory that | you don't want for this particular search. So you add an | exclude rule for that directory. And then just refine | results that way. | | If you have a bunch of similarly named directories, then I | think that just means that the exclude rules at query time | need to be more specific. e.g., use the full path to the | directory. | | I think the feature being requested here is an interactive | feature of the user interface, where as ignore files are | more like a static fact of a directory tree. | the_mitsuhiko wrote: | Ripgrep honors ignore files. | oefrha wrote: | The has nothing to do with ignore files. Tests pushed to a | GitHub repository obviously won't be matched by gitignore | (unless you ignore them in gitignore then force commit them, | which would be really rare), I'm just not interested in them | in most of my searches. | | I'm talking about excluding files and directories during | searches, through a directive like, say, "-exclude tests | -exclude test_*.py". | | Sure, I guess technically you can add a bunch of additional | patterns to gitignore before you use this web frontend, but | that would be a really weird workflow, and it's impossible | when the web frontend is run by someone else. | woadwarrior01 wrote: | You could add them to .rgignore or .ignore (which also | works with ag, IIRC) instead of adding them to .gitignore. | oefrha wrote: | Which is again impossible when you use someone else's web | frontend for code search. Plus if you're editing | .rgignore on the filesystem for every search, what's the | point of using a web frontend? | [deleted] | the_mitsuhiko wrote: | Not sure why ignore files are not the solution here. You | can check on an ignore file into your repo then whoever | uses ripgrep through whatever fronend gets that logic | applied. I'm using this and i don't quite see the | downsides of that. | oefrha wrote: | - I can't control what someone else checks into their | repo. I use web frontends for code search almost | exclusively on other people's repos, because my own repos | are already on my machine, so what's the point. | | - Most of the time I don't want to see search results in | tests, but occasionally I actually want to find things in | tests and not anywhere else. Different people have | different filtering needs at different times, you can't | have a ignore file that satisfies every need. Search is | about filtering to begin with. | the_mitsuhiko wrote: | Datasette is a tool you run against your stuff usually. I | have control over my repos. | | Once you do cross repos search providing global filters | on every query turns into an awful user experience. | | WRT web frontends: datasette runs locally. | oefrha wrote: | Well, all this tool does is calling rg from a directory | with a bunch of repos, so providing global filters on | every query is as awful as rg providing a --glob flag... | Which is arguably not that awful to a lot of users. | | > Datasette is a tool you run against your stuff usually. | ... datasette runs locally. | | I don't think so? I could be wrong but I thought a major | use case (or _the_ major use case) is providing a | frontend for your data to other people. Like on | https://ripgrep.datasette.io/-/ripgrep. | | To quote https://docs.datasette.io/en/stable/, | | > Datasette is aimed at data journalists, museum | curators, archivists, local governments and anyone else | who has data that they wish to share with the world. | masklinn wrote: | > To quote https://docs.datasette.io/en/stable/, | | >> Datasette is aimed at data journalists, museum | curators, archivists, local governments and anyone else | who has data that they wish to share with the world. | | That doesn't preclude installing the program locally. | | The official "getting started" talks of remote / web | datasettes as "demos" and "trials". | oefrha wrote: | > That doesn't preclude installing the program locally. | | Of course not. But the project makes it pretty clear that | it's designed with multiuser in mind. Can you host | multiuser web apps only for yourself? Of course, I do | that all the time. The unreasonable thing is saying "I | use this thing exclusively for my own stuff on my local | network, so screw your requests for multiuser, public- | facing use cases." | masklinn wrote: | > Of course not. But the project makes it pretty clear | that it's designed with multiuser in mind. | | It really doesn't. AFAIK Simon Willison (the author and | creator) does his exploration locally, on things of | interest to himself. | | Publishing the analysis for others to see is no more | "multiuser" than publishing a PDF. | the_mitsuhiko wrote: | > I don't think so? I could be wrong but I thought a | major use case (or the major use case) is providing a | frontend for your data to other people. | | _I_ use datasette and I only ever used it locally or in a | docker container. In either case for datasette-ripgrep to | work you need to check out the repos you want to search | in one folder that datasette will then invoke ripgrep on. | | There are folks that publish datasette installations for | public consumptions but even in that case I would assume | that code search was preconfigured to make any sense with | ignore files. | TeMPOraL wrote: | Let's flip this around: why would that make sense? | Filtering expressions belong to a query, not | configuration. Having to use ignore files for this is | like not being allowed to use the WHERE clause in SQL, | and having to rely on .sqlwhere config file instead. | the_mitsuhiko wrote: | Because I ripgrep all the time and don't want to | configure it every time I query. | colejohnson66 wrote: | That's great that it works for you, but it doesn't work | for me. | | What about the one-off search of _someone else's_ code? I | don't want to check out the repo just to add /edit a file | so I can run ripgrep locally. If GitHub has a search | feature that is entirely online, it would make sense for | people to request they improve it. | the_mitsuhiko wrote: | The topic is datasette and ripgrep. I would be curious | why in the context of those tools ignore files are not a | solution. I agree that github codesearch leaves a lot to | be desired. | OJFord wrote: | Put it this way, OP is basically a web interface for | ripgrep; ripgrep supports not just 'automatic' filtering | through ignore files, but also 'manual' filtering per- | query through the -g flag. OP should have an interface to | the -g flag. | the_mitsuhiko wrote: | I agree that a -g flag would be useful. However given the | original comment and how datasette is used I figured the | ignore file would be a good solution. | the_mitsuhiko wrote: | There are more ignore files that ripgrep honors then | gitignore. | arcatek wrote: | GitHub searches are particularly bad. They always give you test | folders and examples before and after actual relevant source | code. You'd think they would know that folders named "test" | should be in a separate section... | Gehinnn wrote: | Worth to mention ripgrep-all here | (https://github.com/phiresky/ripgrep-all). | | It's a regular expression search engine built on ripgrep that can | search in virtually anything (pdf files, zip archives, movie | subtitles, sqlite databases, png+ocr, ...). | gigatexal wrote: | How have I not known about this? Man this utility is amazing! | j1elo wrote: | I have used regular expressions to search for code, like I guess | most devs; but, apart from artificial limits such as maximum time | or max number of results, I found regex has never been a reliable | search method for me. | | There is always the case that, when trying to find all instances | of some code, it will miss _that one case_ which is not covered | by the initial intuition. | | For example, to find uses of the function "foo()", the first idea | is usually to search for ".foo(". But that won't find places | where the function is passed as value. Nor places where devs were | inconsistent and wrote / didn't write a space between the name | and the opening brace, like ".foo (" (this is a common style for | Gnome or Glib code). Not to mention if the language is C++ and | then you have to distinguish between ".foo", "->foo", and | "::foo". | | In conclusion, regex might be OK to have a superficial look at | some code, but never as a reliable source of information, unless | a lot of work and ironing out of corner cases has been invested | in the regex itself, effectively considering them part of the | code, with unit tests and all the fuss that it entails. | indentit wrote: | Agreed, this is why I like the idea of searching parsed files, | for example using scope selectors against code tokenized with a | sublime-syntax grammar. Unfortunately, it means it needs to be | parsed/indexed first, which is slower than a plain regex | search. | | I wonder if the AST built by Tree-Sitter could also help with | this type of search - does anyone know of any existing | solutions for this? | slimsag wrote: | > I like the idea of searching parsed files, for example | using scope selectors against code tokenized with a sublime- | syntax grammar. | | I've been working on a trigram-based search engine with | support for exactly this (via github.com/trishume/syntect) | over the past several months and plan on open-sourcing it | soon. Cool to see someone else with this idea! | | You might also like https://comby.dev - it is aware of code | structure without parsing files | indentit wrote: | > I've been working on a trigram-based search engine with | support for exactly this (via github.com/trishume/syntect) | over the past several months and plan on open-sourcing it | soon. | | Awesome, I look forward to that - you'll post a "show HN" | for it, I hope? :) Does syntect's lack of support for the | newest sublime-syntax features cause you any problems? | | > You might also like https://comby.dev - it is aware of | code structure without parsing files | | Thanks for sharing, will check it out! | slimsag wrote: | > Awesome, I look forward to that - you'll post a "show | HN" for it, I hope? :) Does syntect's lack of support for | the newest sublime-syntax features cause you any | problems? | | Show HN -> Yep :) | | Lack of support for newest sublime-syntax features: less | than you would expect, but a little for sure. | | Most of the syntax definitions in the wild today don't | really use the newer features, at Sourcegraph I wrote a | little Rust HTTP server wrapping syntect[1] and we use it | for all our syntax highlighting for the past several | years, I would say it works on like 95% of code files | even if you include a lot of additional syntaxes that are | untested with Syntect[2]. That said, it does barf hard on | some specific files - either taking a _really_ long time | to do the work or getting completely stuck in a busy | waiting loop for some reason. That said, it 's still the | 2nd best syntax highlighter out there (second only to | Sublime itself.) | | One of my hopes for this side project is that I'll be | able to contribute more time upstream with e.g. a more | extensive test suite for Syntect against a much larger | number of syntax definitions from the wild instead of | just Sublime's built-in ones. | | [1] https://github.com/sourcegraph/syntect_server | | [2] https://github.com/slimsag/Packages/#license | masklinn wrote: | > does anyone know of any existing solutions for this? | | https://semgrep.dev, though it's mostly an analysis tool it | can be used as a search tool. IIRC it's not super fast, but | for the cases where there is no way to really contort a regex | into something suitable (the regex has way too many false | positives and / or negatives) it works rather well. | porpoise wrote: | In some cases the problem can be somewhat mitigated if the | search is incremental (and fast), as it basically allows you to | refine your regex in real time and help train the correct | intuition. | | For example, you can type out "foo" at first which will, say, | display 7 results, then when you add "()" you see the list | shrunk by one, so you immediately know the regex missed one | it's supposed to capture. | karlicoss wrote: | People in comments suggesting alternatives like sourcegraph, | zoekt, hound, etc -- I've set up local code search for | repositories on my computer a while ago and tried out most of the | tools I could find at the time [0]. Nothing comes remotely close | to a simple ripgrep against literally all of the repositories, | both in terms of convenience and speed (it's instantaneous | against hundreds of repositories, at least on SSD!). The only | hassle is configuring .ignore files (most is covered by | .gitignores), but usually only have to do it a few times to | exclude the few spammy offenders. | | I'm using it with my emacs (+helm), I just have a global | keybinding which instantly opens a new code search. I could | imagine having a web frontend being very convenient for people | who aren't hooked on Emacs, even I often want to persist a code | search result for several days, so will give datasette-ripgrep a | try! | | [0] https://beepb00p.xyz/pkm-search.html#code ___________________________________________________________________ (page generated 2020-11-28 23:00 UTC)