[HN Gopher] Datasette-ripgrep: a regular expression search engin...
       ___________________________________________________________________
        
       Datasette-ripgrep: a regular expression search engine for your
       source code
        
       Author : tosh
       Score  : 106 points
       Date   : 2020-11-28 10:01 UTC (12 hours ago)
        
 (HTM) web link (simonwillison.net)
 (TXT) w3m dump (simonwillison.net)
        
       | mkl95 wrote:
       | If you are an Emacs user and are interested in ripgrep, you may
       | want to check the rg package https://github.com/dajva/rg.el
        
         | dig1 wrote:
         | ripgrep works nice with the default Emacs grep facility as
         | well. After "M-x grep", replace "grep" path with "rg
         | --noheading".
         | 
         | Also, there is a few lines trick [1] that will make it work
         | with "M-x grep-find", respecting git project tree.
         | 
         | [1] https://stegosaurusdormant.com/emacs-ripgrep/
        
       | agustif wrote:
       | Havent tried sourcegraph but seems to fit similar use-cases no?
        
       | asicsp wrote:
       | Suggestion: For the ".plugin_config(" example [0], using fixed-
       | string search with -F option would be much better. Perhaps you
       | could add an option to choose among -F, default regex and PCRE
       | (for features like lookarounds).
       | 
       | [0]
       | https://ripgrep.datasette.io/-/ripgrep?pattern=%5C.plugin_co...
        
         | simonw wrote:
         | That's a great idea, thanks. Filed an issue:
         | https://github.com/simonw/datasette-ripgrep/issues/8
        
           | simonw wrote:
           | That's now shipped in version 0.3 - demo here: https://ripgre
           | p.datasette.io/-/ripgrep?pattern=.plugin_confi...
        
       | hackerpain wrote:
       | Seems really useful, there's another tool -- https://grep.app
       | that searches for code patterns throughout GitHub, very useful to
       | take inspiration from other's code or, look for examples.
        
       | wffurr wrote:
       | Code search is an amazing tool, and this is a great step, but
       | it's missing clickable cross-references, which are the killer
       | feature in Kythe-based code search: https://kythe.io/
        
         | emmanueloga_ wrote:
         | I'm a little bit confused by the state of that website, almost
         | makes Kythe look abandoned, but if you look at the pulse tab on
         | github it tells a different story (looks very much alive).
         | 
         | Is Kythe used to index some prominent open source project/s?
        
           | wffurr wrote:
           | It's used internally at Google and also for Chrome:
           | https://source.chromium.org/chromium
           | 
           | It's also used to index Google's open source:
           | https://cs.opensource.google/search?q=Kythe&sq=
           | 
           | Thread from earlier this year:
           | https://news.ycombinator.com/item?id=22551856
           | 
           | Not sure what's giving you that "abandon ware" vibe. Lack of
           | marketing materials? It's pretty thorough documentation on
           | the indexer.
        
       | thesuperbigfrog wrote:
       | Reminds me of OpenGrok: https://oracle.github.io/opengrok/
       | 
       | OpenGrok is powered by Universal Ctags
       | (https://github.com/universal-ctags/ctags) underneath and uses a
       | Java servlet web server (Apache Tomcat) with search powered by
       | Apache Lucene.
       | 
       | OpenGrok is older, boring technology but it works.
       | 
       | It is refreshing to see a new, Rust-powered approach to code
       | search.
        
       | jpxw wrote:
       | Looks like https://github.com/google/zoekt
        
       | eatonphil wrote:
       | Etsy has a web app for searching a git repo,
       | https://github.com/hound-search/hound. But you have to figure out
       | how to efficiently load and update all interesting repos on disk.
       | 
       | I really wish the git protocol had a way to perform remote git
       | grep so you don't need to install potentially massive repos just
       | to search for keywords.
        
         | karmakaze wrote:
         | Hound is awesome. I've set it up and used it at a few
         | companies. Later I SasS'ed it[0]. Ping me if you want to try it
         | out.
         | 
         | Edit: Here's a live sandbox demo[1] of it hosted.
         | 
         | [0] https://gitgrep.com
         | 
         | [1] https://demo.gitgrep.com/login
        
       | simonw wrote:
       | Tool author here. There are a bunch of really good tools for code
       | search already which are a lot more sophisticated than this one -
       | livegrep for example.
       | 
       | I built this partly as a learning exercise, to see how far I
       | could push the idea of a Datasette plugin and to figure out how
       | to call external processes from Python asyncio.
       | 
       | I also built it because I had a hunch that the "datasette
       | publish" mechanism for deploying Datasette to serverless hosting
       | providers such as Google Cloud Run could make it easy to deploy
       | code search engines too. I think the most interesting thing about
       | this project is the GitHub Actions workflow I wrote that deploys
       | the demo instance running against code from 60+ repositories:
       | https://github.com/simonw/datasette-ripgrep/blob/main/.githu...
        
       | oefrha wrote:
       | Feedback: code search in a repository (or multiple repositories)
       | should allow blacklisting and/or whitelisting files and
       | directories. Every time I try to search for something on GitHub,
       | I have to sift through tons and tons of useless results from
       | tests.
       | 
       | Edit: The discussion has taken a weird turn into support for
       | ignore files. To be clear, this feedback is about dynamically
       | filtering searches when using a website to do code search, most
       | likely searching someone else's repo on someone else's website;
       | you don't have access to the filesystem, the search form is the
       | only input. So all this "rg supports ignore files" (I'm aware,
       | but thanks) is not really relevant.
        
         | IshKebab wrote:
         | I agree. https://grep.app/ does this really well.
        
         | simonw wrote:
         | I've added the ability to use globs (via ripgrep --glob) to
         | include and exclude directories and file path patterns:
         | 
         | https://ripgrep.datasette.io/-/ripgrep?pattern=with.*AsyncCl...
        
         | zeckalpha wrote:
         | Include the filename in the output and pipe to grep :)
        
         | derefr wrote:
         | > To be clear, this feedback is about dynamically filtering
         | searches when using a website to do code search, most likely
         | searching someone else's repo on someone else's website; you
         | don't have access to the filesystem, the search form is the
         | only input.
         | 
         | I think people are imagining that it should be the repo-
         | author's responsibility to create the equivalent of an
         | `.rgignore` file ahead-of-time (since _they_ know the codebase
         | 's structure, _they_ know what 's useful vs. useless to show up
         | in a search of the code -- pretty objectively, "code" should
         | show up, and "not-code" shouldn't); and that code-search on
         | websites like Github should respect the rules in such a file if
         | it were discovered in the repo.
         | 
         | If such a thing existed and was in wide use, you -- a random
         | passerby to the codebase, who doesn't yet understand it --
         | wouldn't need to struggle to do any dynamic filtering, because
         | static filtering would already have been done for you.
        
           | oefrha wrote:
           | > If such a thing existed and was in wide use, you -- a
           | random passerby to the codebase, who doesn't yet understand
           | it -- wouldn't need to struggle to do any dynamic filtering,
           | because static filtering would already have been done for
           | you.
           | 
           | That's the thing, it's impossible to create a generic ignore
           | list that caters to everyone's interest. Could the_mitsuhiko,
           | the creator of flask, provide such a list for flask? Nope.
           | Some people want to understand how register_blueprint is
           | implemented and don't want to see anything from test, docs
           | and examples. Some people want to learn how to use
           | register_blueprint, so they specifically want to limit
           | themselves to examples. Some other people want to check if
           | there are test cases for a certain API. You can't ignore any
           | file that may be interesting to someone, which is basically
           | all of them.
        
             | burntsushi wrote:
             | Right. Ignore files and search-time dynamic filters solve
             | different problems. They aren't mutually exclusive.
             | Otherwise, I wouldn't have added the -g/--glob flag to
             | ripgrep in the first place.
        
         | simonw wrote:
         | I really want this feature. Filed an issue here, thanks:
         | https://github.com/simonw/datasette-ripgrep/issues/9
        
           | oefrha wrote:
           | Thanks!
        
         | burntsushi wrote:
         | I agree your replies are weird. Ignore files would be nice if
         | you wanted repo specific filters or even user wide global
         | filters. But for on-demand filters at search time, the frontend
         | could add a feature for it and implement it with the
         | `-g/--glob` flag.
        
           | the_mitsuhiko wrote:
           | So I guess since I started this entire thread here is why it
           | even came up: datasette (a tool I use) is something you run
           | locally. In particular this plugin also is pointed at a local
           | checkout of the repositories you're working with.
           | 
           | So for instance in my case I have all the code I work with in
           | ~/Development. Since I also use ripgrep for local development
           | I typically include .ignore/.rgignore files to control what
           | I'm searching around in my repos.
           | 
           | Doing "excludes"/"includes" across a variety of repositories
           | is hard at query time because they layouts of those
           | repositories can be very different. For instance in some JS
           | repos you do want to search in "dist" whereas in others you
           | don't because "dist" gets generated out of "src" for instance
           | and just ends up being a duplicate etc.
        
             | burntsushi wrote:
             | Ah I see. I didn't know what datasette was.
             | 
             | The way I see doing includes/excludes at query time is
             | this. You run a search without excludes for example. You
             | see a bunch of stuff coming back all in one directory that
             | you don't want for this particular search. So you add an
             | exclude rule for that directory. And then just refine
             | results that way.
             | 
             | If you have a bunch of similarly named directories, then I
             | think that just means that the exclude rules at query time
             | need to be more specific. e.g., use the full path to the
             | directory.
             | 
             | I think the feature being requested here is an interactive
             | feature of the user interface, where as ignore files are
             | more like a static fact of a directory tree.
        
         | the_mitsuhiko wrote:
         | Ripgrep honors ignore files.
        
           | oefrha wrote:
           | The has nothing to do with ignore files. Tests pushed to a
           | GitHub repository obviously won't be matched by gitignore
           | (unless you ignore them in gitignore then force commit them,
           | which would be really rare), I'm just not interested in them
           | in most of my searches.
           | 
           | I'm talking about excluding files and directories during
           | searches, through a directive like, say, "-exclude tests
           | -exclude test_*.py".
           | 
           | Sure, I guess technically you can add a bunch of additional
           | patterns to gitignore before you use this web frontend, but
           | that would be a really weird workflow, and it's impossible
           | when the web frontend is run by someone else.
        
             | woadwarrior01 wrote:
             | You could add them to .rgignore or .ignore (which also
             | works with ag, IIRC) instead of adding them to .gitignore.
        
               | oefrha wrote:
               | Which is again impossible when you use someone else's web
               | frontend for code search. Plus if you're editing
               | .rgignore on the filesystem for every search, what's the
               | point of using a web frontend?
        
               | [deleted]
        
               | the_mitsuhiko wrote:
               | Not sure why ignore files are not the solution here. You
               | can check on an ignore file into your repo then whoever
               | uses ripgrep through whatever fronend gets that logic
               | applied. I'm using this and i don't quite see the
               | downsides of that.
        
               | oefrha wrote:
               | - I can't control what someone else checks into their
               | repo. I use web frontends for code search almost
               | exclusively on other people's repos, because my own repos
               | are already on my machine, so what's the point.
               | 
               | - Most of the time I don't want to see search results in
               | tests, but occasionally I actually want to find things in
               | tests and not anywhere else. Different people have
               | different filtering needs at different times, you can't
               | have a ignore file that satisfies every need. Search is
               | about filtering to begin with.
        
               | the_mitsuhiko wrote:
               | Datasette is a tool you run against your stuff usually. I
               | have control over my repos.
               | 
               | Once you do cross repos search providing global filters
               | on every query turns into an awful user experience.
               | 
               | WRT web frontends: datasette runs locally.
        
               | oefrha wrote:
               | Well, all this tool does is calling rg from a directory
               | with a bunch of repos, so providing global filters on
               | every query is as awful as rg providing a --glob flag...
               | Which is arguably not that awful to a lot of users.
               | 
               | > Datasette is a tool you run against your stuff usually.
               | ... datasette runs locally.
               | 
               | I don't think so? I could be wrong but I thought a major
               | use case (or _the_ major use case) is providing a
               | frontend for your data to other people. Like on
               | https://ripgrep.datasette.io/-/ripgrep.
               | 
               | To quote https://docs.datasette.io/en/stable/,
               | 
               | > Datasette is aimed at data journalists, museum
               | curators, archivists, local governments and anyone else
               | who has data that they wish to share with the world.
        
               | masklinn wrote:
               | > To quote https://docs.datasette.io/en/stable/,
               | 
               | >> Datasette is aimed at data journalists, museum
               | curators, archivists, local governments and anyone else
               | who has data that they wish to share with the world.
               | 
               | That doesn't preclude installing the program locally.
               | 
               | The official "getting started" talks of remote / web
               | datasettes as "demos" and "trials".
        
               | oefrha wrote:
               | > That doesn't preclude installing the program locally.
               | 
               | Of course not. But the project makes it pretty clear that
               | it's designed with multiuser in mind. Can you host
               | multiuser web apps only for yourself? Of course, I do
               | that all the time. The unreasonable thing is saying "I
               | use this thing exclusively for my own stuff on my local
               | network, so screw your requests for multiuser, public-
               | facing use cases."
        
               | masklinn wrote:
               | > Of course not. But the project makes it pretty clear
               | that it's designed with multiuser in mind.
               | 
               | It really doesn't. AFAIK Simon Willison (the author and
               | creator) does his exploration locally, on things of
               | interest to himself.
               | 
               | Publishing the analysis for others to see is no more
               | "multiuser" than publishing a PDF.
        
               | the_mitsuhiko wrote:
               | > I don't think so? I could be wrong but I thought a
               | major use case (or the major use case) is providing a
               | frontend for your data to other people.
               | 
               | _I_ use datasette and I only ever used it locally or in a
               | docker container. In either case for datasette-ripgrep to
               | work you need to check out the repos you want to search
               | in one folder that datasette will then invoke ripgrep on.
               | 
               | There are folks that publish datasette installations for
               | public consumptions but even in that case I would assume
               | that code search was preconfigured to make any sense with
               | ignore files.
        
               | TeMPOraL wrote:
               | Let's flip this around: why would that make sense?
               | Filtering expressions belong to a query, not
               | configuration. Having to use ignore files for this is
               | like not being allowed to use the WHERE clause in SQL,
               | and having to rely on .sqlwhere config file instead.
        
               | the_mitsuhiko wrote:
               | Because I ripgrep all the time and don't want to
               | configure it every time I query.
        
               | colejohnson66 wrote:
               | That's great that it works for you, but it doesn't work
               | for me.
               | 
               | What about the one-off search of _someone else's_ code? I
               | don't want to check out the repo just to add /edit a file
               | so I can run ripgrep locally. If GitHub has a search
               | feature that is entirely online, it would make sense for
               | people to request they improve it.
        
               | the_mitsuhiko wrote:
               | The topic is datasette and ripgrep. I would be curious
               | why in the context of those tools ignore files are not a
               | solution. I agree that github codesearch leaves a lot to
               | be desired.
        
               | OJFord wrote:
               | Put it this way, OP is basically a web interface for
               | ripgrep; ripgrep supports not just 'automatic' filtering
               | through ignore files, but also 'manual' filtering per-
               | query through the -g flag. OP should have an interface to
               | the -g flag.
        
               | the_mitsuhiko wrote:
               | I agree that a -g flag would be useful. However given the
               | original comment and how datasette is used I figured the
               | ignore file would be a good solution.
        
             | the_mitsuhiko wrote:
             | There are more ignore files that ripgrep honors then
             | gitignore.
        
         | arcatek wrote:
         | GitHub searches are particularly bad. They always give you test
         | folders and examples before and after actual relevant source
         | code. You'd think they would know that folders named "test"
         | should be in a separate section...
        
       | Gehinnn wrote:
       | Worth to mention ripgrep-all here
       | (https://github.com/phiresky/ripgrep-all).
       | 
       | It's a regular expression search engine built on ripgrep that can
       | search in virtually anything (pdf files, zip archives, movie
       | subtitles, sqlite databases, png+ocr, ...).
        
       | gigatexal wrote:
       | How have I not known about this? Man this utility is amazing!
        
       | j1elo wrote:
       | I have used regular expressions to search for code, like I guess
       | most devs; but, apart from artificial limits such as maximum time
       | or max number of results, I found regex has never been a reliable
       | search method for me.
       | 
       | There is always the case that, when trying to find all instances
       | of some code, it will miss _that one case_ which is not covered
       | by the initial intuition.
       | 
       | For example, to find uses of the function "foo()", the first idea
       | is usually to search for ".foo(". But that won't find places
       | where the function is passed as value. Nor places where devs were
       | inconsistent and wrote / didn't write a space between the name
       | and the opening brace, like ".foo (" (this is a common style for
       | Gnome or Glib code). Not to mention if the language is C++ and
       | then you have to distinguish between ".foo", "->foo", and
       | "::foo".
       | 
       | In conclusion, regex might be OK to have a superficial look at
       | some code, but never as a reliable source of information, unless
       | a lot of work and ironing out of corner cases has been invested
       | in the regex itself, effectively considering them part of the
       | code, with unit tests and all the fuss that it entails.
        
         | indentit wrote:
         | Agreed, this is why I like the idea of searching parsed files,
         | for example using scope selectors against code tokenized with a
         | sublime-syntax grammar. Unfortunately, it means it needs to be
         | parsed/indexed first, which is slower than a plain regex
         | search.
         | 
         | I wonder if the AST built by Tree-Sitter could also help with
         | this type of search - does anyone know of any existing
         | solutions for this?
        
           | slimsag wrote:
           | > I like the idea of searching parsed files, for example
           | using scope selectors against code tokenized with a sublime-
           | syntax grammar.
           | 
           | I've been working on a trigram-based search engine with
           | support for exactly this (via github.com/trishume/syntect)
           | over the past several months and plan on open-sourcing it
           | soon. Cool to see someone else with this idea!
           | 
           | You might also like https://comby.dev - it is aware of code
           | structure without parsing files
        
             | indentit wrote:
             | > I've been working on a trigram-based search engine with
             | support for exactly this (via github.com/trishume/syntect)
             | over the past several months and plan on open-sourcing it
             | soon.
             | 
             | Awesome, I look forward to that - you'll post a "show HN"
             | for it, I hope? :) Does syntect's lack of support for the
             | newest sublime-syntax features cause you any problems?
             | 
             | > You might also like https://comby.dev - it is aware of
             | code structure without parsing files
             | 
             | Thanks for sharing, will check it out!
        
               | slimsag wrote:
               | > Awesome, I look forward to that - you'll post a "show
               | HN" for it, I hope? :) Does syntect's lack of support for
               | the newest sublime-syntax features cause you any
               | problems?
               | 
               | Show HN -> Yep :)
               | 
               | Lack of support for newest sublime-syntax features: less
               | than you would expect, but a little for sure.
               | 
               | Most of the syntax definitions in the wild today don't
               | really use the newer features, at Sourcegraph I wrote a
               | little Rust HTTP server wrapping syntect[1] and we use it
               | for all our syntax highlighting for the past several
               | years, I would say it works on like 95% of code files
               | even if you include a lot of additional syntaxes that are
               | untested with Syntect[2]. That said, it does barf hard on
               | some specific files - either taking a _really_ long time
               | to do the work or getting completely stuck in a busy
               | waiting loop for some reason. That said, it 's still the
               | 2nd best syntax highlighter out there (second only to
               | Sublime itself.)
               | 
               | One of my hopes for this side project is that I'll be
               | able to contribute more time upstream with e.g. a more
               | extensive test suite for Syntect against a much larger
               | number of syntax definitions from the wild instead of
               | just Sublime's built-in ones.
               | 
               | [1] https://github.com/sourcegraph/syntect_server
               | 
               | [2] https://github.com/slimsag/Packages/#license
        
           | masklinn wrote:
           | > does anyone know of any existing solutions for this?
           | 
           | https://semgrep.dev, though it's mostly an analysis tool it
           | can be used as a search tool. IIRC it's not super fast, but
           | for the cases where there is no way to really contort a regex
           | into something suitable (the regex has way too many false
           | positives and / or negatives) it works rather well.
        
         | porpoise wrote:
         | In some cases the problem can be somewhat mitigated if the
         | search is incremental (and fast), as it basically allows you to
         | refine your regex in real time and help train the correct
         | intuition.
         | 
         | For example, you can type out "foo" at first which will, say,
         | display 7 results, then when you add "()" you see the list
         | shrunk by one, so you immediately know the regex missed one
         | it's supposed to capture.
        
       | karlicoss wrote:
       | People in comments suggesting alternatives like sourcegraph,
       | zoekt, hound, etc -- I've set up local code search for
       | repositories on my computer a while ago and tried out most of the
       | tools I could find at the time [0]. Nothing comes remotely close
       | to a simple ripgrep against literally all of the repositories,
       | both in terms of convenience and speed (it's instantaneous
       | against hundreds of repositories, at least on SSD!). The only
       | hassle is configuring .ignore files (most is covered by
       | .gitignores), but usually only have to do it a few times to
       | exclude the few spammy offenders.
       | 
       | I'm using it with my emacs (+helm), I just have a global
       | keybinding which instantly opens a new code search. I could
       | imagine having a web frontend being very convenient for people
       | who aren't hooked on Emacs, even I often want to persist a code
       | search result for several days, so will give datasette-ripgrep a
       | try!
       | 
       | [0] https://beepb00p.xyz/pkm-search.html#code
        
       ___________________________________________________________________
       (page generated 2020-11-28 23:00 UTC)