[HN Gopher] Rga: Ripgrep, but also search in PDFs, E-Books, Offi... ___________________________________________________________________ Rga: Ripgrep, but also search in PDFs, E-Books, Office documents, zip, tar.gz Author : angrygoat Score : 451 points Date : 2020-12-02 15:38 UTC (7 hours ago) (HTM) web link (phiresky.github.io) (TXT) w3m dump (phiresky.github.io) | awinter-py wrote: | thanks but it's way faster to have my stuff in G drive | | that way I can open a browser tab, wait 5 seconds for it to load, | locate the new screen location of the search bar, click it, wait | for javascript to finish loading so I can click the search bar, | click it for real this time, mistype because there's some kind of | contenteditable event jank, wait 5 seconds for my results to come | up, fix the typo, and just have my results waiting for me | | I'm not going to learn a new tool when web is fine | ffpip wrote: | If you're using Duckduckgo, just search ''!drive search-term'' | or ''search term !drive'' or ''search !drive term'' | | More !operators here - https://duckduckgo.com/bang | [deleted] | spappal wrote: | Firefox supports custom search engines, the most bang for the | buck custom search engine must be | https://duckduckgo.com/?q=%s with keyword being the letter d. | Then you get all these 13000+ bangs without having to | configure the custom search engines. E.g. write "d !drive | term" in url bar. And "d !w hacker news" sends you directly | to https://en.wikipedia.org/wiki/Hacker_News | codethief wrote: | Or you just set DDG as your default search engine and then | you don't even have to type the "d" anymore. :) | Tokkemon wrote: | Can't tell if this is sarcasm. | awinter-py wrote: | not being sarcastic | | if god wanted me to access my files in less than 15 seconds, | they wouldn't have commanded google to package the search bar | as a separate JS bundle that only gets downloaded when you | focus the search bar | | I'm no frontend dev but I know a thing or two about HTML + | there's no built-in way to input text into a box -- this is | the best we can do and we'll just have to wait for 5G + | moore's law to solve this | durnygbur wrote: | Laugh all you want but try looking for a Fullstack/Frontend | role in today's job market. What do they want? AnGuLaRr | with oBsErVaBlEs! Why do they want it? Because Google can't | be wrong. | awinter-py wrote: | wait actually? my sense is that react is leading | chrisweekly wrote: | > "I'm no frontend dev but I know a thing or two about HTML | + there's no built-in way to input text into a box" | | hahaha, nice one (continued) | murermader wrote: | I don't know if you are joking, but this is clearly sarcasm. | read_if_gay_ wrote: | You're saying you can't tell it's sarcasm that he can't | tell it's sarcasm? | enriquto wrote: | This means it's good sarcasm! | | The best sarcasm lies on a ridge, you cannot tell if it's | sarcasm or not. | hs86 wrote: | If you use Chrome, this might help: | https://www.androidpolice.com/2019/12/04/chrome-omnibox-will... | | For GSuite/Workspace this needs to be enabled by an admin: | https://support.google.com/a/answer/9121487?hl=en | stjohnswarts wrote: | A lot of us don't want our stuff on G-drive for privacy and | security concerns. Tools like this are valuable to us. It's an | old problem and there are plenty indexers out there, this more | real-time scan is more than welcome to join the bunch of | course. | gopty wrote: | Sounds like a poor man's version of recoll | | https://www.lesbonscomptes.com/recoll/ | | A PDF in a Zip file, in an email attachment. recoll can index it | and do OCR if you like | SamuelAdams wrote: | Any advantages to this over something like Agent Ransack? | | https://www.mythicsoft.com/agentransack/ | tfigment wrote: | Command line, linux, and open source immediately come to mind | fnord123 wrote: | Works on non-Windows. ripgrep is notoriously fast. Command line | interface. Not comically priced at 59.95 USD. | cpach wrote: | Why is there an expectation that every application should be | free or cheap? IMHO $60 is very reasonable for a program that | can save a lot of time for the user. And developers also have | to eat, and might want to some day retire. | fnord123 wrote: | I'm not the boss of you. If you want to spend 60 USD on a | program that is mostly built into Finder and Nautilus, fill | your boots. | michaelcampbell wrote: | $60 may not be much for you; it may be worth it for you. | For others it may not be. | maxioatic wrote: | This is great. I have 100+ ebooks/pdfs of programming and | textbooks of which I've been extracting the index pages of. My | intention was to always make some sort of search index out of | them. I will definitely be trialing this (initial few searches | seem promising!) | skanga wrote: | Great idea. Please update on whether this use case works or | not! And other tips, examples, etc. | diimdeep wrote: | Is anyone preferring some other search tool other than Spotlight | on macOS ? | cpach wrote: | I like Spotlight, and its CLI companion mdfind. | qppo wrote: | I use ripgrep on macos | michaelcampbell wrote: | ripgrep in emacs for me. | ghoomketu wrote: | One a related note there is one program that I absolutely miss on | Linux called everything (on windows). | | The closest I can find is mlocate but it does not have a GUI but | more importantly it does not index my Windows or NTFS drives. | | Would appreciate any suggestions if someone knows something like | 'everything' for Ubuntu. | shscs911 wrote: | fd (https://github.com/sharkdp/fd) is the best command line | search utility IMO. Its crazy fast and always found what I was | looking for. If you want a GUI alternative, check out Drill | (https://github.com/yatima1460/Drill). Although the development | seems stalled, it works well for normal usecases. | opan wrote: | Haven't used this, but heard of it years ago, and it aims to be | similar. https://github.com/dotheevo/angrysearch/ | ghoomketu wrote: | Thanks! That's super nice and very close to what I was | looking for all this time. | | I just learned how to mount all my Windows drive under /mnt | using (using the `disks` software), so hopefully this should | index those files too. | pabs3 wrote: | BTW, mlocate is obsoleted by plocate, which is much faster and | is actually maintained. | | https://plocate.sesse.net/ | captn3m0 wrote: | Seriously - I miss it as well. But my access patterns have | changed as well. I spend more time on the terminal, and with | autojump, the alternatives (with similar features) on Linux | aren't really that useful to my usage. | mrlala wrote: | 'Everything' is a LIFE SAVER. | | Hmm.. I seem to remember creating an excel file for this client | a while back.. open Everything -> filter _client_.xlsx .. boom. | Or maybe I didn 't name it properly, at all? Well still just a | simple '*.xlsx' and sort by date, I can generally find anything | this way. As long as you let Everything open on windows | startup, it will be instant after use. | ronjouch wrote: | Everything is great on Windows to pick files/folders. | | From the linux command-line, I like fzf ( | https://github.com/junegunn/fzf ), that you can instruct to use | the faster fd ( https://github.com/junegunn/fzf#environment- | variables ). Fzf even offers keybindings for your shell. For | example, it binds Alt+C to fuzzy-finding a directory, and Enter | cds to it ( https://github.com/junegunn/fzf#key-bindings-for- | command-lin... ). | | Fzf is great for other things too; here is a fish function to | bing Alt+G to fuzzy-pick a Git branch and jump to it: | function fish_user_key_bindings bind \eg 'test -d | .git; or git rev-parse --git-dir > /dev/null 2>&1; and git | checkout (string trim -- (git branch | fzf)); and commandline | -f repaint' bind \eG 'test -d .git; or git rev-parse | --git-dir > /dev/null 2>&1; and git checkout (string trim -- | (git branch --all | fzf)); and commandline -f repaint' | end | demosito666 wrote: | There is about a dozen dedicated file indexers [1] on linux | (some are gui, some are not) and also DE-integrated ones like | Baloo for KDE and Tracker for Gnome. | | [1] | https://lmgtfy.app/?q=Is+there+a+file+search+engine+like+%E2... | pkaye wrote: | For mlocate you can edit /etc/updatedb.conf to specify what to | index. One trick I use is "locate -Ai" that lets you search for | multiple patterns and makes it case insensitive. So you can use | "locate -Ai linux .pdf" to search for all pdf files related to | Linux. | | Also for gnome there is tracker which does search and indexing | built into the system. I think by default its set for minimal | use but it can be configured by the settings/search panel to | index many locations. I haven't played with is much recently | though. | RMPR wrote: | To traverse my files I use the combo ranger + autojump. It is | not GUI and you need to traverse a directory at least once | before accessing it automatically, but I just wanted to mention | this. Another (CLI) software that seem to do what you want is | fzf. | dang wrote: | If curious see also | | 2019 https://news.ycombinator.com/item?id=20196982 | durnygbur wrote: | No ripgrep-all through the package manager: $ | sudo dnf install -y ripgrep-all [...] No match for | argument: ripgrep-all Error: Unable to find a match: | ripgrep-all | | Rust's package manager fails: $ cargo install | ripgrep_all [...] | failed to select a version for the requirement `cachedir = | "^0.1.1"` candidate versions found which didn't match: | 0.2.0 location searched: crates.io index | required by package `ripgrep_all v0.9.6` | | Quick search on the web shows that more people have problems with | cachedir version. | ChrisSD wrote: | It looks like cachedir yanked version 0.1.1. This is usually | only done when a very serious issue is discovered, though I | don't know what the reason is in this case. | | https://crates.io/crates/cachedir | est31 wrote: | You can do _cargo install --locked ripgrep_all_ as a | workaround. It uses the lockfile that 's part of the | ripgrep_all package, so you miss out on some package updates, | but can also get the cachedir version required. | | There is a github issue to make this the default behaviour of | cargo, but you miss out on updates which might fix security | bugs so the cargo team is unwilling to change the default. | | https://github.com/rust-lang/cargo/issues/7169 | edm0nd wrote: | Big fan of ripgrep. Use it on Windows to search through 100s of | GBs of data really quickly. | akavel wrote: | The "Integration with fzf" example looks really cool: | | https://github.com/phiresky/ripgrep-all#integration-with-fzf | kovek wrote: | How could I use Rga to search my browsing history? | simonw wrote: | Your browser history (if you use Chrome or Firefox at least) is | stored in a SQLite database. | | It looks like rga can handle SQLite out of the box, so just | making sure your history .db file is visible to rga may be all | you need. | | You can also use my Datasette tool to get a web UI against your | history, see | https://docs.datasette.io/en/stable/getting_started.html#usi... | hiq wrote: | It would be nice to have a direct comparison with ugrep. In the | case of rg the benchmarks are already enough to switch. Why | should I use rga instead of ugrep? | burntsushi wrote: | I've called the ugrep benchmarks into question, and I | elaborated on it here (and this includes a frustrating exchange | between myself and the ugrep author): | https://old.reddit.com/r/rust/comments/i6pfb2/ugrep_new_ultr... | | I've also re-run my original set of benchmarks[1] with ugrep | included: | https://github.com/BurntSushi/ripgrep/blob/master/benchsuite... | | [1] - https://blog.burntsushi.net/ripgrep/ | hiq wrote: | Just to be clear, I meant that I had switched to ripgrep | because its speed was convincing enough on its own (so I did | not even extra features to switch). | | I'm currently not using any of ugrep or rga, although I have | used pdfgrep in the past. It'd be nice for casual users like | me to know more about why I should use rga over ugrep (or | vice-versa). | cb321 wrote: | NOTE: ripgrep already has --pre. (No pre-built indexing, of | course.) | burntsushi wrote: | That's exactly what ripgrep-all uses to implement this. There's | a lot of integration work required to make this nice. The --pre | flag is just a small hook. More info on it here: | https://github.com/BurntSushi/ripgrep/blob/master/GUIDE.md#p... | cb321 wrote: | Yup. | | Something perhaps more helpful but so far unmentioned (and | somewhat OS-specific) is that statically linked executables | usually fork & exec (especially exec) much faster than | dynamically linked ones. This difference is usually only like | 50..150 us vs 500..3000 us but can multiply up over thousands | of files. | | This only matters on the first run of `rga`, of course. While | the dispatched-to decoder is likely mostly out of one's | linking control, this overhead can be saved for the dispatch | _er_ , at least. So, I would suggest `rga-preproc` should | have a static linking option/suggestion, at least on Linux. | | Of course, this overhead may also fall below the noise of | PDF/ebook/etc. parsing, but maybe not the decompression of | small files in some dark horse format. :-) | vmchale wrote: | Wonderful! pdfgrep is good but slow. | phiresky wrote: | pdfgrep has a --cache option since a while ago :) Not sure why | they don't enable it by default. Still, this is much faster. | globular-toast wrote: | I have mixed feelings about these kinds of tools. | | I can understand it might be nice to have a personal library of | PDF books and searching in them. I can't think of a time I've | ever wished I could search my bookshelf in that way, but you | never know. | | Obviously I use tools like ripgrep for searching codebases and | the like. | | But the extreme flexibility of this one in particular (and others | like MacOs Spotlight) makes it seem more like a data recovery | tool for me. If my directory structures and databases ever | completely failed for some reason I might need to search through | everything to find the data again. It's good to know such tools | exist, I suppose. | | But my fear is that tools like this teach people to not worry | about organisation of data and to just fill up their disks with | no structure at all. I think that unless something goes terribly | wrong nobody should ever need a tool like this. Once you rely on | it, you're out of luck it if it ever fails you. What if you just | can't remember a single searchable phrase from some document, but | you just _know_ it must exist somewhere? | | It's similar to what Google has done to the web. When I was | growing up it used to be a skill to use the web. People used | tools like bookmarks and followed links from one place to | another. Now it's just type it into Google and if Google doesn't | know, it doesn't exist. | nojito wrote: | Hierarchal organizing of data is not a productive way of | organization. Simply due to how much information people | accumulate and often times structures breakdown. | | It's more intuitive to simple search for something in the | something you are looking for and clicking it. | | I haven't used a folder organization structure in many many | years. Other than the defaults for my cloud folders and a | separation between Personal + Work. | durnygbur wrote: | There is nothing wrong with the original Google's postulate. | Your local search results are less likely to be hijacked by | entities bidding for your attention. I agree with the argument | for organizing the data anyway. | armoredkitten wrote: | I mean, I understand what you mean when it comes to Google -- | the web essentially becomes locked into a particular | proprietary solution to finding information. I definitely still | have hundreds (maybe into the thousands?) of bookmarks of sites | that store information I care about. | | But I don't think this tool deserves the same sort of mixed | feelings. I don't think this replaces structure -- there's | still value to having a conceptual mapping of where documents | are stored, and for grouping sets of documents together. It's | just that having a structure doesn't help if you don't know | where in the structure something is stored. This sort of tool | is a bottom-up approach for the times when the top-down | approach doesn't work very well. | | Do you have similarly mixed feelings if sometimes, even with my | carefully-crafted set of bookmarks with all their nested | folders, I use the search tool to find the bookmark I'm looking | for? It's the same idea. Sometimes a top-down structure is | beneficial. But sometimes things get misclassified, or you | forget about some piece of the structure, or you aren't | familiar with some new structure, and in those cases, having | bottom-up tools are immensely useful. There's no risk of vendor | lock-in here. It's just a difference of approach in information | retrieval. | chris_st wrote: | Curious why this isn't a pull request to ripgrep? Maybe it was, | and rejected? It'd be nice to just have one tool, and this | doesn't feel like it's a stretch to add to ripgrep. | burntsushi wrote: | It's a stretch. A big one. | | I answered this a while back: | https://old.reddit.com/r/rust/comments/c1bjw4/rga_ripgrep_bu... | antegamisou wrote: | I always found useful something along the lines of | pdftotext -layout file.pdf | grep -E ... | | for PDFs, good to see a Swiss Army knife utility for all sorts of | file though! | phiresky wrote: | rga uses pdftotext (from poppler) internally for pdfs, except | wraps it in parallelization and a very fast cache layer, since | you usually want to do multiple queries per file :) | hobofan wrote: | Big fan of rga! I use it almost every day for the academic part | of my life, when I want to know the location of some specific | keywords in my lecture slides, books or papers I've been reading. | Even for single ebooks, it is often more useful than the search | in Acrobat Reader. | durnygbur wrote: | > search in Acrobat Reader | | The search in PDF viewers is an anti-feature in terms of UI and | performance. Their advantage is that they allow to scroll to | and highlight the found phrase back in the document. | solstice wrote: | The search in Tracker Software's PDF X-Change Viewer/Editor | is really great. Effective and easy to use | mssdvd wrote: | The search in most application is an anti-feature. | faitswulff wrote: | I noticed that you can use Tesseract as an OCR adapter for rga. | Tesseract is written in python, IIRC, and in the OP it comes with | a warning that it's slow and not enabled by default. Are there | any other fast, reliable OCR libs out there? Or any rust OCR | backends? | mouldysammich wrote: | https://github.com/tesseract-ocr/tesseract seems to be written | in c++ not python | faitswulff wrote: | Ah, my mistake then. | hobofan wrote: | I don't think the problem necessarily is that Tesseract is | slow, but that the whole process of rendering a PDF to a series | of PNGs on which you can then run OCR is slow (which is what it | does in the background). | undebuggable wrote: | The process of converting all pages to raster images and then | OCR-ing each one takes hours for PDFs hundreds of pages long. | This workflow is not suitable for instant search. For non | OCR-ed PDFs it's worth to pregenerate the text. | hobofan wrote: | That's why rga comes with a cache. I've occasionally used | the Tesseract adapter with good success (results-wise), and | after the inital rendering and indexing, it's fast enough | to use. | alexruf wrote: | Idea behind Rga is cool. Anyway, I tried it on Mac and installed | via Homebrew. The formula already says it depends on ripgrep | (that's fine since I have ripgrep already installed and use it | regularly). I still was surprised when I executed Rga for the | first time and got an error message that 'pdftotext' was not | found. Since pdftotext has been officially discontinued, I am not | sure if I want to install an old version just to make Rga work on | my machine. Don't think it's an good idea to rely on a project | which is not maintained actively. | phiresky wrote: | Yeah, In my opinion poppler should be a dependency of rga in | homebrew (since it's kinda useless without having the default | adapters), but I don't maintain that package. | there_the_and wrote: | I don't see any indication that pdftotext has been discontinued | [1]. It looks like a Mac-specific installer available via | Homebrew Cask has been discontinued [2], but pdftotext is still | available through the normal poppler formula [3]. | | 1. https://poppler.freedesktop.org/releases.html | | 2. https://formulae.brew.sh/cask/pdftotext | | 3. https://formulae.brew.sh/formula/poppler | burntsushi wrote: | > Since pdftotext has been officially discontinued | | Do you have a link for that? That's news to me. | alexruf wrote: | brew info pdftotext | | https://formulae.brew.sh/cask/pdftotext#default | root_axis wrote: | This is really great. | fock wrote: | can it produce links to open the file yet (don't know rust, so | can't add a PR easily). At least gnome-terminal supports that | (and normally it should also support opening a specific pdf | page)! | mcintyre1994 wrote: | Not sure if the implementation is in rg or zsh, but that | combination produces cmd-clickable file names for me. | soferio wrote: | Can it (or any tool) perform proximity searches on scanned PDFs? | E.g word1 within 20 words of word2, on scanned PDFs? (I think | this is non trivial but very useful.) | phiresky wrote: | Scanned PDFs only work well if they already have an OCR layer. | There's some optional integration of rga with tesseract, but | it's pretty slow and less good than external OCR tools. | | ripgrep-all can do the same regexes as rg on any filetypes it | supports. So you can could do something like --multiline and | foo(\w+[\s\n]+){,20}bar | | It won't work exactly like this, but something similar should | do it: | | --multiline enables multiline matching | | * foo searches for foo | | * \w+ searches for at least one word character | | * [\W]+ searches for at least one space/nonword character like | sentence marks | | * {,20} searches for at most 20 iterations of the word-space | combination bar searches for bar | ballmerspeak wrote: | If its a scanned PDF (essentially a collection of 1 image per | page), there would need to be an OCR step to get some text out | first. Tesseract would work for this. | | Once that's done, you have all the options available to perform | that search. But I don't know of a search tool that does the | OCR for you. I did read a blog post of someone uploading PDFs | to google drive (they OCR them on upload) as an easy way to do | this. | ssivark wrote: | I love that we're seeing fast & flexible solutions for personal | search. | | I've recently been playing with Recoll for full-text-search on | content. Since it indexes content up front, the search is pretty | fast. It can also easily accommodate tag metadata on files. | | It would be interesting to consider how ripgrep based tools can | fit into generically broad "search your database of content" | workflows (as opposed to remember or go through your file system | paths). | simias wrote: | FZF + ripgrep is really killer for me. I don't even bother | organizing my notes anymore, I just throw everything markdown | files in a flat directory and then I have a script that uses | FZF + ripgrep to search through it when I need it. I search by | "last modified first" so unless I'm digging for something very | old the results are instant. Code snippets, finances, TODO | lists, cake recipes... It's all in there. | | I use the same system in Vim to browse source code. It's very | powerful, very fast, works with any language and requires zero | configuration. | rshm wrote: | Can you share your script. | maxioatic wrote: | I'd love some more info as well! | simias wrote: | This is the main one (that actually only uses FZF, not | ripgrep): https://gist.github.com/simias/b1d8356469d2a9386d | eeb7c45984b... | | You'll need to set NOTES_DIR in your environment to | wherever you want your notes to be stored. Then you can | write `note something` to create or open | $NOTES_DIR/something.md with your $EDITOR. | | If you type "note" without parameter you'll start a search | on all the note names, ordered by last use. If you type | "note -f" it starts a full text search. | | For best results you should have the fzf.vim's preview.sh | somewhere in your fs, otherwise it'll use "cat" but it | won't be as good looking (see FZF_PREVIEW in the script). | | Hopefully despite being shell it should be readable enough | to tweak to your liking. | | Note that it was written and used exclusively on Linux, but | I did try to avoid GNU-isms so hopefully it should work on | BSDs and maybe even on MacOS with a bit of luck. | tasuki wrote: | About a year ago, I discovered it was very helpful for me to | have git branches ordered by "recently modified first": | | From my `~/.gitconfig`: [alias] | brt = "!git for-each-ref refs/heads --color=always --sort | -committerdate --format='%(HEAD)%(color:reset);%(color:yellow | )%(refname:short)%(color:reset);%(contents:subject);%(color:g | reen)(%(committerdate:relative))%(color:blue);<%(authorname)> | ' | column -t -s ';'" | | I always spent a lot of time being confused about branches, | and never realised how easy the solution was. | simias wrote: | Oh that's a great idea, I'd definitely stealing that, | thanks! | polyrand wrote: | I couldn't agree more with that. I wrapped a bash function to | search through my notes folder with fzf + rg and it works | perfectly. | | Also, I have a specific pattern to write some tags inside | files that I can parse with ripgrep. | rjzzleep wrote: | rga also indexes them when you search. To be honest I like that | approach a lot more since it saves space and I generally know | where I'm looking for things ls -sh | ~/.cache/rga/ total 336M 336M data.mdb 4.0K | lock.mdb | ssivark wrote: | That kind of caching is an interesting solution to | incrementally building a database instead of spending hours | up-front indexing. So the tool is ready for immediate use. | Quite nifty :-) | curious_tenet wrote: | Wow that is so cool! | nikisweeting wrote: | Aww hell yeah we should definitely use this in place of ripgrep | for the new ArchiveBox.io full-text search backend. | | https://github.com/ArchiveBox/ArchiveBox/pull/543 | phiresky wrote: | Developer of the tool here :) Glad to see it posted here, I still | actively use it myself. Also check out the fzf integration in the | README: https://github.com/phiresky/ripgrep- | all/blob/master/doc/rga-... | | Currently the main branch is undergoing a refactor to add support | for having custom extractors (calling out to other tools), and | more flexible chains of extractors. | | Ripgrep itself has functionality integrated to call custom | extractors with the `--pre` flag, but by adding it here we can | retain the benefits of the rga wrapper (more accurate file type | matchers, caching, recursion into archives, adapter chaining, no | slow shell scripts in between, etc). | | Sadly, during rewriting it to allow this, I kind of got hung up | and couldn't manage to figure out how to cleanly design that in | Rust. I'd be really glad if a Rust expert could help me out here: | | In the currently stable version, the main interface of each | "adapter" is `fn(Read, Write) -> ()`. To allow custom adapter | chaining I have to change it to be `fn(Read) -> Read` where each | chained adapter wraps the read stream and converts it while | reading. But then I get issues with how to handle threading etc, | as well as a random deadlock that I haven't figured out how to | solve so far :/ | maximz wrote: | Love this. I appreciate your building on ripgrep versus my own | bulky lucene-based approach a while back | (https://github.com/maximz/sift), and that you don't require | pre-indexing but build up a cache as you go. | burntsushi wrote: | > In the currently stable version, the main interface of each | "adapter" is `fn(Read, Write) -> ()`. To allow custom adapter | chaining I have to change it to be `fn(Read) -> Read` where | each chained adapter wraps the read stream and converts it | while reading. But then I get issues with how to handle | threading etc, as well as a random deadlock that I haven't | figured out how to solve so far :/ | | I don't quite grok the problem here. If you file an issue | against ripgrep proper with code links and some more details, I | can try to assist. | | Taken literally, ripgrep uses that exact same approach. There | are potentially multiple adapters being used. Each adapter is | just defined to wrap a `std::io::Read` implementation, and the | adapter in turn implements `std::io::Read` so that it can be | composed with others. The part that I'm missing is why this has | anything to do with threading or deadlocks. I/O adapters | shouldn't be having anything to do with synchronization. So I'm | probably misunderstanding your problem. | phiresky wrote: | > If you file an issue against ripgrep proper with code links | and some more details | | Sorry, I don't think I explained my issue very well. In | general it has nothing to do with the interaction with | ripgrep, that works fine. | | It's that each adapter (e.g. zip -> list of file streams) | needs to have an interface of fn(Read) -> Iter<ReadWithMeta> | | But then if there's a PDF within the zip, I have to give the | returned ReadWithMeta to the PDF adapter - but it can't take | ownership, because the Archive file iterators only give | borrowed reads. I maybe worked around this by creating a | wrapper type [3] and adding an unsafe here [2], but something | deadlocks when adapting zip files currently. | | Also, for external programs, I have to copy the data from the | Read into a Write (stdin of the program) - which needs to | happen in a separate thread, otherwise the stdout is never | read [1], but some Reads I have aren't Send since they come | from e.g. zip-rs, so they can't be passed to a thread. | | [1] https://github.com/phiresky/ripgrep- | all/blob/baca166fdab3d24... | | [2] https://github.com/phiresky/ripgrep- | all/blob/baca166fdab3d24... | | [3] https://github.com/phiresky/ripgrep- | all/blob/baca166fdab3d24... | one-punch wrote: | The integration with fzf seems nice. | | Any plans to integrate with skim, a Rust implementation of fzf? | | https://github.com/lotabout/skim | cb321 wrote: | One possibility is the almost dirt-simple solution wherein you | just have a "make"/"Makefile" (or your favorite other build | system) maintain a shadow tree of parallel pre-translated | files. You get parallelism via `make -j$(nproc)` or its | equivalent. | | Every name in the shadow is built from the name in the origin | but maybe with ".txt" added (or .txt.gz if you want to keep the | compressed with whatever is the fastest decompressor builtin to | ripgrep as a library not called as a program). Untranslated | names can be just symbolic/hard links back to the origin. Build | rules become as flexible as your build system. | | This also scales to deployments that have more disk space than | memory. Admittedly, in that case, the whole procedure probably | becomes disk-IO bound, but maybe not. Maybe some translations | cannot even keep up with disk IO - NVMe storage is pretty fast, | for example. Or available memory may vary dynamically a lot, | sometimes allowing the shadow to be fully in the buffer cache, | other times not. It strikes me as less presumptuous to assume | you can find disk space vs. having that much memory available. | (EDIT2: though I may be confused about how `rga` operates - | your doc says "memory cache", though.) | | On the pro-side, but for updating the shadows based on origins, | the user could even just `rg` from within the shadow and | translate filenames "in their head", although stripping an | always present string is obviously trivial. Indeed, you won't | need `rg --pre` at all and the grep itself could become | pluggable. I doubt any of your other `fzf`/etc. integrations | would be made more complicated by this design, either. | | This all strikes me as simple/nice enough that someone has | probably already done it...EDIT1: Oh, I see from thumbs ups and | other comments over at [1] and [2] that @phiresky is probably | already aware of this design idea, but maybe some HN person | knows of an existing solution along these lines. | | [1] https://github.com/BurntSushi/ripgrep/issues/978 [2] | https://github.com/BurntSushi/ripgrep/pull/981 | aembleton wrote: | AUR has both a ripgrep-all [1] and ripgrep-all-bin [2] package. | Both were addded by you. The bin package has a newer version. | What is the difference between the two? | | 1. https://aur.archlinux.org/packages/ripgrep-all/ | | 2. https://aur.archlinux.org/packages/ripgrep-all-bin ___________________________________________________________________ (page generated 2020-12-02 23:00 UTC)