[HN Gopher] If you use GNU grep on text files, use the -a (--tex... ___________________________________________________________________ If you use GNU grep on text files, use the -a (--text) option Author : rurban Score : 120 points Date : 2020-04-21 04:52 UTC (18 hours ago) (HTM) web link (utcc.utoronto.ca) (TXT) w3m dump (utcc.utoronto.ca) | zajio1am wrote: | This 'feature' is especially irritating when one uses grep on | some text files with legacy (non UTF-8) encoding, but has locale | with UTF-8 encoding. The grep decides that regular text file is | binary just because there are byte sequences that are not valid | UTF-8 sequences. | battery_cowboy wrote: | Oh man, I've had this issue before and I just chose to nuke the | logs and try again, thinking they were corrupted! | arendtio wrote: | Actually, I never considered log files untrusted input, but as | this example shows, it would be wise to do so. | JoachimSchipper wrote: | FWIW, attacks like Javascript or SQL injection via logfiles are | hardly unknown. Log files are plenty scary. ;-) | avodonosov wrote: | Also beware of CWE-117 | arendtio wrote: | https://cwe.mitre.org/data/definitions/117.html | | 'CWE-117: Improper Output Neutralization for Logs' | | That is something probably often forgotten when simply | dumping some requests into a log, but at least it should be | obvious that the source of the content is untrusted. On the | other hand, a log file is a file on your server, so you would | probably think of it as nothing dangerous, as everybody has | cared about CWE-117, right? ;-) | kwoff wrote: | Also precede the file list by `--`. A very confusing thing can | happen if a file happens to begin with a dash... (You can | intersperse options like `-e pattern` among file names, if for | some reason you wanted to do that.) | iforgotpassword wrote: | But that's not grep specific and generally a good idea, | especially in scripts that get the file names from their | command line, some input file or god knows where. | exabrial wrote: | > 'LC_ALL=C' | | Wuff, reminds me of the completely incompatible difference | between BSD sed and GNU sed | gpvos wrote: | I just looked through the GNU grep history to see when it | suddenly started being able to decide halfway through a file that | it is binary after all; this is since 16 September 2014, so | fairly recently. Before that, it just checked the first few | kilobytes to decide, and didn't change its opinion afterwards. To | me, this is a very nonintuitive change. | iforgotpassword wrote: | Oh yes. I was bitten by that when I piped around some tool's | output that eventually went through grep. The tool's output was | actually text but for some reason there was a null byte at the | end which nobody noticed before. | | The fun part was that the way the data got chunked through the | pipes was not deterministic, so sometimes you got the desired | output from grep, but other times just "binary file matches", | even when the raw output from the first tool was identical in | both runs. That took quite some head scratching to figure out. | gpvos wrote: | Why is it that in recent years, with this and the more recent | ls quoting fiasco, maintainers of longstanding UNIX utilities | suddenly got the urge to fix what isn't broken? | chubot wrote: | What's the 'ls quoting fiasco'? | | Actually I recently found that coreutils and ls behave fairly | well with funny filenames: | | Here is an invalid utf-8 byte and then a valid utf-8 sequence | $ x=$'\xce\xce\xbc' $ touch "$x" | | You can list it: $ ls ?m | | And here 'ls' does better than other tools that display | filenames. It shows the invalid byte and then keeps decoding | with error recovery: $ ls --escape | \316m | | However GNU stat (which I think is also in coreutils) does | something similar, but weirdly messed up: $ | stat * File: ''$'\316''m' | | (it looks like it's outputting a valid shell string, except | with extra quotes) | | ----- | | Most command line tools are not aware of stuff like this. For | example you can touch "x$ANSI_TERMINAL_CODES" and if you do | "bash x??" or "python x??", then your terminal will change | color because of the escape codes printed back to the | terminal. | | I just changed Oil to use a well-defined format I called QSN | (quoted string notation): | | http://www.oilshell.org/blog/2020/04/release-0.8.pre4.html#t. | .. | | It adapts Rust's string literal syntax to express arbitrary | byte strings precisely and losslessly. (JSON can't express | arbitrary byte strings.) | | The QSN encoder does UTF-8 _decoding_ with a specific error | recovery mechanism. So it 's basically like what ls and stat | do, but it's more precise. | | (If anyone is interested in QSN, please contact me. I think | it's more generally useful in a lot of places. It's something | we already do but it's precise like JSON.) | _jal wrote: | They broke it in 2016. | | https://www.gnu.org/software/coreutils/quotes.html | | At least with Gnu, you can recompile your own, non-broken | version, which is the only saving grace of these stupid, | trendy changes. | Anthony-G wrote: | Not broken at all. I really don't understand the hate for | this change. | | I deal with a lot of filenames with spaces and think this | change is a great improvement for listing such files. | With this change it's much easier to see where one | filename ends and the other begins. Before this change, I | had to use the `-1` option to ensure that each filename | was listed on a line by itself. Now the filename listings | are much more readable and it takes less cognitive effort | to take it all in. | | The way it handles filenames with ASCII | apostrophes/single quotes works particularly well (wraps | the filename in double quotes instead of single quotes) | and makes it very easy to copy and paste filenames to and | from the terminal. | | Best of all, this change only applies when standard | output is a TTY device so this does not break any shell | scripts (even though parsing `ls` is a bad idea in any | case) and is still compliant with the POSIX | specification[1] which states that _"If the output is to | a terminal, the format is implementation-defined"_. | | 1. https://pubs.opengroup.org/onlinepubs/9699919799/utili | ties/l... | boring_twenties wrote: | GNU ls prints the filenames already-quoted as of late: | % touch foo % touch 'bar baz' % ls | 'bar baz' foo | knolax wrote: | A google search for 'ls quoting fiasco' shows this thread as | the top result. Could you explain what it is? | [deleted] | linsomniac wrote: | I mostly run into this on searching my environment: "set | grep | whatever", now needs an "-a", possibly because of escape codes | added to the environment a decade ago. | | Maybe the fix would be to only activate the "detect binary files" | code if stdout isatty? | | Because it is a nice feature when I do a big grep to find | something among my home directory or the entire filesystem. It is | certainly annoying to get binary garbage in my terminal. Or maybe | the binary detection could get smarter, maybe making the | determination on a match-by-match basis ("This line I'm about to | output is a kilobyte and half of it is non-printable", say). | | Though, ack-grep doesn't seem to avoid putting binary garbage on | my terminal, so maybe reasonable to switch to something that | isn't so clever? Most of my terminal greping is done with ack | these days, so I'd probably be happy with gnu-grep disabling this | cleverness. | downerending wrote: | Usually you want this feature no matter where the output is | going. Adding "-a" sucks, but it's not obvious how else this | could work (and still be backward-compatible). | | IIRC the grep heuristic only considers a short prefix of the | file. If the garbage comes later, you lose. Unfortunately, this | makes things seem a bit unpredictable. | gumby wrote: | > IIRC the grep heuristic only considers a short prefix of | the file. If the garbage comes later, you lose. | Unfortunately, this makes things seem a bit unpredictable. | | This was changed about five years ago to just keep looking. | Which makes things a bit unpredictable in a different way. | chaps wrote: | Also works well on non-text files, similar to `strings`! I don't | think it works as well, but can still be useful for quick checks. | lizknope wrote: | I normally do: | | strings file | grep search_pattern | the_jeremy wrote: | shoutout to [RipGrep](https://github.com/BurntSushi/ripgrep), | which is generally faster, has more intelligent defaults | (searches cwd by default, ignores files matching .gitignore), and | can search through only certain text files (like your .java and | .py files, say). Not affiliated, just found it worth the effort | to learn some slightly different flags, though many are the same | as normal grep. | freedomben wrote: | Also a ripgrep fan. I rolled my own search tool[1] due to | dissatisfaction with the available options at the time, but I | gave ripgrep an evaluation when it came out and it was really | good. I ultimately still use my own tool because I'm very | attached to the way grep colors/formats stuff (and I | implemented the same scheme in findref) but I use ripgrep on | large code bases (like the Linux kernel) since it performs a | lot better due to various optimizations. | | [1] https://github.com/FreedomBen/findref | abrowne wrote: | IIRC VS Code also uses ripgrep internally for searching. | ainar-g wrote: | > ignores files matching .gitignore | | Maybe it's just me, but that sounds like a bad default. I can | definitely imagine people being confused by that. | choeger wrote: | It is. At first. But you get used to it quite fast. In my | experience, when I have a .gitignore I either want to grep | the ignored stuff or the rest, never both. So I notice quite | early that something is odd, when rg reports nothing at all. | Terr_ wrote: | A problem-scenario that comes to mind involves "set your own" | config files, like when a codebase has a config.xml.dist and | you're supposed to copy and customize it to config.xml which | should never get checked in. | burntsushi wrote: | They are. That's why it's always mentioned in the first few | sentences of docs (man page, --help, README). With that said, | this default is one of ripgrep's defining features and is | something that users consistently report as one of their | favorite things about ripgrep. | | You can disable all smart filtering (gitignore, hidden, | binary) with `rg -uuu foo`. That will search the same stuff | that `grep -r foo ./` will. | 2038AD wrote: | I've not looked too far into it but my guess is that ripgrep | being faster is just due to Gnu grep using a slower algorithm | (and supporting unnecessary extensions). The Rust regex library | excludes look-arounds and backreferences and is openly inspired | by RE2. Russ Cox, one of the guys behind RE2, wrote | something[0] on the topic. | | [0] https://swtch.com/~rsc/regexp/regexp1.html | burntsushi wrote: | No, that's not why. I wrote about why: | https://blog.burntsushi.net/ripgrep/ | | GNU grep doesn't have look-arounds either. It does have back- | references, but that doesn't impact searches that don't use | back-references. | | I don't think there is really a concise way to describe why | ripgrep is faster when comparing apples-to-apples. It depends | on the queries and the corpus. The primary reasons are that | it makes more efficient use of the hardware with algorithms | that utilize SIMD. | | If you do an apples-to-oranges comparison (i.e., "why is `rg | foo ./` so much faster than `grep -r foo ./`), then the | answer is pretty easily "because ripgrep uses parallelism and | employs smart filtering by default." | pjot wrote: | Genuinely curious how you noticed that this thread had | referenced your work. And, ironically enough, a thread | about text search! | viklove wrote: | I started using ripgrep a few years ago and haven't looked back. | It's way faster, automatically excludes .gitignored files, and | just has a bunch of common sense functionality. | | https://blog.burntsushi.net/ripgrep/ | eu wrote: | I normally use -I to skip binary files | jindraj wrote: | Have you thought about environment variable GREP_OPTIONS? | https://www.gnu.org/software/grep/manual/grep.html#Environme... | You can define it at the beginning of the script. | yellowapple wrote: | > As this causes problems when writing portable scripts, this | feature will be removed in a future release of grep, and grep | warns if it is used. Please use an alias or script instead. | king_phil wrote: | I see this and suddenly it clicks! That is exactly why I couldn't | import an SQL dump that I tried importing for days now, that is | filtered with grep. Wow. | | And I was wondering all the time why mysql reported this strange | error "SQL error in Binary file" when the .sql file was clearly a | text file... | spenrose wrote: | Love ack for source code and text docs: https://beyondgrep.com | arendtio wrote: | Does someone know if using grep on a binary file is somehow | defined by POSIX? | | At a glance, I couldn't find a reference on the grep page: | | https://pubs.opengroup.org/onlinepubs/9699919799/utilities/g... | jwilk wrote: | It isn't. The page says: | | > _The input files shall be text files._ | | "Text file" is defined in | https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1... | : | | > _A file that contains characters organized into zero or more | lines. The lines do not contain NUL characters and none can | exceed {LINE_MAX} bytes in length, including the <newline> | character._ | arendtio wrote: | Thank you :-) | | Interesting, especially the part about the LINE_MAX. Even | though it kinda makes sense, I would never have thought that | having a very long line makes a file a non-text file when all | characters are 'normal' characters. ___________________________________________________________________ (page generated 2020-04-21 23:00 UTC)