[HN Gopher] If you use GNU grep on text files, use the -a (--tex...
       ___________________________________________________________________
        
       If you use GNU grep on text files, use the -a (--text) option
        
       Author : rurban
       Score  : 120 points
       Date   : 2020-04-21 04:52 UTC (18 hours ago)
        
 (HTM) web link (utcc.utoronto.ca)
 (TXT) w3m dump (utcc.utoronto.ca)
        
       | zajio1am wrote:
       | This 'feature' is especially irritating when one uses grep on
       | some text files with legacy (non UTF-8) encoding, but has locale
       | with UTF-8 encoding. The grep decides that regular text file is
       | binary just because there are byte sequences that are not valid
       | UTF-8 sequences.
        
       | battery_cowboy wrote:
       | Oh man, I've had this issue before and I just chose to nuke the
       | logs and try again, thinking they were corrupted!
        
       | arendtio wrote:
       | Actually, I never considered log files untrusted input, but as
       | this example shows, it would be wise to do so.
        
         | JoachimSchipper wrote:
         | FWIW, attacks like Javascript or SQL injection via logfiles are
         | hardly unknown. Log files are plenty scary. ;-)
        
         | avodonosov wrote:
         | Also beware of CWE-117
        
           | arendtio wrote:
           | https://cwe.mitre.org/data/definitions/117.html
           | 
           | 'CWE-117: Improper Output Neutralization for Logs'
           | 
           | That is something probably often forgotten when simply
           | dumping some requests into a log, but at least it should be
           | obvious that the source of the content is untrusted. On the
           | other hand, a log file is a file on your server, so you would
           | probably think of it as nothing dangerous, as everybody has
           | cared about CWE-117, right? ;-)
        
       | kwoff wrote:
       | Also precede the file list by `--`. A very confusing thing can
       | happen if a file happens to begin with a dash... (You can
       | intersperse options like `-e pattern` among file names, if for
       | some reason you wanted to do that.)
        
         | iforgotpassword wrote:
         | But that's not grep specific and generally a good idea,
         | especially in scripts that get the file names from their
         | command line, some input file or god knows where.
        
       | exabrial wrote:
       | > 'LC_ALL=C'
       | 
       | Wuff, reminds me of the completely incompatible difference
       | between BSD sed and GNU sed
        
       | gpvos wrote:
       | I just looked through the GNU grep history to see when it
       | suddenly started being able to decide halfway through a file that
       | it is binary after all; this is since 16 September 2014, so
       | fairly recently. Before that, it just checked the first few
       | kilobytes to decide, and didn't change its opinion afterwards. To
       | me, this is a very nonintuitive change.
        
         | iforgotpassword wrote:
         | Oh yes. I was bitten by that when I piped around some tool's
         | output that eventually went through grep. The tool's output was
         | actually text but for some reason there was a null byte at the
         | end which nobody noticed before.
         | 
         | The fun part was that the way the data got chunked through the
         | pipes was not deterministic, so sometimes you got the desired
         | output from grep, but other times just "binary file matches",
         | even when the raw output from the first tool was identical in
         | both runs. That took quite some head scratching to figure out.
        
         | gpvos wrote:
         | Why is it that in recent years, with this and the more recent
         | ls quoting fiasco, maintainers of longstanding UNIX utilities
         | suddenly got the urge to fix what isn't broken?
        
           | chubot wrote:
           | What's the 'ls quoting fiasco'?
           | 
           | Actually I recently found that coreutils and ls behave fairly
           | well with funny filenames:
           | 
           | Here is an invalid utf-8 byte and then a valid utf-8 sequence
           | $ x=$'\xce\xce\xbc'         $ touch "$x"
           | 
           | You can list it:                   $ ls         ?m
           | 
           | And here 'ls' does better than other tools that display
           | filenames. It shows the invalid byte and then keeps decoding
           | with error recovery:                   $ ls --escape
           | \316m
           | 
           | However GNU stat (which I think is also in coreutils) does
           | something similar, but weirdly messed up:                   $
           | stat *         File: ''$'\316''m'
           | 
           | (it looks like it's outputting a valid shell string, except
           | with extra quotes)
           | 
           | -----
           | 
           | Most command line tools are not aware of stuff like this. For
           | example you can touch "x$ANSI_TERMINAL_CODES" and if you do
           | "bash x??" or "python x??", then your terminal will change
           | color because of the escape codes printed back to the
           | terminal.
           | 
           | I just changed Oil to use a well-defined format I called QSN
           | (quoted string notation):
           | 
           | http://www.oilshell.org/blog/2020/04/release-0.8.pre4.html#t.
           | ..
           | 
           | It adapts Rust's string literal syntax to express arbitrary
           | byte strings precisely and losslessly. (JSON can't express
           | arbitrary byte strings.)
           | 
           | The QSN encoder does UTF-8 _decoding_ with a specific error
           | recovery mechanism. So it 's basically like what ls and stat
           | do, but it's more precise.
           | 
           | (If anyone is interested in QSN, please contact me. I think
           | it's more generally useful in a lot of places. It's something
           | we already do but it's precise like JSON.)
        
             | _jal wrote:
             | They broke it in 2016.
             | 
             | https://www.gnu.org/software/coreutils/quotes.html
             | 
             | At least with Gnu, you can recompile your own, non-broken
             | version, which is the only saving grace of these stupid,
             | trendy changes.
        
               | Anthony-G wrote:
               | Not broken at all. I really don't understand the hate for
               | this change.
               | 
               | I deal with a lot of filenames with spaces and think this
               | change is a great improvement for listing such files.
               | With this change it's much easier to see where one
               | filename ends and the other begins. Before this change, I
               | had to use the `-1` option to ensure that each filename
               | was listed on a line by itself. Now the filename listings
               | are much more readable and it takes less cognitive effort
               | to take it all in.
               | 
               | The way it handles filenames with ASCII
               | apostrophes/single quotes works particularly well (wraps
               | the filename in double quotes instead of single quotes)
               | and makes it very easy to copy and paste filenames to and
               | from the terminal.
               | 
               | Best of all, this change only applies when standard
               | output is a TTY device so this does not break any shell
               | scripts (even though parsing `ls` is a bad idea in any
               | case) and is still compliant with the POSIX
               | specification[1] which states that _"If the output is to
               | a terminal, the format is implementation-defined"_.
               | 
               | 1. https://pubs.opengroup.org/onlinepubs/9699919799/utili
               | ties/l...
        
             | boring_twenties wrote:
             | GNU ls prints the filenames already-quoted as of late:
             | % touch foo         % touch 'bar baz'         % ls
             | 'bar baz'   foo
        
           | knolax wrote:
           | A google search for 'ls quoting fiasco' shows this thread as
           | the top result. Could you explain what it is?
        
       | [deleted]
        
       | linsomniac wrote:
       | I mostly run into this on searching my environment: "set | grep
       | whatever", now needs an "-a", possibly because of escape codes
       | added to the environment a decade ago.
       | 
       | Maybe the fix would be to only activate the "detect binary files"
       | code if stdout isatty?
       | 
       | Because it is a nice feature when I do a big grep to find
       | something among my home directory or the entire filesystem. It is
       | certainly annoying to get binary garbage in my terminal. Or maybe
       | the binary detection could get smarter, maybe making the
       | determination on a match-by-match basis ("This line I'm about to
       | output is a kilobyte and half of it is non-printable", say).
       | 
       | Though, ack-grep doesn't seem to avoid putting binary garbage on
       | my terminal, so maybe reasonable to switch to something that
       | isn't so clever? Most of my terminal greping is done with ack
       | these days, so I'd probably be happy with gnu-grep disabling this
       | cleverness.
        
         | downerending wrote:
         | Usually you want this feature no matter where the output is
         | going. Adding "-a" sucks, but it's not obvious how else this
         | could work (and still be backward-compatible).
         | 
         | IIRC the grep heuristic only considers a short prefix of the
         | file. If the garbage comes later, you lose. Unfortunately, this
         | makes things seem a bit unpredictable.
        
           | gumby wrote:
           | > IIRC the grep heuristic only considers a short prefix of
           | the file. If the garbage comes later, you lose.
           | Unfortunately, this makes things seem a bit unpredictable.
           | 
           | This was changed about five years ago to just keep looking.
           | Which makes things a bit unpredictable in a different way.
        
       | chaps wrote:
       | Also works well on non-text files, similar to `strings`! I don't
       | think it works as well, but can still be useful for quick checks.
        
         | lizknope wrote:
         | I normally do:
         | 
         | strings file | grep search_pattern
        
       | the_jeremy wrote:
       | shoutout to [RipGrep](https://github.com/BurntSushi/ripgrep),
       | which is generally faster, has more intelligent defaults
       | (searches cwd by default, ignores files matching .gitignore), and
       | can search through only certain text files (like your .java and
       | .py files, say). Not affiliated, just found it worth the effort
       | to learn some slightly different flags, though many are the same
       | as normal grep.
        
         | freedomben wrote:
         | Also a ripgrep fan. I rolled my own search tool[1] due to
         | dissatisfaction with the available options at the time, but I
         | gave ripgrep an evaluation when it came out and it was really
         | good. I ultimately still use my own tool because I'm very
         | attached to the way grep colors/formats stuff (and I
         | implemented the same scheme in findref) but I use ripgrep on
         | large code bases (like the Linux kernel) since it performs a
         | lot better due to various optimizations.
         | 
         | [1] https://github.com/FreedomBen/findref
        
         | abrowne wrote:
         | IIRC VS Code also uses ripgrep internally for searching.
        
         | ainar-g wrote:
         | > ignores files matching .gitignore
         | 
         | Maybe it's just me, but that sounds like a bad default. I can
         | definitely imagine people being confused by that.
        
           | choeger wrote:
           | It is. At first. But you get used to it quite fast. In my
           | experience, when I have a .gitignore I either want to grep
           | the ignored stuff or the rest, never both. So I notice quite
           | early that something is odd, when rg reports nothing at all.
        
           | Terr_ wrote:
           | A problem-scenario that comes to mind involves "set your own"
           | config files, like when a codebase has a config.xml.dist and
           | you're supposed to copy and customize it to config.xml which
           | should never get checked in.
        
           | burntsushi wrote:
           | They are. That's why it's always mentioned in the first few
           | sentences of docs (man page, --help, README). With that said,
           | this default is one of ripgrep's defining features and is
           | something that users consistently report as one of their
           | favorite things about ripgrep.
           | 
           | You can disable all smart filtering (gitignore, hidden,
           | binary) with `rg -uuu foo`. That will search the same stuff
           | that `grep -r foo ./` will.
        
         | 2038AD wrote:
         | I've not looked too far into it but my guess is that ripgrep
         | being faster is just due to Gnu grep using a slower algorithm
         | (and supporting unnecessary extensions). The Rust regex library
         | excludes look-arounds and backreferences and is openly inspired
         | by RE2. Russ Cox, one of the guys behind RE2, wrote
         | something[0] on the topic.
         | 
         | [0] https://swtch.com/~rsc/regexp/regexp1.html
        
           | burntsushi wrote:
           | No, that's not why. I wrote about why:
           | https://blog.burntsushi.net/ripgrep/
           | 
           | GNU grep doesn't have look-arounds either. It does have back-
           | references, but that doesn't impact searches that don't use
           | back-references.
           | 
           | I don't think there is really a concise way to describe why
           | ripgrep is faster when comparing apples-to-apples. It depends
           | on the queries and the corpus. The primary reasons are that
           | it makes more efficient use of the hardware with algorithms
           | that utilize SIMD.
           | 
           | If you do an apples-to-oranges comparison (i.e., "why is `rg
           | foo ./` so much faster than `grep -r foo ./`), then the
           | answer is pretty easily "because ripgrep uses parallelism and
           | employs smart filtering by default."
        
             | pjot wrote:
             | Genuinely curious how you noticed that this thread had
             | referenced your work. And, ironically enough, a thread
             | about text search!
        
       | viklove wrote:
       | I started using ripgrep a few years ago and haven't looked back.
       | It's way faster, automatically excludes .gitignored files, and
       | just has a bunch of common sense functionality.
       | 
       | https://blog.burntsushi.net/ripgrep/
        
       | eu wrote:
       | I normally use -I to skip binary files
        
       | jindraj wrote:
       | Have you thought about environment variable GREP_OPTIONS?
       | https://www.gnu.org/software/grep/manual/grep.html#Environme...
       | You can define it at the beginning of the script.
        
         | yellowapple wrote:
         | > As this causes problems when writing portable scripts, this
         | feature will be removed in a future release of grep, and grep
         | warns if it is used. Please use an alias or script instead.
        
       | king_phil wrote:
       | I see this and suddenly it clicks! That is exactly why I couldn't
       | import an SQL dump that I tried importing for days now, that is
       | filtered with grep. Wow.
       | 
       | And I was wondering all the time why mysql reported this strange
       | error "SQL error in Binary file" when the .sql file was clearly a
       | text file...
        
       | spenrose wrote:
       | Love ack for source code and text docs: https://beyondgrep.com
        
       | arendtio wrote:
       | Does someone know if using grep on a binary file is somehow
       | defined by POSIX?
       | 
       | At a glance, I couldn't find a reference on the grep page:
       | 
       | https://pubs.opengroup.org/onlinepubs/9699919799/utilities/g...
        
         | jwilk wrote:
         | It isn't. The page says:
         | 
         | > _The input files shall be text files._
         | 
         | "Text file" is defined in
         | https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1...
         | :
         | 
         | > _A file that contains characters organized into zero or more
         | lines. The lines do not contain NUL characters and none can
         | exceed {LINE_MAX} bytes in length, including the <newline>
         | character._
        
           | arendtio wrote:
           | Thank you :-)
           | 
           | Interesting, especially the part about the LINE_MAX. Even
           | though it kinda makes sense, I would never have thought that
           | having a very long line makes a file a non-text file when all
           | characters are 'normal' characters.
        
       ___________________________________________________________________
       (page generated 2020-04-21 23:00 UTC)