[HN Gopher] Tips for analyzing logs ___________________________________________________________________ Tips for analyzing logs Author : historynops Score : 186 points Date : 2022-12-13 16:47 UTC (6 hours ago) (HTM) web link (jvns.ca) (TXT) w3m dump (jvns.ca) | phillipcarter wrote: | Surprised to see that under the section "correlate between | different systems" tracing isn't mentioned as an alternative | approach. That's what tracing is: logging across different | systems and getting that structure all stitched together for you. | tetha wrote: | As much as I approve of a skillset to analyze local logs, but | after a relatively small scale (10-20 systems), a central decent | log aggregation like opensearch or ELK just brings so much value | even on 1-3 nodes. It'd be one of the first changes I make to an | infrastructure because it's so powerful. | | And its not just log searching and correlation value. At work, | the entire discussion "oh but we need access to all servers | because of logs" just died when all logs were accessible via one | web interface. I added a log aggregation and suddenly only ops | needed access to servers. | | Designing that thing with accessibility and discoverability in | mind is a whone nother topic though. | rsdbdr203 wrote: | Agreed! Centralizing logs is so helpful, but you don't know it | until you've done it. Too many people rely upon grep, when a | handy tool is just a download away. Plug for said too: | https://log-store.com | emmelaich wrote: | I found the histogram technique to be really helpful. Slight mod | - I tend to sort reverse at the end of the pipeline (sort -rn); | then |head is often more useful. | | It's also good to have histograms by hour or day. I've hacked up | scripts to do this but I should really make something better! | baud147258 wrote: | A few weeks ago I had a windows installer that was silently | failing when upgrading from an older version (installation from | scratch was working without issues). And as windows install logs | aren't exactly easy to read, I was stumped, until I took an | upgrade log from an older, working build, strip all date | information from both files and compare them, checking all the | sections which were different, until I found a line indicating | that a colleague had forgotten about a limitation when dealing | with msp (don't delete components on minor upgrades, but I didn't | throw any stones as I've done the same mistake, twice, one and | two years ago...) | the_arun wrote: | Do we still use utilities like grep for searching logs? Are these | when we cannot stream logs to tools like Splunk & Loggly and use | their search services? | ranger207 wrote: | All of our logs are in Kibana, but sometimes I'll `kubectl logs | pod > tmp; grep pattern tmp` because Kibana's search and | filtering is often annoying. Actually, I'll usually open the | logs in vim and use its search, which also get me the ability | to eg delete every line not containing the pattern and then | search only things matching the first pattern. I'm going to try | lnav as mentioned in this thread but I've gotten by fine with | judicious use of grep | chaps wrote: | One thing I've done to identify infrequent log entries within a | log file is to remove all numbers from a file and print out a | frequency of each. Basically just helps to disregard timestamps | (not just at the beginning of the line), line numbers, etc. | cat file.log | sed 's/[0-9]//g' | sort | uniq -c | sort -nr | | This has been incredibly helpful in quickly resolving outages | more than once. | chasil wrote: | A minor optimization is collapsing the grep -v, from this: | cat file | grep -v THING1 | grep -v THING2 | grep -v THING3 | | grep -v THING4 | | to this: egrep -v 'THING1|THING2|THING3|THING4' | file | | That gets rid of the cat and three greps. Both POSIX and GNU | encourage grep -E to be used in preference to egrep. | | A pcregrep utility also used to exist, if you want expansive | perl-compatible regular expressions. This has been absorbed into | GNU grep with the -P option. | anotherevan wrote: | Or the following if pipes or egrep make you nervous: | | grep -v -e THING1 -e THING2 -e THING3 -e THING4 file | aargh_aargh wrote: | I once compared the speed of these two approaches, rather | accidentally. I did output colorization by adding ANSII | sequences. I thought, of course one process must be more | efficient than a pipe of processes. After the rewrite, I was | disappointed about the slowdown and reverted back to the pipe. | | PS I checked back and I used sed rather than grep. I think the | result would hold for grep but the morale is that you should | verify rather than assume. | | I have around 50 seds in the pipe, running in parallel (which | is what makes it faster), it would have been a half of that | when I tried the rewrite. | yakubin wrote: | I usually prefer to pipe fgrep -v into fgrep -v. With egrep you | need to escape brackets and other characters. | mdaniel wrote: | If you haven't seen it, there's been some noise recently | about fgrep and its egrep friend $ fgrep -h | fgrep: warning: fgrep is obsolescent; using grep -F $ | egrep -h egrep: warning: egrep is obsolescent; using | grep -E | chasil wrote: | True, for absolute fixed strings, the author's approach is | superior. | | Off the cuff, another way to do it is with awk's index | function. I don't know what speed penalty this might impose. | $ cat ~/fmgrep #!/bin/awk -f BEGIN { | split(ARGV[1], s, ","); ARGV[1]="" } { m = 1; | for(c in s) if(index($0, s[c])) m = 0; if(m) print } | $ ~/fmgrep sshd,systemd,polkitd /var/log/secure | anotherevan wrote: | This also works: | | fgrep -v -e THING1 -e THING2 -e THING3 -e THING4 | burntsushi wrote: | > A pcregrep utility also used to exist, if you want expansive | perl-compatible regular expressions. This has been absorbed | into GNU grep with the -P option. | | 'pcregrep' still exists. But with PCRE2 supplanting PCRE, it is | now spelled 'pcre2grep'. | | I don't know the precise history of 'grep -P' and whether | 'pcregrep' was actually absorbed into it, but 'pcregrep' is its | own thing with its own features. For example, it has a | -M/--multiline flag that no standard grep (that I'm aware of) | has. (Although there are some work-arounds, e.g., by treating | NUL as the line terminator via the -z/--null-data flag in GNU | grep.) | chasil wrote: | Oddly, there are pcre2 packages in RedHat/Alma 9, but they do | not include a pcre2grep. | | GNU grep is also linked to pcre, not pcre2. # | pcre2grep bash: pcre2grep: command not found... | # yum install pcre2grep Last metadata expiration check: | 1:58:58 ago on Tue 13 Dec 2022 11:45:44 AM CST. No | match for argument: pcre2grep Error: Unable to find a | match: pcre2grep # yum whatprovides pcre2grep | Last metadata expiration check: 2:09:25 ago on Tue 13 Dec | 2022 11:45:44 AM CST. Error: No matches found. | # rpm -qa | grep pcre2 | sort | pcre2-10.40-2.0.2.el9.x86_64 | pcre2-syntax-10.40-2.0.2.el9.noarch | pcre2-utf32-10.40-2.0.2.el9.x86_64 # which grep | /usr/bin/grep # ldd /usr/bin/grep | grep pcre | libpcre.so.1 => /lib64/libpcre.so.1 (0x00007efc473c4000) | jldugger wrote: | Honestly, the most amazing thing I did with logs was learn how to | do subtraction. Any time you have multiple instances of a thing | and only some of them are bad, you can easily find the problem | (if anyone bothered to log it) by performing bad - good. | | The way you do this is by aggregating logs by fingerprints. | Removing everything but punctuation is a generic approach to | fingerprinting, but is not exactly human friendly. For Java, | log4j can use class in your logging pattern, and that plus log | level is usually pretty specific. | | Once you have a fingerprint, the rest is just counting and | division. Over a specific time window, count the number of log | events, for every finger print, for both good and bad systems. | Then score every fingerprint as (1+ # of bad events) / (1 + # of | good events) and everything at the top is most strongly bad. And | the more often its logged, the further up it will be. No more | lecturing people about "correct" interpretations of ERROR vs INFO | vs DEBUG. No more "this ERROR is always logged, even during | normal operations". | 10000truths wrote: | Isn't this basically what structured binary logs are? Instead | of writing a string like `Error: foo timeout after | ${elapsed_amt} ms` to the log, you write a 4-byte error code | and a 4 byte integer for elapsed_amt. I know there are | libraries like C++'s nanolog that do this for you, under the | hood. | jldugger wrote: | That solves the fingerprint, but you still need to count and | score. | nerdponx wrote: | You might be interested in the TF-IDF algorithm used in | information retrieval and text classification. | jldugger wrote: | Yes, this is pretty much TF-IDF for people too lazy to count | the number of unique items in the corpus. | | Since that number should be the same (or at least close!) in | both good and bad datasets, I'm not sure the extra math | matters much. | alexpetralia wrote: | Has anyone tried passing logs into chatGPT? I had thought it | would be especially effective here. | viraptor wrote: | It may work for a few lines, but you can't really pipe a GB of | logs into it, which is what we typically deal with. | | It works for simple explanations though. | didip wrote: | If you are using Kubernetes, I highly recommend using | https://github.com/stern/stern | [deleted] | kqr wrote: | One thing that's improved my log analysis is learning awk (well | actually Perl but I think 95 % of what I do with Perl I could | also do with just awk). Often the most useful way to look at logs | is statistically, and awk can quickly let you aggregate things | like time-between-statements, counts, rates, state transition | probabilities, etc for arbitrary patterns. | digitalsushi wrote: | One of my pet peeves is "The Useless Use of cat Award". Someone | awarded it to me as a teenager in the late 90s and I've been sore | ever since. | | Yup, it's often a waste of resources to run an extra 'cat'. It | really demonstrates that you don't have the usage of the command | receiving the output completely memorized. You know, the thousand | or so commands you might be piping it into. | | But, if you're doing a 'useless' use of cat, you're probably just | doing it in an interactive session. You're not writing a script. | (Or maybe you are, but even still, I bet that script isn't | running thousands of times per second. And if it is, ok, time to | question it). | | So you're wasting a few clock cycles. The computer is doing a few | billion of these per second? By the time you explain the | 'useless' use of cat to someone, the time you wasted explaining | to them why they are wrong, is greater than the total time that | their lifetime usage of cat was going to waste. | | There's a set of people who correct the same three pairs of | homophones that get used incorrectly, but don't know what the | word 'homophone' is. (Har har, they're/their/there). I always | liken the people who are so quick to chew someone out for using | cat, in the same batch of people who do this: what if I just want | to use cat because it makes my command easier to edit? I can | click up, warp to the front of the line, and change it real | quick. | | Sorry. I did say, it is a pet peeve. | amarshall wrote: | I do "useless" use of cat quite often because, in my brain, the | pipeline naturally starts with "given this file", so it makes | the pipeline more consistent e.g. `cat f | a | b | c` rather | than `a < f | b | c` where one must start with the first | "function" rather than with the data. I see starting with `cat` | analogous to the `->` thread macro in Clojure, `|>` pipe in | Elixir, and `&` reverse application operator in Haskell. If | bash permitted putting the filename first, I'd stop using | `cat`; alas, it does not. | aendruk wrote: | There's this. Though it would look nicer with another |. | <f a | b | c | amarshall wrote: | That's actually not half-bad. I think I'll use that, | thanks! | civopsec wrote: | Doesn't people like Rob Pike argue for `cat |` since it fits so | orthogonally with the rest of the pipeline? | infogulch wrote: | > the time [used to] explain the 'useless[ness]' .. of cat to | someone .. is greater than the total time that their lifetime | usage of cat was going to waste | | If you look for situations like this they are surprisingly | common. | | > _pet_ peeve | dllthomas wrote: | It's a lot more than the few extra cycles to spin up the | process - it's also an extra copy of all the data. Usually | that's also not much, but occasionally it's everything, as the | consuming program can seek in a file, but not in a pipe, so | might otherwise only need a tiny bit of the data. | marcosdumay wrote: | Well, you waste an entire fork and exec, so I believe you are | underestimating the time by a few orders of magnitude. Also, | it's almost always grep following the cat, so it's not much to | memorize. | | But it's well worth wasting a process to have a nice pipeline | where each command does a single thing so you can easily reason | about them. | sisk wrote: | Loosely related: a few years ago I wanted a simpler alternative | to some of the more feature-full log viewers out there so I threw | together a tiny (50kb) app that might be useful to some folks in | here. | | All it does is consistently colors the first field in a line from | stdin so you can quickly see which log lines have the same first | field. | | I used it in combination with the parallel[0] command to prefix | log lines by replica name when tailing logs across machines: | https://github.com/jasisk/color-prefix-pipe | | [0]: https://www.gnu.org/software/parallel/ | chrisweekly wrote: | Glad to see lnav (https://lnav.org) already getting some love in | the comments. Hands-down the most reliable source of "thank you! | I wish I'd known about this tool sooner!" responses, even from v | experienced sysadmins / SREs. | AtNightWeCode wrote: | Usage of correlation ids and structured logs is pretty much | standard. Don't go down this path. | oxff wrote: | Just use structured logs, its 2022 ffs. | kamma4434 wrote: | Jeez, this stuff is frontpage on HN? Sounds... pretty basic. I'm | sure our AI overlords could produce deeper content. | viraptor wrote: | People on HN have varied level of skills, and this is a well | structured introduction to diving into logs. It already started | conversation about better tooling. Let's celebrate today's | lucky 10000 https://xkcd.com/1053/ rather than talk something | down for being basic. | hex4def6 wrote: | I'm interested in this idea as well. Seems like it would be | useful for detecting unusual issues assuming you have a large | enough data set of "normal" runs. | | I've been mulling this idea over in my head as well. I have a | fleet of PCs out in the wild, all running the same software. It | would be nice to have an easy way to detect strange behavior | like processes that are continually respawning / segfaults / | crashes / etc, without explicitly writing a bunch of search | terms. | phillipcarter wrote: | Yours is an unhelpful and non-constructive comment. Clearly a | lot of people have been getting something out of the content in | this post, as it's started several discussions. | kerblang wrote: | > you'll get overwhelmed by a million irrelevant messages because | the log level is set to INFO | | I know this happens, but I think it's because programmers are | abusing INFO. In principle it's reserved for messages that are | informative at a level sys admins and a few others can make sense | of and use. Unfortunately abuse often leads to "We turned INFO | off" making it much harder to diagnose things after the fact. | pantulis wrote: | I think there should be an counterpart to "log analysis" which | is "logging strategies for your app". Which should be WHAT to | log and WHEN. | | Stuff like, if you are exposing an HTTP endpoint, you should | log the request URL and the time it took to serve it. Or if you | are invoking an extenral service, you should log the response | time of that service. | lowbloodsugar wrote: | And you should produce a single line of output for each | request that identifies all of the pertinent information. You | can have more than one, for e.g. a thread dump, but there | should be one that provides a complete summary. I've lived | with apps that logged at each stage of the process as | separate lines, and that's just not useful data when grepping | for anomalies. | rsdbdr203 wrote: | I hypothesize it's not useful _because_ you 're using grep. | If you use a tool that can show you multiple lines all tied | by a request ID, it becomes much more helpful. | jeppesen-io wrote: | also, journalctl. Very nice to have all logs in one place in a | semi-structured nature | | `journalctl -u myservice -S "5 min ago"` | heliostatic wrote: | My biggest quality of life improvement for understanding logs has | been lnav (https://lnav.org/) -- does everything mentioned in | this post in a single tool with interactive filtering and quick | logical and time based navigation. | TeMPOraL wrote: | Thanks for linking to this! | | Until now I thought logview.el[0] is the bee's knees, but now I | can feel feature envy set in. There are some seriously powerful | ideas listed on the lnav page, and it's also the first time I | saw SQLite virtual tables used in the wild. | | -- | | [0] - https://github.com/doublep/logview | chrisweekly wrote: | Yes! lnav (https://lnav.org) is phenomenal. Embedded SQLite... | easily scriptable... OOB log formats galore, or define your | own... it's a mini ETL powertool that scales to at least a few | million rows and runs in your terminal. Maintainer's a friendly | dude, too. | francislavoie wrote: | At my current company, we're using pdsh with tail -f piped into | sed to follow logs from multiple remote servers at the same | time, label them and colorize them. Works okay. Decent solution | without needing to install much other software. Not my | favourite because it doesn't deal well with SSH sessions timing | out and leaves some tail -f processes hanging, and some other | quirks. But out of laziness and risk of breaking things in | prod, we haven't tried experimenting with much else. | tstack wrote: | lnav has some support for tailing files on remote hosts -- | https://lnav.org/2021/05/03/tailing-remote-files.html | | It works by uploading a stub that is an "actually portable | executable" to the remote over ssh. lnav then talks to the | stub to sync files back to the local host. I don't know the | scale of the hosts or files, but if they're not too big, it | might work fine. | UI_at_80x24 wrote: | Run that in a tmux session. Problem solved. | clumsysmurf wrote: | Interesting! Do you know how it compares to multitail? | (https://www.vanheusden.com/multitail/) They look very similar. | tstack wrote: | I don't think multitail really understands logs like lnav | does, it's just following the last lines in the file. For | example, if you try to follow multiple files in multitail | like so: $ multitail /var/log/install.log | -I /var/log/system.log | | You get a view with the tail from one file followed by the | tail from the other, they are not collated by timestamp. In | contrast, if you do the same thing in lnav: | $ lnav /var/log/install.log /var/log/system.log | | You will get a single view with all of the log messages from | both files and they're sorted by their timestamps. Here is | all of what lnav is doing: * Monitoring | files/directories for updates * Decompressed | files/archives * Detected the log format for each file | * Created SQLite vtables that provide access to log messages | * Built an index of all the log messages in all the files so | you can jump to a certain point in time or to the | next/previous error message. * Display all log messages | with syntax highlighting | aargh_aargh wrote: | I migrated from multitail to lnav. Turned out to be a no- | brainer. | | I second the above, just one pain point with multitail to | add. I often page/search/filter in the scrollback buffer (I | typoed "bugger" - Freudian slip?) and in multitail the | scrollback is a separate window with a frame and | everything, which is a pain (copying whole lines using | mouse includes the frame, ugh). The filtering/searching | being a separate pain. | | One thing I used in multitail and not sure if I migrated | wholly to lnav was log file syntax highlighting using | regexes. | oefrha wrote: | Huh, I almost posted a duplicate recommendation. My only | complaint with lnav was that it had to be built from source on | Linux and the build was frigging huge. Apparently they have a | pre-compiled linux-musl binary now. | amarshall wrote: | > that it had to be built from source on Linux | | Seems to be packaged quite widely | https://repology.org/project/lnav/versions | hermitcrab wrote: | Our Easy Data Transform software is intended for data wrangling | and desktop ETL. But it has lots of features useful for | browsing and manipulating log files, including: | | * powerful filtering (including with regex) | | * smooth scrolling of millions of rows of data | | * support for csv, text, xml, json, Excel and other formats | | * available for Window and Mac | dj_gitmo wrote: | I love lnav and use it constantly, but it crashes a lot. I do | wish there was something like lnav that was a little simpler to | use, and written in a more resilient way that crashed less. I | can cut lnav some slack for the crashes because identifying and | parsing arbitrary log formats seems like a messy problem. Still | it shouldn't crash 1/3rd of the time I use it. | heinrichhartman wrote: | Looks great, but I struggle to get this working with the two | main log sources on my system: | | 1) systemd logs (binary data stored at /var/log/journal) | | 2) docker logs (from various containers running on the host) | | Any hints on getting this working? I don't see this covered in | the docs (https://docs.lnav.org/en/latest/formats.html). | ljw1004 wrote: | I wrote https://github.com/ljw1004/seaoflogs - an interactive | filtering tool, for similar ends to what's described here. I | wrote it because my team was struggling to analyze LSP logs | (that's the protocol used by VSCode to communicate with language | servers). But I made it general-purpose able to analyze more log | formats too - for instance, we want to correlate LSP logs with | server logs and other traffic logs. | | (1) I wanted something where colleagues could easily share links | in workplace chat with each other, so we could cooperatively | investigate bugs. | | (2) For LSP we're often concerned with responsiveness, and I | thought the best way to indicate times when viewing a log is with | whitespace gaps between log messages in proportion to their time | gap. | | (3) For LSP we have lots of interleaved activity going on, and I | wanted to have visual "threads" connecting related logs. | | (4) As the post and lnav say, interactivity is everything. I | tried to take it a step further with (1) javascript, (2) | playground-style updates as you type, (3) autocomplete which | "learns" what fields are available from structured logs. | | My tool runs all in the browser. (I spent effort figuring out how | people can distribute it safely and use it for their own | confidential logs too). It's fast enough up to about 10k lines of | logs. | unwind wrote: | That sounds really interesting, although too limited for my | current work focus which deals with Really Big logs. | | I tried the demo link, it rendered a lonely @ at the top left | of the page, the rest was blank. The console said: | Uncaught TypeError: [].filter(...).reverse()[0] is undefined | onload https://ljw1004.github.io/seaoflogs/demo.html:14 | | This was in Firefox 98.0 in Windows. | Manjuuu wrote: | > Often log lines will include a request ID. | | Yes, always include a request id in every request structure you | create and include it also in the response and print it. It would | seem something obvious that everyone does by default but instead, | no, it's not so obvious it seems. | janci wrote: | Not so obvious. How to implement it without passing request id | to all the functions, when they are unrelated to request/http ? | Especially in languages without thread-locals such as | javascript? | yakshaving_jgt wrote: | You can stick it in a middleware which wraps the request | lifecycle. | gh123man wrote: | Many frameworks solve this with logger context. Add the | properties you want to the logging context and all future | logs in that context will have that property. | | One simple example - Serilog enrichment | https://github.com/serilog/serilog/wiki/Enrichment | | This does however assume your entire application uses the | same logger. But this is generally a requirement in services | that require tracing like this. | mrloba wrote: | Nodejs now has async local storage built in, which makes this | fairly easy. | 29athrowaway wrote: | Your friends | | - grep (or ripgrep), sort, uniq, head, tail, cut, colrm, tr | | - diff | | - jq | | - bat | | - ministat | | - less, more, percol | | - Facebook Path Picker (fpp) | Severian wrote: | My tips: | | 1) Fuck grep, use ripgrep, especially if you have to scour over | an entire directory. | | 2) Get good with regex, seriously, it will shave hours off your | searching. | | 3) For whatever application you are using, get to know how the | logging is created. Find the methods used where said logs are | made, and understand why such a log line exists. | | 4) Get good with piping into awk if needed if you need some nice | readable output. | nerdponx wrote: | Piping into AWK feels like a misuse of the tool in all but the | simplest of cases. Don't forget that you can write complete AWK | scripts and invoke them from a file! | geocrasher wrote: | One thing I didn't see was how to use GREP to view the lines | before and after a match: grep regex | /var/log/logfile -A5 #To view the next 5 lines grep regex | /var/log/logfile -B5 #To view the previous 5 lines grep | regex /var/log/logfile -05 #To view the 5 lines before *and* | after the match | | This is super handy to find out what happened just before a | service crashed, for example. | danjc wrote: | We've added a log tailing feature into our product UI which also | has a basic find/filter. It's been enormously useful for cases | where something weird happens as you can immediately access the | last few mins of logs. | terom wrote: | Re timing, logs like nginx access logs have their timestamp from | when the request completed, not when the request came in. That's | a significant difference for long duration (~10s+) requests, and | matters when trying to correlate logs or metrics to a request. | trashcan wrote: | Also, sometimes logs stretch across multiple lines, and the other | lines won't have the identifier you are searching for. For | example, Java stack traces. In that case if you are stuck parsing | unstructured logs, the simplest thing to do is to look at the | entire file and search for the timestamp that found the first | line. | par wrote: | My tips for analyzing logs: | | cd ~/logs | | cat * | grep <my string> | mxuribe wrote: | I guess it depends, but would you not encounter issues like | permission blocks, or there being so much content in the logs | that it is slow to find what you need? ___________________________________________________________________ (page generated 2022-12-13 23:00 UTC)