(DIR) <- Back # Monitoring crashed services on OpenBSD Last modification on 2024-04-23 In my homelab I am running quite a few OpenBSD machines (virtual and physical). For monitoring everything, I have settled on prometheus-style monitoring, specifically VictoriaMetrics. One of the things that were still missing was monitoring failed services on OpenBSD hosts. Usually I do a `rcctl ls failed` regularly to check if any service may have been crashed. However, that turned out to be quite inefficient for a large number of hosts on the long run. Today I noticed that I hardly make use of prometheus-node-exporter's *textfile-collector* in my infrastructure, which turned out to be the perfect solution for this problem. First, a simple shell script outputs OpenMetrics (prometheus-style) metrics, one line for each failed service. I placed it in `/etc/node_exporter/textfile_collector/crashed_services.sh`. Don't forget to give it executable permissions. ``` #!/bin/sh echo "# HELP node_rc_unit_state List of crashed OpenBSD rc services (1 = crashed)" echo "# TYPE node_rc_unit_state gauge" crashed="$(rcctl ls failed)" [i|rcctl ls failed | while read -r service; do||drkhsh.at|70 ] echo "node_rc_unit_state{state=\"failed\",name=\"$service\"} 1" done ``` Afterwards, set the flags of the prometheus-node-exporter to include the textfile-exporter: ``` # rcctl set node_exporter flags --collector.textfile.directory=/var/lib/node_exporter/textfile_collector/ ``` To atomically write the file (which is then read by node-exporter periodically), I made use of *sponge* from *moreutils*. (I'm lazy, a few copy/move operations could have done it as well). Then add a simple cronjob calling the service with *sponge*, writing to the directory specified in the node-exporter flags above: ``` [i|*/15 * * * * /etc/node_exporter/textfile_collector/crashed_services.sh | sponge /var/lib/node_exporter/textfile_collector/crashed_services.prom||drkhsh.at|70 ] ``` Finally, here is a simple alert rule for Prometheus/VictoriaMetrics etc. to notify you on time, when some service crashed: ``` - alert: HostRcServiceCrashed expr: '(node_rc_unit_state{state="failed"} == 1) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}' for: 0m labels: severity: warning annotations: summary: Host rc service crashed (instance {{ $labels.instance }}) description: "openbsd rc service crashed\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" ``` .