(DIR) <- Back
       
       
       # Monitoring crashed services on OpenBSD
       
       Last modification on 2024-04-23
       
       In my homelab I am running quite a few OpenBSD machines (virtual and physical).
       For monitoring everything, I have settled on prometheus-style monitoring,
       specifically VictoriaMetrics.
       
       One of the things that were still missing was monitoring failed services on
       OpenBSD hosts. Usually I do a `rcctl ls failed` regularly to check if any
       service may have been crashed. However, that turned out to be quite inefficient
       for a large number of hosts on the long run.
       
       Today I noticed that I hardly make use of prometheus-node-exporter's
       *textfile-collector* in my infrastructure, which turned out to be the perfect
       solution for this problem.
       
       First, a simple shell script outputs OpenMetrics (prometheus-style)
       metrics, one line for each failed service. I placed it in
       `/etc/node_exporter/textfile_collector/crashed_services.sh`. Don't forget to
       give it executable permissions.
       
       ```
       #!/bin/sh
       
       echo "# HELP node_rc_unit_state List of crashed OpenBSD rc services (1 = crashed)"
       echo "# TYPE node_rc_unit_state gauge"
       crashed="$(rcctl ls failed)"
       [i|rcctl ls failed | while read -r service; do||drkhsh.at|70
]
         echo "node_rc_unit_state{state=\"failed\",name=\"$service\"} 1"
       done
       ```
       
       Afterwards, set the flags of the prometheus-node-exporter to include the
       textfile-exporter:
       
       ```
       # rcctl set node_exporter flags --collector.textfile.directory=/var/lib/node_exporter/textfile_collector/
       ```
       
       To atomically write the file (which is then read by node-exporter periodically),
       I made use of *sponge* from *moreutils*. (I'm lazy, a few copy/move operations
       could have done it as well).
       
       Then add a simple cronjob calling the service with *sponge*, writing to the
       directory specified in the node-exporter flags above:
       
       ```
       [i|*/15 * * * * /etc/node_exporter/textfile_collector/crashed_services.sh | sponge /var/lib/node_exporter/textfile_collector/crashed_services.prom||drkhsh.at|70
]
       ```
       
       Finally, here is a simple alert rule for Prometheus/VictoriaMetrics etc. to
       notify you on time, when some service crashed:
       
       ```
       - alert: HostRcServiceCrashed
         expr: '(node_rc_unit_state{state="failed"} == 1) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}'
         for: 0m
         labels:
           severity: warning
         annotations:
           summary: Host rc service crashed (instance {{ $labels.instance }})
           description: "openbsd rc service crashed\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
       ```
       
       .