commented:
The meat of the article is of course what tripwires you can put into
your code, and what color the walls of the rabbit hole are.
However, if you are mainly interested in using the described Trip
tool, the strace tool might also be an alternative. E.g. to simulate
the example case of fork failing every second time:
$ strace -f -e fault=clone:when=1+2:error=ENOMEM -o /dev/null bash
bash: fork: Cannot allocate memory
$ date
Sat Aug 24 11:59:22 CEST 2024
$ date
bash: fork: Cannot allocate memory
$ date
Sat Aug 24 11:59:23 CEST 2024
$ date
bash: fork: Cannot allocate memory
$

This uses the ptrace mechanism rather than LD_PRELOAD, so it has other
pros and cons. For example the fork() function apparently is
implemented using the clone() syscall on my system.
And of course strace is dangerous (but I guess so is LD_PRELOAD).
Still, a handy tool. I stumbled upon these surprising strace features
just yesterday and wanted to share them.

  commented:
I’ve seen that article about strace before but never found out what
they mean by dangerous. AFAICT there’s nothing in the article that
explains that sentence.
Do you know?

  commented:
Good point; maybe “dangerous” is not the right word here. From my
understanding the main danger of strace is that it meddles directly
with the traced process, at least by sending it STOP and CONT signals.
And in general the ptrace interface provided by the kernel can do
basically anything to a process (read and write all of its memory,
including its machine code, and thereby also influencing control
flow). I don’t know how much of the ptrace capabilities are used by
strace; but generally ptrace is a very powerful and therefore
“dangerous” way to influence a process.
The linked article mainly mentions that strace is slow, which I guess
can be a real danger if used for production processes.
The part about strace meddling with the traced process is mentioned
under “Versus Advanced Tracers”:

There is a possible con: in the past, strace has had bugs which can
leave the target process, or its followed children, in the STOP state
(e.g., here, here). This could cause a serious production outage, as
the application is now frozen mid-flight. If you realize this
immediately and can fix it (kill strace, then kill -CONT the process),
then you may avoid a serious outage. However, you may still have
caused a burst of application requests with multi-second latency
(outliers), depending on how quickly you typed in the kill command.

My understanding is that nowadays tools like perf or dtrace can
provide the same level of insight in a less dangerous way (they don’t
need to directly meddle with the process, but only “passively” look at
the syscalls inside the kernel). So that’s nicer. OTOH perf apparently
cannot easily influence the process the same way strace -e fault or
strace -e inject can.

  commented:
The article is saying it’s dangerous in production because it will
alter the behavior of the program, notably by making it a lot slower.
By inserting innumerable pauses in the execution it might also trigger
issues due to locking or latency assumptions.
.