[HN Gopher] The Unix process API is unreliable and unsafe (2021) ___________________________________________________________________ The Unix process API is unreliable and unsafe (2021) Author : todsacerdoti Score : 107 points Date : 2023-03-22 17:41 UTC (5 hours ago) (HTM) web link (catern.com) (TXT) w3m dump (catern.com) | dataflow wrote: | It seems there isn't even anything written about FD_CLOEXEC and | its associated race conditions either, as far as I can tell. | Basically it's impossible to portably spawn a subprocess in a | safe manner if you don't have sufficient control over all the | code running in your process, because you might duplicate file | descriptors into the child that you might not have intended, and | that can break things in the parent. | rwmj wrote: | AIUI the problem there is not FD_CLOEXEC/SOCK_CLOEXEC but the | possibility that some library might not be using it? (That is | to say, *_CLOEXEC if used does not have race conditions) | | However we usually cope with that by closing all | unknown/unexpected file descriptors after fork and before exec. | Linux even has a system call to make that easier: | https://man7.org/linux/man-pages/man2/close_range.2.html | dataflow wrote: | > AIUI the problem there is not FD_CLOEXEC/SOCK_CLOEXEC but | the possibility that some library might not be using it? | | Not exactly. The problem is that you have to be able to set | it atomically from the creation of the file descriptor. | Setting it after creation is subject to a race condition | where a fork occurs in the interim. There's no portable way | to do that, and people often ignore O_CLOEXEC even when | there's a platform-dependent way to pass it. (How often do | you see dup3() called, for example? And how often do you see | higher-level languages and libraries expose this and force | callers to make a conscious decision?) | | > However we usually cope with that by closing all | unknown/unexpected file descriptors after fork and before | exec. | | You can't really do that portably (well, maybe unless you | want to call close() billions of times). And even if/when you | _can_ do that, you run into the reverse problem, where you | might close descriptors that were supposed to be duplicated | into the subprocess but that you didn 't know about. (One | example is when a user performs redirect inside a shell like | 2>&3 and wants it to work inside a descendant process - you | don't want to just randomly close FDs you don't recognize.) | deathanatos wrote: | > _1.1.4 A should run B inside a container_ | | I think the author knows this, but you don't have to start a | full-blown container if all you want is to solve the article's | stated problem of process leaks. Become a new pid NS: point 1, | the subprocess.run criticism is fixed (it just works); point 2, I | don't believe a pid NS requires either root or a user NS; and all | that remains is point 3. It doesn't _require_ you to start a | separate init, you can _be_ the init, i.e., whatever your top- | level service is. IIRC, the only two requirements is handling | SIGTERM (which you should probably already be doing) and reaping | reparented orphans who then die. But also dumb-init is available? | The article notes using a separate init, too: "This init process | will do nothing but increase the load on the system, and it will | prevent us from directly monitoring the started processes." and | ... no? dumb-init, in a container I have here that's run for >2 | weeks, has used < 20 ms of CPU time. RSS of 522 KiB. You'll be | fine. I'm not sure how it "will prevent us from directly | monitoring the started processes" -- it would live _above_ you in | the process tree. You 'd monitor the started process the same way | you would any started process. | | Edit: ah, crap, I've got it wrong. A new PID NS requires root (or | user NSes); being a subreaper, I think, maybe does not. But I'm | not sure being a subreaper is sufficient; you want the subtree | reaped on the subtree root's death. | | (I'm also not sure that the subreaper approach is sufficient: if | the subreaper itself dies, the processes leak.) | mike_hock wrote: | The subreaper is also gonna have the same footprint as the | pidns init, and is _more complicated._ | | It's just as flawed a solution as the other flawed solutions. | We can accept the subreaper being bug-free as a requirement for | this workaround to be working, but we can't prevent it from | being sigkilled. | jiveturkey wrote: | Too bad the article doesn't discuss contracts, the Solaris | solution. As the article is very linux focused, I imagine the | author is blissfully unaware. | jamesdutc wrote: | I recently wrote an autorunner[1] (like Entr[2] and Watchexec[3]) | so I have some recent exposure to this problem. (I will be | releasing it on Github shortly.) My autorunner allows running | interactive programmes, so it is very sensitive to lingering | child processes. | | For the purposes of the autorunner, I use approach 1.1.3 ("always | write down the pid of every process you start, or otherwise | coordinate between A and B") and leave it to the user to figure | out what happens if the child process misbehaves with relation to | any processes it starts. | | However, I want to point out that approach 1.1.4 ("A should run B | inside a container") is easier to do than one might expect, and | I'd like to plug one of my favourite utilities--Bubblewrap[4]. | The Bubblewrap documentation says "[y]ou are unlikely to use it | directly from the commandline, although that is possible" but I | have built some amazing little tools from it. | | Try the following invocation: bwrap --ro-bind / | / --proc /proc --unshare-pid ps | | This launches `ps` in a PID namespace with a new `/proc` (since | `ps` will read from the host proc otherwise) and the root | filesystem mounted readonly. Any procesesses within the PID | namespace should have been created by the immediate command that | `bwrap` launched. There are also flags `--die-with-parent` and | `--as-pid-1` which can further reduce runtime overhead. If you | really need a supervisor process, this can be as simple as a | `/bin/sh` script that `kill TERM --timeout 1000 KILL` in a loop | on everything it sees in `ps`.) | | As you can see, there's a lot you can do with this tool with | significantly lower overhead than using Docker. It has been my | goal for some time to extract some of the functionality of | Bubblewrap into a Zsh extension to allow accessing these | mechanisms with even lower overhead. I think the creation of | namespaces is a missing primitive in Linux shells, and being able | to quickly construct namespaced environments allows for a style | of safe, robust, simple shell scripting. e.g., if you create a | mount namespace to run your script, you can actually be looser | about parameterising file locations (since the namespace can | ensure everything is exactly where you want it to be.) | | [1] https://fosstodon.org/@dontusethsicode/110019380909461936 | | [2] http://eradman.com/entrproject/ | | [3] https://watchexec.github.io/ | | [4] https://github.com/containers/bubblewrap | jrootabega wrote: | Looks interesting. Have you needed or found any good ways to | detach the wrapped code from the terminal where you first | launch the wrapper? (for security mostly) I haven't found a | good way to do that with bwrap other than using sudo or su and | their pty feature. bwrap's --new-session flag didn't play nice | with interactive programs in my attempts. | ary wrote: | This links one of my favorite critiques of API design: 'A fork() | in the road' | | https://www.microsoft.com/en-us/research/uploads/prod/2019/0... | | It's very much worth a read. | 1vuio0pswjnm7 wrote: | "I only know one existing solution that fixes all these problems | without sacrificing flexibility or generality. | | Use the C utility supervise to start your processes; for Python, | you can use its associated Python library." | | C utility written in 1999. Last updated in 2001. I'm still using | it everyday, not always with multilog and svscan. | evilotto wrote: | Is basic fork/exec from a large process still slow or have newer | apis fixed that? | wmf wrote: | I kept expecting Capsicum to step from behind the curtain but no. | loeg wrote: | Capsicum is about sandboxing code in the same process, not | really related to the problem the article is talking about. | FreeBSD's somewhat related mechanism to Linux pidfd is pdfork / | pdkill: | https://man.freebsd.org/cgi/man.cgi?query=pdfork&sektion=2&n... | dataangel wrote: | > Shell scripts make starting processes trivial, but it's almost | unthinkable that, say, bash, would integrate functionality for | starting containers, so that every process is started in a | container. | | Doooooooo it | edgyquant wrote: | Is it me or does this not make sense? Bash glues and pipes | together commands, has network access etc. every process being | a container would require either knowing all commands and being | able to ensure containers have proper access (even across | pipes) or that containers were so open as to defeat the | purpose. | lozenge wrote: | The problem is every executable can impersonate the user, it | has access to do anything the user can do, including deleting | or encrypting all their files, reading ssh private keys etc. | Network access is rarely concerning unless the program has | access to credentials. | Karellen wrote: | > The problem is every executable can impersonate the user, | | Um, what? | | What do you mean by "impersonate" here? What does a process | that does not impersonate the user look like? Do you just | mean "executables that run as the user"? | | When you log in, and a shell is started that runs as you, | is that shell impersonating the user? | | When you execute commands, as yourself, those commands run | with your credentials. Because you ran them. Isn't that, | like, the point? | dllthomas wrote: | Typically, any program I run has the totality of my | (regular user) authority, which may let it do things I | did not intend. | | Related: | | https://en.wikipedia.org/wiki/Ambient_authority | | https://en.wikipedia.org/wiki/Confused_deputy_problem | | https://en.wikipedia.org/wiki/Object-capability_model | nyrikki wrote: | Nothing is stopping you from using namespaces, and | containers are just namespaces with cgroups etc | | But containers aren't jails, pid and uid remapping is just | remapping. | | A huge problem container has to drop capabilities on the | honor system. In the default docker mode, running as root, | anyone who can launch a container can read from any block | device if they don't drop the mknod capability as an | example. | | Actually a privileged container can update the bios or even | load arbitrary kernel modules in the host context or change | kernel parameters as it is a shared kernel. | | I tried to get the docker folks to add a conf option | disallow privileged container but they refused. | | You can run in user mode now but most people want | persistence and other features that don't allow for that. | | The important point is if you assume containers are a | security feature you are going to have a bad time. Jails | were bad enough and containers are just one step up from | chroots as far as security go. | | namespace isolation is the main benefit of containers. | | Selinux and apparmor are far more appropriate than | containers for the security concerns. While I don't | personally like selinux, apparmor profiles are pretty easy | to write. | nyrikki wrote: | Plus the 'leaks' in the Linux process API is even worse | as each container may run its own tiny-init | | Containers make the first point of the OP far worse by | adding way more pid namespaces. | wmf wrote: | Maybe cgroups would be better than full containers here. | GauntletWizard wrote: | Which cgroups? Containers are not actually a thing in | kernel-land. They're a combination of Process, Network, | User, and other namespacing. | wmf wrote: | No, cgroups are a separate API from namespaces. | https://man7.org/linux/man-pages/man7/cgroups.7.html | mattpallissard wrote: | Done. | https://pallissard.net/2022/06/27/limiting_application_resou... | | Tl'dr two functions "dispatch" that calls systemd-run and | "wrap" that takes a command, a memory limit, and a cpu limit. | nine_k wrote: | systemd is not bash. Otherwise indeed true. | nickdothutton wrote: | It is after reading pieces like these that I'm reminded of how | fortunate I am to have had experience of other "serious" | Operating Systems, used at scale, in complex and sometimes | unfriendly environments. Namely VAX/VMS. Although some might feel | the title was a little clickbaity, I enjoyed the article. | DeathArrow wrote: | VMS was released for x86, so if you miss it you can give it a | spin. | | https://vmssoftware.com/about/news/2022-07-14-openvms-v92-fo... | skissane wrote: | Thus far the x86 port is only available to paying customers. | x86 hobbyist program is expected very soon now (within the | next few days/weeks). Until then, the best x86 option for | hobbyist use is probably running the Alpha version under an | emulator. (I don't know if any Itanium emulators are | available.) Or emulated VAX-OpenVMS for VAX is no longer | legally available to hobbyists, but not hard to find if you | don't care about the legalities of it. | gtirloni wrote: | Is Fuchsia any better for what the article is concerned about? | | https://fuchsia.dev | sitkack wrote: | Excellent article! Thanks for posting it. It outlines all the | problems and then offers a solution with this tool (by the | author) | | https://github.com/catern/supervise | cryptonector wrote: | Yup. PIDs are racy unless they are direct children processes' and | you've not reaped them yet. And it goes on. | | Windows has a much better process API, except for CreateProcess() | (the less said about which the better). | | One thing I generally do when I have a multi-process program (one | that starts multiple worker processes, say), is to have a pipe | with the write end only in the parent process and whose read end | the children include in their I/O event loops. That way when the | parent exits the children find out and then they too exit. The | parent will still try to signal them, but say the parent gets | `SIGKILL`ed: the children find out and they exit. | monocasa wrote: | pidfds solve some of those problems. | cryptonector wrote: | Indeed, they do. | | One can approximate pidfd in multi-processed programs on OSes | that lack it, but that's about it. pidfd needs to be first- | class. | rand_flip_bit wrote: | Curious why you think CreateProcess is worse than fork/exec. | Sure it takes about a dozen parameters but is that really the | end of the world?!? It's much much easier to use correctly and | doesn't have nearly as many of the pitfalls as fork/exec. | Especially in large processes with lots of memory allocated. I | genuinely don't understand why people dislike it so much. | jborean93 wrote: | Most of the complaints I've seen are about the number of args | and the complexity of calling it vs something simple like | fork. There are a lot of knobs to turn which you need to be | explicit about. That's not even getting into the whole | ProcThreadAttributeList and the myriad of options it exposes. | | In saying all that I do prefer the `CreateProcess*` APIs on | Windows vs the POSIX ones but that might be because I | understand the former better. | bolangi wrote: | Where process supervision is required under unix, you can use | systemd, the linux-only solution pushed by redhat, or one of the | small supervision suites such as s6 developed by skarnet.org. | slondr wrote: | What happens when s6 crashes, then? | [deleted] | aidenn0 wrote: | Do cgroups solve any of these problems? I was mildly surprised to | not see them mentioned. | wmf wrote: | Where the author talks about containers you can mentally | substitute cgroups since Linux containers are cgroups + | namespaces. | rcoveson wrote: | That's how I look at it too, but lots of people don't look at | it that way, hence all the handwaving about "too heavyweight" | and "seems like overkill" etc. | | Largely because of Docker and Kubernetes, many think of a | container as _all_ of the following: | | 1. A cgroup + [all or nearly all of the] unshare-able | namespaces | | 2. A writable, disposable overlay on top of an immutable | "image", which may be lazily downloaded and extracted | | 3. A resource managed by a userspace daemon managed by a | userspace utility over a socket | | 4. Optionally, a seccomp-bpf filter or apparmor profile or | something | | But there's a whole useful spectrum between a vanilla process | and a Docker container like that. Lots of points on that | spectrum still feel highly container-ized but aren't really | much more heavyweight than a vanilla process. | | Beyond that, in the point about PID namespaces, the author | should mention that there are ultra-light-weight init | implementations that are barely a factor in overhead. | jwilk wrote: | They were mentioned in 2.1.1. | not_enoch_wise wrote: | you're exciting me | userbinator wrote: | Is "unreliable and unsafe" the new "considered harmful"? Because | it sure feels like that. | calt wrote: | I think it's quite a bit more descriptive and objective than | "considered harmful." | wang_li wrote: | This reads like they have a set of requirements and since the | Unix model doesn't meet their requirements, the Unix model is | bad. As opposed to it's fine for those who have different | requirements. | kccqzy wrote: | Note that this article is really talking about the general case, | but in practice a lot of techniques can work if you have narrower | requirements or if you have more control over what you run. | | For example, in 1.1.4 the author talks about why containers are | not a solution giving three distinct reasons. But if we change | our perspective a little bit, none of the three reasons are | blocking. The first is that it's not easy; but `docker run` or | `podman run` is easy. Even systemd units start with separate | control groups to allow you to terminate everything at once. The | second reason was about gdb; when was the last time you used gdb | in production? If you are using gdb someone is interactively | using the computer and can be relied upon to clean up processes | manually. The third reason is that containers are more | heavyweight, but there's no need to make every subprocess a | separate container: if multiple processes should be managed as a | single unit (including the case when we'd want to terminate a | whole group of processes) they should run in the same container. | | So with a slight change of perspective we find the problem easily | solved. It had trade offs but it works well enough in practice | that only very few purists have a problem with it. Not to diss on | the author--I think this type of perfectionist thinking is | illuminating in terms of API design--but pragmatically it's a | solved problem. | wahern wrote: | The same is true for the process group/session + controlling | terminal solution: the solution doesn't work recursively (can't | do process management downstream), and it also requires child | processes to abstain from changing SIGHUP handler or mask, but | in the vast majority most cases none of those limitations are a | problem. Combined with POSIX fcntl locks[1] on a PID file, this | is my go to generic solution for Unix-portable[2], multiprocess | daemons. The amount of code required in the supervisor | component is quite trivial, yet covers almost all of your | bases. | | [1] fcntl locks permit querying the PID of the lock holder, so | you don't need to write the PID to the file, providing a | solution to the PID file race and loaded gun dilemmas. (There's | still a race, but the same race exists with Linux containers, | and both can be resolved in similar manner--query PID, send | SIGSTOP, verify PID association, send SIGKILL or SIGCONT.) | | [2] One of the crucial behaviors, that the kernel atomically | sends SIGHUP to all processes in the group if the controlling | process terminates, isn't guaranteed by POSIX, but it's the | behavior on all Unix I've tried--AIX, FreeBSD, macOS, Linux, | NetBSD, OpenBSD, and Solaris. | kwhitefoot wrote: | That was interesting and clearly written, I wish all such | articles were as clear. ___________________________________________________________________ (page generated 2023-03-22 23:00 UTC)