[HN Gopher] The Unix process API is unreliable and unsafe (2021)
       ___________________________________________________________________
        
       The Unix process API is unreliable and unsafe (2021)
        
       Author : todsacerdoti
       Score  : 107 points
       Date   : 2023-03-22 17:41 UTC (5 hours ago)
        
 (HTM) web link (catern.com)
 (TXT) w3m dump (catern.com)
        
       | dataflow wrote:
       | It seems there isn't even anything written about FD_CLOEXEC and
       | its associated race conditions either, as far as I can tell.
       | Basically it's impossible to portably spawn a subprocess in a
       | safe manner if you don't have sufficient control over all the
       | code running in your process, because you might duplicate file
       | descriptors into the child that you might not have intended, and
       | that can break things in the parent.
        
         | rwmj wrote:
         | AIUI the problem there is not FD_CLOEXEC/SOCK_CLOEXEC but the
         | possibility that some library might not be using it? (That is
         | to say, *_CLOEXEC if used does not have race conditions)
         | 
         | However we usually cope with that by closing all
         | unknown/unexpected file descriptors after fork and before exec.
         | Linux even has a system call to make that easier:
         | https://man7.org/linux/man-pages/man2/close_range.2.html
        
           | dataflow wrote:
           | > AIUI the problem there is not FD_CLOEXEC/SOCK_CLOEXEC but
           | the possibility that some library might not be using it?
           | 
           | Not exactly. The problem is that you have to be able to set
           | it atomically from the creation of the file descriptor.
           | Setting it after creation is subject to a race condition
           | where a fork occurs in the interim. There's no portable way
           | to do that, and people often ignore O_CLOEXEC even when
           | there's a platform-dependent way to pass it. (How often do
           | you see dup3() called, for example? And how often do you see
           | higher-level languages and libraries expose this and force
           | callers to make a conscious decision?)
           | 
           | > However we usually cope with that by closing all
           | unknown/unexpected file descriptors after fork and before
           | exec.
           | 
           | You can't really do that portably (well, maybe unless you
           | want to call close() billions of times). And even if/when you
           | _can_ do that, you run into the reverse problem, where you
           | might close descriptors that were supposed to be duplicated
           | into the subprocess but that you didn 't know about. (One
           | example is when a user performs redirect inside a shell like
           | 2>&3 and wants it to work inside a descendant process - you
           | don't want to just randomly close FDs you don't recognize.)
        
       | deathanatos wrote:
       | > _1.1.4 A should run B inside a container_
       | 
       | I think the author knows this, but you don't have to start a
       | full-blown container if all you want is to solve the article's
       | stated problem of process leaks. Become a new pid NS: point 1,
       | the subprocess.run criticism is fixed (it just works); point 2, I
       | don't believe a pid NS requires either root or a user NS; and all
       | that remains is point 3. It doesn't _require_ you to start a
       | separate init, you can _be_ the init, i.e., whatever your top-
       | level service is. IIRC, the only two requirements is handling
       | SIGTERM (which you should probably already be doing) and reaping
       | reparented orphans who then die. But also dumb-init is available?
       | The article notes using a separate init, too:  "This init process
       | will do nothing but increase the load on the system, and it will
       | prevent us from directly monitoring the started processes." and
       | ... no? dumb-init, in a container I have here that's run for >2
       | weeks, has used < 20 ms of CPU time. RSS of 522 KiB. You'll be
       | fine. I'm not sure how it "will prevent us from directly
       | monitoring the started processes" -- it would live _above_ you in
       | the process tree. You 'd monitor the started process the same way
       | you would any started process.
       | 
       | Edit: ah, crap, I've got it wrong. A new PID NS requires root (or
       | user NSes); being a subreaper, I think, maybe does not. But I'm
       | not sure being a subreaper is sufficient; you want the subtree
       | reaped on the subtree root's death.
       | 
       | (I'm also not sure that the subreaper approach is sufficient: if
       | the subreaper itself dies, the processes leak.)
        
         | mike_hock wrote:
         | The subreaper is also gonna have the same footprint as the
         | pidns init, and is _more complicated._
         | 
         | It's just as flawed a solution as the other flawed solutions.
         | We can accept the subreaper being bug-free as a requirement for
         | this workaround to be working, but we can't prevent it from
         | being sigkilled.
        
       | jiveturkey wrote:
       | Too bad the article doesn't discuss contracts, the Solaris
       | solution. As the article is very linux focused, I imagine the
       | author is blissfully unaware.
        
       | jamesdutc wrote:
       | I recently wrote an autorunner[1] (like Entr[2] and Watchexec[3])
       | so I have some recent exposure to this problem. (I will be
       | releasing it on Github shortly.) My autorunner allows running
       | interactive programmes, so it is very sensitive to lingering
       | child processes.
       | 
       | For the purposes of the autorunner, I use approach 1.1.3 ("always
       | write down the pid of every process you start, or otherwise
       | coordinate between A and B") and leave it to the user to figure
       | out what happens if the child process misbehaves with relation to
       | any processes it starts.
       | 
       | However, I want to point out that approach 1.1.4 ("A should run B
       | inside a container") is easier to do than one might expect, and
       | I'd like to plug one of my favourite utilities--Bubblewrap[4].
       | The Bubblewrap documentation says "[y]ou are unlikely to use it
       | directly from the commandline, although that is possible" but I
       | have built some amazing little tools from it.
       | 
       | Try the following invocation:                   bwrap --ro-bind /
       | / --proc /proc --unshare-pid ps
       | 
       | This launches `ps` in a PID namespace with a new `/proc` (since
       | `ps` will read from the host proc otherwise) and the root
       | filesystem mounted readonly. Any procesesses within the PID
       | namespace should have been created by the immediate command that
       | `bwrap` launched. There are also flags `--die-with-parent` and
       | `--as-pid-1` which can further reduce runtime overhead. If you
       | really need a supervisor process, this can be as simple as a
       | `/bin/sh` script that `kill TERM --timeout 1000 KILL` in a loop
       | on everything it sees in `ps`.)
       | 
       | As you can see, there's a lot you can do with this tool with
       | significantly lower overhead than using Docker. It has been my
       | goal for some time to extract some of the functionality of
       | Bubblewrap into a Zsh extension to allow accessing these
       | mechanisms with even lower overhead. I think the creation of
       | namespaces is a missing primitive in Linux shells, and being able
       | to quickly construct namespaced environments allows for a style
       | of safe, robust, simple shell scripting. e.g., if you create a
       | mount namespace to run your script, you can actually be looser
       | about parameterising file locations (since the namespace can
       | ensure everything is exactly where you want it to be.)
       | 
       | [1] https://fosstodon.org/@dontusethsicode/110019380909461936
       | 
       | [2] http://eradman.com/entrproject/
       | 
       | [3] https://watchexec.github.io/
       | 
       | [4] https://github.com/containers/bubblewrap
        
         | jrootabega wrote:
         | Looks interesting. Have you needed or found any good ways to
         | detach the wrapped code from the terminal where you first
         | launch the wrapper? (for security mostly) I haven't found a
         | good way to do that with bwrap other than using sudo or su and
         | their pty feature. bwrap's --new-session flag didn't play nice
         | with interactive programs in my attempts.
        
       | ary wrote:
       | This links one of my favorite critiques of API design: 'A fork()
       | in the road'
       | 
       | https://www.microsoft.com/en-us/research/uploads/prod/2019/0...
       | 
       | It's very much worth a read.
        
       | 1vuio0pswjnm7 wrote:
       | "I only know one existing solution that fixes all these problems
       | without sacrificing flexibility or generality.
       | 
       | Use the C utility supervise to start your processes; for Python,
       | you can use its associated Python library."
       | 
       | C utility written in 1999. Last updated in 2001. I'm still using
       | it everyday, not always with multilog and svscan.
        
       | evilotto wrote:
       | Is basic fork/exec from a large process still slow or have newer
       | apis fixed that?
        
       | wmf wrote:
       | I kept expecting Capsicum to step from behind the curtain but no.
        
         | loeg wrote:
         | Capsicum is about sandboxing code in the same process, not
         | really related to the problem the article is talking about.
         | FreeBSD's somewhat related mechanism to Linux pidfd is pdfork /
         | pdkill:
         | https://man.freebsd.org/cgi/man.cgi?query=pdfork&sektion=2&n...
        
       | dataangel wrote:
       | > Shell scripts make starting processes trivial, but it's almost
       | unthinkable that, say, bash, would integrate functionality for
       | starting containers, so that every process is started in a
       | container.
       | 
       | Doooooooo it
        
         | edgyquant wrote:
         | Is it me or does this not make sense? Bash glues and pipes
         | together commands, has network access etc. every process being
         | a container would require either knowing all commands and being
         | able to ensure containers have proper access (even across
         | pipes) or that containers were so open as to defeat the
         | purpose.
        
           | lozenge wrote:
           | The problem is every executable can impersonate the user, it
           | has access to do anything the user can do, including deleting
           | or encrypting all their files, reading ssh private keys etc.
           | Network access is rarely concerning unless the program has
           | access to credentials.
        
             | Karellen wrote:
             | > The problem is every executable can impersonate the user,
             | 
             | Um, what?
             | 
             | What do you mean by "impersonate" here? What does a process
             | that does not impersonate the user look like? Do you just
             | mean "executables that run as the user"?
             | 
             | When you log in, and a shell is started that runs as you,
             | is that shell impersonating the user?
             | 
             | When you execute commands, as yourself, those commands run
             | with your credentials. Because you ran them. Isn't that,
             | like, the point?
        
               | dllthomas wrote:
               | Typically, any program I run has the totality of my
               | (regular user) authority, which may let it do things I
               | did not intend.
               | 
               | Related:
               | 
               | https://en.wikipedia.org/wiki/Ambient_authority
               | 
               | https://en.wikipedia.org/wiki/Confused_deputy_problem
               | 
               | https://en.wikipedia.org/wiki/Object-capability_model
        
             | nyrikki wrote:
             | Nothing is stopping you from using namespaces, and
             | containers are just namespaces with cgroups etc
             | 
             | But containers aren't jails, pid and uid remapping is just
             | remapping.
             | 
             | A huge problem container has to drop capabilities on the
             | honor system. In the default docker mode, running as root,
             | anyone who can launch a container can read from any block
             | device if they don't drop the mknod capability as an
             | example.
             | 
             | Actually a privileged container can update the bios or even
             | load arbitrary kernel modules in the host context or change
             | kernel parameters as it is a shared kernel.
             | 
             | I tried to get the docker folks to add a conf option
             | disallow privileged container but they refused.
             | 
             | You can run in user mode now but most people want
             | persistence and other features that don't allow for that.
             | 
             | The important point is if you assume containers are a
             | security feature you are going to have a bad time. Jails
             | were bad enough and containers are just one step up from
             | chroots as far as security go.
             | 
             | namespace isolation is the main benefit of containers.
             | 
             | Selinux and apparmor are far more appropriate than
             | containers for the security concerns. While I don't
             | personally like selinux, apparmor profiles are pretty easy
             | to write.
        
               | nyrikki wrote:
               | Plus the 'leaks' in the Linux process API is even worse
               | as each container may run its own tiny-init
               | 
               | Containers make the first point of the OP far worse by
               | adding way more pid namespaces.
        
           | wmf wrote:
           | Maybe cgroups would be better than full containers here.
        
             | GauntletWizard wrote:
             | Which cgroups? Containers are not actually a thing in
             | kernel-land. They're a combination of Process, Network,
             | User, and other namespacing.
        
               | wmf wrote:
               | No, cgroups are a separate API from namespaces.
               | https://man7.org/linux/man-pages/man7/cgroups.7.html
        
         | mattpallissard wrote:
         | Done.
         | https://pallissard.net/2022/06/27/limiting_application_resou...
         | 
         | Tl'dr two functions "dispatch" that calls systemd-run and
         | "wrap" that takes a command, a memory limit, and a cpu limit.
        
           | nine_k wrote:
           | systemd is not bash. Otherwise indeed true.
        
       | nickdothutton wrote:
       | It is after reading pieces like these that I'm reminded of how
       | fortunate I am to have had experience of other "serious"
       | Operating Systems, used at scale, in complex and sometimes
       | unfriendly environments. Namely VAX/VMS. Although some might feel
       | the title was a little clickbaity, I enjoyed the article.
        
         | DeathArrow wrote:
         | VMS was released for x86, so if you miss it you can give it a
         | spin.
         | 
         | https://vmssoftware.com/about/news/2022-07-14-openvms-v92-fo...
        
           | skissane wrote:
           | Thus far the x86 port is only available to paying customers.
           | x86 hobbyist program is expected very soon now (within the
           | next few days/weeks). Until then, the best x86 option for
           | hobbyist use is probably running the Alpha version under an
           | emulator. (I don't know if any Itanium emulators are
           | available.) Or emulated VAX-OpenVMS for VAX is no longer
           | legally available to hobbyists, but not hard to find if you
           | don't care about the legalities of it.
        
       | gtirloni wrote:
       | Is Fuchsia any better for what the article is concerned about?
       | 
       | https://fuchsia.dev
        
       | sitkack wrote:
       | Excellent article! Thanks for posting it. It outlines all the
       | problems and then offers a solution with this tool (by the
       | author)
       | 
       | https://github.com/catern/supervise
        
       | cryptonector wrote:
       | Yup. PIDs are racy unless they are direct children processes' and
       | you've not reaped them yet. And it goes on.
       | 
       | Windows has a much better process API, except for CreateProcess()
       | (the less said about which the better).
       | 
       | One thing I generally do when I have a multi-process program (one
       | that starts multiple worker processes, say), is to have a pipe
       | with the write end only in the parent process and whose read end
       | the children include in their I/O event loops. That way when the
       | parent exits the children find out and then they too exit. The
       | parent will still try to signal them, but say the parent gets
       | `SIGKILL`ed: the children find out and they exit.
        
         | monocasa wrote:
         | pidfds solve some of those problems.
        
           | cryptonector wrote:
           | Indeed, they do.
           | 
           | One can approximate pidfd in multi-processed programs on OSes
           | that lack it, but that's about it. pidfd needs to be first-
           | class.
        
         | rand_flip_bit wrote:
         | Curious why you think CreateProcess is worse than fork/exec.
         | Sure it takes about a dozen parameters but is that really the
         | end of the world?!? It's much much easier to use correctly and
         | doesn't have nearly as many of the pitfalls as fork/exec.
         | Especially in large processes with lots of memory allocated. I
         | genuinely don't understand why people dislike it so much.
        
           | jborean93 wrote:
           | Most of the complaints I've seen are about the number of args
           | and the complexity of calling it vs something simple like
           | fork. There are a lot of knobs to turn which you need to be
           | explicit about. That's not even getting into the whole
           | ProcThreadAttributeList and the myriad of options it exposes.
           | 
           | In saying all that I do prefer the `CreateProcess*` APIs on
           | Windows vs the POSIX ones but that might be because I
           | understand the former better.
        
       | bolangi wrote:
       | Where process supervision is required under unix, you can use
       | systemd, the linux-only solution pushed by redhat, or one of the
       | small supervision suites such as s6 developed by skarnet.org.
        
         | slondr wrote:
         | What happens when s6 crashes, then?
        
       | [deleted]
        
       | aidenn0 wrote:
       | Do cgroups solve any of these problems? I was mildly surprised to
       | not see them mentioned.
        
         | wmf wrote:
         | Where the author talks about containers you can mentally
         | substitute cgroups since Linux containers are cgroups +
         | namespaces.
        
           | rcoveson wrote:
           | That's how I look at it too, but lots of people don't look at
           | it that way, hence all the handwaving about "too heavyweight"
           | and "seems like overkill" etc.
           | 
           | Largely because of Docker and Kubernetes, many think of a
           | container as _all_ of the following:
           | 
           | 1. A cgroup + [all or nearly all of the] unshare-able
           | namespaces
           | 
           | 2. A writable, disposable overlay on top of an immutable
           | "image", which may be lazily downloaded and extracted
           | 
           | 3. A resource managed by a userspace daemon managed by a
           | userspace utility over a socket
           | 
           | 4. Optionally, a seccomp-bpf filter or apparmor profile or
           | something
           | 
           | But there's a whole useful spectrum between a vanilla process
           | and a Docker container like that. Lots of points on that
           | spectrum still feel highly container-ized but aren't really
           | much more heavyweight than a vanilla process.
           | 
           | Beyond that, in the point about PID namespaces, the author
           | should mention that there are ultra-light-weight init
           | implementations that are barely a factor in overhead.
        
         | jwilk wrote:
         | They were mentioned in 2.1.1.
        
       | not_enoch_wise wrote:
       | you're exciting me
        
       | userbinator wrote:
       | Is "unreliable and unsafe" the new "considered harmful"? Because
       | it sure feels like that.
        
         | calt wrote:
         | I think it's quite a bit more descriptive and objective than
         | "considered harmful."
        
       | wang_li wrote:
       | This reads like they have a set of requirements and since the
       | Unix model doesn't meet their requirements, the Unix model is
       | bad. As opposed to it's fine for those who have different
       | requirements.
        
       | kccqzy wrote:
       | Note that this article is really talking about the general case,
       | but in practice a lot of techniques can work if you have narrower
       | requirements or if you have more control over what you run.
       | 
       | For example, in 1.1.4 the author talks about why containers are
       | not a solution giving three distinct reasons. But if we change
       | our perspective a little bit, none of the three reasons are
       | blocking. The first is that it's not easy; but `docker run` or
       | `podman run` is easy. Even systemd units start with separate
       | control groups to allow you to terminate everything at once. The
       | second reason was about gdb; when was the last time you used gdb
       | in production? If you are using gdb someone is interactively
       | using the computer and can be relied upon to clean up processes
       | manually. The third reason is that containers are more
       | heavyweight, but there's no need to make every subprocess a
       | separate container: if multiple processes should be managed as a
       | single unit (including the case when we'd want to terminate a
       | whole group of processes) they should run in the same container.
       | 
       | So with a slight change of perspective we find the problem easily
       | solved. It had trade offs but it works well enough in practice
       | that only very few purists have a problem with it. Not to diss on
       | the author--I think this type of perfectionist thinking is
       | illuminating in terms of API design--but pragmatically it's a
       | solved problem.
        
         | wahern wrote:
         | The same is true for the process group/session + controlling
         | terminal solution: the solution doesn't work recursively (can't
         | do process management downstream), and it also requires child
         | processes to abstain from changing SIGHUP handler or mask, but
         | in the vast majority most cases none of those limitations are a
         | problem. Combined with POSIX fcntl locks[1] on a PID file, this
         | is my go to generic solution for Unix-portable[2], multiprocess
         | daemons. The amount of code required in the supervisor
         | component is quite trivial, yet covers almost all of your
         | bases.
         | 
         | [1] fcntl locks permit querying the PID of the lock holder, so
         | you don't need to write the PID to the file, providing a
         | solution to the PID file race and loaded gun dilemmas. (There's
         | still a race, but the same race exists with Linux containers,
         | and both can be resolved in similar manner--query PID, send
         | SIGSTOP, verify PID association, send SIGKILL or SIGCONT.)
         | 
         | [2] One of the crucial behaviors, that the kernel atomically
         | sends SIGHUP to all processes in the group if the controlling
         | process terminates, isn't guaranteed by POSIX, but it's the
         | behavior on all Unix I've tried--AIX, FreeBSD, macOS, Linux,
         | NetBSD, OpenBSD, and Solaris.
        
       | kwhitefoot wrote:
       | That was interesting and clearly written, I wish all such
       | articles were as clear.
        
       ___________________________________________________________________
       (page generated 2023-03-22 23:00 UTC)