[HN Gopher] Linux kernel heap buffer overflow in fs_context.c si...
       ___________________________________________________________________
        
       Linux kernel heap buffer overflow in fs_context.c since version 5.1
        
       Author : todsacerdoti
       Score  : 180 points
       Date   : 2022-01-20 15:35 UTC (7 hours ago)
        
 (HTM) web link (seclists.org)
 (TXT) w3m dump (seclists.org)
        
       | carlhjerpe wrote:
       | Somehow I first thought I was affected personally, but I'm on
       | 5.15. Version numbers crossing 10 messes my head up more often
       | than I'd be willing to admit in person.
        
         | Eduard wrote:
         | So you are affected...?
        
           | carlhjerpe wrote:
           | I need to stop posting HN on the subway, yes I'm affected.
           | This might be a bad time to be on NixOS unstable. But it
           | seems hydra has been green since 3 hours so I should get a
           | patch soon.
        
       | stormbrew wrote:
       | > An unprivileged user can use unshare(CLONE_NEWNS|CLONE_NEWUSER)
       | to enter a namespace with the CAP_SYS_ADMIN permission, and then
       | proceed with exploitation to root the system.
       | 
       | I'm confused by this, don't you need CAP_SYS_ADMIN to
       | unshare(CLONE_NEWNS) to begin with?
       | 
       | From unshare(2):
       | 
       | > CLONE_NEWNS
       | 
       | > This flag has the same effect as the clone(2) CLONE_NEWNS flag.
       | Unshare the mount namespace, so that the calling process has a
       | private copy of its namespace which is not shared with any other
       | process. Specifying this flag automatically implies CLONE_FS as
       | well. Use of CLONE_NEWNS requires the CAP_SYS_ADMIN capability.
       | For further information, see mount_namespaces(7).
       | 
       | Edit: Oh does this work specifically _because_ you 're also
       | unsharing into a new user namespace where you have that
       | capability? This is kind of wild tbh.
        
         | alerighi wrote:
         | Some distributions disable user namespaces by default because
         | they are considered a dangerous feature. And it probably is, in
         | the end.
        
         | staticassertion wrote:
         | Yeah, a lot of the Linux kernel code was reachable by root, and
         | for a long time the attitude of a lot of kernel maintainers was
         | that privesc from root didn't matter much.
         | 
         | But now any code can be root in its own namespace... so all of
         | this code that's far less scrutinized is now reachable.
        
           | boring_twenties wrote:
           | Exactly the main argument that was being made against
           | enabling this by default all those years ago.
        
       | jakeinspace wrote:
       | Sorry to be somewhat off-topic, but I have a Linux kernel bug
       | question. I found a very small kernel bug (no obvious security
       | implication, only affecting 32-bit builds) at work a few weeks
       | back while working on a custom kernel patch. I sent an email to
       | the maintainers for that kernel subsystem, but didn't hear back.
       | I'm not quite sure if I should keep pestering them until I get a
       | response, or if I should be doing something else to get it
       | addressed. Any suggestions from someone with experience?
        
         | onphonenow wrote:
         | If you are filing a bug you will be ignored (or could be
         | ignored) forever.
         | 
         | If you send in a patch - you are MUCH MUCH more likely to get a
         | response. As it should be.
         | 
         | What is obvious to you as a patch may not be obvious to others,
         | so if you can write and test your patch that would go a long
         | way towards getting things to move forward. Bug reports are
         | noise to many maintainers (they know there are lots, their
         | focus is on code that fixes bugs).
        
         | tych0 wrote:
         | Standard advice is to wait two full weeks, then bump your
         | thread (or rebase the patch and send a v2 if there's new
         | conflicts with the maintainer's tree).
        
           | charcircuit wrote:
           | It sounded like he just filed a bug.
        
           | jakeinspace wrote:
           | It's a 1-line patch, I didn't send it in my initial email but
           | maybe that was a faux pas.
        
             | tych0 wrote:
             | I'd say sending the patch with git send-email --in-reply-
             | to=<the header of your last email> is good. A patch is much
             | easier to apply than write :)
        
               | jakeinspace wrote:
               | Thanks! Would you recommend I send to the maintainer(s),
               | and Cc the mailing list?
        
               | tych0 wrote:
               | Yeah, I'd just use whatever the output of
               | ./scripts/get_maintainer.pl says. It will also suggest
               | any recent committers to that area of the code, which
               | I've found useful in the past. Usually I put the
               | maintainers as To:, and everyone else as Cc:.
        
         | sweettea wrote:
         | There are both maintainers and lists listed in MAINTAINERS (L:
         | entries) -- did you Cc the mailing list? It might be good to
         | bump the mailing list email if it's been several weeks, asking
         | if there's more information you could provide.
        
           | jakeinspace wrote:
           | This is for timekeeping, which doesn't look to have its own
           | mailing list (just points me to the vger linux-kernel mail
           | list). I didn't Cc that mailing list, although I could.
        
         | bonzini wrote:
         | What subsystem is it?
        
           | jakeinspace wrote:
           | Timekeeping
        
       | kidd0 wrote:
       | Does it effect AWS ec2?
        
       | [deleted]
        
       | Diggsey wrote:
       | I assume that `size + len + 2` can't _over_ flow :)
        
       | mjw1007 wrote:
       | The Debian 11 release notes say:
       | 
       | << From Linux 5.10, all users are allowed to create user
       | namespaces by default. This will allow programs such as web
       | browsers and container managers to create more restricted
       | sandboxes for untrusted or less-trusted code, without the need to
       | run as root or to use a setuid-root helper.
       | 
       | The previous Debian default was to restrict this feature to
       | processes running as root, because it exposed more security
       | issues in the kernel. However, as the implementation of this
       | feature has matured, we are now confident that the risk of
       | enabling it is outweighed by the security benefits it provides.
       | 
       | If you prefer to keep this feature restricted, set the sysctl:
       | user.max_user_namespaces = 0
       | 
       | Note that various desktop and container features will not work
       | with this restriction in place, including web browsers,
       | WebKitGTK, Flatpak and GNOME thumbnailing. >>
       | 
       | Does anyone know a reason to keep this feature enabled on a
       | server, other than Docker's rootless mode?
        
         | prpl wrote:
         | If you have a multi tenant server and don't want to provide
         | root access to users but want them to be able to run
         | containers, otherwise it's probably not necessary
        
         | dathinab wrote:
         | some programs use it to sandbox themself without needing root.
         | Through currently I can only think about desktop apps which do
         | so.
        
       | faisal_ksa wrote:
       | I wander if rust (or any other memory safe system language in the
       | future) could have avoided this exploit. If not, what could we do
       | to avoid such exploits?
        
         | gpm wrote:
         | One method of forbidding the entire category of bugs is "bounds
         | checks on integer arithmetic". Rust implements this in debug
         | mode, but not by default in release mode, because it comes at a
         | performance cost. To make this sort of solution ubiquitous you
         | really want better hardware support to make bounds checking
         | cheap.
         | 
         | Realistically I think it is unlikely you would have written the
         | same exploit in rust even with integer overflow wrapping by
         | default, because in idiomatic rust you end up using types with
         | lengths attached to them, and memcpy methods that check that
         | you didn't fuck up the lengths before copying. You absolutely
         | could end up writing it in rust though (using unsafe code, but
         | at some level unsafe code is inevitable for this sort of work),
         | and you could if you really wanted to implement a similar set
         | of safer buffer types in C that would provide a similar degree
         | of prevention (though it would be more cumbersome to use than
         | in rust).
        
         | menaerus wrote:
         | This got nothing to do with the memory but to the fact how CPU
         | works with the integers. This means that (low-level)
         | programming language fundamentally cannot solve this problem
         | but only alleviate it either by:
         | 
         | 1. Changing the semantics of integer arithmetic (e.g. saturate
         | on overflow)
         | 
         | 2. Keeping the semantics but babysit the computation during
         | runtime so that the overflow/underflow can never happen
         | (expensive)
        
           | duped wrote:
           | Modern CPUs will alert you to overflow and under flow. Rust
           | actually panics on overflow or under flow conditions in debug
           | builds by default.
           | 
           | It is not expensive to check for under flow at runtime in
           | security critical code, and is actually mandatory for cases
           | like this as it is UB in C.
        
             | menaerus wrote:
             | Sorry, but you're wrong in both of your claims.
             | 
             | First, unsigned integer underflow and overflow is _not_ UB.
             | It is very well defined operation (wrap-around arithmetic)
             | and the bug in question is not the result of undefined
             | behavior and rust or whatever other bs I keep hearing
             | around would have not solved it. It's the fundamental
             | artifact of how CPUs work.
             | 
             | Secondly, CPUs have been "alerting" through their carry and
             | overflow bits in registers since forever so this isn't some
             | exclusive feature that only rust compiler writers were
             | smart enough to take advantage of. The same code can be and
             | is written where it matters in C and C++ code too.
             | 
             | It's not only the question if such extra checks are
             | expensive (which they are given that integer arithmetic is
             | such a fundamental operation and your favorite language
             | disables it in release builds for the sakes of, I guess,
             | nothing?) but it is also a question of all known
             | _semantics_ of unsigned integer arithmetic. That's simply
             | the way they work and I see no near future where the CPU
             | hardware engineers would change that (they will not).
        
               | im3w1l wrote:
               | You could imagine a version of the arithmetic
               | instructions that traps on overflow. Or maybe a prefix
               | for the normal instruction. Then it can be basically free
               | in the happy path.
        
         | mustache_kimono wrote:
         | I'm not an expert, but I will say it may be easier to avoid an
         | over/underflow with: https://doc.rust-
         | lang.org/std/primitive.u32.html#method.satu...
         | 
         | And to check if one has occurred with: https://doc.rust-
         | lang.org/std/primitive.u32.html#method.chec...
        
       | snvzz wrote:
       | There's likely many more of these.
       | 
       | As a reminder, Linux has millions of lines of code, and all of
       | them run with supervisor privileges.
       | 
       | This is not a good architecture. Generally, you'd try to minimize
       | the attack surface.
       | 
       | Multiserver, microkernel systems based on capabilities is where
       | it's at.
       | 
       | seL4 is the better microkernel to build such a system on.
        
         | athrowaway3z wrote:
         | There exist people who own their own hardware and are not
         | providing an API to run arbitrary code.
         | 
         | If they get together and build an OS they are generally more
         | interested in throughput than security models.
         | 
         | Both have pro's and con's but the fact is: one is more popular
         | with the "just get something working" crowd for better and
         | worse.
        
         | [deleted]
        
         | a-dub wrote:
         | i predict that tanenbaum will ultimately win the famous
         | monolithic vs. microkernel design debate.
         | 
         | monolithic kernels are good for building features quickly and
         | runtime performance, but the security design reminds me of 90s
         | era computer security approaches, where firewalls were supposed
         | to stop all threats and behind them security was lax on
         | internal networks. microkernels are much more similar to
         | today's more effective defense in depth approaches.
         | 
         | what does it matter if your kernel is fast and featureful if
         | you cannot trust it?
        
           | snvzz wrote:
           | >what does it matter if your kernel is fast and featureful if
           | you cannot trust it?
           | 
           | And if the kernel is Linux, I'm not so sure about fast.
           | 
           | Relative to Linux, seL4 has:
           | 
           | - Order of magnitude faster context switch.
           | 
           | - Order of magnitude lower scheduler latency.
           | 
           | - Order of magnitude faster Inter-Process Communication.
        
         | VWWHFSfQ wrote:
         | It's my understanding that the microkernel architecture is
         | slow. Nearly unusably slow. And that's why nobody uses it. Am I
         | off-base? I'm interested!
        
           | dijit wrote:
           | Yes, it's going to be slower
           | 
           | All of those security checks between components and memory
           | passing will cause it to be slower.
           | 
           | But that doesn't mean it's a worthy trade off.
           | 
           | People write software in python despite it being slower than
           | C++.
        
             | hutrdvnj wrote:
             | Except that anything that is actually performance critical
             | is written as a C extension python module.
             | 
             | I think that low level filesystem operations are very
             | performance critical.
        
               | jeffbee wrote:
               | I doubt that mounting a filesystem is performance-
               | critical. You could afford to fork an unprivileged user-
               | space process written in Perl to parse these mount
               | options and that would be fast enough for everyone.
        
               | YarickR2 wrote:
               | Tell that to dockerd mounting images layer by layer ,
               | with k8s doing all kinds of emptyDir/PVC mounts on top.
               | Pod start up speeds are abysmal now, they would be
               | positively glacial with userspafe permissions validations
        
               | jeffbee wrote:
               | That's exactly my point. People who are doing lots of
               | mounts are already demonstrably not performance-sensitive
               | to a difference of a few milliseconds. They already
               | waited 20 minutes for the stupid cluster autoscaler to
               | provision a machine for them. They DGAF.
        
           | snvzz wrote:
           | There's more myth than truth[0]. In the early days, they were
           | slow. Mach, used in OSX, is a representative of those early
           | days.
           | 
           | Liedtke's L4 proved that a performant microkernel is
           | possible.
           | 
           | Later, SMP changed the scenario considerably, as all of a
           | sudden the microkernel multiserver fits SMP like a globe,
           | while monolithic kernels need the complexity of locks to
           | handle it.
           | 
           | [0] https://news.ycombinator.com/item?id=10824382
        
           | jeffbee wrote:
           | As more and more things move out of the kernel, the perceived
           | performance problems of microkernels look less important. If
           | you are doing your network protocols in user space, and your
           | thread scheduling is in user space, and you're not using a
           | traditional filesystem much, then suddenly nobody cares how
           | fast the kernel is.
        
             | athrowaway3z wrote:
             | I'm really not understanding what you're saying here.
             | 
             | What the hell is a microkernel here? What kind of security
             | are you talking about?
             | 
             | User space filesystem and network implementations still
             | need access to the hardware. Multiplexing that access is a
             | kernels job. The more you want to separate and hide that
             | between clients the higher the cost.
             | 
             | As far as i understand your argument you are saying "If an
             | application has a dedicated hard drive there will be little
             | overhead"
        
               | [deleted]
        
               | jeffbee wrote:
               | People think a monolithic kernel can be faster because of
               | high-level abstractions that make many calls within the
               | kernel, for example if I write to a TCP socket and
               | everything else is handled for me, in the kernel, by
               | function calls only. People believe this is faster than
               | having an isolated network stack that has to pass
               | messages to an isolated network driver and all that.
               | 
               | But increasingly people realize that the performance of
               | writing to a TCP socket in a unikernel also pretty much
               | sucks and you get a much better result as you move more
               | and more of it into user space. For example you decide,
               | correctly, that TCP is obsolete and you switch to QUIC.
               | Now the existence of the Linux TCP stack is of no value
               | to you. You furthermore discover that Linux firewall,
               | traffic control, and routing also kinda sucks, so you
               | start using raw networking. Then you discover that trying
               | to get your frames processed on the right core at the
               | right moment isn't great in Linux, so you just take over
               | the whole net device with DPDK.
               | 
               | Now, _nothing_ in the whole Linux network stack is of any
               | use to you at all.
               | 
               | The same thing can happen with storage. Maybe you started
               | with files on XFS but then eventually you were using
               | disaggregated storage where a service takes over the
               | whole device with SPDK, and all the storage users are
               | talking to the service instead of the kernel.
        
             | stormbrew wrote:
             | Yeah this, really. For the most part even monolithic
             | kernels have kind of reversed trend in the last decade or
             | so and there's a lot of push to move critical code out of
             | the kernel. A lot of new kernel apis are built to avoid
             | context switching, and part of that usually involves moving
             | to a more asynchronous kind of communication between
             | process and kernel.
             | 
             | Often these APIs even look an awful lot like late
             | generation microkernel shared memory buffer protocols. DRI
             | and uring in linux for example.
             | 
             | A lot of the "microkernels are inherently slow" meme is
             | built on how earlier port-based message passing kernels
             | like mach worked.
        
           | nwmcsween wrote:
           | This depends on the definition of the word microkernel, in
           | the classical definition yes it will be much slower due to
           | IPC, for something like an exokernel though it will be much
           | faster than even a monolithic kernel.
        
       | whimsicalism wrote:
       | The compiler can't warn about something like this? I guess
       | unsigned integer underflow can be the intended behavior often.
        
         | mustache_kimono wrote:
         | It could, for other reasons, never underflow. C expects that
         | you know what you're doing, and C expected you did a bounds
         | check. But I agree. Cases like this should have a lint warn on
         | them, saying -- "Wake up programmer!"
        
           | menaerus wrote:
           | Fundamentally this problem cannot be solved at the compile-
           | time level because, well, code is dealing with the values
           | which are only known during code execution runtime. So I
           | don't think compiler can do much here other than providing
           | you with a hint that you may rewrite your expression but only
           | to reduce the risk of a potential error, e.g. `if (len + 2 +
           | size > PAGE_SIZE)` still still remains feasible to unsigned
           | integer overflow and to handle the problem fully one must
           | either:
           | 
           | 1. Write a lot of convoluted if..else logic such as https://w
           | iki.sei.cmu.edu/confluence/display/c/INT30-C.+Ensur... and ht
           | tps://wiki.sei.cmu.edu/confluence/display/c/INT32-C.+Ensur...
           | 
           | 2. Or use compiler built-in intrinsics, e.g.
           | https://gcc.gnu.org/onlinedocs/gcc/Integer-Overflow-
           | Builtins...
           | 
           | But almost nobody does that ... except probably where it
           | really matters (not the OS kernel).
        
             | whimsicalism wrote:
             | Not solved at compile-time, but warned at compile time?
        
               | touisteur wrote:
               | Could be solved at compile time with proof of absence of
               | runtime errors, which somehow forces you to handle all
               | cases for any input.
        
               | menaerus wrote:
               | Warned about what exactly? Literally any operation on two
               | unsigned integers can either underflow or overflow and
               | any of those would still be correct and expected
               | behavior.
        
             | mustache_kimono wrote:
             | What you say makes sense. I was obviously wishcasting. ;)
        
         | dahfizz wrote:
         | It's not obvious what the warning would be, unless you want a
         | warning attached to every single arithmetic operation? The
         | compiler can't know what `size` will be in this case.
        
           | whimsicalism wrote:
           | Comparisons in an if statement involving subtracting two
           | unsigned variables from each other?
        
       | pxeger1 wrote:
       | This is CVE-2022-0185 if you need to know it
        
       | caaqil wrote:
       | See it in action:
       | https://twitter.com/ryaagard/status/1483592308352294917
        
         | encryptluks2 wrote:
         | Arch shows some warnings about unprivileged user namespaces but
         | it is enabled by default I believe which allows for rootless
         | podman/docker. I didn't realize we'd actually see an exploit so
         | soon
        
       | tremon wrote:
       | What's the origin of the legacy_parse_param size parameter (from
       | struct fs_context->fs_private->data_size)? Is this a mount
       | option, a format-time fs configuration option, or does it require
       | writing a specially-crafted inode to disk? The exploit says the
       | user needs CAP_SYS_ADMIN, so I'm guessing it's the first one?
        
         | shakna wrote:
         | From the commit [0] that added it:
         | 
         | > Legacy filesystems are supported by the provision of a set of
         | legacy fs_context operations that build up a list of mount
         | options and then invoke fs_type->mount() from within the
         | fs_context ->get_tree() operation. This allows all filesystems
         | to be accessed using fs_context.
         | 
         | And then the description of the function itself:
         | 
         | > Add a parameter to a legacy config. We build up a comma-
         | separated list of options.
         | 
         | It looks to be the second one.
         | 
         | [0]
         | https://github.com/torvalds/linux/commit/3e1aeb00e6d132efc15...
        
         | AshamedCaptain wrote:
         | I am surprised that mounting is now allowed inside containers.
         | Doesn't this expose a load of new surface attack for the
         | kernel? All these pesky academical filesystem code does not
         | inspire a lot of confidence when parsing user data/disk
         | images....
        
       | phendrenad2 wrote:
       | Does anyone offer security fix backports for Linux? If I'm stuck
       | on Linux 5.1, is my only recourse to update or patch it myself?
        
         | rwmj wrote:
         | It's a one line patch in old code so it should apply easily.
         | However if you encounter this kind of problem a lot I'd highly
         | advise some kind of long-term supported Linux distribution. (I
         | work at Red Hat on RHEL, and that's what people pay us for)
        
         | singron wrote:
         | There are Linux stable branches that backport fixes. It looks
         | like all the affected branches have the fix now: linux-5.16.y
         | linux-5.15.y linux-5.10.y linux-5.4.y linux-rolling-lts linux-
         | rolling-stable
         | 
         | EDIT: notably absent are linux-5.1.y and other non-lts
         | releases. If you can't stay on the most recent stable release,
         | you should use lts releases.
        
         | adfsdsaf wrote:
         | If you're stuck on 5.1, you probably have a ton of other
         | vulnerabilities too. 5.1 isn't even an LTS release, so support
         | for it was dropped in 2019.
         | 
         | 5.4 is the first LTS of 5.x, and is supported through 2025. You
         | should try to find a way to get on an LTS kernel, or plan on
         | managing a lot of kernel patches.
        
       ___________________________________________________________________
       (page generated 2022-01-20 23:00 UTC)