[HN Gopher] Linux kernel heap buffer overflow in fs_context.c si... ___________________________________________________________________ Linux kernel heap buffer overflow in fs_context.c since version 5.1 Author : todsacerdoti Score : 180 points Date : 2022-01-20 15:35 UTC (7 hours ago) (HTM) web link (seclists.org) (TXT) w3m dump (seclists.org) | carlhjerpe wrote: | Somehow I first thought I was affected personally, but I'm on | 5.15. Version numbers crossing 10 messes my head up more often | than I'd be willing to admit in person. | Eduard wrote: | So you are affected...? | carlhjerpe wrote: | I need to stop posting HN on the subway, yes I'm affected. | This might be a bad time to be on NixOS unstable. But it | seems hydra has been green since 3 hours so I should get a | patch soon. | stormbrew wrote: | > An unprivileged user can use unshare(CLONE_NEWNS|CLONE_NEWUSER) | to enter a namespace with the CAP_SYS_ADMIN permission, and then | proceed with exploitation to root the system. | | I'm confused by this, don't you need CAP_SYS_ADMIN to | unshare(CLONE_NEWNS) to begin with? | | From unshare(2): | | > CLONE_NEWNS | | > This flag has the same effect as the clone(2) CLONE_NEWNS flag. | Unshare the mount namespace, so that the calling process has a | private copy of its namespace which is not shared with any other | process. Specifying this flag automatically implies CLONE_FS as | well. Use of CLONE_NEWNS requires the CAP_SYS_ADMIN capability. | For further information, see mount_namespaces(7). | | Edit: Oh does this work specifically _because_ you 're also | unsharing into a new user namespace where you have that | capability? This is kind of wild tbh. | alerighi wrote: | Some distributions disable user namespaces by default because | they are considered a dangerous feature. And it probably is, in | the end. | staticassertion wrote: | Yeah, a lot of the Linux kernel code was reachable by root, and | for a long time the attitude of a lot of kernel maintainers was | that privesc from root didn't matter much. | | But now any code can be root in its own namespace... so all of | this code that's far less scrutinized is now reachable. | boring_twenties wrote: | Exactly the main argument that was being made against | enabling this by default all those years ago. | jakeinspace wrote: | Sorry to be somewhat off-topic, but I have a Linux kernel bug | question. I found a very small kernel bug (no obvious security | implication, only affecting 32-bit builds) at work a few weeks | back while working on a custom kernel patch. I sent an email to | the maintainers for that kernel subsystem, but didn't hear back. | I'm not quite sure if I should keep pestering them until I get a | response, or if I should be doing something else to get it | addressed. Any suggestions from someone with experience? | onphonenow wrote: | If you are filing a bug you will be ignored (or could be | ignored) forever. | | If you send in a patch - you are MUCH MUCH more likely to get a | response. As it should be. | | What is obvious to you as a patch may not be obvious to others, | so if you can write and test your patch that would go a long | way towards getting things to move forward. Bug reports are | noise to many maintainers (they know there are lots, their | focus is on code that fixes bugs). | tych0 wrote: | Standard advice is to wait two full weeks, then bump your | thread (or rebase the patch and send a v2 if there's new | conflicts with the maintainer's tree). | charcircuit wrote: | It sounded like he just filed a bug. | jakeinspace wrote: | It's a 1-line patch, I didn't send it in my initial email but | maybe that was a faux pas. | tych0 wrote: | I'd say sending the patch with git send-email --in-reply- | to=<the header of your last email> is good. A patch is much | easier to apply than write :) | jakeinspace wrote: | Thanks! Would you recommend I send to the maintainer(s), | and Cc the mailing list? | tych0 wrote: | Yeah, I'd just use whatever the output of | ./scripts/get_maintainer.pl says. It will also suggest | any recent committers to that area of the code, which | I've found useful in the past. Usually I put the | maintainers as To:, and everyone else as Cc:. | sweettea wrote: | There are both maintainers and lists listed in MAINTAINERS (L: | entries) -- did you Cc the mailing list? It might be good to | bump the mailing list email if it's been several weeks, asking | if there's more information you could provide. | jakeinspace wrote: | This is for timekeeping, which doesn't look to have its own | mailing list (just points me to the vger linux-kernel mail | list). I didn't Cc that mailing list, although I could. | bonzini wrote: | What subsystem is it? | jakeinspace wrote: | Timekeeping | kidd0 wrote: | Does it effect AWS ec2? | [deleted] | Diggsey wrote: | I assume that `size + len + 2` can't _over_ flow :) | mjw1007 wrote: | The Debian 11 release notes say: | | << From Linux 5.10, all users are allowed to create user | namespaces by default. This will allow programs such as web | browsers and container managers to create more restricted | sandboxes for untrusted or less-trusted code, without the need to | run as root or to use a setuid-root helper. | | The previous Debian default was to restrict this feature to | processes running as root, because it exposed more security | issues in the kernel. However, as the implementation of this | feature has matured, we are now confident that the risk of | enabling it is outweighed by the security benefits it provides. | | If you prefer to keep this feature restricted, set the sysctl: | user.max_user_namespaces = 0 | | Note that various desktop and container features will not work | with this restriction in place, including web browsers, | WebKitGTK, Flatpak and GNOME thumbnailing. >> | | Does anyone know a reason to keep this feature enabled on a | server, other than Docker's rootless mode? | prpl wrote: | If you have a multi tenant server and don't want to provide | root access to users but want them to be able to run | containers, otherwise it's probably not necessary | dathinab wrote: | some programs use it to sandbox themself without needing root. | Through currently I can only think about desktop apps which do | so. | faisal_ksa wrote: | I wander if rust (or any other memory safe system language in the | future) could have avoided this exploit. If not, what could we do | to avoid such exploits? | gpm wrote: | One method of forbidding the entire category of bugs is "bounds | checks on integer arithmetic". Rust implements this in debug | mode, but not by default in release mode, because it comes at a | performance cost. To make this sort of solution ubiquitous you | really want better hardware support to make bounds checking | cheap. | | Realistically I think it is unlikely you would have written the | same exploit in rust even with integer overflow wrapping by | default, because in idiomatic rust you end up using types with | lengths attached to them, and memcpy methods that check that | you didn't fuck up the lengths before copying. You absolutely | could end up writing it in rust though (using unsafe code, but | at some level unsafe code is inevitable for this sort of work), | and you could if you really wanted to implement a similar set | of safer buffer types in C that would provide a similar degree | of prevention (though it would be more cumbersome to use than | in rust). | menaerus wrote: | This got nothing to do with the memory but to the fact how CPU | works with the integers. This means that (low-level) | programming language fundamentally cannot solve this problem | but only alleviate it either by: | | 1. Changing the semantics of integer arithmetic (e.g. saturate | on overflow) | | 2. Keeping the semantics but babysit the computation during | runtime so that the overflow/underflow can never happen | (expensive) | duped wrote: | Modern CPUs will alert you to overflow and under flow. Rust | actually panics on overflow or under flow conditions in debug | builds by default. | | It is not expensive to check for under flow at runtime in | security critical code, and is actually mandatory for cases | like this as it is UB in C. | menaerus wrote: | Sorry, but you're wrong in both of your claims. | | First, unsigned integer underflow and overflow is _not_ UB. | It is very well defined operation (wrap-around arithmetic) | and the bug in question is not the result of undefined | behavior and rust or whatever other bs I keep hearing | around would have not solved it. It's the fundamental | artifact of how CPUs work. | | Secondly, CPUs have been "alerting" through their carry and | overflow bits in registers since forever so this isn't some | exclusive feature that only rust compiler writers were | smart enough to take advantage of. The same code can be and | is written where it matters in C and C++ code too. | | It's not only the question if such extra checks are | expensive (which they are given that integer arithmetic is | such a fundamental operation and your favorite language | disables it in release builds for the sakes of, I guess, | nothing?) but it is also a question of all known | _semantics_ of unsigned integer arithmetic. That's simply | the way they work and I see no near future where the CPU | hardware engineers would change that (they will not). | im3w1l wrote: | You could imagine a version of the arithmetic | instructions that traps on overflow. Or maybe a prefix | for the normal instruction. Then it can be basically free | in the happy path. | mustache_kimono wrote: | I'm not an expert, but I will say it may be easier to avoid an | over/underflow with: https://doc.rust- | lang.org/std/primitive.u32.html#method.satu... | | And to check if one has occurred with: https://doc.rust- | lang.org/std/primitive.u32.html#method.chec... | snvzz wrote: | There's likely many more of these. | | As a reminder, Linux has millions of lines of code, and all of | them run with supervisor privileges. | | This is not a good architecture. Generally, you'd try to minimize | the attack surface. | | Multiserver, microkernel systems based on capabilities is where | it's at. | | seL4 is the better microkernel to build such a system on. | athrowaway3z wrote: | There exist people who own their own hardware and are not | providing an API to run arbitrary code. | | If they get together and build an OS they are generally more | interested in throughput than security models. | | Both have pro's and con's but the fact is: one is more popular | with the "just get something working" crowd for better and | worse. | [deleted] | a-dub wrote: | i predict that tanenbaum will ultimately win the famous | monolithic vs. microkernel design debate. | | monolithic kernels are good for building features quickly and | runtime performance, but the security design reminds me of 90s | era computer security approaches, where firewalls were supposed | to stop all threats and behind them security was lax on | internal networks. microkernels are much more similar to | today's more effective defense in depth approaches. | | what does it matter if your kernel is fast and featureful if | you cannot trust it? | snvzz wrote: | >what does it matter if your kernel is fast and featureful if | you cannot trust it? | | And if the kernel is Linux, I'm not so sure about fast. | | Relative to Linux, seL4 has: | | - Order of magnitude faster context switch. | | - Order of magnitude lower scheduler latency. | | - Order of magnitude faster Inter-Process Communication. | VWWHFSfQ wrote: | It's my understanding that the microkernel architecture is | slow. Nearly unusably slow. And that's why nobody uses it. Am I | off-base? I'm interested! | dijit wrote: | Yes, it's going to be slower | | All of those security checks between components and memory | passing will cause it to be slower. | | But that doesn't mean it's a worthy trade off. | | People write software in python despite it being slower than | C++. | hutrdvnj wrote: | Except that anything that is actually performance critical | is written as a C extension python module. | | I think that low level filesystem operations are very | performance critical. | jeffbee wrote: | I doubt that mounting a filesystem is performance- | critical. You could afford to fork an unprivileged user- | space process written in Perl to parse these mount | options and that would be fast enough for everyone. | YarickR2 wrote: | Tell that to dockerd mounting images layer by layer , | with k8s doing all kinds of emptyDir/PVC mounts on top. | Pod start up speeds are abysmal now, they would be | positively glacial with userspafe permissions validations | jeffbee wrote: | That's exactly my point. People who are doing lots of | mounts are already demonstrably not performance-sensitive | to a difference of a few milliseconds. They already | waited 20 minutes for the stupid cluster autoscaler to | provision a machine for them. They DGAF. | snvzz wrote: | There's more myth than truth[0]. In the early days, they were | slow. Mach, used in OSX, is a representative of those early | days. | | Liedtke's L4 proved that a performant microkernel is | possible. | | Later, SMP changed the scenario considerably, as all of a | sudden the microkernel multiserver fits SMP like a globe, | while monolithic kernels need the complexity of locks to | handle it. | | [0] https://news.ycombinator.com/item?id=10824382 | jeffbee wrote: | As more and more things move out of the kernel, the perceived | performance problems of microkernels look less important. If | you are doing your network protocols in user space, and your | thread scheduling is in user space, and you're not using a | traditional filesystem much, then suddenly nobody cares how | fast the kernel is. | athrowaway3z wrote: | I'm really not understanding what you're saying here. | | What the hell is a microkernel here? What kind of security | are you talking about? | | User space filesystem and network implementations still | need access to the hardware. Multiplexing that access is a | kernels job. The more you want to separate and hide that | between clients the higher the cost. | | As far as i understand your argument you are saying "If an | application has a dedicated hard drive there will be little | overhead" | [deleted] | jeffbee wrote: | People think a monolithic kernel can be faster because of | high-level abstractions that make many calls within the | kernel, for example if I write to a TCP socket and | everything else is handled for me, in the kernel, by | function calls only. People believe this is faster than | having an isolated network stack that has to pass | messages to an isolated network driver and all that. | | But increasingly people realize that the performance of | writing to a TCP socket in a unikernel also pretty much | sucks and you get a much better result as you move more | and more of it into user space. For example you decide, | correctly, that TCP is obsolete and you switch to QUIC. | Now the existence of the Linux TCP stack is of no value | to you. You furthermore discover that Linux firewall, | traffic control, and routing also kinda sucks, so you | start using raw networking. Then you discover that trying | to get your frames processed on the right core at the | right moment isn't great in Linux, so you just take over | the whole net device with DPDK. | | Now, _nothing_ in the whole Linux network stack is of any | use to you at all. | | The same thing can happen with storage. Maybe you started | with files on XFS but then eventually you were using | disaggregated storage where a service takes over the | whole device with SPDK, and all the storage users are | talking to the service instead of the kernel. | stormbrew wrote: | Yeah this, really. For the most part even monolithic | kernels have kind of reversed trend in the last decade or | so and there's a lot of push to move critical code out of | the kernel. A lot of new kernel apis are built to avoid | context switching, and part of that usually involves moving | to a more asynchronous kind of communication between | process and kernel. | | Often these APIs even look an awful lot like late | generation microkernel shared memory buffer protocols. DRI | and uring in linux for example. | | A lot of the "microkernels are inherently slow" meme is | built on how earlier port-based message passing kernels | like mach worked. | nwmcsween wrote: | This depends on the definition of the word microkernel, in | the classical definition yes it will be much slower due to | IPC, for something like an exokernel though it will be much | faster than even a monolithic kernel. | whimsicalism wrote: | The compiler can't warn about something like this? I guess | unsigned integer underflow can be the intended behavior often. | mustache_kimono wrote: | It could, for other reasons, never underflow. C expects that | you know what you're doing, and C expected you did a bounds | check. But I agree. Cases like this should have a lint warn on | them, saying -- "Wake up programmer!" | menaerus wrote: | Fundamentally this problem cannot be solved at the compile- | time level because, well, code is dealing with the values | which are only known during code execution runtime. So I | don't think compiler can do much here other than providing | you with a hint that you may rewrite your expression but only | to reduce the risk of a potential error, e.g. `if (len + 2 + | size > PAGE_SIZE)` still still remains feasible to unsigned | integer overflow and to handle the problem fully one must | either: | | 1. Write a lot of convoluted if..else logic such as https://w | iki.sei.cmu.edu/confluence/display/c/INT30-C.+Ensur... and ht | tps://wiki.sei.cmu.edu/confluence/display/c/INT32-C.+Ensur... | | 2. Or use compiler built-in intrinsics, e.g. | https://gcc.gnu.org/onlinedocs/gcc/Integer-Overflow- | Builtins... | | But almost nobody does that ... except probably where it | really matters (not the OS kernel). | whimsicalism wrote: | Not solved at compile-time, but warned at compile time? | touisteur wrote: | Could be solved at compile time with proof of absence of | runtime errors, which somehow forces you to handle all | cases for any input. | menaerus wrote: | Warned about what exactly? Literally any operation on two | unsigned integers can either underflow or overflow and | any of those would still be correct and expected | behavior. | mustache_kimono wrote: | What you say makes sense. I was obviously wishcasting. ;) | dahfizz wrote: | It's not obvious what the warning would be, unless you want a | warning attached to every single arithmetic operation? The | compiler can't know what `size` will be in this case. | whimsicalism wrote: | Comparisons in an if statement involving subtracting two | unsigned variables from each other? | pxeger1 wrote: | This is CVE-2022-0185 if you need to know it | caaqil wrote: | See it in action: | https://twitter.com/ryaagard/status/1483592308352294917 | encryptluks2 wrote: | Arch shows some warnings about unprivileged user namespaces but | it is enabled by default I believe which allows for rootless | podman/docker. I didn't realize we'd actually see an exploit so | soon | tremon wrote: | What's the origin of the legacy_parse_param size parameter (from | struct fs_context->fs_private->data_size)? Is this a mount | option, a format-time fs configuration option, or does it require | writing a specially-crafted inode to disk? The exploit says the | user needs CAP_SYS_ADMIN, so I'm guessing it's the first one? | shakna wrote: | From the commit [0] that added it: | | > Legacy filesystems are supported by the provision of a set of | legacy fs_context operations that build up a list of mount | options and then invoke fs_type->mount() from within the | fs_context ->get_tree() operation. This allows all filesystems | to be accessed using fs_context. | | And then the description of the function itself: | | > Add a parameter to a legacy config. We build up a comma- | separated list of options. | | It looks to be the second one. | | [0] | https://github.com/torvalds/linux/commit/3e1aeb00e6d132efc15... | AshamedCaptain wrote: | I am surprised that mounting is now allowed inside containers. | Doesn't this expose a load of new surface attack for the | kernel? All these pesky academical filesystem code does not | inspire a lot of confidence when parsing user data/disk | images.... | phendrenad2 wrote: | Does anyone offer security fix backports for Linux? If I'm stuck | on Linux 5.1, is my only recourse to update or patch it myself? | rwmj wrote: | It's a one line patch in old code so it should apply easily. | However if you encounter this kind of problem a lot I'd highly | advise some kind of long-term supported Linux distribution. (I | work at Red Hat on RHEL, and that's what people pay us for) | singron wrote: | There are Linux stable branches that backport fixes. It looks | like all the affected branches have the fix now: linux-5.16.y | linux-5.15.y linux-5.10.y linux-5.4.y linux-rolling-lts linux- | rolling-stable | | EDIT: notably absent are linux-5.1.y and other non-lts | releases. If you can't stay on the most recent stable release, | you should use lts releases. | adfsdsaf wrote: | If you're stuck on 5.1, you probably have a ton of other | vulnerabilities too. 5.1 isn't even an LTS release, so support | for it was dropped in 2019. | | 5.4 is the first LTS of 5.x, and is supported through 2025. You | should try to find a way to get on an LTS kernel, or plan on | managing a lot of kernel patches. ___________________________________________________________________ (page generated 2022-01-20 23:00 UTC)