[HN Gopher] The Linux Scheduler: A Decade of Wasted Cores (2016)... ___________________________________________________________________ The Linux Scheduler: A Decade of Wasted Cores (2016) [pdf] Author : PaulHoule Score : 113 points Date : 2023-12-13 14:31 UTC (8 hours ago) (HTM) web link (people.ece.ubc.ca) (TXT) w3m dump (people.ece.ubc.ca) | westurner wrote: | > Abstract: _As a central part of resource management, the OS | thread scheduler must maintain the following, simple, invariant: | make sure that ready threads are scheduled on available cores. As | simple as it may seem, we found that this invari- ant is often | broken in Linux. Cores may stay idle for sec- onds while ready | threads are waiting in runqueues. In our experiments, these | performance bugs caused many-fold per- formance degradation for | synchronization-heavy scientific applications, 13% higher latency | for kernel make, and a 14- 23% decrease in TPC-H throughput for a | widely used com- mercial database. The main contribution of this | work is the discovery and analysis of these bugs and providing | the fixes. Conventional testing techniques and debugging tools | are in- effective at confirming or understanding this kind of | bugs, because their symptoms are often evasive. To drive our in- | vestigation, we built new tools that check for violation of the_ | invariant _online and visualize scheduling activity. They are | simple, easily portable across kernel versions, and run with a | negligible overhead. We believe that making these tools part of | the kernel developers' tool belt can help keep this type of bug | at bay._ | megaloblasto wrote: | This paper is from 2016. Anyone know the status of this now? | pokler wrote: | The Linux kernel received a new scheduler not too long ago | (https://lwn.net/Articles/925371/), so I'm not sure how | relevant the critiques about CFS are. | whoisthemachine wrote: | Looks great! According to one article, those changes just hit | shore last month in kernel 6.6, so it's likely not in most os | distributions yet. | | https://tuxcare.com/blog/linux-kernel-6-6-is-here-find- | out-w.... | nickelpro wrote: | It was discussed at the time, and the consensus was the patches | were unusuable, the testing methodology dubious, and the | machine configuration rare. | | https://lkml.org/lkml/2016/4/23/135 | Agingcoder wrote: | This paper caused me to start doubting the Linux scheduler on | medium sized boxes ( 36 physical cores in 2016 ) and helped | me find an annoying bug in the rhel 7 kernel which prevented | some processes from being scheduled at all in some cases( it | was an issue with Numa load balancing if I remember well). | Note my bug was most likely unrelated to the paper, but it | marked in my mind a specific class of bugs as possible. | | So it helped me at least :-) | Aerbil313 wrote: | Didn't read the paper, but Linux kernel and userspace gotten a | lot of tech debt while it became the most common OS. | | I'd love to see something new built with modern practices and | ideas come and dethrone it. Maybe Theseus OS (not affiliated). | entropicdrifter wrote: | I've been keeping an eye on Redox for several years now, as | well. It'll be interesting to see what floats and what sinks in | the decades to come. | adql wrote: | I hesitate to call "the shit that keeps stuff older than a year | working on that kernel" a tech debt. It's necessity in any | general use OS. | | The stuff that can be changed without breaking compatibility | is, well, it was the developer's best idea at the time and some | of them turned out to be good, some turned out to be bad. | | Going "but on paper that idea would be better, let's make OS | around it" rarely ends up being plain better. It's not one | dimensional. | | For example if your CPU scheduler makes all jobs run faster | (less cache thrashing while keeping CPUs busy etc.) but is | unfair, that might be great for people running batch jobs, but | shit for desktop users. | | Scheduler wasting cycles but making sure interactive tasks get | the right level of priority might be improvement for desktop | users but not for server use etc. | | Or you figure out that the reason something was "unnecessarily | complex", was actual necessary complexity that wasn't obvious | at first. | | Also Linux isn't exactly stranger to taking the whole bunch of | code and replacing it with something better, I think we're at | 3th firewall stack now (ipchains -> iptables -> nftables) | gerdesj wrote: | There was another one before ipchains - ipfw. | throwaway914 wrote: | Something I would love to find is a practical/succinct guide | on Linux kernel performance tuning. I'd love it to give | examples of sysctl's you'd adjust for specific use cases | like: | | - a real-time kernel | | - a single user mode kernel | | - an embedded device kernel (like an esp 32 or rpi) | | - general purpose desktop use (surely there are gems in here) | | - to use the linux kernel as a hypervisor hosting others | | - tuning the linux kernel for within a vm (guest to another | hypervisor) | | - tuning for gaming performance | | I myself do not know enough about sysctls, and I'm sure it's | a goldmine. | dang wrote: | Related. Others? | | _The Linux scheduler: A decade of wasted cores (2016)_ - | https://news.ycombinator.com/item?id=33462345 - Nov 2022 (26 | comments) | | _The Linux Scheduler: A Decade of Wasted Cores (2016)_ - | https://news.ycombinator.com/item?id=15531332 - Oct 2017 (35 | comments) | | _The Linux Scheduler: A Decade of Wasted Cores_ - | https://news.ycombinator.com/item?id=11570606 - April 2016 (38 | comments) | | _The Linux Scheduler: A Decade of Wasted Cores [pdf]_ - | https://news.ycombinator.com/item?id=11501493 - April 2016 (142 | comments) | mpixel wrote: | Ultimately -- the thing is, if anyone is both capable and | willing, they can, and sometimes do, fix it. | | Granted, this combination is rather rare. Most people aren't | capable. Of those who are, they have better things to do and they | probably have very well paying jobs they could be focusing on | instead. | | With that being said, Linux is _still_ more efficient than | Windows. | | I don't want to say Linux is free, in practice it's not, those | who are running the big powerful machines are using RHEL and | paying hefty licenses. | | Which are still better than any other alternative. | adql wrote: | > I don't want to say Linux is free, in practice it's not, | those who are running the big powerful machines are using RHEL | and paying hefty licenses. | | Google/Amazon/MS isn't paying RHEL licenses. | | Reason for using RHEL is basically "we don't want to hire | experts", which makes a lot of sense in small/midsized company | but in big one it's probably mostly the ability to blame | someone else if something is fucked up, and the fact some of | the software basically says "run it on this distro or it is | unsupported". | deafpolygon wrote: | That's basically it. As a sysadmin, I supported over hundreds | of Linux servers (mostly virtual) pretty much by myself. We | paid the license so that if the shit hit the fan (and we | really had no way of fixing it) we can call on Red Hat. At | least, I could be able to tell my bosses that things were | being handled. | | This never happened, of course. But it's CYA. | ElijahLynn wrote: | I've used RHEL support too, believe it or not, while | working as a DevOps engineer at Red Hat on www.redhat.com. | And the support specialist assigned to my issue was very | sharp and solved the issue quickly. Easy in hindsight but I | was stuck so I reached out, and I was glad I did. It | involved a segfault and grabbing a core dump and analyzing | it. ++Red Hat Support | saagarjha wrote: | There are well-paying jobs that allow smart people to focus on | exactly this. | commandlinefan wrote: | > capable and willing | | This is the sort of research that scientific grants are | _supposed_ to be targeting for the public good. Supposed to be. | AaronFriel wrote: | The Linux 6.6 kernel ships with a new default scheduler[1]. Is | the testing methodology in this paper relevant and used to assess | the current (EEVDF) scheduler, or is it irrelevant? | | [1] - https://lwn.net/Articles/925371/ | tremon wrote: | Same goes for I/O scheduling. The virtual memory subsystem seems | to only have two modes: buffering and write-out. The system will | buffer writes until some threshold is reached, then it switches | to write-out mode and flushes all dirty pages regardless of their | age. The problem is that new pages are locked during write-out, | so pending writes will stall, regardless of the memory pressure | on the rest of the system. | | A better design would be to only lock a subset of dirty pages and | let new writes to virtual memory continue while the write-out is | happening, but it doesn't seem like the system can accomodate | that. | | A simple test to see this in action is to use netcat to transfer | a large file across the network, and monitor the device activity | with sar or atop (or just check the disk/network activity | lights). What you'll see is that while the disk is writing, | network activity drops to zero, and when network activity | resumes, the disk remains idle again for seconds. It doesn't | matter how much smaller you make | vm.dirty_background_{bytes,ratio} with respect to | vm.dirty_{bytes,ratio}, the network traffic will block as soon as | the "background" disk write starts. The only effect a low value | for vm.dirty_background_{bytes,ratio} has is to increase the | frequency at which the network and disk activity will alternate. | freedomben wrote: | Why do you think it's still this way? Why hasn't somebody | implemented something better? Is it too core that we can't | afford any instability? | | Edit: apparently it has been done, but only very recently and | hasn't had a major distro ship it yet | ahartmetz wrote: | Oh? That sounds important and interesting. Do you have a URL | or commit hash with more information? | freedomben wrote: | I'm very confused now so I have no idea what to think. I | need to take some time to read through everything, but in | the mean time here are some relevant links I've seen: | | https://lwn.net/Articles/925371/ | | https://lkml.org/lkml/2016/4/23/135 | ahartmetz wrote: | Hm yes, maybe you are confused. The articles you posted | seem to be about CPU, not I/O. | alyandon wrote: | Ran into similar pathological behavior on servers with huge | amounts of ram (>1 TiB). Negative dentries builds up until the | kernel decides it is time to reclaim and locks up the server | for >60 seconds scanning pages. :-/ | tremon wrote: | Yes, indeed. The more ram available, the more pages are | available as buffer cache, the more pronounced the system | stalls are. Though in my example, the determining factor is | available_memory/disk_write_troughput, while in your example | it is available_memory/single_core_throughput (assuming the | page scan is a single-thread process). | anonacct37 wrote: | I also ran into spooky negative dentry issues. On a fleet of | seemingly identical physical machines one machine would | constantly check for files that didn't exist (iirc it was | somehow part of nss and a quick Google search seems to | confirm https://lwn.net/Articles/894098/ ). | | I'm okayish at systems level debugging but I never figured | that one out. It caused the kernel to use alot of memory. | Arguable whether or not it impacted performance since it was | a cache. | alexey-salmin wrote: | A complete nightmare. In the past I had to work on an IO-heavy | system with 1-5ms read-latency guarantees and the only working | solution was to completely manage all writes in the userspace. | Manual buffering, manual throttling, manual fallocate, then | either DIRECT_IO or mmap+sync_file_range depending on whether | you need the written data to be readily available for reads or | not. | | There was a kernel patch though that solved around 95% of | issues with no userspace changes: vm.dirty_write_behind. | Unfortunately it never made it into the mainline kernel. [1] | For strong latency guarantees it ws insufficient but it greatly | improved the typical network/IO spike alternations described | above. | | I'm surprised it was never fixed upstream despite that fact | that even most basic and simple scenarios like "nginx writes a | log file to the disk" sometimes explode with seconds-long | lockups on memory-fat machines. | | [1] https://lore.kernel.org/linux- | fsdevel/156896493723.4334.1334... | bee_rider wrote: | Is their LU benchmark just LU factorize and possibly solve? That | seems like an odd candidate for this kind of problem, because of | course anybody running that kind of problem that cares about | performance just doesn't over-subscribe their system... the | scheduler has a much easier job when there's one big chunky | thread per core. | | I mean, I get that they are testing a _general_ fix for the sort | of problems that, like, non-scientific-computing users get. So it | isn't to say the work is useless. It just seems like a funny | example. | | I think coming up with benchmarks for schedulers must actually be | very difficult, because you have figure out exactly how silly of | a thing you want the imagined user to do. | semi-extrinsic wrote: | Further to this, if you are running a program where compute is | mainly a sparse linear solve, you are memory bandwidth limited | and on modern systems like AMD Epyc you will very very likely | see improved performance if you undersubscribe cores by 2x - | e.g. using max 64 threads per 128 cores. | bee_rider wrote: | That's interesting. I think it must be the case but just to | triple check my intuition, I guess you want the threads | spread out, so you can get every chiplet in on the action? | csdvrx wrote: | No, I think it's more about how cores may share | infrastructure (like M cache etc) | swatson741 wrote: | And yet Linux has consistently out preformed macOS in terms of | throughput. | jerf wrote: | Yes. It turns out one thing being bad does not make a | completely different thing good. | | (You'd think this was obvious, but I've been on the Internet a | while and it sure isn't.) | haberman wrote: | > As a central part of resource management, the OS thread | scheduler must maintain the following, simple, invariant: make | sure that ready threads are scheduled on available cores. As | simple as it may seem, we found that this invariant is often | broken in Linux. Cores may stay idle for seconds while ready | threads are waiting in runqueues. | | This oddly lines up with an observation I've made about traffic | lights. It is very surprising how often you'll be stopped at an | intersection where everybody is waiting and nobody is allowed to | go, sometimes for what feels like 10-30 seconds. | | It seems like the lowest-hanging fruit for improving traffic is | to aim never to have degenerate intersections. If someone is | waiting and nobody is going, the light should change quickly. | zukzuk wrote: | Isn't that usually to give pedestrians a head start on | crossing? | wmf wrote: | That sounds like a protected left turn but no one is turning so | it looks stalled. | nextaccountic wrote: | In some places the lights work like this, but you need to | install relatively expensive sensors | | The sensors help gather high quality data for traffic models | though. That should help traffic engineers to do their job ___________________________________________________________________ (page generated 2023-12-13 23:00 UTC)