[HN Gopher] The Linux Scheduler: A Decade of Wasted Cores (2016)...
       ___________________________________________________________________
        
       The Linux Scheduler: A Decade of Wasted Cores (2016) [pdf]
        
       Author : PaulHoule
       Score  : 113 points
       Date   : 2023-12-13 14:31 UTC (8 hours ago)
        
 (HTM) web link (people.ece.ubc.ca)
 (TXT) w3m dump (people.ece.ubc.ca)
        
       | westurner wrote:
       | > Abstract: _As a central part of resource management, the OS
       | thread scheduler must maintain the following, simple, invariant:
       | make sure that ready threads are scheduled on available cores. As
       | simple as it may seem, we found that this invari- ant is often
       | broken in Linux. Cores may stay idle for sec- onds while ready
       | threads are waiting in runqueues. In our experiments, these
       | performance bugs caused many-fold per- formance degradation for
       | synchronization-heavy scientific applications, 13% higher latency
       | for kernel make, and a 14- 23% decrease in TPC-H throughput for a
       | widely used com- mercial database. The main contribution of this
       | work is the discovery and analysis of these bugs and providing
       | the fixes. Conventional testing techniques and debugging tools
       | are in- effective at confirming or understanding this kind of
       | bugs, because their symptoms are often evasive. To drive our in-
       | vestigation, we built new tools that check for violation of the_
       | invariant _online and visualize scheduling activity. They are
       | simple, easily portable across kernel versions, and run with a
       | negligible overhead. We believe that making these tools part of
       | the kernel developers' tool belt can help keep this type of bug
       | at bay._
        
       | megaloblasto wrote:
       | This paper is from 2016. Anyone know the status of this now?
        
         | pokler wrote:
         | The Linux kernel received a new scheduler not too long ago
         | (https://lwn.net/Articles/925371/), so I'm not sure how
         | relevant the critiques about CFS are.
        
           | whoisthemachine wrote:
           | Looks great! According to one article, those changes just hit
           | shore last month in kernel 6.6, so it's likely not in most os
           | distributions yet.
           | 
           | https://tuxcare.com/blog/linux-kernel-6-6-is-here-find-
           | out-w....
        
         | nickelpro wrote:
         | It was discussed at the time, and the consensus was the patches
         | were unusuable, the testing methodology dubious, and the
         | machine configuration rare.
         | 
         | https://lkml.org/lkml/2016/4/23/135
        
           | Agingcoder wrote:
           | This paper caused me to start doubting the Linux scheduler on
           | medium sized boxes ( 36 physical cores in 2016 ) and helped
           | me find an annoying bug in the rhel 7 kernel which prevented
           | some processes from being scheduled at all in some cases( it
           | was an issue with Numa load balancing if I remember well).
           | Note my bug was most likely unrelated to the paper, but it
           | marked in my mind a specific class of bugs as possible.
           | 
           | So it helped me at least :-)
        
       | Aerbil313 wrote:
       | Didn't read the paper, but Linux kernel and userspace gotten a
       | lot of tech debt while it became the most common OS.
       | 
       | I'd love to see something new built with modern practices and
       | ideas come and dethrone it. Maybe Theseus OS (not affiliated).
        
         | entropicdrifter wrote:
         | I've been keeping an eye on Redox for several years now, as
         | well. It'll be interesting to see what floats and what sinks in
         | the decades to come.
        
         | adql wrote:
         | I hesitate to call "the shit that keeps stuff older than a year
         | working on that kernel" a tech debt. It's necessity in any
         | general use OS.
         | 
         | The stuff that can be changed without breaking compatibility
         | is, well, it was the developer's best idea at the time and some
         | of them turned out to be good, some turned out to be bad.
         | 
         | Going "but on paper that idea would be better, let's make OS
         | around it" rarely ends up being plain better. It's not one
         | dimensional.
         | 
         | For example if your CPU scheduler makes all jobs run faster
         | (less cache thrashing while keeping CPUs busy etc.) but is
         | unfair, that might be great for people running batch jobs, but
         | shit for desktop users.
         | 
         | Scheduler wasting cycles but making sure interactive tasks get
         | the right level of priority might be improvement for desktop
         | users but not for server use etc.
         | 
         | Or you figure out that the reason something was "unnecessarily
         | complex", was actual necessary complexity that wasn't obvious
         | at first.
         | 
         | Also Linux isn't exactly stranger to taking the whole bunch of
         | code and replacing it with something better, I think we're at
         | 3th firewall stack now (ipchains -> iptables -> nftables)
        
           | gerdesj wrote:
           | There was another one before ipchains - ipfw.
        
           | throwaway914 wrote:
           | Something I would love to find is a practical/succinct guide
           | on Linux kernel performance tuning. I'd love it to give
           | examples of sysctl's you'd adjust for specific use cases
           | like:
           | 
           | - a real-time kernel
           | 
           | - a single user mode kernel
           | 
           | - an embedded device kernel (like an esp 32 or rpi)
           | 
           | - general purpose desktop use (surely there are gems in here)
           | 
           | - to use the linux kernel as a hypervisor hosting others
           | 
           | - tuning the linux kernel for within a vm (guest to another
           | hypervisor)
           | 
           | - tuning for gaming performance
           | 
           | I myself do not know enough about sysctls, and I'm sure it's
           | a goldmine.
        
       | dang wrote:
       | Related. Others?
       | 
       |  _The Linux scheduler: A decade of wasted cores (2016)_ -
       | https://news.ycombinator.com/item?id=33462345 - Nov 2022 (26
       | comments)
       | 
       |  _The Linux Scheduler: A Decade of Wasted Cores (2016)_ -
       | https://news.ycombinator.com/item?id=15531332 - Oct 2017 (35
       | comments)
       | 
       |  _The Linux Scheduler: A Decade of Wasted Cores_ -
       | https://news.ycombinator.com/item?id=11570606 - April 2016 (38
       | comments)
       | 
       |  _The Linux Scheduler: A Decade of Wasted Cores [pdf]_ -
       | https://news.ycombinator.com/item?id=11501493 - April 2016 (142
       | comments)
        
       | mpixel wrote:
       | Ultimately -- the thing is, if anyone is both capable and
       | willing, they can, and sometimes do, fix it.
       | 
       | Granted, this combination is rather rare. Most people aren't
       | capable. Of those who are, they have better things to do and they
       | probably have very well paying jobs they could be focusing on
       | instead.
       | 
       | With that being said, Linux is _still_ more efficient than
       | Windows.
       | 
       | I don't want to say Linux is free, in practice it's not, those
       | who are running the big powerful machines are using RHEL and
       | paying hefty licenses.
       | 
       | Which are still better than any other alternative.
        
         | adql wrote:
         | > I don't want to say Linux is free, in practice it's not,
         | those who are running the big powerful machines are using RHEL
         | and paying hefty licenses.
         | 
         | Google/Amazon/MS isn't paying RHEL licenses.
         | 
         | Reason for using RHEL is basically "we don't want to hire
         | experts", which makes a lot of sense in small/midsized company
         | but in big one it's probably mostly the ability to blame
         | someone else if something is fucked up, and the fact some of
         | the software basically says "run it on this distro or it is
         | unsupported".
        
           | deafpolygon wrote:
           | That's basically it. As a sysadmin, I supported over hundreds
           | of Linux servers (mostly virtual) pretty much by myself. We
           | paid the license so that if the shit hit the fan (and we
           | really had no way of fixing it) we can call on Red Hat. At
           | least, I could be able to tell my bosses that things were
           | being handled.
           | 
           | This never happened, of course. But it's CYA.
        
             | ElijahLynn wrote:
             | I've used RHEL support too, believe it or not, while
             | working as a DevOps engineer at Red Hat on www.redhat.com.
             | And the support specialist assigned to my issue was very
             | sharp and solved the issue quickly. Easy in hindsight but I
             | was stuck so I reached out, and I was glad I did. It
             | involved a segfault and grabbing a core dump and analyzing
             | it. ++Red Hat Support
        
         | saagarjha wrote:
         | There are well-paying jobs that allow smart people to focus on
         | exactly this.
        
         | commandlinefan wrote:
         | > capable and willing
         | 
         | This is the sort of research that scientific grants are
         | _supposed_ to be targeting for the public good. Supposed to be.
        
       | AaronFriel wrote:
       | The Linux 6.6 kernel ships with a new default scheduler[1]. Is
       | the testing methodology in this paper relevant and used to assess
       | the current (EEVDF) scheduler, or is it irrelevant?
       | 
       | [1] - https://lwn.net/Articles/925371/
        
       | tremon wrote:
       | Same goes for I/O scheduling. The virtual memory subsystem seems
       | to only have two modes: buffering and write-out. The system will
       | buffer writes until some threshold is reached, then it switches
       | to write-out mode and flushes all dirty pages regardless of their
       | age. The problem is that new pages are locked during write-out,
       | so pending writes will stall, regardless of the memory pressure
       | on the rest of the system.
       | 
       | A better design would be to only lock a subset of dirty pages and
       | let new writes to virtual memory continue while the write-out is
       | happening, but it doesn't seem like the system can accomodate
       | that.
       | 
       | A simple test to see this in action is to use netcat to transfer
       | a large file across the network, and monitor the device activity
       | with sar or atop (or just check the disk/network activity
       | lights). What you'll see is that while the disk is writing,
       | network activity drops to zero, and when network activity
       | resumes, the disk remains idle again for seconds. It doesn't
       | matter how much smaller you make
       | vm.dirty_background_{bytes,ratio} with respect to
       | vm.dirty_{bytes,ratio}, the network traffic will block as soon as
       | the "background" disk write starts. The only effect a low value
       | for vm.dirty_background_{bytes,ratio} has is to increase the
       | frequency at which the network and disk activity will alternate.
        
         | freedomben wrote:
         | Why do you think it's still this way? Why hasn't somebody
         | implemented something better? Is it too core that we can't
         | afford any instability?
         | 
         | Edit: apparently it has been done, but only very recently and
         | hasn't had a major distro ship it yet
        
           | ahartmetz wrote:
           | Oh? That sounds important and interesting. Do you have a URL
           | or commit hash with more information?
        
             | freedomben wrote:
             | I'm very confused now so I have no idea what to think. I
             | need to take some time to read through everything, but in
             | the mean time here are some relevant links I've seen:
             | 
             | https://lwn.net/Articles/925371/
             | 
             | https://lkml.org/lkml/2016/4/23/135
        
               | ahartmetz wrote:
               | Hm yes, maybe you are confused. The articles you posted
               | seem to be about CPU, not I/O.
        
         | alyandon wrote:
         | Ran into similar pathological behavior on servers with huge
         | amounts of ram (>1 TiB). Negative dentries builds up until the
         | kernel decides it is time to reclaim and locks up the server
         | for >60 seconds scanning pages. :-/
        
           | tremon wrote:
           | Yes, indeed. The more ram available, the more pages are
           | available as buffer cache, the more pronounced the system
           | stalls are. Though in my example, the determining factor is
           | available_memory/disk_write_troughput, while in your example
           | it is available_memory/single_core_throughput (assuming the
           | page scan is a single-thread process).
        
           | anonacct37 wrote:
           | I also ran into spooky negative dentry issues. On a fleet of
           | seemingly identical physical machines one machine would
           | constantly check for files that didn't exist (iirc it was
           | somehow part of nss and a quick Google search seems to
           | confirm https://lwn.net/Articles/894098/ ).
           | 
           | I'm okayish at systems level debugging but I never figured
           | that one out. It caused the kernel to use alot of memory.
           | Arguable whether or not it impacted performance since it was
           | a cache.
        
         | alexey-salmin wrote:
         | A complete nightmare. In the past I had to work on an IO-heavy
         | system with 1-5ms read-latency guarantees and the only working
         | solution was to completely manage all writes in the userspace.
         | Manual buffering, manual throttling, manual fallocate, then
         | either DIRECT_IO or mmap+sync_file_range depending on whether
         | you need the written data to be readily available for reads or
         | not.
         | 
         | There was a kernel patch though that solved around 95% of
         | issues with no userspace changes: vm.dirty_write_behind.
         | Unfortunately it never made it into the mainline kernel. [1]
         | For strong latency guarantees it ws insufficient but it greatly
         | improved the typical network/IO spike alternations described
         | above.
         | 
         | I'm surprised it was never fixed upstream despite that fact
         | that even most basic and simple scenarios like "nginx writes a
         | log file to the disk" sometimes explode with seconds-long
         | lockups on memory-fat machines.
         | 
         | [1] https://lore.kernel.org/linux-
         | fsdevel/156896493723.4334.1334...
        
       | bee_rider wrote:
       | Is their LU benchmark just LU factorize and possibly solve? That
       | seems like an odd candidate for this kind of problem, because of
       | course anybody running that kind of problem that cares about
       | performance just doesn't over-subscribe their system... the
       | scheduler has a much easier job when there's one big chunky
       | thread per core.
       | 
       | I mean, I get that they are testing a _general_ fix for the sort
       | of problems that, like, non-scientific-computing users get. So it
       | isn't to say the work is useless. It just seems like a funny
       | example.
       | 
       | I think coming up with benchmarks for schedulers must actually be
       | very difficult, because you have figure out exactly how silly of
       | a thing you want the imagined user to do.
        
         | semi-extrinsic wrote:
         | Further to this, if you are running a program where compute is
         | mainly a sparse linear solve, you are memory bandwidth limited
         | and on modern systems like AMD Epyc you will very very likely
         | see improved performance if you undersubscribe cores by 2x -
         | e.g. using max 64 threads per 128 cores.
        
           | bee_rider wrote:
           | That's interesting. I think it must be the case but just to
           | triple check my intuition, I guess you want the threads
           | spread out, so you can get every chiplet in on the action?
        
             | csdvrx wrote:
             | No, I think it's more about how cores may share
             | infrastructure (like M cache etc)
        
       | swatson741 wrote:
       | And yet Linux has consistently out preformed macOS in terms of
       | throughput.
        
         | jerf wrote:
         | Yes. It turns out one thing being bad does not make a
         | completely different thing good.
         | 
         | (You'd think this was obvious, but I've been on the Internet a
         | while and it sure isn't.)
        
       | haberman wrote:
       | > As a central part of resource management, the OS thread
       | scheduler must maintain the following, simple, invariant: make
       | sure that ready threads are scheduled on available cores. As
       | simple as it may seem, we found that this invariant is often
       | broken in Linux. Cores may stay idle for seconds while ready
       | threads are waiting in runqueues.
       | 
       | This oddly lines up with an observation I've made about traffic
       | lights. It is very surprising how often you'll be stopped at an
       | intersection where everybody is waiting and nobody is allowed to
       | go, sometimes for what feels like 10-30 seconds.
       | 
       | It seems like the lowest-hanging fruit for improving traffic is
       | to aim never to have degenerate intersections. If someone is
       | waiting and nobody is going, the light should change quickly.
        
         | zukzuk wrote:
         | Isn't that usually to give pedestrians a head start on
         | crossing?
        
         | wmf wrote:
         | That sounds like a protected left turn but no one is turning so
         | it looks stalled.
        
         | nextaccountic wrote:
         | In some places the lights work like this, but you need to
         | install relatively expensive sensors
         | 
         | The sensors help gather high quality data for traffic models
         | though. That should help traffic engineers to do their job
        
       ___________________________________________________________________
       (page generated 2023-12-13 23:00 UTC)