[HN Gopher] Go, Containers, and the Linux Scheduler
       ___________________________________________________________________
        
       Go, Containers, and the Linux Scheduler
        
       Author : rbanffy
       Score  : 120 points
       Date   : 2023-11-07 19:10 UTC (3 hours ago)
        
 (HTM) web link (www.riverphillips.dev)
 (TXT) w3m dump (www.riverphillips.dev)
        
       | ntonozzi wrote:
       | I've been bitten many times by the CFS scheduler while using
       | containers and cgroups. What's the new scheduler? Has anyone here
       | tried it in a production cluster? We're now going on two decades
       | of wasted cores:
       | https://people.ece.ubc.ca/sasha/papers/eurosys16-final29.pdf.
        
         | donaldihunter wrote:
         | https://kernelnewbies.org/Linux_6.6#New_task_scheduler:_EEVD...
        
         | the8472 wrote:
         | The problem here isn't the scheduler. It's resource
         | restrictions imposed by the container but the containerized
         | process (Go) not checking the OS features used to do that when
         | calculating the available amount of parallelism.
        
       | dilyevsky wrote:
       | This is subtly incorrect - as far as Docker is concerned CFS
       | cgroup extension has several knobs to tune - cfs_quota_us,
       | cfs_period_us (typical default is 100ms not a second) and shares.
       | When you set shares you get weighted proportional scheduling (but
       | only when there's contention). The former two enforce strict
       | quota. Don't use Docker's --cpu flag and instead use --cpu-shares
       | to avoid (mostly useless) quota enforcement.
       | 
       | From Linux docs:                 - cpu.shares: The weight of each
       | group living in the same hierarchy, that         translates into
       | the amount of CPU it is expected to get. Upon cgroup creation,
       | each group gets assigned a default of 1024. The percentage of CPU
       | assigned to         the cgroup is the value of shares divided by
       | the sum of all shares in all         cgroups in the same level.
       | - cpu.cfs_period_us: The duration in microseconds of each
       | scheduler period, for         bandwidth decisions. This defaults
       | to 100000us or 100ms. Larger periods will         improve
       | throughput at the expense of latency, since the scheduler will be
       | able         to sustain a cpu-bound workload for longer. The
       | opposite of true for smaller         periods. Note that this only
       | affects non-RT tasks that are scheduled by the         CFS
       | scheduler.       - cpu.cfs_quota_us: The maximum time in
       | microseconds during each cfs_period_us         in for the current
       | group will be allowed to run. For instance, if it is set to
       | half of cpu_period_us, the cgroup will only be able to peak run
       | for 50 % of         the time. One should note that this
       | represents aggregate time over all CPUs         in the system.
       | Therefore, in order to allow full usage of two CPUs, for
       | instance, one should set this value to twice the value of
       | cfs_period_us.
        
         | Thaxll wrote:
         | People using Kubernetes don't tune or change those settings,
         | it's up to the app to behave properly.
        
           | dilyevsky wrote:
           | False. Kubernetes cpu request sets the shares, cpu limit sets
           | the cfs quota
        
             | Thaxll wrote:
             | You said to change docker flags. Anyway your post is
             | irrelevant, the goal is to let know the runtime about how
             | many posix threads should it use.
             | 
             | If you set request / limit to 1 core but you run on 64
             | cores node , then you runtime will see that which will
             | bring performance down.
        
               | dilyevsky wrote:
               | Original article is about docker. That's the point of my
               | comment - dont set cpu limit
        
               | riv991 wrote:
               | I intended it to be applicable to all containerised
               | environments. Docker is just easiest on my local machine.
               | 
               | I still believe it's best to set these variables
               | regardless of cpu limits and/or cpu shares
        
               | dilyevsky wrote:
               | All you did is kneecapped your app to have lower
               | performance so it fits under your arbitrary limit. Hardly
               | what most people describe as "best" - only useful in
               | small percentage of usecases (like reselling compute)
        
               | riv991 wrote:
               | I've seen significant performance gains from this in
               | production.
               | 
               | Other people have encountered it too hence libraries like
               | Automaxprocs existing and issues being open with Go for
               | it.
        
         | riv991 wrote:
         | Hi I'm the blog author, thanks for the feedback
         | 
         | I'll try and clarify this. I think this is how the sympton
         | presents but I should be clearer.
        
         | mratsim wrote:
         | > Don't use Docker's --cpu flag and instead use --cpu-shares to
         | avoid (mostly useless) quota enforcement.
         | 
         | One caveat is that an application can detect when --cpu is used
         | as I think it's using cpuset. When quota are used it cannot
         | detect and more threads than necessary will likely be spawned
        
           | cpuguy83 wrote:
           | It is not using cpuset (there is a separate flag for this).
           | --cpus tweaks the cfs quota based on the number of cpus on
           | the system and the requested amount.
        
           | dilyevsky wrote:
           | --cpu sets the quota, there is is a --cpuset-cpu flag for
           | cpuset and you can detect both by looking at the
           | /sys/fs/cgroup
        
         | cpuguy83 wrote:
         | > "Don't use Docker's --cpu flag and instead use"
         | 
         | This is rather strong language without any real qualifiers. It
         | is definitely not "mostly useless". Shares and quotas are for
         | different use-cases, that's all. Understand your use-case and
         | choose accordingly.
        
           | dilyevsky wrote:
           | It doesn't make any sense to me why --cpu flag is tweaking
           | quota and not shares since quota is useful in tiny minority
           | of usecases. A lot of people waste a ton of time debugging
           | weird latency issues as a result of this decision
        
             | the8472 wrote:
             | With shares you're going to experience worse latency if all
             | the containers on the system size their thread pool to the
             | maximum that's available during idle periods and then
             | constantly context-switch due to oversubscription under
             | load. With quotas you can do fixed resource allocation and
             | the runtimes (not Go apparently) can fit themselves into
             | that and not try to service more requests than they can
             | currently execute given those resources.
        
       | gregfurman wrote:
       | Discovered this sometime last year in my previous role as a
       | platform engineer managing our on-prem kubernetes cluster as well
       | as the CI/CD pipeline infrastructure.
       | 
       | Although I saw this dissonance between actual and assigned CPU
       | causing issues, particularly CPU throttling, I struggled to find
       | a scalable solution that would affect all Go deployments on the
       | cluster.
       | 
       | Getting all devs to include that autoprocs dependency was not
       | exactly an option for hundreds of projects. Alternatively,
       | setting all CPU request/limit to a whole number and then
       | assigning that to a GOMAXPROCS environment variable in a k8s
       | manifest was also clunky and infeasible.
       | 
       | I ended up just using this GOMAXPROCS variable for some of our
       | more highly multithreaded applications which yielded some
       | improvements but I've yet to find a solution that is applicable
       | to all deployments in a microservices architecture with a high
       | variability of CPU requirements for each project.
        
         | jeffbee wrote:
         | There isn't one answer for this. Capping GOMAXPROCS may cause
         | severe latency problems if your process gets a burst of traffic
         | and has naive queueing. It's best really to set GOMAXPROCS to
         | whatever the hardware offers regardless of your ideas about how
         | much time the process will use on average.
        
         | linuxftw wrote:
         | You could define a mutating webhook to inject GOMAXPROCS into
         | all pod containers.
        
       | hiroshi3110 wrote:
       | How about GKE and containerd?
        
       | rickette wrote:
       | Besides GOMAXPROCS there's also GOMEMLIMIT in recent Go releases.
       | You can use https://github.com/KimMachineGun/automemlimit to
       | automatically set this this limit, kinda like
       | https://github.com/uber-go/automaxprocs.
        
       | ImJasonH wrote:
       | Thanks for sharing this!
       | 
       | And as a maintainer of ko[1], it was a pleasant surprised to see
       | ko mentioned briefly, so that's for that too :)
       | 
       | 1: https://ko.build
        
       | dekhn wrote:
       | The common problem I see across many languages is: applications
       | detect machine cores by looking at /proc/cpuinfo. However, in a
       | docker container (or other container technology), that file looks
       | the same as the container host (listing all cores, regardless of
       | how few have been assigned to the container).
       | 
       | I wondered for a while if docker could make a fake /proc/cpuinfo
       | that apps could parse that just listed "docker cpus" allocated to
       | the job, but upon further reflection, that probably wouldn't work
       | for many reasons.
        
         | dharmab wrote:
         | Point of clarification: Containers, when using quota based
         | limits, can use all of the CPU cores on the host. They're
         | limited in how much time they can spend using them.
         | 
         | (There are exceptions, such as documented here:
         | https://kubernetes.io/docs/tasks/administer-cluster/cpu-
         | mana...)
        
           | dekhn wrote:
           | Maybe I should be clearer: Let's say I have a 16 core host
           | and I start a flask container with cpu=0.5 that forks and has
           | a heavy post-fork initializer.
           | 
           | flask/gunicorn will fork 16 processes (by reading
           | /proc/cpuinfo and counting cores) all of which will try to
           | share 0.5 cores worth of CPU power (maybe spread over many
           | physical CPUs; I don't really care about that).
           | 
           | I can solve this by passing a flag to my application; my
           | complaint is more that apps shouldn't consult /proc/cpuinfo,
           | but have another standard interface to ask "what should I set
           | my max parallelism (NOT CONCURRENCY, ROB) so my worker
           | threads get adequate CPU time so the framework doesn't time
           | out on startup.
        
             | status_quo69 wrote:
             | https://stackoverflow.com/questions/65551215/get-docker-
             | cpu-...
             | 
             | Been a bit but I do believe that dotnet does this exact
             | behavior. Sounds like gunicorn needs a pr to mimic, if they
             | want to replicate this.
             | 
             | https://github.com/dotnet/runtime/issues/8485
        
             | Volundr wrote:
             | It's not clear to me what the max parallelism should
             | actually be on a container with a CPU limit of .5. To my
             | understanding that limits CPU time the container can use
             | within a certain time interval, but doesn't actually limit
             | the parallel processes an application can run. In other
             | words that container with .5 on the CPU limit can indeed
             | use all 16 physical cores of that machine. It'll just burn
             | through it's budget 16x faster. If that's desirable vs
             | limiting itself to one process is going to be highly
             | application dependent and not something kubernetes and
             | docker can just tell you.
        
         | jeffbee wrote:
         | That's not what Go does though. Go looks at the population of
         | the CPU mask at startup. It never looks again, which of
         | problematic in K8s where the visible CPUs may change while your
         | process runs.
        
           | dekhn wrote:
           | What is the population of the CPU mask at startup? Is this a
           | kernel call? A /proc file? Some register?
        
             | EdSchouten wrote:
             | On Linux, it likely calls sched_getaffinity().
        
               | dekhn wrote:
               | hmm, I can see that as being useful but I also don't see
               | that as the way to determine "how many worker threads I
               | should start"
        
               | jeffbee wrote:
               | It's not a bad way to guess, up to maybe 16 or so. Most
               | Go server programs aren't going to just scale up forever,
               | so having 188 threads might be a waste.
               | 
               | Just setting it to 16 will satisfy 99% of users.
        
               | dekhn wrote:
               | There's going to be a bunch of missing info, though, in
               | some cases I can think of. For example, more and more
               | systems have asymmetric cores. /proc/cpuinfo can expose
               | that information in detail, including (current) clock
               | speed, processor type, etc, while cpu_set is literally
               | just a bitmask (if I read the man pages right) of system
               | cores your process is allowed to schedule on.
               | 
               | Fundamentally, intelligent apps need to interrogate their
               | environment to make concurrency decisions. But I agree-
               | Go would probably work best if it just picked a standard
               | parallelism constant like 16 and just let users know that
               | can be tuned if they have additional context.
        
               | jeffbee wrote:
               | Yes, running on a set of heterogenous CPUs presents
               | further challenges, for the program and the thread
               | scheduler. Happily there are no such systems in the
               | cloud, yet.
               | 
               | Most people are running on systems where the CPU capacity
               | varies and they haven't even noticed. For example in EC2
               | there are 8 victim CPUs that handle all the network
               | interrupts, so if you have an instance type with 32 CPUs,
               | you already have 24 that are faster than the others.
               | Practically nobody even notices this effect.
        
       | bruh2 wrote:
       | As someone not that familiar with Docker or Go, is this behavior
       | intentional? Could the Go team make it aware of the CGroups
       | limit? Do other runtimes behave similarly?
        
         | yjftsjthsd-h wrote:
         | I'm fairly certain that that .net had to deal with it and Java
         | had or still has a problem, I forget which. (Or did you mean
         | runtimes like containerd?)
        
       | evntdrvn wrote:
       | I know that the .NET CLR team adjusted its behavior to address
       | this scenario, fwiw!
        
         | the8472 wrote:
         | So did OpenJDK and the Rust standard library.
        
       ___________________________________________________________________
       (page generated 2023-11-07 23:00 UTC)