[HN Gopher] Go, Containers, and the Linux Scheduler ___________________________________________________________________ Go, Containers, and the Linux Scheduler Author : rbanffy Score : 120 points Date : 2023-11-07 19:10 UTC (3 hours ago) (HTM) web link (www.riverphillips.dev) (TXT) w3m dump (www.riverphillips.dev) | ntonozzi wrote: | I've been bitten many times by the CFS scheduler while using | containers and cgroups. What's the new scheduler? Has anyone here | tried it in a production cluster? We're now going on two decades | of wasted cores: | https://people.ece.ubc.ca/sasha/papers/eurosys16-final29.pdf. | donaldihunter wrote: | https://kernelnewbies.org/Linux_6.6#New_task_scheduler:_EEVD... | the8472 wrote: | The problem here isn't the scheduler. It's resource | restrictions imposed by the container but the containerized | process (Go) not checking the OS features used to do that when | calculating the available amount of parallelism. | dilyevsky wrote: | This is subtly incorrect - as far as Docker is concerned CFS | cgroup extension has several knobs to tune - cfs_quota_us, | cfs_period_us (typical default is 100ms not a second) and shares. | When you set shares you get weighted proportional scheduling (but | only when there's contention). The former two enforce strict | quota. Don't use Docker's --cpu flag and instead use --cpu-shares | to avoid (mostly useless) quota enforcement. | | From Linux docs: - cpu.shares: The weight of each | group living in the same hierarchy, that translates into | the amount of CPU it is expected to get. Upon cgroup creation, | each group gets assigned a default of 1024. The percentage of CPU | assigned to the cgroup is the value of shares divided by | the sum of all shares in all cgroups in the same level. | - cpu.cfs_period_us: The duration in microseconds of each | scheduler period, for bandwidth decisions. This defaults | to 100000us or 100ms. Larger periods will improve | throughput at the expense of latency, since the scheduler will be | able to sustain a cpu-bound workload for longer. The | opposite of true for smaller periods. Note that this only | affects non-RT tasks that are scheduled by the CFS | scheduler. - cpu.cfs_quota_us: The maximum time in | microseconds during each cfs_period_us in for the current | group will be allowed to run. For instance, if it is set to | half of cpu_period_us, the cgroup will only be able to peak run | for 50 % of the time. One should note that this | represents aggregate time over all CPUs in the system. | Therefore, in order to allow full usage of two CPUs, for | instance, one should set this value to twice the value of | cfs_period_us. | Thaxll wrote: | People using Kubernetes don't tune or change those settings, | it's up to the app to behave properly. | dilyevsky wrote: | False. Kubernetes cpu request sets the shares, cpu limit sets | the cfs quota | Thaxll wrote: | You said to change docker flags. Anyway your post is | irrelevant, the goal is to let know the runtime about how | many posix threads should it use. | | If you set request / limit to 1 core but you run on 64 | cores node , then you runtime will see that which will | bring performance down. | dilyevsky wrote: | Original article is about docker. That's the point of my | comment - dont set cpu limit | riv991 wrote: | I intended it to be applicable to all containerised | environments. Docker is just easiest on my local machine. | | I still believe it's best to set these variables | regardless of cpu limits and/or cpu shares | dilyevsky wrote: | All you did is kneecapped your app to have lower | performance so it fits under your arbitrary limit. Hardly | what most people describe as "best" - only useful in | small percentage of usecases (like reselling compute) | riv991 wrote: | I've seen significant performance gains from this in | production. | | Other people have encountered it too hence libraries like | Automaxprocs existing and issues being open with Go for | it. | riv991 wrote: | Hi I'm the blog author, thanks for the feedback | | I'll try and clarify this. I think this is how the sympton | presents but I should be clearer. | mratsim wrote: | > Don't use Docker's --cpu flag and instead use --cpu-shares to | avoid (mostly useless) quota enforcement. | | One caveat is that an application can detect when --cpu is used | as I think it's using cpuset. When quota are used it cannot | detect and more threads than necessary will likely be spawned | cpuguy83 wrote: | It is not using cpuset (there is a separate flag for this). | --cpus tweaks the cfs quota based on the number of cpus on | the system and the requested amount. | dilyevsky wrote: | --cpu sets the quota, there is is a --cpuset-cpu flag for | cpuset and you can detect both by looking at the | /sys/fs/cgroup | cpuguy83 wrote: | > "Don't use Docker's --cpu flag and instead use" | | This is rather strong language without any real qualifiers. It | is definitely not "mostly useless". Shares and quotas are for | different use-cases, that's all. Understand your use-case and | choose accordingly. | dilyevsky wrote: | It doesn't make any sense to me why --cpu flag is tweaking | quota and not shares since quota is useful in tiny minority | of usecases. A lot of people waste a ton of time debugging | weird latency issues as a result of this decision | the8472 wrote: | With shares you're going to experience worse latency if all | the containers on the system size their thread pool to the | maximum that's available during idle periods and then | constantly context-switch due to oversubscription under | load. With quotas you can do fixed resource allocation and | the runtimes (not Go apparently) can fit themselves into | that and not try to service more requests than they can | currently execute given those resources. | gregfurman wrote: | Discovered this sometime last year in my previous role as a | platform engineer managing our on-prem kubernetes cluster as well | as the CI/CD pipeline infrastructure. | | Although I saw this dissonance between actual and assigned CPU | causing issues, particularly CPU throttling, I struggled to find | a scalable solution that would affect all Go deployments on the | cluster. | | Getting all devs to include that autoprocs dependency was not | exactly an option for hundreds of projects. Alternatively, | setting all CPU request/limit to a whole number and then | assigning that to a GOMAXPROCS environment variable in a k8s | manifest was also clunky and infeasible. | | I ended up just using this GOMAXPROCS variable for some of our | more highly multithreaded applications which yielded some | improvements but I've yet to find a solution that is applicable | to all deployments in a microservices architecture with a high | variability of CPU requirements for each project. | jeffbee wrote: | There isn't one answer for this. Capping GOMAXPROCS may cause | severe latency problems if your process gets a burst of traffic | and has naive queueing. It's best really to set GOMAXPROCS to | whatever the hardware offers regardless of your ideas about how | much time the process will use on average. | linuxftw wrote: | You could define a mutating webhook to inject GOMAXPROCS into | all pod containers. | hiroshi3110 wrote: | How about GKE and containerd? | rickette wrote: | Besides GOMAXPROCS there's also GOMEMLIMIT in recent Go releases. | You can use https://github.com/KimMachineGun/automemlimit to | automatically set this this limit, kinda like | https://github.com/uber-go/automaxprocs. | ImJasonH wrote: | Thanks for sharing this! | | And as a maintainer of ko[1], it was a pleasant surprised to see | ko mentioned briefly, so that's for that too :) | | 1: https://ko.build | dekhn wrote: | The common problem I see across many languages is: applications | detect machine cores by looking at /proc/cpuinfo. However, in a | docker container (or other container technology), that file looks | the same as the container host (listing all cores, regardless of | how few have been assigned to the container). | | I wondered for a while if docker could make a fake /proc/cpuinfo | that apps could parse that just listed "docker cpus" allocated to | the job, but upon further reflection, that probably wouldn't work | for many reasons. | dharmab wrote: | Point of clarification: Containers, when using quota based | limits, can use all of the CPU cores on the host. They're | limited in how much time they can spend using them. | | (There are exceptions, such as documented here: | https://kubernetes.io/docs/tasks/administer-cluster/cpu- | mana...) | dekhn wrote: | Maybe I should be clearer: Let's say I have a 16 core host | and I start a flask container with cpu=0.5 that forks and has | a heavy post-fork initializer. | | flask/gunicorn will fork 16 processes (by reading | /proc/cpuinfo and counting cores) all of which will try to | share 0.5 cores worth of CPU power (maybe spread over many | physical CPUs; I don't really care about that). | | I can solve this by passing a flag to my application; my | complaint is more that apps shouldn't consult /proc/cpuinfo, | but have another standard interface to ask "what should I set | my max parallelism (NOT CONCURRENCY, ROB) so my worker | threads get adequate CPU time so the framework doesn't time | out on startup. | status_quo69 wrote: | https://stackoverflow.com/questions/65551215/get-docker- | cpu-... | | Been a bit but I do believe that dotnet does this exact | behavior. Sounds like gunicorn needs a pr to mimic, if they | want to replicate this. | | https://github.com/dotnet/runtime/issues/8485 | Volundr wrote: | It's not clear to me what the max parallelism should | actually be on a container with a CPU limit of .5. To my | understanding that limits CPU time the container can use | within a certain time interval, but doesn't actually limit | the parallel processes an application can run. In other | words that container with .5 on the CPU limit can indeed | use all 16 physical cores of that machine. It'll just burn | through it's budget 16x faster. If that's desirable vs | limiting itself to one process is going to be highly | application dependent and not something kubernetes and | docker can just tell you. | jeffbee wrote: | That's not what Go does though. Go looks at the population of | the CPU mask at startup. It never looks again, which of | problematic in K8s where the visible CPUs may change while your | process runs. | dekhn wrote: | What is the population of the CPU mask at startup? Is this a | kernel call? A /proc file? Some register? | EdSchouten wrote: | On Linux, it likely calls sched_getaffinity(). | dekhn wrote: | hmm, I can see that as being useful but I also don't see | that as the way to determine "how many worker threads I | should start" | jeffbee wrote: | It's not a bad way to guess, up to maybe 16 or so. Most | Go server programs aren't going to just scale up forever, | so having 188 threads might be a waste. | | Just setting it to 16 will satisfy 99% of users. | dekhn wrote: | There's going to be a bunch of missing info, though, in | some cases I can think of. For example, more and more | systems have asymmetric cores. /proc/cpuinfo can expose | that information in detail, including (current) clock | speed, processor type, etc, while cpu_set is literally | just a bitmask (if I read the man pages right) of system | cores your process is allowed to schedule on. | | Fundamentally, intelligent apps need to interrogate their | environment to make concurrency decisions. But I agree- | Go would probably work best if it just picked a standard | parallelism constant like 16 and just let users know that | can be tuned if they have additional context. | jeffbee wrote: | Yes, running on a set of heterogenous CPUs presents | further challenges, for the program and the thread | scheduler. Happily there are no such systems in the | cloud, yet. | | Most people are running on systems where the CPU capacity | varies and they haven't even noticed. For example in EC2 | there are 8 victim CPUs that handle all the network | interrupts, so if you have an instance type with 32 CPUs, | you already have 24 that are faster than the others. | Practically nobody even notices this effect. | bruh2 wrote: | As someone not that familiar with Docker or Go, is this behavior | intentional? Could the Go team make it aware of the CGroups | limit? Do other runtimes behave similarly? | yjftsjthsd-h wrote: | I'm fairly certain that that .net had to deal with it and Java | had or still has a problem, I forget which. (Or did you mean | runtimes like containerd?) | evntdrvn wrote: | I know that the .NET CLR team adjusted its behavior to address | this scenario, fwiw! | the8472 wrote: | So did OpenJDK and the Rust standard library. ___________________________________________________________________ (page generated 2023-11-07 23:00 UTC)