[HN Gopher] The Container Throttling Problem ___________________________________________________________________ The Container Throttling Problem Author : rognjen Score : 214 points Date : 2021-12-26 08:42 UTC (14 hours ago) (HTM) web link (danluu.com) (TXT) w3m dump (danluu.com) | jeffrallen wrote: | Tl;dr, which is too bad, because normally danluu's stuff is | great. | | From the bit I had patience to read it sounds like "we made a | complicated thing and it's doing complicated things wrong in | complicated ways". | | It is hard to believe that some of these CPU heavy, latency | sensitive servers should really be in containers. Why are they | not on dedicated machines? KISS. | marcosdumay wrote: | Linux is optimized for desktops and shared servers. When you | own the entire machine and wants to use it fully, that | optimization gets in your way. | londons_explore wrote: | I think this problem would have been debugged and solved much | quicker if they'd done a CPU scheduling trace. Then they could | see, microsecond by microsecond, exactly which processes were | doing which work, and what incoming requests are still waiting. | | Then, let a human go in and say "How come request#77 hasn't yet | been processed at this point, even though CPU#3 is working on | printing unused debug data for a low priority request and #77 is | well after its deadline!??". | | Then you debug deeper and deeper, adjusting parameters and | patching algorithms till you can get a CPU trace that a human can | look at and think "yeah, I couldn't adjust this schedule by hand | to get this work done better". | | In this process, most people/teams will find at least 10x | performance gains if they've never done it before, and usually | still 2x if you limit changes to one layer of the stack ('eg. Im | just tweaking the application code - we won't touch the runtime, | VM, OS or hypervisor parameters'). | neerajsi wrote: | I don't know why the negative reaction. I've done the kind of | analysis you've described many times and essentially been able | to quickly identify such issues over the years. We had a | similar problem in Windows when we first implemented the | Dynamic Fair Share thread scheduler. It took a couple months to | have the right tooling to do a proper scheduler trace, but with | that available the problem was better understood in a week. I | eventually rewrote the scheduler component and added a control | law to give better burstable behavior than the hard cap quota | that this article seems to be describing. | ghusbands wrote: | That does not cover almost anything in the article. It's a long | article, so maybe you could quote the bit you're responding to. | | A CPU scheduling trace wouldn't easily show you the details of | the kernel-level group throttling that was causing a lot of | issues, for example. They weren't having an issue with threads | fighting other threads, they were having an issue with threads | being penalised now for activity from several seconds ago, | drastically reducing the amount of available CPU. | | The article clearly shows a lot of debugging and diagnostic | patching ability, so it's unlikely they missed the simple | options. Rather, they probably didn't mention them because they | were obvious to try and didn't help. | londons_explore wrote: | > threads being penalised now for activity from several | seconds ago, | | Exactly... They would have found this out much quicker with a | trace. They would have seen "how come this application level | request is being handled on thread number X, yet that thread | is not running on any core, and many cores are idle"? Then | quickly they could see the reason that thread isn't scheduled | by enabling extra tracing detail seeing the internal data | structures used by the scheduler to see why something is | schedulable or not at that instant. | jeffbee wrote: | I completely agree. KUTrace would have been ideal for this | and indeed KUTrace was developed to diagnose this exact | problem. | ghusbands wrote: | I think you're suffering hindsight bias, here. A trace is | rarely as clear as that, and it's hard to see the details | it's not designed to expose. | | Your original message would probably be better received if | you'd omitted the "I think this problem would have been | debugged and solved much quicker [...]" and its insulting | implications and instead started with "Sometimes, I find | that CPU activity traces can really help with diagnosing | this sort of problem". | The_rationalist wrote: | Please stop advocating for politeness over correctness. | Sure hindsight help but regardless, a company such as | Twitter should have experts at tracing that have tools | and knowledge that goes beyond the average developer | knowledge about tracing methodologies. Excusing that is | an appeal to a lowering of technical excellence | worldwide, which is majorly important and matter more | than hypothetical feelings. | londons_explore wrote: | > a company such as Twitter should have experts at | tracing | | In a big company, getting the person with the most skills | to solve a problem to be the one actually tasked with | solving the problem is very hard. This particular problem | had many avenues to find a solution - and while I think | my proposed route would have been quicker, if you aren't | aware of those tools or techniques, then other avenues | might be much quicker. When starting an investigation | like this, you don't know where you're going to end up | either - if it turned out that the performance cliff was | caused by CPU thermal throttling, it would be hard to see | in a scheduling trace - everything would just seem | universally slow all of a sudden. | neerajsi wrote: | On Windows, we have the xperf and wpa toolset that makes | looking at holistic scheduling performance, including | processor power management and device io tractable. Even | then, the skillset to analyze an issue like the one | presented here takes months to acquire and only a few | engineers can do it. We have dedicated teams to do this | performance analysis work, and they're always in high | demand. | treffer wrote: | I have been running k8s clusters at utilizations far beyond 50% | (up to 90% during incidents). For web services/microservices, so | tail latencies were important. | | The way we solved this? 1. Kernel settings. Check e.g. the | settings of the Ubuntu low latency kernel for example. 2. CFS | tuning. Short timeslices. There are good documentations on how to | do that 3. CPU pressure. We cordoned and load shedded overloaded | nodes (k8s-pressurecooker). | | By limiting the maximum CPU pressure to 20% you can say "every | service will get all the CPU it needs at least 80% of the time on | most nodes". This is what you want. A low chance of seeing CPU | exhaustion. This is needed for predictable and stable tail | latencies. | | There are a few more knobs. E.g. scale services such that you use | at least one core as requests are effectively limits under | congestions and you can't get half a core continuously. | | Very nice to see that people go public about this. We need to | drop the footprint of services. It is straight up wasted money | and CO2. | diegocg wrote: | Quite interesting problem. It is indeed a contradiction to make a | service use all the CPUs on a system, and, at the same time have | an upper limit over how much CPU utilisation they can do. | | The thread pool size negotiation seems a necessary fix - | applications shouldn't be pre calculating their pool sizes on | their own anyway. But you get additional (smaller) problems, like | giving more or less threads to some service depending on their | priority. | | One of the big problems here as I understand it is trying to use | a resource whose "size" changes dynamically (Max CPU usage on a | cgroup, which can change depending on whether other prioritised | service is currently running or not) with a fixed sized resource | (nr of threads when a service starts). | | As the number of cores per CPU grows, I wonder if this whole | approach of scheduling tasks based on their CPU "usage" makes any | sense. At some point, the basic scheduling unit should be one | core, and tasks should be assigned a number of core units on the | system for a given time. | mabbo wrote: | I have to wonder why the authors skipped the potential solution | of removing containers and mesos from the equation entirely. | | If you gave this service a dedicated, non-co-located fleet, | running the JVM directly on the OS, and ran basic autoscaling of | the number of hosts, you'd eliminate a huge number of the moving | parts of the system that are causing these issues. | | Yes, that would add to ops costs (edit: _human_ ops costs) for | this service, but when you 're spending 8 figures per year in it, | clearly the budget is available. | | To quote the great philosopher Avril Lavigne: "Why'd you have to | go and make things so complicated?" | marcinzm wrote: | Isn't the problem then that each host would be underutilized on | average by a lot? It has X cpus and the service can never use | more than X cpus. If a service has any spiky loads then it'd | been overprovisioned cpu to handle them at good latency. | | That seems significantly more expensive at scale. | xorcist wrote: | > that would add to ops costs for this service, | | Wouldn't fewer moving parts mean lower operational costs? | Kalium wrote: | Only to the extent that cost is a function of complexity. | This isn't always the case. In a case like this, going to | bare metal likely brings with it significant drawbacks in | organizational complexity, orchestrational complexity, and | more while allowing for much better utilization of memory and | cpu resources. | | Telling someone whose car is making some funny noises that | it's simpler to go back to horse-and-buggy times would both | increase costs and decrease the number of user-servicable | moving parts. There's some significant overhead attached. | xorcist wrote: | Bare metal has nothing to do with this. It isn't even | touched upon in the article. It discusses a scheduler, and | the parent post suggests exempting these kind of jobs from | the scheduler in question, which they obviously aren't a | very good product fit for. | | Should you wish to really stretch that car analogy, maybe a | bit more appropriate than a horse would be: If you aren't | happy with your travel agency aren't booking your taxi | trips in time, try booking with the taxi company directly. | mabbo wrote: | Yes and no. | | It would lower the operations costs of hardware, hopefully | (that's the entire goal of this article) but you'd need more | people resources to manage it, I would guess. Mesos and | containers automate a lot of thinking work. | Kalium wrote: | Once you move to hosts dedicated to specific services, as | seems to be the suggestion here, you also might increase | the overall hardware cost across your set of services. The | cost per some of the services might decrease, though. | toast0 wrote: | I suspect it's the temptation of oversubscription. If service A | and service B each use 50% of a server, it's so tempting to put | them both on one server to maximize efficiency. Even if | sometimes you need 4 servers running A and B to serve the load | that can be managed with one server each of A and B. | | Or if you've broken things up into small pieces that aren't big | enough to use a whole server, that can feel inefficient as | well. | nvarsj wrote: | CFS quotas have been broken for a long time - with processes | being starved far below their utilisation of their quota. I think | every serious user of k8s discovers this the hard way. Recent | changes have been done to improve the scheduler for quotas but | I'm surprised twitter was using them at all in 2019. Java GC also | suffers badly with quotas. Pinning cpu is probably the best | compromise, otherwise just use CPU requests with no limits. | genewitch wrote: | I can't imagine the man-hours that went into creating this, and, | from here on out, knowing that core contention is still an issue | that isn't solved will allow me to waltz in to contract jobs and | save companies money, e-waste, and power costs - this causes | hope, joy, something like that. | | In case anyone missed it, the removal of throttling in certain | circumstances saved twitter ~$5mm/year, if I read it correctly. | With a naive kernel patch. While it takes dedicated engineers | decades of knowledge to know where to aim an intern, an intern | still banged out a kernel scheduling patch that made, what I | assume, is a huge difference. | | Dan Luu is a gem. | euiq wrote: | Note that the intern in question was close to finishing their | PhD in a related area. | wolf550e wrote: | "Low 8 figures" is more like $25 per year, and that's a single | service. Across all services it's more. | fulafel wrote: | Self-teergrubing by cpu quotas. | | Wonder what mechanism could be used to communicate the available | timeslice length so that the app/thread could stop taking on a | request when throttling is imminent. | mkhnews wrote: | Hi, I recently found similar behavior in an app for our company. | A simple threaded cpu benchmark shows: | | % numactl -C 0,5 ./ssp 12 elapsed time: 99943 ms | | cpu.cfs_quota_us = 200000 cpu.cfs_period_us = 100000 % cgexec -g | cpu:cgtestq ./ssp 12 elapsed time: 420888 ms | | cpu.cfs_quota_us = 2000 cpu.cfs_period_us = 1000 % cgexec -g | cpu:cgtestqx ./ssp 12 elapsed time: 168104 ms | | Also interesting was in our app some RR thread priorities are | used, and those do not get controlled via the cgroup cpu.cfs | settings. | tybit wrote: | I realise that the Twitter is using Mesos, but for those of us on | Kubernetes does guaranteed QoS solve this? | https://kubernetes.io/docs/tasks/configure-pod-container/qua... | mac-chaffee wrote: | QoS classes are only used "to make decisions about scheduling | and evicting Pods." It still uses the Completely Fair | Scheduler, which is where the problem came from (as far as I | understand). | KptMarchewa wrote: | I think they are not using Mesos now. | | https://dzone.com/articles/what-can-we-learn-from-twitters-m... | bboreham wrote: | If you also use the CPU Manager feature and request an integer | number of cores, yes. Then for example if you request 3 cores | your process will be pinned onto 3 specific cores and nothing | else will be scheduled onto those cores, and CFS will not | throttle your process. | | https://kubernetes.io/docs/tasks/administer-cluster/cpu-mana... | d3nj4l wrote: | As an newbie developer who hasn't dug into this stuff before, but | found this post fascinating: does anybody have any good pointers, | like books/articles/videos to learn about low-level details like | this? | closeparen wrote: | Computer Systems: A Programmer's Perspective. | | Operating Systems: Three Easy Pieces. | | Most important parts of my undergrad. Much more so than | Algorithms or a anything mathematical. | mochomocha wrote: | At Netflix, we're doing a mix of what Dan calls "CPU Pinning and | Isolation" (ie, host-level scheduling controlled by user-space | logic) [1] and "Oversubscription at the cluster scheduler level" | (through a bunch of custom k8s controllers) to avoid placing | unhappy neighbors on the same box in the first place, while | oversubscribing the machines based on containers usage patterns. | | [1]: https://netflixtechblog.com/predictive-cpu-isolation-of- | cont... | tyingq wrote: | That's a really terrific article, thanks for sharing. I wonder | if Linux will eventually tie the CPU scheduler together with | the cgroup cpu affinity functionality, and some awareness of | cores, smt, shared cache, etc. Seems a shame that you have to | tie all that together yourself, including a solver. | eternalban wrote: | The article mentions "nice values". What does that mean? | Underutilization/under-provisioning? | | [p.s. thanks for the replies] | bboreham wrote: | "nice" in Unix is a way to lower the priority of a process, | so that others are more likely to be scheduled. | | Eg https://man7.org/linux/man-pages/man2/nice.2.html | chrisoverzero wrote: | It's the kinds of values for things like `nice(2)`: | https://linux.die.net/man/2/nice | | In short, an offset from the base process priority. | staticassertion wrote: | I remember working in Java where we'd have huge threadpools that | sat idle 90% of the time. | | It feels like you can eliminate most of this problem in other | languages by using a much smaller pool and then leveraging | userland concurrency/ scheduling. You probably don't want to have | N cores and N + K threads, but in some languages you don't have | much choice. Java has options for userland concurrency but | they're pretty ugly and I don't think you'll find a lot of | integration. | | Containers make this a bit harder, and the Linux kernel sounds | like it had a pretty silly default behavior, but how much of this | is also just Java? | le-mark wrote: | I don't think blaming JVM is productive, but identifying "JVM" | as a proxy for "language runtime designed to optimize multi | processor machines" is the core element here. | | One could imagine a vm or runtime that is async and | multiprocess that also enforces quotas on cycles and heap such | that these types of "noisy neighbor" events aren't a problem. | | In this direction there have been solutions that haven't caught | on; a multi tenant jvm existed at one time, and at least one js | implementation has this ability. I've often thought Lua would | be ideal for this. | staticassertion wrote: | Yeah to be clear I'm not saying all fault lies with the JVM | here. But a lack of concurrency primitives exacerbates the | problem by encouraging very large threadpools. | neerajsi wrote: | I haven't used it in anger, but it looks to me like the C# | async compiler and library support helps reduce the need | for large threadpools. | | But it also looks like the GC was a major contributor, so | that would not be as influenced by the differences between | dotnet and Java. | YokoZar wrote: | > The gains for doing this for individual large services are | significant (in the case of service-1, it's [mid 7 figures per | year] for the service and [low 8 figures per year] including | services that are clones of it, but tuning every service by hand | isn't scalable. | | This point seems wrong to me, bound too much by requiring | solutions to be done by a small team of engineers who already | have a mandate to work on the problem. | | With numbers like that Twitter could, profitably, hire dozens of | engineers that do literally nothing else. Just tweak thread pool | sizes all day, every day, for service after service. Even though | it's a boring, manual, "noncomplex" thing, this type of work is | clearly valuable and should have happened years ago. | | Most likely Twitter's job ladder, promotion process, and hiring | pipeline is highly incentivizing people to avoid such work even | when it has clear impact. They are very much not alone in that | regard. | xyzzy_plugh wrote: | I solved this same problem for a company also in 2019 (as the | CPU quota bug hadn't been fixed yet) and it resulted in | something like 8 figures of yearly cost savings. | | You are correct in that most companies are not equipped to | staff issues like this. Most places just accept their bills as | a cost of doing business, not something that can be optimized. | karmakaze wrote: | I can see how there are diminishing returns when optimizing | but I would never say that server bills are not a metric to | be aware of and address. I've always had some idea of what's | practically achievable in terms of efficiency within a given | architecture and aim for something that gets a good amount of | the way there without undue effort. I also enjoy thinking of | longer term improvements for efficiency whether that could | improve latency or the bottom line and at the same time know | that's secondary to providing additional value and gaining | customers during a growth period. | [deleted] | eloff wrote: | This. I've never worked anywhere that had dedicated ongoing | effort to cost reduction in compute services. It's always a | once every couple of years thing to look at the cloud | spending and spend a little effort dealing with the low | hanging fruit. | _jal wrote: | A side effect of deciding systems engineers can be replaced | by devops. | | In reality, you want both. A good systems person can save you | a ton of money. | ithkuil wrote: | A nice approach is to staff a DevOps team with people from | diverse backgrounds; some more towards the system side of | the spectrum and some more towards the dev side of the | spectrum. As long as everybody knows a little bit of the | other side. This helps ok avoiding a culture where devs | "throw some code over the fence" and sysops people just | moan that devs are careless and/or that they should do | things differently, but without a clear way of showing | exactly how differently things should be made (and also | without a clear understanding of what devs ended up | choosing the way they choose) | spockz wrote: | Afaik, Twitter already has significant and mature | infrastructure in place to run a plethora of different | instances on shadowed traffic and compare settings. It is used | at least by the people working on the optimised jre. | mnutt wrote: | There was a startup I talked to at one point that had a service | where they'd run an agent on your instances that would collect | performance data and live tune kernel parameters, and they had | some AI to find the best parameters for your workload. No idea | how well it worked, but it seems like a potentially good | application of AI. | servytor wrote: | Do you remember the name for it? Sounds really useful. | syngrog66 wrote: | Log4Shell.com | vegetablepotpie wrote: | They could hire contractors or consultants to do the job, no? | That class of worker would not be concerned about promotion | opportunities. For some reason they haven't done that either. | joshuamorton wrote: | > With numbers like that Twitter could, profitably, hire dozens | of engineers that do literally nothing else. Just tweak thread | pool sizes all day, every day, for service after service. Even | though it's a boring, manual, "noncomplex" thing, this type of | work is clearly valuable and should have happened years ago. | | The issue is that once you hire a dozen engineers to do this | (say for 5M a year in total), and they do it for a year, they | save mid 8 figures (keep in mind this was the largest service, | so the savings across other services will be smaller). | | Then can they keep saving mid 8 figures every year? | | I'll paraphrase something I previously wrote privately, but | imagine you have some team that's able to save 10% of your | fleetwide resources this year. They densify and optimize and | improve defaults. So you can now grow your services by 10% | without any additional cost increase, and you do! The next | year, they've already saved the easiest 10%. Can they save 10% | again? Can they keep it up every year? How long until they're | saving 3% a year, or 1% a year? And that's if you keep the team | the same size, where its clearly loosing money! If you could | afford a dozen people to save 10%, you can only really afford | 1-2 to save 1%, but then you're likely to get an even smaller | return. | | Unless you expect to be able to maintain the same relative | value of optimizations every year, 3 or 5 years out, its not | worth it to hire an FTE to work on them. | | I should note that I've experienced this myself: I was working | in an area where resource optimization could lead to | "significant" savings (not 8 figures, but 6 or maybe 7). My | first 6 months working in this area, I found all sorts of low | hanging fruit and saved a fair amount. The second six months, | 5-10x less ROI. I gave up even trying in the third six moths, | if I come across a thing, I'll fix it, but its no longer | worthwhile to _look_. | ZephyrBlu wrote: | If you're looking in the same area/domain, what you're saying | is almost certainly true. | | If you're looking across the business as a whole, it seems | likely that there is a lot of this kind of work lying around | because there is not much incentive for people to tackle it | as described in this comment: | https://news.ycombinator.com/item?id=29691847. | ghusbands wrote: | Certainly, at one place I worked, the higher-ups were very | clear that any work on cost reduction was wasted and devs and | ops should always work on increasing net income, not decreasing | costs. It was consistently claimed that cost reduction can only | get you small percentage decreases, whereas increases in income | are larger and compound better. | avianlyric wrote: | The big difference between cost reduction and income | increase, is that one has a hard limit on possible upside, | whereas the other does not. You can reduce your costs by more | than your total costs, but it's quite possible to increase | your income by many multiples of your existing income. | | Result is that maximising income is generally better than | reducing cost. Of course, as with all generalisation, there | are situations where this approach doesn't hold true. But as | a high level first order strategy, it's a good one to adopt. | ClumsyPilot wrote: | "one has a hard limit on possible upside, whereas the other | does not." | | Thats plain wrong, the gloval market for cars, bicycles and | what have you has a limited size. Every large company | that's a market leader understands that. | beiller wrote: | Would capturing 100% of the market granting a monopoly | essentially grant unlimited upside? Cause you can just | jack the price to absurdity? Also wouldn't reducing cost | to zero have an infinite upside as well? Basically zero | cost you can produce infinite output. Gettin real | pedantic here heh. | wolf550e wrote: | You meant to write "You cannot reduce your costs" | lostdog wrote: | You can do similar types of work, but target speed increases | instead. Getting all the batch jobs to finish faster can help | developer productivity, and is even worthwhile at a new | startup. | tomrod wrote: | The higher ups need some basic economics education, it | appears. Certainly you shouldn't invest everything in long | term returns, but you should be open to it. | | Instead, when something has a payoff of 3 years, executives | get antsy in orgs that have a 2-year cycle on exec positions. | kevin_nisbet wrote: | I've had similar messages articulated to me by my manager, | and have found myself articulating similar messages to my | team. | | In my team, the key is for the state of the project/product | we manage, cost optimization is likely one of the lowest ROI | activities we can spend too much time on. That doesn't mean | we don't tackle some clear low hanging fruit when we see it, | or use low hanging fruit as training opportunities to onboard | new team members, but that we need to be conscious on where | we make investments, and for the stage we're at, the more | important investment is into areas that make our product more | appealing to more customers. | | I think it's easy to say someone, like an intern could pay | for themselves with savings. But this to me overlooks that | someone has to manage that intern, get them into change | management, review the work, investigate mistakes or | interruptions from the changes, etc. And then they're still | the lowest earning employee, since most of us aren't hired to | pay for ourselves, but actually to turn a profit for the | company. | | So while I'm not sure I agree with the message "whereas | increases in income are larger and compound better.", I | certainly understand and have pushed a similar message, that | we be conscious on where we're spending our time, and that | we're selecting the highest impact activities based on the | resources and team we have. Sometimes that may be fixing high | wastage, but very frequently that will be investing into the | product. And I think for the stage of the product we manage, | that is the best choice for us. | R0b0t1 wrote: | This was verified by (what should be) a famous Harvard | Business school study. Quality before cost, revenue before | expenses, and there is no three. | richardwhiuk wrote: | This article is quite old - the kernel patch has been available | for a while now, I believe, and CMK is no longer in beta (the | article references K8s 1.8 and 1.10, but the current latest | version is 1.23). | cpitman wrote: | There are updates from this month at the bottom! | [deleted] | throwaway984393 wrote: | I wonder if k8s' bin-packing features would help here. | | The graphs seems to validate my general assumption that large- | load tasks just suck at scaling whereas small-load tasks can be | horizontally scaled easier without falling over. The general | assumption being that for most applications, if you ignore | everything else about an operation and assume a somewhat random | distribution of load, smaller-load services use up more available | resources on average than a single large-load service. That's | just been an assumption in my head, I can't remember any data to | back that up. | | Back in the day when I worked on a large-traffic internet site, | we tried jiggering with the scheduler and other kernel tweaks, | and in the end we literally just routed certain kinds of requests | to certain machines and said "these requests need loads of cache" | (memcache, local disk, nfs cache, etc) and "these requests need | fast io" and "these requests need a ton of cpu". It was "dumb" | engineering but it worked out. | zekrioca wrote: | I think there is another solution, not discussed in the article, | which lies between CPU isolation and pinning, and that is | virtualizing the container's /proc so to not let it think the | number of available (logical) processors is larger than a certain | limit set by the cluster operator, but which is actually lower | than the physical capacity of a server (so to allow overbooking | and increase their 'redacted' savings in $M). This is basically | presenting a container/application with a number of vCPUs that it | can use in any way it sees fit, but with all the (invisible) | control group (quota) limits (i.e., "throttling") the author | discusses in the text and that avoids the application to spawn so | many threads that inevitably overloads the physical server and | destroys tail latency. | | This is at the kernel level, opposed to paravirtualization. And I | guess this is Twitter's use case, but should not be confused by | the typical vCPUs offers one sees in most cloud providers, which | is usually done through hypervisors such as Qeme/KVM, VMware, or | Xen. | | I'm not sure why Mesos (maybe this one tried and didn't succeed), | nor K8S (available through external Intel code) or even Docker, | never really thought about that, but I guess they want to keep | their internal (operational) overheads up to a limit, and | possibly also to maintain the metastability of their services | [1]. But now we see where it leads to, with all these redacted | numbers in the article. | | [1] | https://sigops.org/s/conferences/hotos/2021/papers/hotos21-s... | | Ps: edits for clarifications. | KaiserPro wrote: | We had a similar problem, but it exhibited differently | | We had two lumps of compute: | | 1) huge render farm, at the time it was 36k CPU. The driving goal | was 100% utilisation. it was a shared resource, and when someone | wasn't using their share it was aggressively loaned out. (both | CPU and licenses) Latency wasnt an issue | | 2) much smaller VM fleet. Latency was an issue. Even though the | contention was much less, as was the number utilisation. | | Number two was the biggest issue. We had a number of processes | that needed 100% of one CPU all the time, and they were | stuttering. Even though the VM thought they were getting 100% of | a core, they were in practice getting ~50% according to the | hyperviser. (this was a 24 core box, with only one CPU heavy | process) | | After much graphing, it turned out that it was because we had too | many VMs on a machine defined with 4-8CPU. Because the hypervisor | won't allocate only 2 CPUs to a 4 CPU VM, there was lots of spin | locking waiting for space to schedule the VM. This meant that | even though the VMs thought they were getting 100% cpu, the host | was actually giving the VM 25% | | The solution was to have more smaller machines. The more threads | you ask for to be scheduled at the same time reduces the ability | to share. | | We didn't see this on the big farm, because the only thing we | constrained was memory, The orchestrator would make sure that a | thing configured for 4 threads was put in a 4 thread slot, but we | would configure each machine to have 125% CPU allocated to it. | 3np wrote: | The way the issue is presented, it sounds to me like context | switching should be one of the major considerations, especially | when talking about CPU pinning. Yet it's barely mentioned in | passing. How come? | bboreham wrote: | Dave Chiluk did a great talk covering a similar scheduler | throttling problem. | | https://m.youtube.com/watch?v=UE7QX98-kO0 ___________________________________________________________________ (page generated 2021-12-26 23:01 UTC)