[HN Gopher] The Container Throttling Problem
       ___________________________________________________________________
        
       The Container Throttling Problem
        
       Author : rognjen
       Score  : 214 points
       Date   : 2021-12-26 08:42 UTC (14 hours ago)
        
 (HTM) web link (danluu.com)
 (TXT) w3m dump (danluu.com)
        
       | jeffrallen wrote:
       | Tl;dr, which is too bad, because normally danluu's stuff is
       | great.
       | 
       | From the bit I had patience to read it sounds like "we made a
       | complicated thing and it's doing complicated things wrong in
       | complicated ways".
       | 
       | It is hard to believe that some of these CPU heavy, latency
       | sensitive servers should really be in containers. Why are they
       | not on dedicated machines? KISS.
        
         | marcosdumay wrote:
         | Linux is optimized for desktops and shared servers. When you
         | own the entire machine and wants to use it fully, that
         | optimization gets in your way.
        
       | londons_explore wrote:
       | I think this problem would have been debugged and solved much
       | quicker if they'd done a CPU scheduling trace. Then they could
       | see, microsecond by microsecond, exactly which processes were
       | doing which work, and what incoming requests are still waiting.
       | 
       | Then, let a human go in and say "How come request#77 hasn't yet
       | been processed at this point, even though CPU#3 is working on
       | printing unused debug data for a low priority request and #77 is
       | well after its deadline!??".
       | 
       | Then you debug deeper and deeper, adjusting parameters and
       | patching algorithms till you can get a CPU trace that a human can
       | look at and think "yeah, I couldn't adjust this schedule by hand
       | to get this work done better".
       | 
       | In this process, most people/teams will find at least 10x
       | performance gains if they've never done it before, and usually
       | still 2x if you limit changes to one layer of the stack ('eg. Im
       | just tweaking the application code - we won't touch the runtime,
       | VM, OS or hypervisor parameters').
        
         | neerajsi wrote:
         | I don't know why the negative reaction. I've done the kind of
         | analysis you've described many times and essentially been able
         | to quickly identify such issues over the years. We had a
         | similar problem in Windows when we first implemented the
         | Dynamic Fair Share thread scheduler. It took a couple months to
         | have the right tooling to do a proper scheduler trace, but with
         | that available the problem was better understood in a week. I
         | eventually rewrote the scheduler component and added a control
         | law to give better burstable behavior than the hard cap quota
         | that this article seems to be describing.
        
         | ghusbands wrote:
         | That does not cover almost anything in the article. It's a long
         | article, so maybe you could quote the bit you're responding to.
         | 
         | A CPU scheduling trace wouldn't easily show you the details of
         | the kernel-level group throttling that was causing a lot of
         | issues, for example. They weren't having an issue with threads
         | fighting other threads, they were having an issue with threads
         | being penalised now for activity from several seconds ago,
         | drastically reducing the amount of available CPU.
         | 
         | The article clearly shows a lot of debugging and diagnostic
         | patching ability, so it's unlikely they missed the simple
         | options. Rather, they probably didn't mention them because they
         | were obvious to try and didn't help.
        
           | londons_explore wrote:
           | > threads being penalised now for activity from several
           | seconds ago,
           | 
           | Exactly... They would have found this out much quicker with a
           | trace. They would have seen "how come this application level
           | request is being handled on thread number X, yet that thread
           | is not running on any core, and many cores are idle"? Then
           | quickly they could see the reason that thread isn't scheduled
           | by enabling extra tracing detail seeing the internal data
           | structures used by the scheduler to see why something is
           | schedulable or not at that instant.
        
             | jeffbee wrote:
             | I completely agree. KUTrace would have been ideal for this
             | and indeed KUTrace was developed to diagnose this exact
             | problem.
        
             | ghusbands wrote:
             | I think you're suffering hindsight bias, here. A trace is
             | rarely as clear as that, and it's hard to see the details
             | it's not designed to expose.
             | 
             | Your original message would probably be better received if
             | you'd omitted the "I think this problem would have been
             | debugged and solved much quicker [...]" and its insulting
             | implications and instead started with "Sometimes, I find
             | that CPU activity traces can really help with diagnosing
             | this sort of problem".
        
               | The_rationalist wrote:
               | Please stop advocating for politeness over correctness.
               | Sure hindsight help but regardless, a company such as
               | Twitter should have experts at tracing that have tools
               | and knowledge that goes beyond the average developer
               | knowledge about tracing methodologies. Excusing that is
               | an appeal to a lowering of technical excellence
               | worldwide, which is majorly important and matter more
               | than hypothetical feelings.
        
               | londons_explore wrote:
               | > a company such as Twitter should have experts at
               | tracing
               | 
               | In a big company, getting the person with the most skills
               | to solve a problem to be the one actually tasked with
               | solving the problem is very hard. This particular problem
               | had many avenues to find a solution - and while I think
               | my proposed route would have been quicker, if you aren't
               | aware of those tools or techniques, then other avenues
               | might be much quicker. When starting an investigation
               | like this, you don't know where you're going to end up
               | either - if it turned out that the performance cliff was
               | caused by CPU thermal throttling, it would be hard to see
               | in a scheduling trace - everything would just seem
               | universally slow all of a sudden.
        
               | neerajsi wrote:
               | On Windows, we have the xperf and wpa toolset that makes
               | looking at holistic scheduling performance, including
               | processor power management and device io tractable. Even
               | then, the skillset to analyze an issue like the one
               | presented here takes months to acquire and only a few
               | engineers can do it. We have dedicated teams to do this
               | performance analysis work, and they're always in high
               | demand.
        
       | treffer wrote:
       | I have been running k8s clusters at utilizations far beyond 50%
       | (up to 90% during incidents). For web services/microservices, so
       | tail latencies were important.
       | 
       | The way we solved this? 1. Kernel settings. Check e.g. the
       | settings of the Ubuntu low latency kernel for example. 2. CFS
       | tuning. Short timeslices. There are good documentations on how to
       | do that 3. CPU pressure. We cordoned and load shedded overloaded
       | nodes (k8s-pressurecooker).
       | 
       | By limiting the maximum CPU pressure to 20% you can say "every
       | service will get all the CPU it needs at least 80% of the time on
       | most nodes". This is what you want. A low chance of seeing CPU
       | exhaustion. This is needed for predictable and stable tail
       | latencies.
       | 
       | There are a few more knobs. E.g. scale services such that you use
       | at least one core as requests are effectively limits under
       | congestions and you can't get half a core continuously.
       | 
       | Very nice to see that people go public about this. We need to
       | drop the footprint of services. It is straight up wasted money
       | and CO2.
        
       | diegocg wrote:
       | Quite interesting problem. It is indeed a contradiction to make a
       | service use all the CPUs on a system, and, at the same time have
       | an upper limit over how much CPU utilisation they can do.
       | 
       | The thread pool size negotiation seems a necessary fix -
       | applications shouldn't be pre calculating their pool sizes on
       | their own anyway. But you get additional (smaller) problems, like
       | giving more or less threads to some service depending on their
       | priority.
       | 
       | One of the big problems here as I understand it is trying to use
       | a resource whose "size" changes dynamically (Max CPU usage on a
       | cgroup, which can change depending on whether other prioritised
       | service is currently running or not) with a fixed sized resource
       | (nr of threads when a service starts).
       | 
       | As the number of cores per CPU grows, I wonder if this whole
       | approach of scheduling tasks based on their CPU "usage" makes any
       | sense. At some point, the basic scheduling unit should be one
       | core, and tasks should be assigned a number of core units on the
       | system for a given time.
        
       | mabbo wrote:
       | I have to wonder why the authors skipped the potential solution
       | of removing containers and mesos from the equation entirely.
       | 
       | If you gave this service a dedicated, non-co-located fleet,
       | running the JVM directly on the OS, and ran basic autoscaling of
       | the number of hosts, you'd eliminate a huge number of the moving
       | parts of the system that are causing these issues.
       | 
       | Yes, that would add to ops costs (edit: _human_ ops costs) for
       | this service, but when you 're spending 8 figures per year in it,
       | clearly the budget is available.
       | 
       | To quote the great philosopher Avril Lavigne: "Why'd you have to
       | go and make things so complicated?"
        
         | marcinzm wrote:
         | Isn't the problem then that each host would be underutilized on
         | average by a lot? It has X cpus and the service can never use
         | more than X cpus. If a service has any spiky loads then it'd
         | been overprovisioned cpu to handle them at good latency.
         | 
         | That seems significantly more expensive at scale.
        
         | xorcist wrote:
         | > that would add to ops costs for this service,
         | 
         | Wouldn't fewer moving parts mean lower operational costs?
        
           | Kalium wrote:
           | Only to the extent that cost is a function of complexity.
           | This isn't always the case. In a case like this, going to
           | bare metal likely brings with it significant drawbacks in
           | organizational complexity, orchestrational complexity, and
           | more while allowing for much better utilization of memory and
           | cpu resources.
           | 
           | Telling someone whose car is making some funny noises that
           | it's simpler to go back to horse-and-buggy times would both
           | increase costs and decrease the number of user-servicable
           | moving parts. There's some significant overhead attached.
        
             | xorcist wrote:
             | Bare metal has nothing to do with this. It isn't even
             | touched upon in the article. It discusses a scheduler, and
             | the parent post suggests exempting these kind of jobs from
             | the scheduler in question, which they obviously aren't a
             | very good product fit for.
             | 
             | Should you wish to really stretch that car analogy, maybe a
             | bit more appropriate than a horse would be: If you aren't
             | happy with your travel agency aren't booking your taxi
             | trips in time, try booking with the taxi company directly.
        
           | mabbo wrote:
           | Yes and no.
           | 
           | It would lower the operations costs of hardware, hopefully
           | (that's the entire goal of this article) but you'd need more
           | people resources to manage it, I would guess. Mesos and
           | containers automate a lot of thinking work.
        
             | Kalium wrote:
             | Once you move to hosts dedicated to specific services, as
             | seems to be the suggestion here, you also might increase
             | the overall hardware cost across your set of services. The
             | cost per some of the services might decrease, though.
        
         | toast0 wrote:
         | I suspect it's the temptation of oversubscription. If service A
         | and service B each use 50% of a server, it's so tempting to put
         | them both on one server to maximize efficiency. Even if
         | sometimes you need 4 servers running A and B to serve the load
         | that can be managed with one server each of A and B.
         | 
         | Or if you've broken things up into small pieces that aren't big
         | enough to use a whole server, that can feel inefficient as
         | well.
        
       | nvarsj wrote:
       | CFS quotas have been broken for a long time - with processes
       | being starved far below their utilisation of their quota. I think
       | every serious user of k8s discovers this the hard way. Recent
       | changes have been done to improve the scheduler for quotas but
       | I'm surprised twitter was using them at all in 2019. Java GC also
       | suffers badly with quotas. Pinning cpu is probably the best
       | compromise, otherwise just use CPU requests with no limits.
        
       | genewitch wrote:
       | I can't imagine the man-hours that went into creating this, and,
       | from here on out, knowing that core contention is still an issue
       | that isn't solved will allow me to waltz in to contract jobs and
       | save companies money, e-waste, and power costs - this causes
       | hope, joy, something like that.
       | 
       | In case anyone missed it, the removal of throttling in certain
       | circumstances saved twitter ~$5mm/year, if I read it correctly.
       | With a naive kernel patch. While it takes dedicated engineers
       | decades of knowledge to know where to aim an intern, an intern
       | still banged out a kernel scheduling patch that made, what I
       | assume, is a huge difference.
       | 
       | Dan Luu is a gem.
        
         | euiq wrote:
         | Note that the intern in question was close to finishing their
         | PhD in a related area.
        
         | wolf550e wrote:
         | "Low 8 figures" is more like $25 per year, and that's a single
         | service. Across all services it's more.
        
       | fulafel wrote:
       | Self-teergrubing by cpu quotas.
       | 
       | Wonder what mechanism could be used to communicate the available
       | timeslice length so that the app/thread could stop taking on a
       | request when throttling is imminent.
        
       | mkhnews wrote:
       | Hi, I recently found similar behavior in an app for our company.
       | A simple threaded cpu benchmark shows:
       | 
       | % numactl -C 0,5 ./ssp 12 elapsed time: 99943 ms
       | 
       | cpu.cfs_quota_us = 200000 cpu.cfs_period_us = 100000 % cgexec -g
       | cpu:cgtestq ./ssp 12 elapsed time: 420888 ms
       | 
       | cpu.cfs_quota_us = 2000 cpu.cfs_period_us = 1000 % cgexec -g
       | cpu:cgtestqx ./ssp 12 elapsed time: 168104 ms
       | 
       | Also interesting was in our app some RR thread priorities are
       | used, and those do not get controlled via the cgroup cpu.cfs
       | settings.
        
       | tybit wrote:
       | I realise that the Twitter is using Mesos, but for those of us on
       | Kubernetes does guaranteed QoS solve this?
       | https://kubernetes.io/docs/tasks/configure-pod-container/qua...
        
         | mac-chaffee wrote:
         | QoS classes are only used "to make decisions about scheduling
         | and evicting Pods." It still uses the Completely Fair
         | Scheduler, which is where the problem came from (as far as I
         | understand).
        
         | KptMarchewa wrote:
         | I think they are not using Mesos now.
         | 
         | https://dzone.com/articles/what-can-we-learn-from-twitters-m...
        
         | bboreham wrote:
         | If you also use the CPU Manager feature and request an integer
         | number of cores, yes. Then for example if you request 3 cores
         | your process will be pinned onto 3 specific cores and nothing
         | else will be scheduled onto those cores, and CFS will not
         | throttle your process.
         | 
         | https://kubernetes.io/docs/tasks/administer-cluster/cpu-mana...
        
       | d3nj4l wrote:
       | As an newbie developer who hasn't dug into this stuff before, but
       | found this post fascinating: does anybody have any good pointers,
       | like books/articles/videos to learn about low-level details like
       | this?
        
         | closeparen wrote:
         | Computer Systems: A Programmer's Perspective.
         | 
         | Operating Systems: Three Easy Pieces.
         | 
         | Most important parts of my undergrad. Much more so than
         | Algorithms or a anything mathematical.
        
       | mochomocha wrote:
       | At Netflix, we're doing a mix of what Dan calls "CPU Pinning and
       | Isolation" (ie, host-level scheduling controlled by user-space
       | logic) [1] and "Oversubscription at the cluster scheduler level"
       | (through a bunch of custom k8s controllers) to avoid placing
       | unhappy neighbors on the same box in the first place, while
       | oversubscribing the machines based on containers usage patterns.
       | 
       | [1]: https://netflixtechblog.com/predictive-cpu-isolation-of-
       | cont...
        
         | tyingq wrote:
         | That's a really terrific article, thanks for sharing. I wonder
         | if Linux will eventually tie the CPU scheduler together with
         | the cgroup cpu affinity functionality, and some awareness of
         | cores, smt, shared cache, etc. Seems a shame that you have to
         | tie all that together yourself, including a solver.
        
         | eternalban wrote:
         | The article mentions "nice values". What does that mean?
         | Underutilization/under-provisioning?
         | 
         | [p.s. thanks for the replies]
        
           | bboreham wrote:
           | "nice" in Unix is a way to lower the priority of a process,
           | so that others are more likely to be scheduled.
           | 
           | Eg https://man7.org/linux/man-pages/man2/nice.2.html
        
           | chrisoverzero wrote:
           | It's the kinds of values for things like `nice(2)`:
           | https://linux.die.net/man/2/nice
           | 
           | In short, an offset from the base process priority.
        
       | staticassertion wrote:
       | I remember working in Java where we'd have huge threadpools that
       | sat idle 90% of the time.
       | 
       | It feels like you can eliminate most of this problem in other
       | languages by using a much smaller pool and then leveraging
       | userland concurrency/ scheduling. You probably don't want to have
       | N cores and N + K threads, but in some languages you don't have
       | much choice. Java has options for userland concurrency but
       | they're pretty ugly and I don't think you'll find a lot of
       | integration.
       | 
       | Containers make this a bit harder, and the Linux kernel sounds
       | like it had a pretty silly default behavior, but how much of this
       | is also just Java?
        
         | le-mark wrote:
         | I don't think blaming JVM is productive, but identifying "JVM"
         | as a proxy for "language runtime designed to optimize multi
         | processor machines" is the core element here.
         | 
         | One could imagine a vm or runtime that is async and
         | multiprocess that also enforces quotas on cycles and heap such
         | that these types of "noisy neighbor" events aren't a problem.
         | 
         | In this direction there have been solutions that haven't caught
         | on; a multi tenant jvm existed at one time, and at least one js
         | implementation has this ability. I've often thought Lua would
         | be ideal for this.
        
           | staticassertion wrote:
           | Yeah to be clear I'm not saying all fault lies with the JVM
           | here. But a lack of concurrency primitives exacerbates the
           | problem by encouraging very large threadpools.
        
             | neerajsi wrote:
             | I haven't used it in anger, but it looks to me like the C#
             | async compiler and library support helps reduce the need
             | for large threadpools.
             | 
             | But it also looks like the GC was a major contributor, so
             | that would not be as influenced by the differences between
             | dotnet and Java.
        
       | YokoZar wrote:
       | > The gains for doing this for individual large services are
       | significant (in the case of service-1, it's [mid 7 figures per
       | year] for the service and [low 8 figures per year] including
       | services that are clones of it, but tuning every service by hand
       | isn't scalable.
       | 
       | This point seems wrong to me, bound too much by requiring
       | solutions to be done by a small team of engineers who already
       | have a mandate to work on the problem.
       | 
       | With numbers like that Twitter could, profitably, hire dozens of
       | engineers that do literally nothing else. Just tweak thread pool
       | sizes all day, every day, for service after service. Even though
       | it's a boring, manual, "noncomplex" thing, this type of work is
       | clearly valuable and should have happened years ago.
       | 
       | Most likely Twitter's job ladder, promotion process, and hiring
       | pipeline is highly incentivizing people to avoid such work even
       | when it has clear impact. They are very much not alone in that
       | regard.
        
         | xyzzy_plugh wrote:
         | I solved this same problem for a company also in 2019 (as the
         | CPU quota bug hadn't been fixed yet) and it resulted in
         | something like 8 figures of yearly cost savings.
         | 
         | You are correct in that most companies are not equipped to
         | staff issues like this. Most places just accept their bills as
         | a cost of doing business, not something that can be optimized.
        
           | karmakaze wrote:
           | I can see how there are diminishing returns when optimizing
           | but I would never say that server bills are not a metric to
           | be aware of and address. I've always had some idea of what's
           | practically achievable in terms of efficiency within a given
           | architecture and aim for something that gets a good amount of
           | the way there without undue effort. I also enjoy thinking of
           | longer term improvements for efficiency whether that could
           | improve latency or the bottom line and at the same time know
           | that's secondary to providing additional value and gaining
           | customers during a growth period.
        
           | [deleted]
        
           | eloff wrote:
           | This. I've never worked anywhere that had dedicated ongoing
           | effort to cost reduction in compute services. It's always a
           | once every couple of years thing to look at the cloud
           | spending and spend a little effort dealing with the low
           | hanging fruit.
        
           | _jal wrote:
           | A side effect of deciding systems engineers can be replaced
           | by devops.
           | 
           | In reality, you want both. A good systems person can save you
           | a ton of money.
        
             | ithkuil wrote:
             | A nice approach is to staff a DevOps team with people from
             | diverse backgrounds; some more towards the system side of
             | the spectrum and some more towards the dev side of the
             | spectrum. As long as everybody knows a little bit of the
             | other side. This helps ok avoiding a culture where devs
             | "throw some code over the fence" and sysops people just
             | moan that devs are careless and/or that they should do
             | things differently, but without a clear way of showing
             | exactly how differently things should be made (and also
             | without a clear understanding of what devs ended up
             | choosing the way they choose)
        
         | spockz wrote:
         | Afaik, Twitter already has significant and mature
         | infrastructure in place to run a plethora of different
         | instances on shadowed traffic and compare settings. It is used
         | at least by the people working on the optimised jre.
        
         | mnutt wrote:
         | There was a startup I talked to at one point that had a service
         | where they'd run an agent on your instances that would collect
         | performance data and live tune kernel parameters, and they had
         | some AI to find the best parameters for your workload. No idea
         | how well it worked, but it seems like a potentially good
         | application of AI.
        
           | servytor wrote:
           | Do you remember the name for it? Sounds really useful.
        
             | syngrog66 wrote:
             | Log4Shell.com
        
         | vegetablepotpie wrote:
         | They could hire contractors or consultants to do the job, no?
         | That class of worker would not be concerned about promotion
         | opportunities. For some reason they haven't done that either.
        
         | joshuamorton wrote:
         | > With numbers like that Twitter could, profitably, hire dozens
         | of engineers that do literally nothing else. Just tweak thread
         | pool sizes all day, every day, for service after service. Even
         | though it's a boring, manual, "noncomplex" thing, this type of
         | work is clearly valuable and should have happened years ago.
         | 
         | The issue is that once you hire a dozen engineers to do this
         | (say for 5M a year in total), and they do it for a year, they
         | save mid 8 figures (keep in mind this was the largest service,
         | so the savings across other services will be smaller).
         | 
         | Then can they keep saving mid 8 figures every year?
         | 
         | I'll paraphrase something I previously wrote privately, but
         | imagine you have some team that's able to save 10% of your
         | fleetwide resources this year. They densify and optimize and
         | improve defaults. So you can now grow your services by 10%
         | without any additional cost increase, and you do! The next
         | year, they've already saved the easiest 10%. Can they save 10%
         | again? Can they keep it up every year? How long until they're
         | saving 3% a year, or 1% a year? And that's if you keep the team
         | the same size, where its clearly loosing money! If you could
         | afford a dozen people to save 10%, you can only really afford
         | 1-2 to save 1%, but then you're likely to get an even smaller
         | return.
         | 
         | Unless you expect to be able to maintain the same relative
         | value of optimizations every year, 3 or 5 years out, its not
         | worth it to hire an FTE to work on them.
         | 
         | I should note that I've experienced this myself: I was working
         | in an area where resource optimization could lead to
         | "significant" savings (not 8 figures, but 6 or maybe 7). My
         | first 6 months working in this area, I found all sorts of low
         | hanging fruit and saved a fair amount. The second six months,
         | 5-10x less ROI. I gave up even trying in the third six moths,
         | if I come across a thing, I'll fix it, but its no longer
         | worthwhile to _look_.
        
           | ZephyrBlu wrote:
           | If you're looking in the same area/domain, what you're saying
           | is almost certainly true.
           | 
           | If you're looking across the business as a whole, it seems
           | likely that there is a lot of this kind of work lying around
           | because there is not much incentive for people to tackle it
           | as described in this comment:
           | https://news.ycombinator.com/item?id=29691847.
        
         | ghusbands wrote:
         | Certainly, at one place I worked, the higher-ups were very
         | clear that any work on cost reduction was wasted and devs and
         | ops should always work on increasing net income, not decreasing
         | costs. It was consistently claimed that cost reduction can only
         | get you small percentage decreases, whereas increases in income
         | are larger and compound better.
        
           | avianlyric wrote:
           | The big difference between cost reduction and income
           | increase, is that one has a hard limit on possible upside,
           | whereas the other does not. You can reduce your costs by more
           | than your total costs, but it's quite possible to increase
           | your income by many multiples of your existing income.
           | 
           | Result is that maximising income is generally better than
           | reducing cost. Of course, as with all generalisation, there
           | are situations where this approach doesn't hold true. But as
           | a high level first order strategy, it's a good one to adopt.
        
             | ClumsyPilot wrote:
             | "one has a hard limit on possible upside, whereas the other
             | does not."
             | 
             | Thats plain wrong, the gloval market for cars, bicycles and
             | what have you has a limited size. Every large company
             | that's a market leader understands that.
        
               | beiller wrote:
               | Would capturing 100% of the market granting a monopoly
               | essentially grant unlimited upside? Cause you can just
               | jack the price to absurdity? Also wouldn't reducing cost
               | to zero have an infinite upside as well? Basically zero
               | cost you can produce infinite output. Gettin real
               | pedantic here heh.
        
             | wolf550e wrote:
             | You meant to write "You cannot reduce your costs"
        
           | lostdog wrote:
           | You can do similar types of work, but target speed increases
           | instead. Getting all the batch jobs to finish faster can help
           | developer productivity, and is even worthwhile at a new
           | startup.
        
           | tomrod wrote:
           | The higher ups need some basic economics education, it
           | appears. Certainly you shouldn't invest everything in long
           | term returns, but you should be open to it.
           | 
           | Instead, when something has a payoff of 3 years, executives
           | get antsy in orgs that have a 2-year cycle on exec positions.
        
           | kevin_nisbet wrote:
           | I've had similar messages articulated to me by my manager,
           | and have found myself articulating similar messages to my
           | team.
           | 
           | In my team, the key is for the state of the project/product
           | we manage, cost optimization is likely one of the lowest ROI
           | activities we can spend too much time on. That doesn't mean
           | we don't tackle some clear low hanging fruit when we see it,
           | or use low hanging fruit as training opportunities to onboard
           | new team members, but that we need to be conscious on where
           | we make investments, and for the stage we're at, the more
           | important investment is into areas that make our product more
           | appealing to more customers.
           | 
           | I think it's easy to say someone, like an intern could pay
           | for themselves with savings. But this to me overlooks that
           | someone has to manage that intern, get them into change
           | management, review the work, investigate mistakes or
           | interruptions from the changes, etc. And then they're still
           | the lowest earning employee, since most of us aren't hired to
           | pay for ourselves, but actually to turn a profit for the
           | company.
           | 
           | So while I'm not sure I agree with the message "whereas
           | increases in income are larger and compound better.", I
           | certainly understand and have pushed a similar message, that
           | we be conscious on where we're spending our time, and that
           | we're selecting the highest impact activities based on the
           | resources and team we have. Sometimes that may be fixing high
           | wastage, but very frequently that will be investing into the
           | product. And I think for the stage of the product we manage,
           | that is the best choice for us.
        
           | R0b0t1 wrote:
           | This was verified by (what should be) a famous Harvard
           | Business school study. Quality before cost, revenue before
           | expenses, and there is no three.
        
       | richardwhiuk wrote:
       | This article is quite old - the kernel patch has been available
       | for a while now, I believe, and CMK is no longer in beta (the
       | article references K8s 1.8 and 1.10, but the current latest
       | version is 1.23).
        
         | cpitman wrote:
         | There are updates from this month at the bottom!
        
       | [deleted]
        
       | throwaway984393 wrote:
       | I wonder if k8s' bin-packing features would help here.
       | 
       | The graphs seems to validate my general assumption that large-
       | load tasks just suck at scaling whereas small-load tasks can be
       | horizontally scaled easier without falling over. The general
       | assumption being that for most applications, if you ignore
       | everything else about an operation and assume a somewhat random
       | distribution of load, smaller-load services use up more available
       | resources on average than a single large-load service. That's
       | just been an assumption in my head, I can't remember any data to
       | back that up.
       | 
       | Back in the day when I worked on a large-traffic internet site,
       | we tried jiggering with the scheduler and other kernel tweaks,
       | and in the end we literally just routed certain kinds of requests
       | to certain machines and said "these requests need loads of cache"
       | (memcache, local disk, nfs cache, etc) and "these requests need
       | fast io" and "these requests need a ton of cpu". It was "dumb"
       | engineering but it worked out.
        
       | zekrioca wrote:
       | I think there is another solution, not discussed in the article,
       | which lies between CPU isolation and pinning, and that is
       | virtualizing the container's /proc so to not let it think the
       | number of available (logical) processors is larger than a certain
       | limit set by the cluster operator, but which is actually lower
       | than the physical capacity of a server (so to allow overbooking
       | and increase their 'redacted' savings in $M). This is basically
       | presenting a container/application with a number of vCPUs that it
       | can use in any way it sees fit, but with all the (invisible)
       | control group (quota) limits (i.e., "throttling") the author
       | discusses in the text and that avoids the application to spawn so
       | many threads that inevitably overloads the physical server and
       | destroys tail latency.
       | 
       | This is at the kernel level, opposed to paravirtualization. And I
       | guess this is Twitter's use case, but should not be confused by
       | the typical vCPUs offers one sees in most cloud providers, which
       | is usually done through hypervisors such as Qeme/KVM, VMware, or
       | Xen.
       | 
       | I'm not sure why Mesos (maybe this one tried and didn't succeed),
       | nor K8S (available through external Intel code) or even Docker,
       | never really thought about that, but I guess they want to keep
       | their internal (operational) overheads up to a limit, and
       | possibly also to maintain the metastability of their services
       | [1]. But now we see where it leads to, with all these redacted
       | numbers in the article.
       | 
       | [1]
       | https://sigops.org/s/conferences/hotos/2021/papers/hotos21-s...
       | 
       | Ps: edits for clarifications.
        
       | KaiserPro wrote:
       | We had a similar problem, but it exhibited differently
       | 
       | We had two lumps of compute:
       | 
       | 1) huge render farm, at the time it was 36k CPU. The driving goal
       | was 100% utilisation. it was a shared resource, and when someone
       | wasn't using their share it was aggressively loaned out. (both
       | CPU and licenses) Latency wasnt an issue
       | 
       | 2) much smaller VM fleet. Latency was an issue. Even though the
       | contention was much less, as was the number utilisation.
       | 
       | Number two was the biggest issue. We had a number of processes
       | that needed 100% of one CPU all the time, and they were
       | stuttering. Even though the VM thought they were getting 100% of
       | a core, they were in practice getting ~50% according to the
       | hyperviser. (this was a 24 core box, with only one CPU heavy
       | process)
       | 
       | After much graphing, it turned out that it was because we had too
       | many VMs on a machine defined with 4-8CPU. Because the hypervisor
       | won't allocate only 2 CPUs to a 4 CPU VM, there was lots of spin
       | locking waiting for space to schedule the VM. This meant that
       | even though the VMs thought they were getting 100% cpu, the host
       | was actually giving the VM 25%
       | 
       | The solution was to have more smaller machines. The more threads
       | you ask for to be scheduled at the same time reduces the ability
       | to share.
       | 
       | We didn't see this on the big farm, because the only thing we
       | constrained was memory, The orchestrator would make sure that a
       | thing configured for 4 threads was put in a 4 thread slot, but we
       | would configure each machine to have 125% CPU allocated to it.
        
       | 3np wrote:
       | The way the issue is presented, it sounds to me like context
       | switching should be one of the major considerations, especially
       | when talking about CPU pinning. Yet it's barely mentioned in
       | passing. How come?
        
       | bboreham wrote:
       | Dave Chiluk did a great talk covering a similar scheduler
       | throttling problem.
       | 
       | https://m.youtube.com/watch?v=UE7QX98-kO0
        
       ___________________________________________________________________
       (page generated 2021-12-26 23:01 UTC)