[HN Gopher] Carving the scheduler out of our orchestrator ___________________________________________________________________ Carving the scheduler out of our orchestrator Author : darthShadow Score : 150 points Date : 2023-02-02 16:30 UTC (6 hours ago) (HTM) web link (fly.io) (TXT) w3m dump (fly.io) | mochomocha wrote: | I have my own fair share of complaints about k8s, but I can't say | the author articulates clearly what is wrong with k8s scheduler | exactly. | | IMO it's one of the better part of k8s. The core scheduler is | pretty well written and extensible through scheduling plugins, to | implement whatever policies you heart desires (which we | extensively make use of at Netflix). | | The main issue I have with it is the lack of built-in | observability, which makes it non-trivial to A/B test scheduling | policies in large scale deployment setups because you want to be | able to log the various subscores of your plugins. But it's so | extensible through NodeAffinity and PodAffinity plugins that you | can even delegate part (or all!) of the scheduling decisions | outside of it if you want. | | Besides observability, one issue we've had to overcome with k8s | scheduling is the inheritance of the Borg design decisions around | pod shape immutability, which makes implementing things like | oversubscription less easy in a "native" way. | tptacek wrote: | Nothing is wrong with the k8s scheduler! | | Really, nothing is wrong with k8s at all, beyond our more | general problem of "our users want to run Linux apps, not k8s | apps". | | K8s, Borg, Omega, Flynn, Nomad, and to some extent Mesos all | share a common high-level scheduler architecture: a logically | centralized, possibly distributed server process that functions | like an allocator and is based on a consistent view of | available cluster resources. | | It's a straightforward and logical way to design an | orchestrator. It's probably the way you'd decide to do it by | default. Why wouldn't you? It's an approach that works well in | other domains. And: it works well for clusters too. | | The point of the post is that it's not the only way to design | an orchestrator. You can effectively schedule without a | centralized consistent allocator scheduler, and when you do | that, you get some interesting UX implications. | | They're not _necessarily_ good implications! If you 're running | a cluster for, like, Pixar, they're probably bad. You probably | want something that works like Borg or Omega did. You have a | (relatively) small number of (relatively) huge jobs, you want | optimal placement+, and you probably want to minimize your | hardware costs. | | We have the opposite constraints, so the complications of | keeping a _globally_ consistent real-time inventory of | available resources and scheduling decisions don 't pay their | freight in benefits. That's just us, though. It's probably not | anything resembling _most_ k8s users. | | + _In fact, going back even before Borg but especially once | Borg came on the scene, mainstream schedulers have been making | this distinction --- between service jobs and batch jobs, where | batch jobs are less placement sensitive and more delay | sensitive. So one way to think about the design approach we 're | taking is, what if the whole platform scheduler thought in | terms of a batch-friendly notion of jobs, and then you built | the service placement logic on top of it, rather than alongside | it?_ | audi0slave wrote: | I wonder if you ever looked at sparrow paper[1] which came | out Ion Stoica's lab. Its also a decentralized in nature | focusing on low latency placement. | | [1] | https://cs.stanford.edu/~matei/papers/2013/sosp_sparrow.pdf | tptacek wrote: | Lot of similarities! | | At a low level, there are distinctions like Sparrow being | explicitly p2c based (choose two random workers, then use | metrics to pick the "best" between them). We don't p2c at | all right now; part of our idea is being able to scale- | from-zero to serve incoming requests with sleeping Fly | Machines, rather than keeping everything online. And, of | course, we run whole VMs (/Docker containers), not jobs | with fixed definitions running on long-running executors | --- but that's just a detail. | | My sense of it is that the big distinction though is that | you'd run Sparrow _alongside_ a classic scheduler design | like Mesos or Borg or Omega. You 'd schedule placement- | sensitive servers with the Borglike, and handle Lambda- | style HTTP request work with the Sparrowlike. | | What we're working on instead if whether you can build | something sufficiently Borg-like _on top of_ something | Sparrow-like, using "Sparrow-style" scheduling as a low- | level primitive. This doesn't quite capture what we're | doing (the scheduling API we export implicitly handles some | constraints, for instance), but is maybe a good way to | think about it. | | I should have cited this paper! Thanks for linking it. | mochomocha wrote: | I have a hard time being on board with the logic. Neither k8s | nor mesos guarantee optimal placements (which as you pointed | out is unrealistic given the NP-hardness of the problem), in | fact they are explicitly trading placement quality for lower | scheduling latency - and are mostly designed to schedule | really fast. | | Mainstream schedulers make no distinction between service & | batch jobs out of the box - you need to explicitly go out of | your way to implement differential preferences when it comes | to scheduling latency for example. | | I am surprised cost isn't a concern for you guys, I actually | had assumed you went all in on bin-packing+firecracker | oversubscription to maximize your margin. | tptacek wrote: | Nothing guarantees optimal placement, but all the | mainstream schedulers attempt an approximation of it. The | general assumption in mainstream schedulers is that servers | are placement-sensitive, and batch jobs aren't. If you | assume there's only one kind of job (all servers, or all | batch jobs), the design of a scheduler gets a lot simpler; | like, most of what the literature talks about with respect | to Mesos and Borg is how to do distributed scheduling for | varying schedule regimes. | | I like how I worded it earlier: instead of having a complex | scheduler that keeps a synchronized distributed database of | previous schedulings and available resources, what if you | just scheduled everything as if it was a placement- | insensitive batch job? Obviously, your servers aren't happy | about that. But you don't expose that to your servers; you | build another scheduling layer _on top of_ the batch | scheduling. | | That's not precisely what we're doing (there are ultimately | still constraints in our service model!) but it sort of | gets at the spirit of it. | | As for cost: we rent out server space. We rack servers to | keep up with customer load. The more load we have, the more | money we're making. If we're racking a bunch of new servers | in FRA, that means FRA is making us more money. Ideally, we | want to rack enough machines to have a decent amount of | headroom _past_ what we absolutely need for our current | workload --- if our business is going well, customer load | is consistently growing. So if ops is going well, we 've | generally got more machines than a mainstream scheduler | would "need" to schedule on. But there's no point in having | those machines racked and doing nothing. | | At some point we'll reach a stage where we'll be more | sensitive to hardware costs and efficiency. Think of a | slider of sensitivity; we're currently closer to the | insensitive size, because we're a high-growth business. | mwcampbell wrote: | > Think of a slider of sensitivity; we're currently | closer to the insensitive size, because we're a high- | growth business. | | If the big-tech layoffs are any indication of a general | trend toward belt-tightening, then reducing headroom, and | even over-subscribing servers shared-hosting-style, must | be tempting. | tptacek wrote: | Not for us, right now. We have if anything the opposite | problem. | kalev wrote: | I'm annoyed by the way this is written. The topic is super | interesting but the author tried to hard to be funny and being a | non-native reader it's difficult to determine if certain words | are technical jargon or trying to be funny. | Dowwie wrote: | I think it succeeded at being funny. tptacek is a Numad lad. | tptacek wrote: | I'm mostly just curious about which terms the reader thinks | are made up. | romantomjak wrote: | I'm quite the opposite. This style of writing is quite | refreshing. It's playful, has enough technical details and is | easy to read. | filereaper wrote: | Mesos always had these notions of two level scheduling that let | you build your own orchestration. | | Aurora, Marathon, etc... would add the flavor of Orchestration | that's needed. Mesos provided the resources requested. | | https://mesos.apache.org/documentation/latest/architecture/ | necubi wrote: | I remain sad that Mesos never really took off, and then k8s ate | the market. It had a lot of really clever ideas and could do | stuff well that k8s still can't (although it has been catching | up, ten years later). In particular, the scheduler architecture | allowed it to work well for both batch workloads and service- | like workloads within a single resource pool. | tptacek wrote: | Borg, Omega, K8s, and Nomad all have a separate scheduling | pathway for batch jobs, don't they? My understanding is that | this is the big complexifier for all distributed schedulers: | that once you have two different service models for | scheduling (placement sensitive delay insensitive, and | placement insensitive delay sensitive), you now have two | scheduler process, and you have to work out concurrency | between their claims on resources. | | The Omega design in particular is, like, ground up meant to | address this problem; the whole paper is basically "why the | Mesos approach is suboptimal for this problem". | necubi wrote: | It's true that Omega was designed specifically to solve | this problem well (and one of the big insights from the | original Borg paper is how efficient it is to have | heterogenous workloads on a single cluster). But (as a non- | google person with no inside info) I think the amount of | omega inspiration in kubernetes is overstated. | | Until a few years ago there was very limited support for | pluggable schedulers in kubernetes (I think it's gotten | better recently but I'm a bit out of date). The beauty of | Omega and Mesos is that schedulers were a first-level | concept. How Spark thinks about job scheduling will not be | the same as how Presto does, or a real-time system like | Flink. Mesos provided a framework where you could create | custom schedulers for each class of application you run. | | For example, for a query engine you might care about | scheduling latency more than anything else, and you'd be | willing to accept subpar resources if you're able to get | them more quickly. For a Spark job, the scheduler can | introspect the actual properties of the job to figure out | whether a particular resource is suitable. | tptacek wrote: | For what it's worth: I didn't know if there was any Omega | inspiration in K8s at all; I just know that Nomad was | heavily inspired by it. | plaidfuji wrote: | > With strict bin packing, we end up with Katamari Damacy | scheduling, where a couple overworked servers in our fleet suck | up all the random jobs they come into contact with. | tptacek wrote: | I'm worried that I'm meaner about K8s in this than I mean to be, | which would be a problem not least because I don't have enough | K8s experience to justify meanness. I'm really more just | surprised at how path-dependent the industry is; even systems | that were consciously built not to echo Borg, even greenfield | systems like Flynn that were reimaginactiments of orchestration, | all seem to follow the same model of central, allocating | schedulers based on distributed consensus about worker inventory. | linuxftw wrote: | The nice thing about a centralized scheduler is the code for | scheduling only needs to run in one place. If each node 'bids' | on the workload, you need to run scheduling fit on every node | for every application. You also need to work out a way to | decide who wins the bid. | | Every time you want to change your scheduling logic, guess | what? Gotta update all the nodes. | | The k8s scheduler is battle-tested. Across all the different | environments and CI pipelines, it has scheduled billions, if | not trillions, of pods at this point. If you have a better way | of scheduling workloads across nodes, I advise you to make it | work for kubernetes and sell it as an enterprise scheduler, and | you'll make tons of money. Even if it's not better for the | general use case, and just better for some edge (pun intended) | use case, it would do quite well. | | IMO, all roads lead to k8s. You're just eventually going to | have to solve the same set of problems. | tptacek wrote: | The code for scheduling in mainstream schedulers runs in lots | of places, which is why mainstream schedulers tend to run | Paxos or Raft. | lupex wrote: | Borg-at least as it is published in those papers( _)-and these | other schedulers are really design to maximize the availability | of large distributed apps. | | (_) The real Borg now does a lot more... | | But there is a completely different world if you look | elsewhere. In batch systems like supercomputers (HPC), you want | to maximize throughput, not availability. It is common to have | something closer to your design: a inventory of jobs and | scheduler that allocates workers to them. E.g. https://hpc- | wiki.info/hpc/Scheduling_Basics | vidarh wrote: | Be mean about K8s, it can take it. I don't think any of the | criticism of it is unreasonable. It's a huge, convoluted beast. | To me, K8s makes most sense in terms of creating a swiss army | knife that lots of people know. | | You can do far better than K8s for specialised cases like yours | and/or with people who has the right skills. But for a lot of | people with simpler needs it's better to just stick to what is | easy to hire for than look for an optimal solution. | intelVISA wrote: | > You can do far better than K8s | | "Nobody ever got fired for K8s" is how I imagine most SWEs | justify using such a crappy system. Kinda like SAP. | btown wrote: | However bad you think k8s is, SAP is worse on its best day. | [deleted] | jallmann wrote: | This was a great article. While I was at Livepeer (distributed | video transcoding on Ethereum [1]), we converged onto a very | similar architecture, disaggregating scheduling into the client | itself. | | The key piece is to have a registry [2] with a (somewhat) up-to- | date view of worker resources. This could actually be a | completely static list that gets refreshed once in a while. | Whenever a client has a new job, they can look at their latest | copy of the registry, select workers that seems suitable, submit | jobs to workers directly, handle re-tries, etc. | | One neat thing about this "direct-to-worker" architecture is that | it allows for backpressure from the workers themselves. Workers | can respond to a shifting load profile almost instantaneously | without having to centrally deallocate resources, or wait for | healthchecks to pick up the latest state. Workers can tell | incoming jobs, "hey sorry but I'm busy atm" and the client will | try elsewhere. | | This also allows for rich worker selection strategies on the | client itself; eg it can preemptively request the same job on | multiple workers, keep the first one that is accepted, and cancel | the rest, or favor workers that respond fastest, and so forth. | | [1] We were more of a "decentralized job queue" than "distributed | VM scheduler" with the corresponding differences, eg shorter- | lived jobs with fluctuating load profiles and our clients could | be thicker. But many of the core ideas are shared, even our | workers were called "orchestrators" which in turn could use | similar ideas to manage jobs on GPUs attached to it... schedulers | all the way down! | | [2] Here the registry seems to be constructed via the Corrosion | gossip protocol; we used the blockchain with regular healthcheck | probes. | coredog64 wrote: | Although it's currently archived, there's another open source | orchestrator that's similar to Borg: Treadmill. | | AFS solves for Colossus, with packages being distributed into | AFS. | | [0] https://github.com/morganstanley/treadmill | korijn wrote: | I can't shake the sense that they seem to just prefer building | something new (and blog about it) over figuring out how to | configure a/the scheduler properly. | | Anyway, they did get it done and made it work, so whatever. | [deleted] | mrkurt wrote: | We built Fly.io to treat chronic NIH syndrome. | korijn wrote: | I don't understand how that relates to my comment. | | Anyway, I'm happy you seem to be successful. Keep up the good | work. | linuxftw wrote: | NIH syndromes is 'not invented here syndrome.' This equates | to the 'build something new and blog about it' remark you | made, IMO. | [deleted] | schmichael wrote: | As the Nomad Team Lead, this article is a _gift_ - thank you Fly! | - even if they 're transitioning off of Nomad. Their description | of Nomad is exactly what I would love people to hear, and their | reasons for DIYing their own orchestration layer seem totally | reasonable to me. Nomad has never wanted people to think we're | The One True Way to run all workloads. | | I hope Nomad covers cases like scaling-from-zero better in the | future, but to do that within the latency requirements of a | single HTTP request is quite the feat of design and | implementation. There's a lot of batching Nomad does for scale | and throughput that conflict with the desire for minimal | placement+startup latency, and it's yet to be seen whether | "having our cake and eating it too" is physically possible, much | less whether we can package it up in a way operators can | understand what tradeoffs they're choosing. | | I've had the pleasure of chatting with mrkurt in the past, and I | definitely intend to follow fly.io closely even if they're no | longer a Nomad user! Thanks again for yet another fantastic post, | and I wish fly.io all the best. | tptacek wrote: | This whole article started with me rewatching your Nomad deep | dive video and then chasing papers. :) ___________________________________________________________________ (page generated 2023-02-02 23:00 UTC)