[HN Gopher] Carving the scheduler out of our orchestrator
       ___________________________________________________________________
        
       Carving the scheduler out of our orchestrator
        
       Author : darthShadow
       Score  : 150 points
       Date   : 2023-02-02 16:30 UTC (6 hours ago)
        
 (HTM) web link (fly.io)
 (TXT) w3m dump (fly.io)
        
       | mochomocha wrote:
       | I have my own fair share of complaints about k8s, but I can't say
       | the author articulates clearly what is wrong with k8s scheduler
       | exactly.
       | 
       | IMO it's one of the better part of k8s. The core scheduler is
       | pretty well written and extensible through scheduling plugins, to
       | implement whatever policies you heart desires (which we
       | extensively make use of at Netflix).
       | 
       | The main issue I have with it is the lack of built-in
       | observability, which makes it non-trivial to A/B test scheduling
       | policies in large scale deployment setups because you want to be
       | able to log the various subscores of your plugins. But it's so
       | extensible through NodeAffinity and PodAffinity plugins that you
       | can even delegate part (or all!) of the scheduling decisions
       | outside of it if you want.
       | 
       | Besides observability, one issue we've had to overcome with k8s
       | scheduling is the inheritance of the Borg design decisions around
       | pod shape immutability, which makes implementing things like
       | oversubscription less easy in a "native" way.
        
         | tptacek wrote:
         | Nothing is wrong with the k8s scheduler!
         | 
         | Really, nothing is wrong with k8s at all, beyond our more
         | general problem of "our users want to run Linux apps, not k8s
         | apps".
         | 
         | K8s, Borg, Omega, Flynn, Nomad, and to some extent Mesos all
         | share a common high-level scheduler architecture: a logically
         | centralized, possibly distributed server process that functions
         | like an allocator and is based on a consistent view of
         | available cluster resources.
         | 
         | It's a straightforward and logical way to design an
         | orchestrator. It's probably the way you'd decide to do it by
         | default. Why wouldn't you? It's an approach that works well in
         | other domains. And: it works well for clusters too.
         | 
         | The point of the post is that it's not the only way to design
         | an orchestrator. You can effectively schedule without a
         | centralized consistent allocator scheduler, and when you do
         | that, you get some interesting UX implications.
         | 
         | They're not _necessarily_ good implications! If you 're running
         | a cluster for, like, Pixar, they're probably bad. You probably
         | want something that works like Borg or Omega did. You have a
         | (relatively) small number of (relatively) huge jobs, you want
         | optimal placement+, and you probably want to minimize your
         | hardware costs.
         | 
         | We have the opposite constraints, so the complications of
         | keeping a _globally_ consistent real-time inventory of
         | available resources and scheduling decisions don 't pay their
         | freight in benefits. That's just us, though. It's probably not
         | anything resembling _most_ k8s users.
         | 
         | + _In fact, going back even before Borg but especially once
         | Borg came on the scene, mainstream schedulers have been making
         | this distinction --- between service jobs and batch jobs, where
         | batch jobs are less placement sensitive and more delay
         | sensitive. So one way to think about the design approach we 're
         | taking is, what if the whole platform scheduler thought in
         | terms of a batch-friendly notion of jobs, and then you built
         | the service placement logic on top of it, rather than alongside
         | it?_
        
           | audi0slave wrote:
           | I wonder if you ever looked at sparrow paper[1] which came
           | out Ion Stoica's lab. Its also a decentralized in nature
           | focusing on low latency placement.
           | 
           | [1]
           | https://cs.stanford.edu/~matei/papers/2013/sosp_sparrow.pdf
        
             | tptacek wrote:
             | Lot of similarities!
             | 
             | At a low level, there are distinctions like Sparrow being
             | explicitly p2c based (choose two random workers, then use
             | metrics to pick the "best" between them). We don't p2c at
             | all right now; part of our idea is being able to scale-
             | from-zero to serve incoming requests with sleeping Fly
             | Machines, rather than keeping everything online. And, of
             | course, we run whole VMs (/Docker containers), not jobs
             | with fixed definitions running on long-running executors
             | --- but that's just a detail.
             | 
             | My sense of it is that the big distinction though is that
             | you'd run Sparrow _alongside_ a classic scheduler design
             | like Mesos or Borg or Omega. You 'd schedule placement-
             | sensitive servers with the Borglike, and handle Lambda-
             | style HTTP request work with the Sparrowlike.
             | 
             | What we're working on instead if whether you can build
             | something sufficiently Borg-like _on top of_ something
             | Sparrow-like, using  "Sparrow-style" scheduling as a low-
             | level primitive. This doesn't quite capture what we're
             | doing (the scheduling API we export implicitly handles some
             | constraints, for instance), but is maybe a good way to
             | think about it.
             | 
             | I should have cited this paper! Thanks for linking it.
        
           | mochomocha wrote:
           | I have a hard time being on board with the logic. Neither k8s
           | nor mesos guarantee optimal placements (which as you pointed
           | out is unrealistic given the NP-hardness of the problem), in
           | fact they are explicitly trading placement quality for lower
           | scheduling latency - and are mostly designed to schedule
           | really fast.
           | 
           | Mainstream schedulers make no distinction between service &
           | batch jobs out of the box - you need to explicitly go out of
           | your way to implement differential preferences when it comes
           | to scheduling latency for example.
           | 
           | I am surprised cost isn't a concern for you guys, I actually
           | had assumed you went all in on bin-packing+firecracker
           | oversubscription to maximize your margin.
        
             | tptacek wrote:
             | Nothing guarantees optimal placement, but all the
             | mainstream schedulers attempt an approximation of it. The
             | general assumption in mainstream schedulers is that servers
             | are placement-sensitive, and batch jobs aren't. If you
             | assume there's only one kind of job (all servers, or all
             | batch jobs), the design of a scheduler gets a lot simpler;
             | like, most of what the literature talks about with respect
             | to Mesos and Borg is how to do distributed scheduling for
             | varying schedule regimes.
             | 
             | I like how I worded it earlier: instead of having a complex
             | scheduler that keeps a synchronized distributed database of
             | previous schedulings and available resources, what if you
             | just scheduled everything as if it was a placement-
             | insensitive batch job? Obviously, your servers aren't happy
             | about that. But you don't expose that to your servers; you
             | build another scheduling layer _on top of_ the batch
             | scheduling.
             | 
             | That's not precisely what we're doing (there are ultimately
             | still constraints in our service model!) but it sort of
             | gets at the spirit of it.
             | 
             | As for cost: we rent out server space. We rack servers to
             | keep up with customer load. The more load we have, the more
             | money we're making. If we're racking a bunch of new servers
             | in FRA, that means FRA is making us more money. Ideally, we
             | want to rack enough machines to have a decent amount of
             | headroom _past_ what we absolutely need for our current
             | workload --- if our business is going well, customer load
             | is consistently growing. So if ops is going well, we 've
             | generally got more machines than a mainstream scheduler
             | would "need" to schedule on. But there's no point in having
             | those machines racked and doing nothing.
             | 
             | At some point we'll reach a stage where we'll be more
             | sensitive to hardware costs and efficiency. Think of a
             | slider of sensitivity; we're currently closer to the
             | insensitive size, because we're a high-growth business.
        
               | mwcampbell wrote:
               | > Think of a slider of sensitivity; we're currently
               | closer to the insensitive size, because we're a high-
               | growth business.
               | 
               | If the big-tech layoffs are any indication of a general
               | trend toward belt-tightening, then reducing headroom, and
               | even over-subscribing servers shared-hosting-style, must
               | be tempting.
        
               | tptacek wrote:
               | Not for us, right now. We have if anything the opposite
               | problem.
        
       | kalev wrote:
       | I'm annoyed by the way this is written. The topic is super
       | interesting but the author tried to hard to be funny and being a
       | non-native reader it's difficult to determine if certain words
       | are technical jargon or trying to be funny.
        
         | Dowwie wrote:
         | I think it succeeded at being funny. tptacek is a Numad lad.
        
           | tptacek wrote:
           | I'm mostly just curious about which terms the reader thinks
           | are made up.
        
         | romantomjak wrote:
         | I'm quite the opposite. This style of writing is quite
         | refreshing. It's playful, has enough technical details and is
         | easy to read.
        
       | filereaper wrote:
       | Mesos always had these notions of two level scheduling that let
       | you build your own orchestration.
       | 
       | Aurora, Marathon, etc... would add the flavor of Orchestration
       | that's needed. Mesos provided the resources requested.
       | 
       | https://mesos.apache.org/documentation/latest/architecture/
        
         | necubi wrote:
         | I remain sad that Mesos never really took off, and then k8s ate
         | the market. It had a lot of really clever ideas and could do
         | stuff well that k8s still can't (although it has been catching
         | up, ten years later). In particular, the scheduler architecture
         | allowed it to work well for both batch workloads and service-
         | like workloads within a single resource pool.
        
           | tptacek wrote:
           | Borg, Omega, K8s, and Nomad all have a separate scheduling
           | pathway for batch jobs, don't they? My understanding is that
           | this is the big complexifier for all distributed schedulers:
           | that once you have two different service models for
           | scheduling (placement sensitive delay insensitive, and
           | placement insensitive delay sensitive), you now have two
           | scheduler process, and you have to work out concurrency
           | between their claims on resources.
           | 
           | The Omega design in particular is, like, ground up meant to
           | address this problem; the whole paper is basically "why the
           | Mesos approach is suboptimal for this problem".
        
             | necubi wrote:
             | It's true that Omega was designed specifically to solve
             | this problem well (and one of the big insights from the
             | original Borg paper is how efficient it is to have
             | heterogenous workloads on a single cluster). But (as a non-
             | google person with no inside info) I think the amount of
             | omega inspiration in kubernetes is overstated.
             | 
             | Until a few years ago there was very limited support for
             | pluggable schedulers in kubernetes (I think it's gotten
             | better recently but I'm a bit out of date). The beauty of
             | Omega and Mesos is that schedulers were a first-level
             | concept. How Spark thinks about job scheduling will not be
             | the same as how Presto does, or a real-time system like
             | Flink. Mesos provided a framework where you could create
             | custom schedulers for each class of application you run.
             | 
             | For example, for a query engine you might care about
             | scheduling latency more than anything else, and you'd be
             | willing to accept subpar resources if you're able to get
             | them more quickly. For a Spark job, the scheduler can
             | introspect the actual properties of the job to figure out
             | whether a particular resource is suitable.
        
               | tptacek wrote:
               | For what it's worth: I didn't know if there was any Omega
               | inspiration in K8s at all; I just know that Nomad was
               | heavily inspired by it.
        
       | plaidfuji wrote:
       | > With strict bin packing, we end up with Katamari Damacy
       | scheduling, where a couple overworked servers in our fleet suck
       | up all the random jobs they come into contact with.
        
       | tptacek wrote:
       | I'm worried that I'm meaner about K8s in this than I mean to be,
       | which would be a problem not least because I don't have enough
       | K8s experience to justify meanness. I'm really more just
       | surprised at how path-dependent the industry is; even systems
       | that were consciously built not to echo Borg, even greenfield
       | systems like Flynn that were reimaginactiments of orchestration,
       | all seem to follow the same model of central, allocating
       | schedulers based on distributed consensus about worker inventory.
        
         | linuxftw wrote:
         | The nice thing about a centralized scheduler is the code for
         | scheduling only needs to run in one place. If each node 'bids'
         | on the workload, you need to run scheduling fit on every node
         | for every application. You also need to work out a way to
         | decide who wins the bid.
         | 
         | Every time you want to change your scheduling logic, guess
         | what? Gotta update all the nodes.
         | 
         | The k8s scheduler is battle-tested. Across all the different
         | environments and CI pipelines, it has scheduled billions, if
         | not trillions, of pods at this point. If you have a better way
         | of scheduling workloads across nodes, I advise you to make it
         | work for kubernetes and sell it as an enterprise scheduler, and
         | you'll make tons of money. Even if it's not better for the
         | general use case, and just better for some edge (pun intended)
         | use case, it would do quite well.
         | 
         | IMO, all roads lead to k8s. You're just eventually going to
         | have to solve the same set of problems.
        
           | tptacek wrote:
           | The code for scheduling in mainstream schedulers runs in lots
           | of places, which is why mainstream schedulers tend to run
           | Paxos or Raft.
        
         | lupex wrote:
         | Borg-at least as it is published in those papers( _)-and these
         | other schedulers are really design to maximize the availability
         | of large distributed apps.
         | 
         | (_) The real Borg now does a lot more...
         | 
         | But there is a completely different world if you look
         | elsewhere. In batch systems like supercomputers (HPC), you want
         | to maximize throughput, not availability. It is common to have
         | something closer to your design: a inventory of jobs and
         | scheduler that allocates workers to them. E.g. https://hpc-
         | wiki.info/hpc/Scheduling_Basics
        
         | vidarh wrote:
         | Be mean about K8s, it can take it. I don't think any of the
         | criticism of it is unreasonable. It's a huge, convoluted beast.
         | To me, K8s makes most sense in terms of creating a swiss army
         | knife that lots of people know.
         | 
         | You can do far better than K8s for specialised cases like yours
         | and/or with people who has the right skills. But for a lot of
         | people with simpler needs it's better to just stick to what is
         | easy to hire for than look for an optimal solution.
        
           | intelVISA wrote:
           | > You can do far better than K8s
           | 
           | "Nobody ever got fired for K8s" is how I imagine most SWEs
           | justify using such a crappy system. Kinda like SAP.
        
             | btown wrote:
             | However bad you think k8s is, SAP is worse on its best day.
        
               | [deleted]
        
       | jallmann wrote:
       | This was a great article. While I was at Livepeer (distributed
       | video transcoding on Ethereum [1]), we converged onto a very
       | similar architecture, disaggregating scheduling into the client
       | itself.
       | 
       | The key piece is to have a registry [2] with a (somewhat) up-to-
       | date view of worker resources. This could actually be a
       | completely static list that gets refreshed once in a while.
       | Whenever a client has a new job, they can look at their latest
       | copy of the registry, select workers that seems suitable, submit
       | jobs to workers directly, handle re-tries, etc.
       | 
       | One neat thing about this "direct-to-worker" architecture is that
       | it allows for backpressure from the workers themselves. Workers
       | can respond to a shifting load profile almost instantaneously
       | without having to centrally deallocate resources, or wait for
       | healthchecks to pick up the latest state. Workers can tell
       | incoming jobs, "hey sorry but I'm busy atm" and the client will
       | try elsewhere.
       | 
       | This also allows for rich worker selection strategies on the
       | client itself; eg it can preemptively request the same job on
       | multiple workers, keep the first one that is accepted, and cancel
       | the rest, or favor workers that respond fastest, and so forth.
       | 
       | [1] We were more of a "decentralized job queue" than "distributed
       | VM scheduler" with the corresponding differences, eg shorter-
       | lived jobs with fluctuating load profiles and our clients could
       | be thicker. But many of the core ideas are shared, even our
       | workers were called "orchestrators" which in turn could use
       | similar ideas to manage jobs on GPUs attached to it... schedulers
       | all the way down!
       | 
       | [2] Here the registry seems to be constructed via the Corrosion
       | gossip protocol; we used the blockchain with regular healthcheck
       | probes.
        
       | coredog64 wrote:
       | Although it's currently archived, there's another open source
       | orchestrator that's similar to Borg: Treadmill.
       | 
       | AFS solves for Colossus, with packages being distributed into
       | AFS.
       | 
       | [0] https://github.com/morganstanley/treadmill
        
       | korijn wrote:
       | I can't shake the sense that they seem to just prefer building
       | something new (and blog about it) over figuring out how to
       | configure a/the scheduler properly.
       | 
       | Anyway, they did get it done and made it work, so whatever.
        
         | [deleted]
        
         | mrkurt wrote:
         | We built Fly.io to treat chronic NIH syndrome.
        
           | korijn wrote:
           | I don't understand how that relates to my comment.
           | 
           | Anyway, I'm happy you seem to be successful. Keep up the good
           | work.
        
             | linuxftw wrote:
             | NIH syndromes is 'not invented here syndrome.' This equates
             | to the 'build something new and blog about it' remark you
             | made, IMO.
        
         | [deleted]
        
       | schmichael wrote:
       | As the Nomad Team Lead, this article is a _gift_ - thank you Fly!
       | - even if they 're transitioning off of Nomad. Their description
       | of Nomad is exactly what I would love people to hear, and their
       | reasons for DIYing their own orchestration layer seem totally
       | reasonable to me. Nomad has never wanted people to think we're
       | The One True Way to run all workloads.
       | 
       | I hope Nomad covers cases like scaling-from-zero better in the
       | future, but to do that within the latency requirements of a
       | single HTTP request is quite the feat of design and
       | implementation. There's a lot of batching Nomad does for scale
       | and throughput that conflict with the desire for minimal
       | placement+startup latency, and it's yet to be seen whether
       | "having our cake and eating it too" is physically possible, much
       | less whether we can package it up in a way operators can
       | understand what tradeoffs they're choosing.
       | 
       | I've had the pleasure of chatting with mrkurt in the past, and I
       | definitely intend to follow fly.io closely even if they're no
       | longer a Nomad user! Thanks again for yet another fantastic post,
       | and I wish fly.io all the best.
        
         | tptacek wrote:
         | This whole article started with me rewatching your Nomad deep
         | dive video and then chasing papers. :)
        
       ___________________________________________________________________
       (page generated 2023-02-02 23:00 UTC)