[HN Gopher] Treat Kubernetes clusters as cattle, not pets ___________________________________________________________________ Treat Kubernetes clusters as cattle, not pets Author : mffap Score : 110 points Date : 2021-06-30 13:24 UTC (9 hours ago) (HTM) web link (zitadel.ch) (TXT) w3m dump (zitadel.ch) | mrweasel wrote: | While I don't disagree, this is also a reminder that debugging | Kubernetes can be terribly complicated. | ulzeraj wrote: | I want to go back in time when naming your servers as X-men | characters or Dune houses was a thing. I'm not a big fan of this | brave new DevOps world. | exdsq wrote: | My first job, tech support, had fantasy server names. That was | fun :) | psanford wrote: | > It means that we'd rather just replace a "sick" instance by a | new healthy one than taking it to a doctor. | | This analogy really bothers me. Cattle are expensive. They are an | investment. You don't put down an investment just because it got | sick. | | If you have a sick cow you will in-fact call your local large | animal vet to come and treat it. | ska wrote: | Yes that statement is wrong. I guess the real difference in the | pet/cow calculus is for cattle you probably won't pay more for | treatment than the cow is worth; with pets people do this all | the time. | psanford wrote: | Yup. If the argument is "don't fall in love with your | servers", I'm in agreement. | | However, the idea that whenever anything weird happens you | should just kill you server/cluster and move on without doing | any sort of investigation seems like a recipe for disaster. | That's a great way to mask bugs that may in-fact be systemic | in nature, where they are or will eventually cause service | degradation for your customers. | | I would hate to work in an environment where bugs are ignored | and worked around instead of understood and fixed. | lamontcg wrote: | I like the idea of killing it and moving on without paging | someone at 2AM in the middle of the night. | | Ideally that goes into an async queue of issues though and | someone finds the root cause and that goes into a queue of | issues to fix, which is actually burned down. | | I suspect what is happening more often is that the whole | stack has so many levels and the SREs responsible for it | all don't have the visibility into the stack they need to | debug it all, so they use their SLOs as a club to ignore | issues as long as they're meeting their metrics until it | becomes a firefighting drill. | | A pile of cargo culted best practices and SLOs replacing | hands on debugging. | pyuser583 wrote: | Don't let facts get in the way of a good metaphor. | | There's nothing remotely disgusting about making sausages, but | people say "Laws are like sausages - it's better not to see | them being made." | | I stopped correcting them long ago. | coding123 wrote: | AWS in my mind can quickly lose the kubernetes war amongst cloud | providers. This is every cloud providers chance: EKS on AWS is so | damn tied into a bunch of other AWS products that it's literally | impossible to just delete a cluster now. I tried. It's tied into | VPCs and Subnets and EC2 and Load balancers and a bunch of other | products that no longer makes sense now that K8s won. | | In my opinion it needs to be re-engineered completely into a | super slim product that is not tied to all these crazy things. | ffo wrote: | You mean EKS needs re-engineering? | kinghajj wrote: | Also the fact that a new EKS cluster takes at least 20 minutes | to come up and be ready, makes AWS' offering the weakest among | the Big Three cloud vendors. | jen20 wrote: | This isn't true. I've provisioned 53 EKS clusters this week, | and every one of them has been up in under 11 minutes with | all of the accoutrements. I understand it has been | substantially slower in the past. | nickjj wrote: | 11 minutes is still a long time IMO. | | You can spin up a local multi-node cluster using kind[0] in | 1 minute on 6+ year old hardware. I know it's not the same | but I really have to imagine there's ways to speed this up | on the cloud. I haven't spun up a cluster on DO or Linode | in a while, does anyone know how long it takes there for a | comparison? | | [0]: https://kind.sigs.k8s.io/ | mlnj wrote: | I think it is unfair to compare KinD with an actual | Kubernetes cluster which comes with Load Balancers, | External IP addresses etc. | | My Terraform scripts get a HA K3s cluster in Google Cloud | VMs in less than 9 minutes, which in my opinion is | fantastic. | misiti3780 wrote: | Are you using terraform ? | hodgesrm wrote: | It really depends on the use case. We use Kubernetes for | hosting managed data warehouses. EKS cluster spin-up time is | a one-time cost to start a new environment for data | warehouses. It's insignificant compared to the time required | to load data. Other issues like auto-scaling performance are | more important, at least in our case. | hughrr wrote: | Also the base cluster cost of $0.10 an hour really hurts for | test and experimentation. | jrockway wrote: | GCP charges the same now (though your first one is free). I | don't think it's an unreasonable cost, but it's certainly | not competitive with places like Digital Ocean which are | still free. | atmosx wrote: | Not sure. I'm using DigitalOcean Managed Version and have | experience with EKS. There are pros and cons on both sides. | What I like about EKS is support for API and k8s logs | (accounting / security), coupled with CloudTrail and IAM | integration for RBAC/user management is a bliss. | misiti3780 wrote: | I am about to move our infrastructure from ECS to EKS, what | other AWS products is EKS tied to? | k__ wrote: | I don't have the impression AWS even sees it as a war worth | fighting. | | And I can't blame them. | unfunco wrote: | It's not literally impossible to delete a cluster, I do this | many times daily fully automated with no issues, and an EKS | cluster is not tied to a VPC or subnets, you can spin them up | independently, and you can delete your clusters without | affecting the VPC or subnets in any way. | | An Ingress or a service of type LoadBalancer will create a load | balancer in AWS that's tied to your cluster, but that's the | whole point of Kubernetes, it'll spin up the equivalent | resources in Azure or GCP or DigitalOcean. | aliswe wrote: | > ... the GitOps pattern is gaining adoption for Kubernetes | workload deployment. | | Is it really though? I for one am glad I didnt jump on the | bandwagon early. A lot of the articles popping up nowadays | mentioning the downsides of GitOps make a lot of sense. | tw600040 wrote: | This idea that cattles are meant to be slaughtered and can't be | pets is extremely offensive. | sweetheart wrote: | Preach. It's a metaphor that enforces harmful beliefs, like | master/slave terminology. | | Our language defines our world, so we should aim to shy away | from language that amplifies a harmful message, even in | radically different contexts. | zachrose wrote: | I'd like to see a vegan alternative to the pets/cattle meme. | Zababa wrote: | Fungible vs non-fungible seem to be the concept hiding | behind pets/cattle. You can substitute a dollar for | another, you can't substitute your favorite rock for | another. | sweetheart wrote: | Screen printing, not painting. | mfer wrote: | If I've read the Google papers on borg right (Kubernetes is | conceptually borg v3 with omega being v2) this is different from | how Google runs the things. | | They'll do warehouse scale computing with borg operating large | clusters. borg is at the bottom. | | The workloads spanning dev, test, and prod then run on these | clusters. By having large clusters with lots of things running on | them they get high utilization of the hardware and need less | hardware. | | It's amusing to see k8s used in such a different way and one that | often uses a lot more hardware while driving up costs. Concepts | Google used to lower the cost. | | Or, maybe I read the papers and book wrong. | | I like the idea of higher utilization and better efficiency | because it uses less resources which is more green. | closeparen wrote: | Exactly. We practice this with Mesos. The point is that a | central infrastructure team maintains the (regional) clusters | and provides an interface for application teams to submit the | services. Each application team maintaining its own cluster or | dealing directly with the full power of the cluster scheduler | ecosystem would defeat the purpose. | atmosx wrote: | > It's amusing to see k8s used in such a different way and one | that often uses a lot more hardware while driving up costs. | | To be fair, google runs its own datacenters with teams doing | research & optimisations on all levels of the stack: hardware | and software and most importantly an amount of engineering | resources converging to infinity. | | The rest are stuck with VMs that share network interfaces and | have to monitor CPU steal, understand complex pricing models, | etc. Engineering resources are scarce so most companies will | over-provision just to be safe and because profiling the | application and fixing that API call that takes too long is | expensive, will spin-up another 50 pods. | bambambazooka wrote: | Could you please name the papers and the book? | q3k wrote: | Not GP, but: | | https://research.google/pubs/pub43438/ | | https://sre.google/sre-book/production-environment/ | | are the best public documentation entrypoints that I know of. | q3k wrote: | > Or, maybe I read the papers and book wrong. | | No, that's exactly how it works. You have clusters spanning a | datacenter failure domain (~= an AZ), and everything from prod | to dev workloads runs on there, with low priority batch jobs | bringing up the resource utilization to a sensible level. | | You can do the same thing with k8s, you just have to trust its | multitenancy support. You have RBAC, priority, quotas, | preemption, pod security policies, network policies... Use | them. You can even force some workloads to use gVisor or | separate prod and dev workloads on different worker machines. | conradev wrote: | I've also see Kubernetes being used this way, with one cluster | per data center for company-wide utilization with segmentation | being at the _namespace_ layer. The priority class system is | heavily utilized to make sure production workloads are always | running, and other workloads are pre-empted as needed. | pyuser583 wrote: | Sounds like a Mesos-style arrangement. | fulafel wrote: | How do they do version upgrades, isn't that the traditional | Achilles heel of K8s that leads people to want to frequently | recreate clusters from scratch and/or do blue/green? | mfer wrote: | First, if I understand it right... Google does some smart | things in upgrades. They do things like tests and then | upgrade their equivalent to AZs in a data center. I'm sure | there have been upgrades gone bad that they've had to fix. | | Kubernetes can be upgraded. I've watched nicely done upgrades | happening 4 or 5 years ago. I've watched simple upgrades | happen more recently. It's not unheard of. Even in public | clouds I've upgrade Kubernetes through many minor versions | without issue. | | I would argue it's more work to create more clusters. You | need to migrate workloads and anything pointing to them. It | also would cost more as you have to run more hardware. | [deleted] | dharmab wrote: | We have active clusters that have been continually updated | since 1.13 or so. | | Of course it is the cluster of Theseus since every single | bit of compute has been entirely replaced :) | moondev wrote: | Highly available kubeadm clusters are designed to be upgraded | in place. The Kubernetes api-server is also designed to | function with a minor version skew (for example v1.19.x and | v1.20.x) that would happen during an in-place upgrade. | | Cluster API takes the above and can in-place upgrade clusters | for you. It's pretty awesome to see first hand. Bumping the | Kubernetes version string and machine image reference can | upgrade and even swap distros of a running cluster with no | down-time. | dharmab wrote: | One caveat is that downgrading the apiserver is not | guaranteed to be possible, since the schema of some types | in the API may have been migrated to newer versions in the | etcd database that the previous version may not be able to | read. There are tools such as Velero (https://velero.io) | which can restore a previous snapshot, but you will likely | incur downtime and lose any changes since the snapshot. | lokar wrote: | Nothing important is single homed to one cluster. Much | emphasis is placed on preventing correlated failed of | clusters. | jeffbee wrote: | The borg concepts and interfaces have been the same for ages. | The borglet and borgmaster get released and pushed out very | frequently (daily or weekly) and it doesn't break anything | among the workloads. There are not maintenance windows for | these changes because containers can run while the borglet | restarts, and borgmasters are protocol/format-compatible so | the new release binary simply joins the quorum. Machines also | get rotated out of the borg cell regularly for kernel | upgrades, way more often than I've seen outside of Google. | | I think an important thing to know about K8s is Omega failed | to replace Borg, and then the Omega people created K8s. So | K8s does not necessarily descend from Borg, and not all of | Borg's desirable attributes made it into K8s. | dekhn wrote: | The omega folks were on the borgmaster and borglet teams | when they were building omega (I was on the borg team at | the same time, but working on a different project). It's | fair to say that k8s is an intellectual inheritor of the | parts of borg that are required for it to be useful on the | outside. | | Borglet/borgmaster releases definitely break workloads. I | recall one where something was rolled out to all the test | clusters, passed all the tests (except one of ours) and was | about to be promoted to increasing percents of the prod | clusters. Whilst debugging why our test (which is not part | of the feature rollout) broke we realize that if this had | rolled out to prod, it would have broken all Tensorflow | jobs, and would have been a major OMG. | | But yeah, most of the time, the release process for borglet | and borgmaster is fairly fast and fairly reliable. | jeffbee wrote: | > required for it to be useful on the outside. | | A big part of this is that outside Google, the number of | people who have to _operate_ k8s as a fraction of all k8s | users is way higher than the fraction of borg users who | have to operate borg, so there 's a lot of stuff in k8s | that is 'end user experience' comforts and affordances | for operators. | q3k wrote: | If you don't depend on clusters being 100% available all the | time and design your applications to handle cluster-wide | outages (which you need to do at Google scale anyway), then | simply doing progressive rollouts across clusters is good | enough. If a cluster rollout fails and knocks some workloads | offline, so be it. Just revert the rollout and figure out | what went wrong. | | You can also throw in some smaller clusters that replicate | the 'main' clusters' software stack and have some workloads | whose downtime does not impact production jobs (not just dev | workloads, but replicas of prod workloads!). These can be the | earliest in your rollout pipeline and serve as an early stage | canary. | lifty wrote: | You're right, vanilla Kubernetes has a level of complexity that | starts paying off at a certain cluster level. But the wide | adoption of K8s also shows that people love the standardization | and API it offers for orchestrating workloads even if they | don't take advantage of its scaling capabilities. | | My hope is that projects like k3s will manage to cover that | small scale spectrum of the market. | void_mint wrote: | I've intentionally ignored Kubernetes for a few years. Is it | worth looking into K3s instead? | antonvs wrote: | It depends what you're looking for. | | One attraction of k3s is you can get a "real" production- | grade K8s running on a single machine, or a small cluster, | very quickly and easily. That can be great for learning, | development and testing, certainly. | | K3s is actually targeted at "edge" scenarios where you | aren't running in a cloud, and don't have the ability | and/or desire to dynamically adjust the cluster size. | | You can scale a k3s cluster easily enough manually, adding | or deleting nodes, but to get cluster autoscaling you'd | need to do environment-dependent work to make it automatic. | | If you do want that, then things will likely be quite a bit | easier with a managed cluster like Google's GKE, or even | possibly an self-installed cluster using a tool like kops, | kubespray etc. (It's a while since I've used any of those, | so not sure which is best.) AWS EKS is another choice, | although its setup is a bit more complex and I wouldn't | really recommend it to someone getting started. | | And for production scenarios, integration with cloud load | balancers, network environment etc. is similarly going to | be easier with a provider-managed cluster. It's all | possible to do with k3s, but it's more work and a steeper | learning curve. | koeng wrote: | I've really enjoyed developing on K3s, would recommend. | andrewstuart2 wrote: | And furthermore, the primary problem with scaling out the | number of clusters is that it hamstrings one of the primary | value propositions of kubernetes: increased utilization. The | scheduler can't do its job without yet another scheduler on top | if you spread your workloads outside of their sphere of | influence. | deeblering4 wrote: | All this does is make me want to go vegan and avoid maintaining | the entire k8s farm. | | Truly, if your software team headcount is under 500 why are you | running k8s? | exdsq wrote: | I'm in a team of 1 but using it because my product is based on | spinning up dedicated services for users on demand, so it works | well | mschuster91 wrote: | It gives you automatic failover and decent-ish (at least when | coupled with Rancher, naked k8s is nuts) management compared to | a couple of manually (or Puppet) managed servers. | | A well implemented mini cluster can and will save you so much | time in later maintenance and deployments. | LAC-Tech wrote: | Raising cattle is a lot of work. You have to weigh them | regularly, apply treatments for intestinal worms, for lice, move | them from pasture to pasture so they don't overgraze. It's a | fulltime job. | | Also if a cow dies, people don't just buy a new one. It | represents quite a loss of profit. Also represents a big | potential problem on the farm that people will want to resolve - | they're you're money makers, if they're dying it's an issue. | [deleted] | [deleted] | rllin wrote: | really you should treat your entire cloud deployment (sans state | where impossible) as cattle | jeffbee wrote: | Running a separate cluster for every service assures high | overhead and poor utilization. Fine if you can afford it, but be | aware that you are paying it. | q3k wrote: | Yeah, especially in production bare metal clusters. If you want | N+2 redundancy that's at least 5 physical machines for just the | control plane (etcd & apiservers), more if you don't want to | colocate worker nodes with that... | | Even if you have full bare metal automation and thousands of | machines that seems like unnecessary waste. | dharmab wrote: | Add the cost of administrative services: DNS resolvers, | ingress controllers, log forwarders, monitoring (e.g. | Prometheus, some exporters), autoscalers, tracing | infrastructure, admission controllers, backups/disaster | recovery tools (e.g. Velero)... | | It can add up to millions per year if you aren't auditing | your costs. | dijit wrote: | Yeah, that's insane as a concept. One of the larger selling | points of Kubernetes was bin packing. Removing that selling | point leaves you with... | | * Orchestration of jobs (restart, start); this can be achieved | easily without the complexity of k8s | | * Sidecar loading; Literally the easiest thing to do with | normal VMs. | | and...? | dolni wrote: | Packing everything together is a "selling point" until you | find that a service can fill up ephemeral storage and take | down other services, or consume bandwidth without limit. | | Let's not forget the potential security implications of not | keeping things properly isolated. | | People who were around when provisioning on bare-metal was | still a thing already learned all these lessons. Somehow it | seems they have been forgotten by all the people driving hype | around Kubernetes. | q3k wrote: | > Packing everything together is a "selling point" until | you find that a service can fill up ephemeral storage and | take down other services, or consume bandwidth without | limit. | | Ephemeral storage has resource requests/limits in pods. | | Traffic shaping/limiting can be accomplished using | kubernetes.io/{ingress,egress}-bandwidth annotations. It's | not as nice as resources (because there's not quotas and | capacity planning, and it's generally very simplistic) but | you can still easily build on this. | | Pods can also have priorities and higher priority workloads | can and will preempt lower priority workloads. | | > Let's not forget the potential security implications of | not keeping things properly isolated. | | For hardware isolation isolation, you can use gVisor or | even Kata containers. | | > People who were around when provisioning on bare-metal | was still a thing already learned all these lessons. | Somehow it seems they have been forgotten by all the people | driving hype around Kubernetes. | | Kubernetes explicitly aims to solve resource isolation. It | was built by people who have decades of experience solving | this exact problem in production, on bare metal, at scale. | Effectively, Kubernetes resource isolation is one of the | best solutions out there to easily, predictably and | strongly isolate resource between workloads _and_ maximize | utilization at the same time. | packetlost wrote: | Does ephemeral storage not have configurable limits? That | seems like quite the oversight if not. | dijit wrote: | Kubernetes has the concept of limits, especially on | ephemeral storage; additionally: if your node becomes | unhealthy then the workloads would be rescheduled on | another node. | | I'm super not hypey about kubernetes, mostly because the | complexity surrounding networking is opaque and built on a | foundation of sand... But let's not argue things that | aren't true. | dolni wrote: | > Kubernetes has the concept of limits, especially on | ephemeral storage | | So... why are these issues open? | | https://github.com/kubernetes/enhancements/issues/1029 | https://github.com/kubernetes/enhancements/issues/361 | https://github.com/kubernetes/kubernetes/issues/54384 | | > additionally: if your node becomes unhealthy then the | workloads would be rescheduled on another node. | | Well of course, but you're going to run into that issue | (likely) on all of the nodes where the offending service | lives. | | > But let's not argue things that aren't true. | | If what I've said is untrue, looking at open GitHub | issues and the Kubernetes documentation is certainly no | indication. That's a massive problem all by itself. | q3k wrote: | The first issue you've linked concerns quota support for | ephemeral storage requests/limits - which is not about | the limits themselves, but the ability to set a limit | quotas per tenant/namespace. Eg., team A cannot use a | total of more than 100G ephemeral storage in total in the | cluster. EDIT: No, sorry, it's about using underlying | filesystem quotas for limiting ephemeral storage, vs. the | current implementation, see the third point below. Also | see KEP: https://github.com/kubernetes/enhancements/tree/ | master/keps/... | | The second is a tracking issue for a KEP that has been | implemented but is still in alpha/beta. This will be | closed when all the related features are stable. There's | also some discussion about related functionality that | might be added as part of this KEP/design. | | The third issue is about integrating Docker storage | quotas with Kubernetes ephemeral quotas - ie., | translating ephemeral storage limits into disk quotas | (which would result in -ENOSPC to workloads), vs. the | standard kubelet implementation which just kills/evicts | workloads that run past their limit. | | I agree these are difficult to understand if you're not | familiar with the k8s development/design process. I also | had to spend a few minutes on each one of them to | understand what the actual state of the issues is. | However, they're in a development issue tracker, and the | end-user k8s documentation clearly states that Ephemeral | Storage requests/limits works, how it works, and what its | limitations are: | https://kubernetes.io/docs/concepts/configuration/manage- | res... | fshbbdssbbgdd wrote: | Wait till you hear that AWS is renting you instances that | are running on the same metal as other customers. | dolni wrote: | You seem to be asserting that the ability of Linux | containers to isolate workloads is on par with virtual | machines. | | That's just not the case. | Spooky23 wrote: | Remember in big companies the internal politics rule the day. | It's cheaper to buy more computers than to become the | overlord of computing. | ffo wrote: | You don't exactly need to run a cluster per service ;-) Instead | you can choose to collocate services who belong together and | form a ,,domain". But don't go the route and build the almighty | one Kubernetes cluster where all your domains run in one single | place. | eliodorro wrote: | Decreases utilization but also decreases coordination between | teams (no man-bear-pigs). Also weight the long-term costs of | poorly maintained platflorms and infrastructure in desaster | cases, security issues or when migrating to other providers. | | High overhead can be automated away, google ORBOS. | yongjik wrote: | At this point, why not just drive it up to the logical | conclusion? Treat your business model as cattle, not pets. | Customers leaving? Fire up another business until capital runs | out, and if it does, no worries, jut hop to another job! | | Sorry, but I feel like I landed in crazy-land. Kubernetes is | already an exercise in how many layers you can insert with nobody | understanding the whole picture. Ostensibly, it's so that you can | _isolate_ those fucking jobs so that different teams can run | different tasks in the same cluster without interfering with each | other. Hence namespaces, services, resource requirements, port | translation, autoscalers, and all those yaml files. | | It boggles my mind that people look at something like Kubernetes | and decide "You know what? We need more layers. On top of this." | tyingq wrote: | _" You know what? We need more layers. On top of this."_ | | Heh. Service Mesh! | rantwasp wrote: | K8s on K8s! K64s! | voidfunc wrote: | But can it run on an N64? | yongjik wrote: | Kuberneteuberneteuberneteuberneteuberneteuberneteuberneteub | ernetes is a perfect name for the beautiful system you are | about to architect: it's an ancient Greek word for | helmsman-man-man-man-man-man-man-man, who is in charge of | helmsman-man-man-man-man-man-man, who is in charge of ... | | (...sorry =_=) | spondyl wrote: | I mean, depending on the business, employees aren't trusted to | understand the whole picture regardless. Many employees at | traditional business don't even have administrator access on | their laptops, depending on their position so it's not | logically inconsistent with how things seem to operate. With | that lack of big picture overview, it makes things hard to | scrutinise since you're only seeing one sliver of an | implementation (ie; "It saves money" without seeing, let alone | understanding the technical requirements and vice versa) | aynyc wrote: | Limited admin practice isn't just to save money. It's a | security practice and it's a good one. In a large business, | no one understand the whole business maybe except legal | department and CEO. | MuffinFlavored wrote: | > In a large business, no one understand the whole business | maybe except legal department and CEO. | | lol... what? | | A CEO of a 20,000+ person company literally sees projects | as revenue sources and a deadline, nothing more. | | To get it delivered, he/she walks down the chain of | managers until they gets answers. It's might as well be a | black hole. | rq1 wrote: | > It means that we'd rather just replace a "sick" instance by a | new healthy one than taking it to a doctor. | | Oh god! Please treat your cattle better! | tnisonoff wrote: | When I worked at Asana, we created a small framework that allowed | for blue-green deployments of Kubernetes Clusters (and the apps | that lived on top of them) called KubeApps[0]. | | It worked out great for us -- upgrading Kubernetes was easy and | testable, never worried about code drift, etc. | | [0] https://blog.asana.com/2021/02/kubernetes-at-asana/ (Not | written by me). | AtNightWeCode wrote: | Resource utilization is the main reason I would run a cluster in | the first place. Immutable infrastructure is also expensive to | build and maintain. | tnisonoff wrote: | For larger companies, I think a huge benefit of Kubernetes is | the shared language for defining and operating services, as | well as the well-thought-out abstractions for how these | services interact. | | Costs are generally less of a concern, but having one way of | running, operating, and writing services allowed our dev team | to move faster, share knowledge, etc. | ffo wrote: | True, cost is not the biggest issue. Separation of teams with | different velocity and needs on the other hand is. | | One API as abstraction with shared processes eases the pain | for the people relying on a platform. | jzelinskie wrote: | I like the idea of "Building on Quicksand" as the analogy for | Distributed Systems, but also maintaining your software | dependencies. This article basically recommends trying to | minimize your dependencies to keep reproducibility/portability | high. I generally agree with this, but also carry an "all things | within reason" mentality. But just as the article describes | coworkers growing into their cluster, the complexity of what they | run in their cluster will also grow over time and eventually | they'll realize they've just built up their own "distribution". A | few years ago, I've written a post asking people to think | critically when they hear someone mention "Vanilla" | Kubernetes[0]. | | The real problem they suffered is actually that Kubernetes isn't | fundamentally designed for multi-tenancy. Instead, you're forced | to make separate clusters to isolate different domains. Google | themselves run multiple Borg clusters to isolate different | domains, so it's natural that Kubernetes end up with a similar | design. | | [0]: https://jzelinskie.com/posts/youre-not-running-vanilla- | kuber... | | Disclosure: I worked as an engineer and product manager on CoreOS | Tectonic, the (now defunct) Kubernetes used in the post. | GauntletWizard wrote: | You're just wrong, unfortunately - Google runs dev and test and | prod all on the same clusters. Kubernetes multi-tenancy works | just fine, but the conventional definition of multi-tenancy | includes things like "network isolation" that are misguided. | Multi-tenancy should be set up (and is within Google) by | understanding what is and isn't shared with the environment, | and through cryptographic assertion of who you're speaking to. | If you want to see the latter part nicely integrated, come to a | SIGAUTH meeting and help me argue for it | jzelinskie wrote: | All I said was that they use separate clusters to "isolate | domains" which is a pretty vague description on purpose -- I | did not intend to claim they do it for different deployment | environments as you've described. | | It's fairly subjective what types of isolation define "multi- | tenancy" which is why there hasn't been progress made despite | SIG and WG efforts in the past. While you do not believe | network isolation should be included, there are plenty of | developers working on OpenShift Online which may disagree. | OSO lets anyone on the internet sign up for free and | instantly be given their own namespace on a shared cluster | full of untrusted actors. | q3k wrote: | > If you want to see the latter part nicely integrated, come | to a SIGAUTH meeting and help me argue for it. | | Invitation accepted! :) I've been dying to see an ALTS-like | [1] thing that works with Kubernetes. I really should be able | to talk encrypted and authenticated gRPC-to-gRPC without ever | having to set up secrets or manually provision certificates, | dammit. | | [1] - https://cloud.google.com/security/encryption-in- | transit/appl... | ffo wrote: | Hey we used tectonic ;-) was a great tool at that time. | Tectonic did influence some of the concepts around ORBOS. Just | think of Tectonic combined with GitOps, minus the iPXE part. | | Disclaimer: I am working with ORBOS ___________________________________________________________________ (page generated 2021-06-30 23:00 UTC)