[HN Gopher] Writing a Kubernetes Operator
       ___________________________________________________________________
        
       Writing a Kubernetes Operator
        
       Author : todsacerdoti
       Score  : 141 points
       Date   : 2023-03-09 13:41 UTC (9 hours ago)
        
 (HTM) web link (metalbear.co)
 (TXT) w3m dump (metalbear.co)
        
       | sigwinch28 wrote:
       | I find myself conflicted between two approaches at work:
       | 
       | 1. Write a provider/extension/whatever for a tool like Terraform
       | or Pulumi. I live in a world where the infrastructure doesn't
       | move underneath my feet. I am the source of truth. I feel like I
       | only need to reconcile changes when _I_ make changes to my IaC
       | repositories.
       | 
       | 2. I could write something that exists in a control plane, like
       | Kubernetes operators or Crossplane. I live in a world where I
       | look at the world, find the delta between current state and
       | desired state, then try to reconcile. This is an endless loop.
       | 
       | I feel like these are different approaches with the same goal.
       | Why should I decide either way beyond tossing a coin?
       | 
       | Some use cases:
       | 
       | - an internal enterprise DNS system which is not standards-
       | compliant with the world at large
       | 
       | - an internal certificate authority and certificate issuing
       | system.
        
       | debarshri wrote:
       | A better way to write an operator these days is to use
       | kubebuilder [1].
       | 
       | My complaint is that I have seen orgs write operators for random
       | stuff, often reinventing the wheel. Lot of operators in orgs are
       | result of resume driven development. Having said that it often
       | comes handy for complex orchestration.
       | 
       | [1] https://github.com/kubernetes-sigs/kubebuilder
        
         | casperc wrote:
         | What would be a good example where an operator would make
         | sense?
        
           | EdwardDiego wrote:
           | I worked on an operator that manages Kafka in K8s. If you
           | want to upgrade the brokers in a Kafka cluster, you generally
           | do a rolling upgrade to ensure availability.
           | 
           | The operator will do this for you, you just update the
           | version of the broker in the CR spec, it notices, and then
           | applies the change.
           | 
           | Likewise, some configuration options can be applied at
           | runtime, some need the broker to be restarted to be applied,
           | the operator knows which are which, and will again manage the
           | process of a rolling restart if needed to apply the change.
           | 
           | You can also define topics and users as custom resources, so
           | have a nice Gitops approach to declaring resources.
        
           | debarshri wrote:
           | There is whole list of public operators that you can find in
           | operator hub [1].
           | 
           | [1] https://operatorhub.io/
        
           | spenczar5 wrote:
           | Operators make sense when you need to automatically modify
           | resources in response to changes in the cluster's state.
           | 
           | An example that has come up for me is an operator for a Kafka
           | Schema Registry. This is a service that needs some
           | credentials in a somewhat obscure format so it can
           | communicate very directly with a Kafka broker. If the
           | broker's certificates (or CA) are modified, then the Schema
           | Registry needs to have new credentials generated, and needs
           | to be restarted. But the registry shouldn't (obviously) have
           | direct access to the broker's certificates. Instead, there's
           | a more-privileged subsystem which orchestrates that dance;
           | that's the operator.
        
           | sleepybrett wrote:
           | kubernetes itself is a collection of controllers/operators.
           | It takes manifests like pods and uses that information to
           | create the workload in your container runtime on a node with
           | the resources it needs.
        
           | debarshri wrote:
           | A good example from my perspective is when you are delivering
           | an application as 3rd party vendor and you wish to automate
           | lot of operational stuff like backup, scaling based on
           | events, automating stuff based on cluster events. It starts
           | becoming very valuable. I am sure there are many more use
           | cases for.
        
             | jrockway wrote:
             | I would not write an operator to do any of these things. To
             | me an "operator" strongly implies the existence of a CRD
             | and the need to manage it. So for autoscaling, HPA/VPA are
             | built into k8s. Backups should be an application-level
             | feature; when the "take a backup" RPC or time arrives, take
             | a backup and dump it in configured object storage.
             | Automating stuff based on cluster events also doesn't
             | require an operator; call client.V1().Whatever().Watch and
             | do what you need to do.
             | 
             | The only moderately justifiable operator I've ever seen is
             | cert-manager. Even then, one wonders what it would be like
             | if it just updated a Secret every 3 months based on a hard-
             | coded config passed to it, and skipped the CRDs.
        
           | jhoelzel wrote:
           | - creating databases for your app on the fly.
           | 
           | - scaling up and down applications because of time instead of
           | demand. or based on non metric based actions
           | 
           | - Extending kubernetes to understand your workload
           | 
           | - Automating configuration and management of complex
           | applications
           | 
           | - Managing legacy applications that cannot be easily
           | containerized or migrated to the cloud.
           | 
           | if you love k8s youll love operators
           | 
           | the list is endless!
        
             | dilyevsky wrote:
             | With respect, being "in love" with a technology is not a
             | good way to go about it - it leads to tunnel vision
        
           | remram wrote:
           | An operator operates something, e.g. it actively makes
           | changes. If you want to deploy an application, a Helm Chart
           | is the correct way. It will allow you to have deterministic
           | deployment, that you can duplicate multiple times in your
           | cluster, and you can dry-run it and see the generated
           | manifests.
           | 
           | An operator is needed when you can't just deploy and forget
           | about it. An example is the Prometheus operator, which will
           | track annotations created by users to configure the scraping
           | configuration of your Prometheus instances. Another example
           | is cert-manager, which gets certificates into secrets based
           | on Certificate and Ingress objects, renews them automatically
           | before expiry, and does that by creating ingresses picked up
           | by your ingress controller.
           | 
           | The advantage of an operator is that it will react to stuff
           | happening in the cluster. The drawback is that it reacts to
           | stuff happening, potentially doing unexpected things because
           | changes happen at any time and you can't dry-run them.
           | Another drawback is that they are usually global, so you
           | can't run multiple versions at the same time for different
           | namespaces (mainly because custom resource definitions are
           | global).
           | 
           | Unfortunately many people think packaging an application =
           | creating an operator, and that operator does nothing a chart
           | couldn't do.
        
             | stasmo wrote:
             | The CockRoach DB example in the article is a perfect
             | example of an unnecessary CRD. Acquiring certificates
             | within an Kubernetes cluster is a common requirement for
             | lots of applications and there are lots of solutions out
             | there. Is it really necessary to spend time writing your
             | own operator? Now you have a second helm chart and an
             | operator to maintain. Now you have to explain to people
             | which chart to use. You could get rid of the non-operator
             | chart but now I have operators within the cluster acquiring
             | certificates in 5 or 6 different ways. Do I have to
             | configure the credentials for 6 operators so they can make
             | Route53 DNS challenge records?
             | 
             | Edit: maybe we could shift left and ask the app developers
             | to add certificate acquisition directly into the app
             | source.
        
               | outworlder wrote:
               | > Do I have to configure the credentials for 6 operators
               | so they can make Route53 DNS challenge records?
               | 
               | A certificate for service to service communication does
               | not have to correspond to a public endpoint.
        
             | mdaniel wrote:
             | > that operator does nothing a chart couldn't do.
             | 
             | Or is can be _actively harmful_ when they don 't do any
             | error checking whatsoever, causing it to be less accurate
             | that `helm template` would be. Related, it's also one more
             | thing to monitor because it can decide to start vomiting
             | errors for whatever random reason
        
             | dpkirchner wrote:
             | Neither of those cases really need an operator --
             | Prometheus and cert-manager both have code that watches for
             | changes on ingresses/services/custom resources and reacts
             | to changes (using permissions granted via RBAC). I've used
             | both without an operator and still use Prometheus without
             | one.
        
         | cacois wrote:
         | I've found operator-sdk [1] (which uses kubebuilder under the
         | hood) to be a better starting point for operator development.
         | 
         | [1] https://github.com/operator-framework/operator-sdk
        
           | MuffinFlavored wrote:
           | Can you give me an example use case you've ran into where you
           | need to write a custom k8s operator/API?
        
             | [deleted]
        
       | darren0 wrote:
       | I'm not sure why this is a top post. The definitions of
       | controller and operator are completely wrong. The example code is
       | for creating a custom api server which is only done in the most
       | advanced of advanced use cases. The implementation of the
       | apiserver is too naive to demonstrate they have any understanding
       | of the complexity that supporting watch will cause.
        
         | mfer wrote:
         | The article has a description of what an operator is wrong. The
         | definition of an operator originally was...
         | 
         | > An Operator is an application-specific controller that
         | extends the Kubernetes API to create, configure, and manage
         | instances of complex stateful applications on behalf of a
         | Kubernetes user. It builds upon the basic Kubernetes resource
         | and controller concepts but includes domain or application-
         | specific knowledge to automate common tasks.
         | 
         | This is the original definition of an operator [1]. People no
         | use them for stateless things and domain specific work has
         | taken off.
         | 
         | You can look at the Kubernetes docs [2] to see refinements on
         | it...
         | 
         | > Kubernetes' operator pattern concept lets you extend the
         | cluster's behaviour without modifying the code of Kubernetes
         | itself by linking controllers to one or more custom resources.
         | Operators are clients of the Kubernetes API that act as
         | controllers for a Custom Resource.
         | 
         | [1]
         | https://web.archive.org/web/20190113035722/https://coreos.co...
         | 
         | [2] https://kubernetes.io/docs/concepts/extend-
         | kubernetes/operat...
        
           | richardwhiuk wrote:
           | You don't need to implement a custom API server to implement
           | an operator - you can just watch a CR.
        
             | jhoelzel wrote:
             | for an operator you do, what you mean is a controller =)
        
             | [deleted]
        
         | timelapse wrote:
         | > The definitions of controller and operator are completely
         | wrong.
         | 
         | mind clarifying?
        
       | devkulkarni wrote:
       | We have an FAQ about Operators here: https://github.com/cloud-
       | ark/kubeplus/blob/master/Operator-F...
       | 
       | It should be helpful if you are new to the Operator concept.
       | 
       | Operators are generally useful for handling domain-specific
       | actions - for example, performing database backups, installing
       | plugins on Moodle/Wordpress, etc. If you are looking for
       | application deployment then a Helm chart should be sufficient.
        
       | kimbernator wrote:
       | I didn't really enjoy my experience with the few operators I've
       | worked with, mainly because they require the maintainer to build
       | in some sort of access to basic kubernetes functionality. I see
       | the benefit of operators, but I hated that in order to do
       | something as simple as define memory/CPU limits to certain
       | containers I would need to open a PR to the repo and wait weeks,
       | sometimes months, for a new release.
       | 
       | It's frustrating to be a kubernetes admin but not have access to
       | basic configuration options because the maintainers of even some
       | very high-profile operators (looking at you, AWX) neglected to
       | build in access to basic functionality.
        
         | evancordell wrote:
         | This is a common frustration of mine as well!
         | 
         | In the latest release of the spicedb-operator[0], I added a
         | feature that allows users to specify arbitrary patches over
         | operator-managed resources directly in the API (examples in the
         | link).
         | 
         | There are some other projects like Kyverno and Gatekeeper that
         | try to do this generically with mutating webhooks, but
         | embedding a `patches` API into the operator itself gives the
         | operator a chance to ensure the changes are within some
         | reasonable guardrails.
         | 
         | [0]: https://github.com/authzed/spicedb-
         | operator/releases/tag/v1....
        
           | remram wrote:
           | The SpiceDB operator looks like a prime example of something
           | that should have been a Helm Chart. Migrations can be run in
           | the containers.
           | 
           | Operators are just the non-containerized daemons of the
           | Kubernetes OS. We did all this work to run everything in
           | neatly encapsulated containers, and then everyone wants to
           | run stuff globally on the whole cluster. What's the point? Do
           | we just containerize clusters and start over?
        
             | xyzzy_plugh wrote:
             | I'm not sure what you're on about. Operators don't need to
             | run in cluster at all. And even then, they can absolutely
             | run as containers. And as far as permissions go, that's up
             | to you. They're just regular service accounts.
        
             | evancordell wrote:
             | I get the sentiment. We held off on building an operator
             | until we felt there was actually value in doing so (for the
             | most part, Deployments cover the operational needs pretty
             | well).
             | 
             | Migrations can be run in containers (and they are, even
             | with the operator), but it's actually a lot of work to run
             | them at the right time, only once, with the right flags, in
             | the right order, waiting for SpiceDB to reach a specific
             | spot in phased migrations, etc.
             | 
             | Moving from v1.13.0 to v1.14.0 of SpiceDB requires a multi-
             | phase migration to avoid downtime[0], as could any phased
             | migration for any stateful workload. The operator will walk
             | you through them correctly, without intervention. Users who
             | aren't running on Kubernetes or aren't using the operator
             | often have problems running these steps correctly.
             | 
             | The value is in this automation, but also in the API
             | interface itself. RDS is just some automation and an API on
             | top of EC2, and I think RDS has value over running postgres
             | on EC2 myself directly.
             | 
             | As for helm charts, this is just my opinion, but I don't
             | think they're a good way to distribute software to end
             | users. The interface for a helm chart becomes polluted over
             | time in the same way that most operator APIs become
             | polluted over time, as more and more configuration is
             | pulled up to the top. I think helm is better suited to
             | managing configuration you write yourself to deploy on your
             | own clusters (I realize I'm in the minority here).
             | 
             | [0]:
             | https://github.com/authzed/spicedb/releases/tag/v1.14.0
        
           | ojhughes wrote:
           | Adding the patch api is neat! I've solved this in the past by
           | embedding the entire PodSpec etc into the CRD
        
             | remram wrote:
             | Did you call your CRD "Deployment"?
        
             | sklarsa wrote:
             | I might have to borrow that! Very clever
        
         | hintymad wrote:
         | > I would need to open a PR to the repo and wait weeks,
         | sometimes months, for a new release.
         | 
         | Just curious, is this a limitation of the Operators framework,
         | or that of your system's implementation? My knee-jerk reaction
         | is that any implementation should absolutely not require
         | opening ticket. After all, Amazon's API mandate happened 20
         | years ago, and Netflix followed suit to achieve phenomenal
         | productivity for their engineers. I have a hard time imagining
         | why any engineer would think that gatekeeping configuration
         | with PR is a good idea(a UI with proper automation and approval
         | process that hides generated PR for specific use cases is a
         | different matter)
        
           | IceWreck wrote:
           | Not a kubernetes expert, but my understanding is that that
           | operators are regular programs that run in a kubernetes
           | container and interact with the kubernetes API to
           | launch/manage other containers and custom kubernetes
           | resources.
           | 
           | An operator (or its custom resource) can be configured by
           | Kubernetes YAML/API and its upto the creator of the operator
           | to specify the kind of configuration. If the operator creator
           | did not specify options to set cpu/memory limits on the pods
           | managed by the operator, then you can't do anything. You have
           | to add that feature into the operator and then make a pull
           | request and wait for it to be upstreamed.
           | 
           | Or fork it instead. Same thing for helm charts (except
           | forking and patching them is easier than forking an
           | operator).
        
       | fedreg wrote:
       | Here's another example of a custom rust operator,
       | https://github.com/mach-kernel/databricks-kube-operator
       | 
       | Written by a co-worker to help manage our databricks projects
       | across clusters. Works wonderfully!!
        
         | alexott wrote:
         | But why such complexity? Is it easier to maintain than
         | terraform code?
        
           | EdwardDiego wrote:
           | Yes. Terraform doesn't actively manage resources, opererators
           | do.
        
       | jhoelzel wrote:
       | Oh i love operators they usually tie the entire cluster together
       | and lead to amazing things! Think of Kubernetes as an advanced
       | API server that can be extended endlessly and operators are the
       | way to do it.
       | 
       | There really is no magic, is all there and with go the images are
       | usually what? like 10 mb?
       | 
       | It's essential to have a solid understanding of Kubernetes
       | architecture, concepts such as custom resources and controllers,
       | and the tools and APIs available for working with Operators.
       | 
       | Dont use rust though, use and sdk like the operator sdk or
       | kubebuilder. Its native to k8s and you will have a much easier
       | time too.
        
       | Thaxll wrote:
       | Using Rust for that is a bad idea, just use the official and
       | native SDKs ( in Go ). Rust does not have any equivalent to
       | https://sdk.operatorframework.io/
        
       | jzelinskie wrote:
       | Since Go got generics, working with the Kubernetes API could
       | become far more ergonomic. It's been pulling teeth until now. I'm
       | eager to see how the upstream APIs change over time.
       | 
       | In the mean time, one of the creators of the Operator
       | Framework[0] built a bunch of useful patterns using generics that
       | we used to build the SpiceDB Operator[1] called controller-
       | idioms[2].
       | 
       | Does anyone know of other efforts to improve the status quo?
       | 
       | [0]: https://operatorframework.io
       | 
       | [1]: https://github.com/authzed/spicedb-operator
       | 
       | [2]: https://github.com/authzed/controller-idioms
        
       | crabbone wrote:
       | I've written (well, participated in development of) two
       | Kubernetes operators, and support about a dozen of them (in our
       | own deployment of Kubernetes): Jupyter, PostgreSQL, a bunch of
       | Prometheus operators and a handful of proprietary ones.
       | 
       | In my years of working with Kubernetes I cannot shake the feeling
       | that it's, basically, an MLM. It carefully obscures it's
       | functionality by hiding behind opaque definitions. It doesn't
       | really work, when push comes to shove. And, most importantly, it
       | survives in a parasitic kind of way: by piggybacking on those who
       | develop all kinds of extensions, be it operators, custom
       | networking or storage plugins, authentication and so on.
       | 
       | My problem is I cannot find who stands at the top of the pyramid.
       | There's Cloudnative Foundation, but all it does is selling
       | certifications nobody really needs... so, that cannot possibly be
       | it. No big name doesn't really benefit from this in an obvious
       | way...
       | 
       | So... anyways, when I hear people argue about how to implement
       | this or another extension of Kubernetes, it rings the same as
       | when people argue about styles of agile, or code readability etc.
       | nonsense. There isn't a good way. There is not acceptance
       | criteria. The whole system is flawed to no end.
        
       | _muff1nman_ wrote:
       | This article is mistaken from the get-go as an operator is not
       | the same as an apiservice. Rather an operator is a wider term for
       | something that includes a controller. See
       | https://kubernetes.io/docs/concepts/extend-kubernetes/operat...
       | 
       | Also it's important for people reading this article - an
       | apiservice (which this article talks about) is very rarely
       | something that should be done. An operator is more appropriate
       | for nearly all cases except for when you truly need your state
       | stored outside of the internal Kubernetes etcd datastore.
        
         | reedjosh wrote:
         | Custom Resource + Controller = Operator. Good call!
         | 
         | > Operators are clients of the Kubernetes API that act as
         | controllers for a Custom Resource.
        
           | jhoelzel wrote:
           | exactly! controlling refers to directing or regulating the
           | behavior of something, while operating refers to the actual
           | execution or manipulation.
        
         | tenac23 wrote:
         | After reading the comments we updated the article
        
       | rdtsc wrote:
       | You have a problem: orchestrating some thing in kube, so you
       | write some custom operator logic running alongside your main
       | product; but now you have two problems to worry about.
       | 
       | I've seen just as much if not more issues with debugging the
       | operator logic itself as with the main pods/deployments it was
       | trying to manage.
       | 
       | So just from a practical point of view, I think it should be a
       | last resort after everything else fails (helm charts, etc).
        
       ___________________________________________________________________
       (page generated 2023-03-09 23:01 UTC)