[HN Gopher] Kubernetes' dirty endpoint secret and Ingress ___________________________________________________________________ Kubernetes' dirty endpoint secret and Ingress Author : richardfey Score : 72 points Date : 2022-03-22 20:34 UTC (2 hours ago) (HTM) web link (philpearl.github.io) (TXT) w3m dump (philpearl.github.io) | motoboi wrote: | There is a bug in GKE causing 502 when using CNI network. | | The big is triggered by port names in ingresses. Use port numbers | and you should be good to go. | TurningCanadian wrote: | Check out | | https://kubernetes.github.io/ingress-nginx/user-guide/nginx-... | | if you're running nginx. Consider setting it to true instead of | the default (false) | | --- | | By default the NGINX ingress controller uses a list of all | endpoints (Pod IP/port) in the NGINX upstream configuration. | | The nginx.ingress.kubernetes.io/service-upstream annotation | disables that behavior and instead uses a single upstream in | NGINX, the service's Cluster IP and port. | | This can be desirable for things like zero-downtime deployments . | nhoughto wrote: | ah good tip, I don't care about session affinity or custom | balancing algos so that works. I'd imagine running in GKE or | AWS you would also avoid the DNAT / conntrack overhead as pods | by default use a routable VPC IP instead of a magic CNI IP. | Would have to test that though. | | Quote from related issue: | | The NGINX ingress controller does not uses Services to route | traffic to the pods. Instead it uses the Endpoints API in order | to bypass kube-proxy to allow NGINX features like session | affinity and custom load balancing algorithms. It also removes | some overhead, such as conntrack entries for iptables DNAT. | AaronBBrown wrote: | This is a design flaw in Kubernetes. The article doesn't really | explain what's happening though. The real problem is that there | is no synchronization between the ingress controller (which | manages the ingress software configuration, e.g. nginx from the | Endpoints resources), kube-proxy (which manages iptables rules | from the Endpoints resource), and kubelet (which sends the | signals to the container). A presStop hook w/ a sleep equivalent | to an acceptable timeout will handle the 99%+ cases (and the | cases it doesn't will have exceeded your timeout anyhow). Things | become more complicated when there are sidecar containers (say an | envoy or nginx routing to another container in the same pod) and | that often requires shenanigans such as shared emptyDir{} volumes | that waits (with fsnotify or similar) for socket files to be | closed to ensure requests are fully completed. | kodah wrote: | I mean, technically, you can recreate this scenario on a single | host as well. Send a sigterm to an application and try to swap | in another instance of it. | | System fundamentals are at the heart of that problem: SIGTERM | is just what it is, it's a signal and an application can choose | to acknowledge it and do something or catch it and ignore it. | The system also has no way of knowing what the application | chose to do. | | All that to say, I'm not sure it's as much of a _flaw_ in | Kubernetes as much as it 's the way systems work and Kubernetes | is reflecting that. | lolc wrote: | In my view it is a clear flaw that the signal to terminate | can arrive while the server is still getting new requests. | Being able to steer traffic based on your knowledge of the | state of the system is one of the reasons why you'd want to | set up an integrated environment where the load-balancer and | servers are controlled from the same process. | | The time to send the signal is entirely under control of the | managing process. It could synchronize with the load-balancer | before sending pods the term signal, and I'm unclear why this | isn't done. | spullara wrote: | Tomcat had similar behavior when I was using it except it would | bind the listener before it was ready to serve traffic with | similar results. | zeckalpha wrote: | (2019) | richardfey wrote: | It is still 100% applicable (AFAIK) and informative, with the | (2019) in title readers will think it's not relevant anymore? | gscho wrote: | Wait until you find out that kubernetes secrets aren't actually | secrets but base64 encoded strings. | twalla wrote: | Encryption at rest for secrets can be enabled, the base64 thing | is more of an artifact of how JSON serialization works with | byte arrays. | notwedtm wrote: | I think K8S secrets get a bad wrap. They are not intended to be | secret in the sense that they are "kept from prying eyes by | default". The secret object is simply a first-class citizen | that differentiates it from a ConfigMap in a way that allows | distinct ACL's. | | Most organizations I know will still use something like | ExternalSecret for source control and then populate the Secret | with the values once in cluster and to an object with very few | access points. | gscho wrote: | I think calling it a secret when it isn't gave it a bad wrap. | The last time I looked at the documentation it didn't even | clearly describe that it is not a secure object (that may | have changed recently). Why call it a secret when it is not | even close to one? I guess thing-to-store-secrets-if-you-use- | rbac was too long. | dharmab wrote: | They're not necessarily strings. You can put binary data in the | data field, which is why it is base64. | | You can also configure the apiserver/etcd to encrypt specific | keyspaces, such as the secrets/ key space. | zaat wrote: | It is in your hands (the version where it became available is | more than a year end of life, basically forever in Kubernetes | life), maybe they will change the default too. At least there's | a fine bold warning box in the docs. | | https://kubernetes.io/docs/tasks/administer-cluster/encrypt-... | nhoughto wrote: | Is this true of the OOTB GKE nginx ingress? Hard to tell by 'load | balancer' do they mean nginx ingress reverse proxy? | | I can imagine the delay between updating the GCP global load | balancer service from GKE would be much higher than nginx-ingress | reacting to changes in pod health/endpoints. | | Either way I guess the takeaway is there is a race there between | endpoints being updated and those updates propagating, and seems | like that isn't handled as perfectly as people assume, and this | likely gets worse with node contention and Kube API performance | problems. | cyberpunk wrote: | By load balancer they mean internal kubernetes "service" object | that a given ingress uses as it's backing service. | mad_vill wrote: | the issues I see with kubernetes ingress are more related to an | ingress pod going down than the upstream. | cyberpunk wrote: | What controller are you using? I've absolutely smashed nginx | and the aws elb controllers and never seen them flinch... | blaisio wrote: | Yes! I think this is a really under-reported issue. It's | basically caused by kubernetes doing things without confirming | everyone responded to prior status updates. It affects every | ingress controller, and it also affects services of type "Load | Balancer" and there isn't a real fix. Even if you add a timeout | in the pre stop hook, that still might not handle it 100% of the | time. IMO it is a design flaw in Kubernetes. | [deleted] | LimaBearz wrote: | Not defending the situation of a preStop hook at least in the | case of API's k8s can handle it 100%, its just messy. | | We have a preStop hook of 62s. 60s timeouts are set in our | apps, 61s is set on the ALBs (ensuring the ALB is never the | cause of the hangup), and 62s on the preStop to make sure | nothing has come into the container in the last 62s. | | Then we set a terminationGracePeriodSeconds of 60 just to make | sure it doesn't pop off too fast. This gives us 120s where | nothing happens and anything in flight can get to where its | going. | rifelpet wrote: | There's a KubeCon North America talk (also from 2019) that goes | into more detail on this very issue including some additional | recommendations | | https://youtu.be/0o5C12kzEDI?t=57 ___________________________________________________________________ (page generated 2022-03-22 23:00 UTC)