[HN Gopher] Kubernetes' dirty endpoint secret and Ingress
       ___________________________________________________________________
        
       Kubernetes' dirty endpoint secret and Ingress
        
       Author : richardfey
       Score  : 72 points
       Date   : 2022-03-22 20:34 UTC (2 hours ago)
        
 (HTM) web link (philpearl.github.io)
 (TXT) w3m dump (philpearl.github.io)
        
       | motoboi wrote:
       | There is a bug in GKE causing 502 when using CNI network.
       | 
       | The big is triggered by port names in ingresses. Use port numbers
       | and you should be good to go.
        
       | TurningCanadian wrote:
       | Check out
       | 
       | https://kubernetes.github.io/ingress-nginx/user-guide/nginx-...
       | 
       | if you're running nginx. Consider setting it to true instead of
       | the default (false)
       | 
       | ---
       | 
       | By default the NGINX ingress controller uses a list of all
       | endpoints (Pod IP/port) in the NGINX upstream configuration.
       | 
       | The nginx.ingress.kubernetes.io/service-upstream annotation
       | disables that behavior and instead uses a single upstream in
       | NGINX, the service's Cluster IP and port.
       | 
       | This can be desirable for things like zero-downtime deployments .
        
         | nhoughto wrote:
         | ah good tip, I don't care about session affinity or custom
         | balancing algos so that works. I'd imagine running in GKE or
         | AWS you would also avoid the DNAT / conntrack overhead as pods
         | by default use a routable VPC IP instead of a magic CNI IP.
         | Would have to test that though.
         | 
         | Quote from related issue:
         | 
         | The NGINX ingress controller does not uses Services to route
         | traffic to the pods. Instead it uses the Endpoints API in order
         | to bypass kube-proxy to allow NGINX features like session
         | affinity and custom load balancing algorithms. It also removes
         | some overhead, such as conntrack entries for iptables DNAT.
        
       | AaronBBrown wrote:
       | This is a design flaw in Kubernetes. The article doesn't really
       | explain what's happening though. The real problem is that there
       | is no synchronization between the ingress controller (which
       | manages the ingress software configuration, e.g. nginx from the
       | Endpoints resources), kube-proxy (which manages iptables rules
       | from the Endpoints resource), and kubelet (which sends the
       | signals to the container). A presStop hook w/ a sleep equivalent
       | to an acceptable timeout will handle the 99%+ cases (and the
       | cases it doesn't will have exceeded your timeout anyhow). Things
       | become more complicated when there are sidecar containers (say an
       | envoy or nginx routing to another container in the same pod) and
       | that often requires shenanigans such as shared emptyDir{} volumes
       | that waits (with fsnotify or similar) for socket files to be
       | closed to ensure requests are fully completed.
        
         | kodah wrote:
         | I mean, technically, you can recreate this scenario on a single
         | host as well. Send a sigterm to an application and try to swap
         | in another instance of it.
         | 
         | System fundamentals are at the heart of that problem: SIGTERM
         | is just what it is, it's a signal and an application can choose
         | to acknowledge it and do something or catch it and ignore it.
         | The system also has no way of knowing what the application
         | chose to do.
         | 
         | All that to say, I'm not sure it's as much of a _flaw_ in
         | Kubernetes as much as it 's the way systems work and Kubernetes
         | is reflecting that.
        
           | lolc wrote:
           | In my view it is a clear flaw that the signal to terminate
           | can arrive while the server is still getting new requests.
           | Being able to steer traffic based on your knowledge of the
           | state of the system is one of the reasons why you'd want to
           | set up an integrated environment where the load-balancer and
           | servers are controlled from the same process.
           | 
           | The time to send the signal is entirely under control of the
           | managing process. It could synchronize with the load-balancer
           | before sending pods the term signal, and I'm unclear why this
           | isn't done.
        
       | spullara wrote:
       | Tomcat had similar behavior when I was using it except it would
       | bind the listener before it was ready to serve traffic with
       | similar results.
        
       | zeckalpha wrote:
       | (2019)
        
         | richardfey wrote:
         | It is still 100% applicable (AFAIK) and informative, with the
         | (2019) in title readers will think it's not relevant anymore?
        
       | gscho wrote:
       | Wait until you find out that kubernetes secrets aren't actually
       | secrets but base64 encoded strings.
        
         | twalla wrote:
         | Encryption at rest for secrets can be enabled, the base64 thing
         | is more of an artifact of how JSON serialization works with
         | byte arrays.
        
         | notwedtm wrote:
         | I think K8S secrets get a bad wrap. They are not intended to be
         | secret in the sense that they are "kept from prying eyes by
         | default". The secret object is simply a first-class citizen
         | that differentiates it from a ConfigMap in a way that allows
         | distinct ACL's.
         | 
         | Most organizations I know will still use something like
         | ExternalSecret for source control and then populate the Secret
         | with the values once in cluster and to an object with very few
         | access points.
        
           | gscho wrote:
           | I think calling it a secret when it isn't gave it a bad wrap.
           | The last time I looked at the documentation it didn't even
           | clearly describe that it is not a secure object (that may
           | have changed recently). Why call it a secret when it is not
           | even close to one? I guess thing-to-store-secrets-if-you-use-
           | rbac was too long.
        
         | dharmab wrote:
         | They're not necessarily strings. You can put binary data in the
         | data field, which is why it is base64.
         | 
         | You can also configure the apiserver/etcd to encrypt specific
         | keyspaces, such as the secrets/ key space.
        
         | zaat wrote:
         | It is in your hands (the version where it became available is
         | more than a year end of life, basically forever in Kubernetes
         | life), maybe they will change the default too. At least there's
         | a fine bold warning box in the docs.
         | 
         | https://kubernetes.io/docs/tasks/administer-cluster/encrypt-...
        
       | nhoughto wrote:
       | Is this true of the OOTB GKE nginx ingress? Hard to tell by 'load
       | balancer' do they mean nginx ingress reverse proxy?
       | 
       | I can imagine the delay between updating the GCP global load
       | balancer service from GKE would be much higher than nginx-ingress
       | reacting to changes in pod health/endpoints.
       | 
       | Either way I guess the takeaway is there is a race there between
       | endpoints being updated and those updates propagating, and seems
       | like that isn't handled as perfectly as people assume, and this
       | likely gets worse with node contention and Kube API performance
       | problems.
        
         | cyberpunk wrote:
         | By load balancer they mean internal kubernetes "service" object
         | that a given ingress uses as it's backing service.
        
       | mad_vill wrote:
       | the issues I see with kubernetes ingress are more related to an
       | ingress pod going down than the upstream.
        
         | cyberpunk wrote:
         | What controller are you using? I've absolutely smashed nginx
         | and the aws elb controllers and never seen them flinch...
        
       | blaisio wrote:
       | Yes! I think this is a really under-reported issue. It's
       | basically caused by kubernetes doing things without confirming
       | everyone responded to prior status updates. It affects every
       | ingress controller, and it also affects services of type "Load
       | Balancer" and there isn't a real fix. Even if you add a timeout
       | in the pre stop hook, that still might not handle it 100% of the
       | time. IMO it is a design flaw in Kubernetes.
        
         | [deleted]
        
         | LimaBearz wrote:
         | Not defending the situation of a preStop hook at least in the
         | case of API's k8s can handle it 100%, its just messy.
         | 
         | We have a preStop hook of 62s. 60s timeouts are set in our
         | apps, 61s is set on the ALBs (ensuring the ALB is never the
         | cause of the hangup), and 62s on the preStop to make sure
         | nothing has come into the container in the last 62s.
         | 
         | Then we set a terminationGracePeriodSeconds of 60 just to make
         | sure it doesn't pop off too fast. This gives us 120s where
         | nothing happens and anything in flight can get to where its
         | going.
        
       | rifelpet wrote:
       | There's a KubeCon North America talk (also from 2019) that goes
       | into more detail on this very issue including some additional
       | recommendations
       | 
       | https://youtu.be/0o5C12kzEDI?t=57
        
       ___________________________________________________________________
       (page generated 2022-03-22 23:00 UTC)