0

We deploy our microservices in two distinct GKE clusters, one for testing, other for production.

Our workloads make use of workload identity. In "test environment" all works well, all workloads share the same Kubernetes service account that has been bound to a GCP service account.

In "production environment" the cluster is backed by three node pools (I include this info for completeness but I'm not sure it is important) and we have problems with workload identity.

In production environment, in some containers, if we use the shell to GET the metadata or we use gcloud, we get that unexpectedly the current user is the user associated with the node, not the one from workload identity. For other pods the workload identity works as expected instead.

Another potentially interesting thing is that only the pods that have been added lately through new Deployments seem to be affected by thin "misconfiguration".

I'm at a loss about how to investigate this issue. Do you have any idea?

Thx in advance.

danidemi
  • 151

1 Answers1

1

As usually explaining a problem is the first step to autonomously find a solution.

It turns out that our production environment was backed by three node pools. In one of those node pools, workload identity was not enabled.

So, all the pods that were scheduled to run on one of those nodes were not acquiring the right credentials.

The solution was twofold:

  1. add a node selector on the workloads that need workload identity, as specified in this GCP document:
# Ensure we run on nodes that support workload identity
nodeSelector:
  iam.gke.io/gke-metadata-server-enabled: 'true'
  1. enable workload identity on all node pools.
danidemi
  • 151