I am trying to create a Google Cloud Composer 2 environment in my project, but it fails to become healthy.
I am creating it with the default settings and a service account that has the Cloud Composer v2 API Service Agent Extension, Composer Worker and Editor roles.
The environments starts and creates some of the pods, but ultimately fails to become healthy with this error: Some of the GKE pods failed to become healthy. Please check the GKE logs for details, and retry the operation..
In the logs there seems to be an issue with Kubernetes:
Traceback (most recent call last):
File "/opt/python3.11/lib/python3.11/threading.py", line 1045, in _bootstrap_inner
self.run()
File "/opt/python3.11/lib/python3.11/threading.py", line 982, in run
self._target(*self._args, **self._kwargs)
File "/opt/python3.11/lib/python3.11/sched.py", line 151, in run
action(*argument, **kwargs)
File "/home/airflow/src/composer_monitoring/lib/event_scheduler.py", line 36, in repeat
action(*action_args)
File "/home/airflow/src/composer_monitoring/lib/composer_metric.py", line 140, in update_recent_metric_values
values = self.calculate_metric_values()
kubernetes.client.exceptions.ApiException: (404)
This is still part of the environment creation, I have not uploaded any dags yet. Other potentially relevant logs that were logged during the creation process:
[conn-id:69bde5bac9fb7084 rpc-id:72f13b85e6957114 remote-addr:10.60.0.204:57830 pod:composer-system/airflow-monitoring-7b76fb5846-q7gnw] "/computeMetadata/v1/universe/universe_domain" HTTP/404: generic::not_found: no child "universe_domain", Reason: "NOT_FOUND", UserMessage: "Not Found", started at 2024-05-16 10:53:13.799086669 +0000 UTC m=+1558.116947164
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"horizontalpodautoscalers.autoscaling \"airflow-worker-hpa\" not found","reason":"NotFound","details":{"name":"airflow-worker-hpa","group":"autoscaling","kind":"horizontalpodautoscalers"},"code":404}
The resource 'projects/xxx/global/instanceTemplates/gk3-europe-west3-xxx-scheduli-pool-2-1c38f934' was not found
The project is fairly new with very few custom settings. I have added a firewall rule to allow all Egress. I also gave all the service accounts that the Cloud Composer 2 created the permission specified in the documentation.
I have tried the process a few times, each time giving the service accounts and the firewalls more permissive settings, but to no success.