0

I am trying to create a Google Cloud Composer 2 environment in my project, but it fails to become healthy.

I am creating it with the default settings and a service account that has the Cloud Composer v2 API Service Agent Extension, Composer Worker and Editor roles. The environments starts and creates some of the pods, but ultimately fails to become healthy with this error: Some of the GKE pods failed to become healthy. Please check the GKE logs for details, and retry the operation.. In the logs there seems to be an issue with Kubernetes:

Traceback (most recent call last):
  File "/opt/python3.11/lib/python3.11/threading.py", line 1045, in _bootstrap_inner
    self.run()
  File "/opt/python3.11/lib/python3.11/threading.py", line 982, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/python3.11/lib/python3.11/sched.py", line 151, in run
    action(*argument, **kwargs)
  File "/home/airflow/src/composer_monitoring/lib/event_scheduler.py", line 36, in repeat
    action(*action_args)
  File "/home/airflow/src/composer_monitoring/lib/composer_metric.py", line 140, in update_recent_metric_values
    values = self.calculate_metric_values()

kubernetes.client.exceptions.ApiException: (404)

This is still part of the environment creation, I have not uploaded any dags yet. Other potentially relevant logs that were logged during the creation process:

[conn-id:69bde5bac9fb7084 rpc-id:72f13b85e6957114 remote-addr:10.60.0.204:57830 pod:composer-system/airflow-monitoring-7b76fb5846-q7gnw] "/computeMetadata/v1/universe/universe_domain" HTTP/404: generic::not_found: no child "universe_domain", Reason: "NOT_FOUND", UserMessage: "Not Found", started at 2024-05-16 10:53:13.799086669 +0000 UTC m=+1558.116947164
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"horizontalpodautoscalers.autoscaling \"airflow-worker-hpa\" not found","reason":"NotFound","details":{"name":"airflow-worker-hpa","group":"autoscaling","kind":"horizontalpodautoscalers"},"code":404}
The resource 'projects/xxx/global/instanceTemplates/gk3-europe-west3-xxx-scheduli-pool-2-1c38f934' was not found

The project is fairly new with very few custom settings. I have added a firewall rule to allow all Egress. I also gave all the service accounts that the Cloud Composer 2 created the permission specified in the documentation.

I have tried the process a few times, each time giving the service accounts and the firewalls more permissive settings, but to no success.

1 Answers1

0

The same issue here.

I also saw some quota error in logs.

"value = workers_quota_helper.get_workers_hpa_spec..."

After we delete the old composer2 to upgrade memory. It can not be created anymore.