Best Approach for Managing ECS Clusters and IP Address Allocation for Batch Job Workflows in a VPC with Step Functions

Question

We are orchestrating batch jobs as workflows using AWS Step Functions with Fargate ECS Clusters for container execution. The containers are deployed in a VPC with 3 public subnets. In addition to the containers used for workflows, there are other services and EC2 instances running inside the same VPC, utilizing these public subnets.

We may have multiple VPCs, each with its own ECS cluster and 3 subnets, hosting both workflow containers and other services/EC2 instances.

Each ECS cluster has a limit of 5,000 containers, allowing us to theoretically run workflows with up to 5,000 containers in parallel. This would also require 5,000 IP addresses per workflow. Each workflow takes approximately 20–30 minutes to complete. While the cluster can support 5,000 containers, in practice, we would typically run at most a few hundred containers per workflow.

Plan: I am considering creating an ECS cluster, task definitions, and a state machine for each workflow execution, and then terminating them upon completion. This would allow for better infrastructure management and make it easier to update workflows (e.g., updating container images or changes to the state machine). The running workflows would remain unaffected by updates because they would use older state machines and task definitions that remain intact while new ones are being created for future executions.

Following is a design that I have planned.

I have two main questions regarding the architecture and setup:

1. ECS Cluster Setup: Would it be better to create a separate ECS cluster for each workflow and terminate it after the workflow completes, or should I have one ECS cluster per VPC that handles all workflows? My preference leans towards the first approach (separate ECS cluster per workflow) because workflows can change frequently, and this approach would make updates easier (e.g., pushing new container images or modifying the state machine). Running workflows wouldn't be affected by updates, as they would continue using older task definitions and state machines. The new workflow executions would use newly created task definitions and ECS clusters. Are there any potential scalability, resource management, or cost concerns with this approach, especially if the number of workflows increases over time?

2. IP Address Allocation for ECS Containers: How can I ensure that I get 5000 public IP addresses for the workflow containers in a VPC, dedicated only to these workflows? Additionally, I want to make sure that other ECS clusters within the VPC, as well as those in other VPCs, can also run their own containers and have their own pool of 5000 IP addresses without interference. Are there any strategies or best practices for managing IP address allocation to avoid conflicts between different ECS clusters (either within the same VPC or across multiple VPCs), ensuring that each cluster has its own dedicated IP pool for containers?

Best Approach for Managing ECS Clusters and IP Address Allocation for Batch Job Workflows in a VPC with Step Functions

0 Answers0