Deploy hundreds of thousands of short lived jobs per day

Question

I have a system that needs to deploy hundreds of thousands of short-lived jobs per day. Each job runs anywhere from a few seconds, to a couple of hours. Each job makes HTTP requests to external web servers, writes data to disk (anywhere from a few megabytes to hundreds of gigabytes), and makes a series of connections to databases.

Every job is the same Docker container, running the same single Java process. Each job has a different configuration, passed as a environment variable.

We currently deploy these jobs on a Kubernetes cluster using the "Job" spec. However, the cluster is not immediately available for jobs when a large influx of jobs are to be ran. We also have to constantly query the Kubernetes cluster to determine if the Job has been finished, or was killed (e.g. out of memory).

I'd like to find a solution that would allow us to deploy these jobs as quickly as possible with the least amount of concern about whether resources are available, or requiring us to query a system to determine if the job has completed.

AWS Lambda comes to mind, but I have little experience with it.

As a architectural note, we have our process that serves a scheduler, in that calculates what job should be ran, and when. that process currently submits the job to the Kubernetes cluster.

Given the above description, what architectures should I be evaluating to minimize the amount of concern this system has around 1) if resources are available to handle the job and 2) Whether that job fail for any "non application" reason.

This system currently runs on GCP and AWS. We're open to any solution, even if it means selecting a single (and potentially different) platform.

score 1 · Answer 1 · answered Aug 06 '19 at 06:07

1

If the jobs are short-lived, your purpose might be better served by implementing a job queue, and a set of longer-lived workers which consume jobs from the queue. Is there a reason you need to run the jobs in k8s itself?

answered Aug 06 '19 at 06:07

Matt Zimmerman

381

score 0 · Answer 2 · answered Jul 12 '19 at 02:22

Presumably your cluster is resource limited. Achieving a higher job volume, if that is a requirement, must involve a more efficient application or more resources.

Large providers like you use will rent you as many instances as your budget allows. Scale up your cluster, possibly automatically. Possibly some spare capacity is needed if you schedule jobs on short notice.

An alternative to polling the Kubernetes job is your code passing a message. At the end of a job, do some kind of call back to your scheduler indicating finished.

Of course, it may have died and will never report back. Eventually this needs to become a failure state. Consider polling the job at intervals after your typical shortest job time, and giving up on it after a hard limit like activeDeadlineSeconds.

Deploy hundreds of thousands of short lived jobs per day

2 Answers2