CI/CD with a GPU cluster

Question

With typical continuous integration environments, you configure an environment capable to execute compilation and test batches (agent, slave..) coordinated by a scheduler (master, server..)

But what if your "client" environment is a Graphics Processing Unit (GPU) cluster used to perform model trainings in different configurations? Is there any difference or would you just for example let the head cluster node incorporate a Jenkins slave? (or Bamboo agent etc)

score 2 · Answer 1 · answered Feb 25 '18 at 08:12

Since you're talking about CI/CD I presume you have the possibility to automate the model trainings in those configurations. Let's call the scripts able to do that train_model_config_A, train_model_config_B, etc.

Then you could have a wrapper script which checks an environment variable used to select which client environment you desire and invokes the corresponding train_model_config_<blah> script. Ideally translating the outcome of the training (whatever that is) into one or more results of the pass/fail type. Then such wrapper script can be integrated in a CI/CD pipeline like a custom test step/stage (or even a build one, if it produces any artifacts you might want to archive). Just like any test executed on a testbed incorporating some non-generic piece of hardware. In other words the CPU cluster makes no real difference.

You might not need to install the slave directly on the GPU cluster if the train_model_config_<blah> scripts (and thus the wrapper script as well) can be executed on some other hosts and remotely control the GPUs - you can then have the slave on some other host, leaving the GPU cluster free to only do its stuff.

CI/CD with a GPU cluster

1 Answers1