I am the administrator of a cluster running on CentOS and using SLURM to send jobs from a login node to compute nodes. Recently, a user complained about some unexpected behaviour with their jobs. If a user starts a job with srun and then logs out, the job keeps running as expected. However, when the user is disconnected by a SSH timeout, the job is killed. I've replicated this behaviour by killing a shell running a job using kill -1 ShellJobID and the job is killed. Examining the SLURM logs indicates that the job actually received a SIGKILL and not a SIGHUP based on the line WSIGTERM 9. Additionally, if I run kill -1 ActiveSrunJob, the jobs exits with WSIGTERM 9. What about logging out using exit prevents the SLURM job from being cancelled? I was under the impression, and research seems to back that, SIGHUP is propagated to a shell's children on logout. Am I missing something or completely off base?
Asked
Active
Viewed 1,389 times
0
TheOneHyer
- 1
- 5