docker – SLURM cluster inside k8s cannot run srun command

I’m a beginner k8s user, I’m trying to recreate this docker-compose SLURM cluster with kubernetes.
First I converted the docker-compose.yaml file into k8s yaml file in order to use
kubectl apply -f . to create pods and services.

I’m using minikube on my computer with the none driver (like this tutorial since I need GPU support) to create a k8s cluster with one node.

Then I launch kubectl get pods to get a list of running pods and open a bash shell into the slurmcltd container with kubectl exec -it <kubectl-xxxx-xxxx> -- bash in order to launch slurm commands.
I have a simple bash script named test.sh:


    #!/bin/bash -l
    #SBATCH --nodes=2
    #SBATCH --ntasks-per-node=1
    #SBATCH --mem=0
    #SBATCH --time=0-02:00:00
    
    srun hostname

The problem is that if I run it by launching the command sbatch test.sh on the slurmcltd bash shell, then I open the .out file in the c1 container of the cluster I find only


    c1

instead of the expected output:


    c1
    c2

Also simple commands such as srun hostname doesn’t work, it just hangs up on the command line and I have to kill it with ctrl+c.
I think that is because srun it’s an interactive command that makes use of an open port to send the output of the job to the shell on slurmcltd. This happens also in the sbatch with more than one node since the output of c2 is collected on c1 (in the .out file) and so c2 has to send to an open port on c1 the output of the job ran on itself.

From the SLURM documentation you can specify the port range from which the srun command will pick a port to send interactive output, I chose to set SrunPortRange=60001-60101 in my slurm.conf file.
In order to tell k8s to expose a port range I used gomplate as suggested in one of these answers and I managed to expose the range on c1, c2 and slurmcltd pods.

The output of kubectl get pods -o wide --all-namespaces is:

NAMESPACE              NAME                                         READY   STATUS    RESTARTS       AGE   IP             NODE           NOMINATED NODE   READINESS GATES
default                c1-6b9d86d59d-sm946                          1/1     Running   0              57m   172.17.0.9     miki-desktop   <none>           <none>
default                c2-7fc7ccc789-5qgpl                          1/1     Running   0              57m   172.17.0.4     miki-desktop   <none>           <none>
default                mysql-5bc8b5c987-8wmnx                       1/1     Running   0              57m   172.17.0.10    miki-desktop   <none>           <none>
default                slurmctld-b8c9bf7c8-v5pfg                    1/1     Running   0              57m   172.17.0.8     miki-desktop   <none>           <none>
default                slurmdbd-c48bd8d64-rg7kl                     1/1     Running   0              57m   172.17.0.3     miki-desktop   <none>           <none>
kube-system            coredns-78fcd69978-pmfn6                     1/1     Running   2 (17h ago)    2d    172.17.0.6     miki-desktop   <none>           <none>
kube-system            etcd-miki-desktop                            1/1     Running   3 (17h ago)    2d    192.168.1.10   miki-desktop   <none>           <none>
kube-system            kube-apiserver-miki-desktop                  1/1     Running   10 (17h ago)   2d    192.168.1.10   miki-desktop   <none>           <none>
kube-system            kube-controller-manager-miki-desktop         1/1     Running   3 (17h ago)    2d    192.168.1.10   miki-desktop   <none>           <none>
kube-system            kube-proxy-wp4qh                             1/1     Running   2 (17h ago)    2d    192.168.1.10   miki-desktop   <none>           <none>
kube-system            kube-scheduler-miki-desktop                  1/1     Running   3 (17h ago)    2d    192.168.1.10   miki-desktop   <none>           <none>
kube-system            nvidia-device-plugin-daemonset-g99wj         1/1     Running   2              2d    172.17.0.7     miki-desktop   <none>           <none>
kube-system            storage-provisioner                          1/1     Running   12 (60m ago)   2d    192.168.1.10   miki-desktop   <none>           <none>
kubernetes-dashboard   dashboard-metrics-scraper-5594458c94-lmghb   1/1     Running   1 (17h ago)    25h   172.17.0.2     miki-desktop   <none>           <none>
kubernetes-dashboard   kubernetes-dashboard-654cf69797-6g29f        1/1     Running   2 (60m ago)    25h   172.17.0.5     miki-desktop   <none>           <none>

While the output of kubectl get service is:

    NAME         TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      AGE
c1           ClusterIP   10.110.44.238    <none>        6818/TCP,60001/TCP,60002/TCP,60003/TCP,60004/TCP,60005/TCP,60006/TCP,60007/TCP,60008/TCP,60009/TCP,60010/TCP,60011/TCP,60012/TCP,60013/TCP,60014/TCP,60015/TCP,60016/TCP,60017/TCP,60018/TCP,60019/TCP,60020/TCP,60021/TCP,60022/TCP,60023/TCP,60024/TCP,60025/TCP,60026/TCP,60027/TCP,60028/TCP,60029/TCP,60030/TCP,60031/TCP,60032/TCP,60033/TCP,60034/TCP,60035/TCP,60036/TCP,60037/TCP,60038/TCP,60039/TCP,60040/TCP,60041/TCP,60042/TCP,60043/TCP,60044/TCP,60045/TCP,60046/TCP,60047/TCP,60048/TCP,60049/TCP,60050/TCP,60051/TCP,60052/TCP,60053/TCP,60054/TCP,60055/TCP,60056/TCP,60057/TCP,60058/TCP,60059/TCP,60060/TCP,60061/TCP,60062/TCP,60063/TCP,60064/TCP,60065/TCP,60066/TCP,60067/TCP,60068/TCP,60069/TCP,60070/TCP,60071/TCP,60072/TCP,60073/TCP,60074/TCP,60075/TCP,60076/TCP,60077/TCP,60078/TCP,60079/TCP,60080/TCP,60081/TCP,60082/TCP,60083/TCP,60084/TCP,60085/TCP,60086/TCP,60087/TCP,60088/TCP,60089/TCP,60090/TCP,60091/TCP,60092/TCP,60093/TCP,60094/TCP,60095/TCP,60096/TCP,60097/TCP,60098/TCP,60099/TCP,60100/TCP,60101/TCP   58m
c2           ClusterIP   10.107.158.107   <none>        6818/TCP,60001/TCP,60002/TCP,60003/TCP,60004/TCP,60005/TCP,60006/TCP,60007/TCP,60008/TCP,60009/TCP,60010/TCP,60011/TCP,60012/TCP,60013/TCP,60014/TCP,60015/TCP,60016/TCP,60017/TCP,60018/TCP,60019/TCP,60020/TCP,60021/TCP,60022/TCP,60023/TCP,60024/TCP,60025/TCP,60026/TCP,60027/TCP,60028/TCP,60029/TCP,60030/TCP,60031/TCP,60032/TCP,60033/TCP,60034/TCP,60035/TCP,60036/TCP,60037/TCP,60038/TCP,60039/TCP,60040/TCP,60041/TCP,60042/TCP,60043/TCP,60044/TCP,60045/TCP,60046/TCP,60047/TCP,60048/TCP,60049/TCP,60050/TCP,60051/TCP,60052/TCP,60053/TCP,60054/TCP,60055/TCP,60056/TCP,60057/TCP,60058/TCP,60059/TCP,60060/TCP,60061/TCP,60062/TCP,60063/TCP,60064/TCP,60065/TCP,60066/TCP,60067/TCP,60068/TCP,60069/TCP,60070/TCP,60071/TCP,60072/TCP,60073/TCP,60074/TCP,60075/TCP,60076/TCP,60077/TCP,60078/TCP,60079/TCP,60080/TCP,60081/TCP,60082/TCP,60083/TCP,60084/TCP,60085/TCP,60086/TCP,60087/TCP,60088/TCP,60089/TCP,60090/TCP,60091/TCP,60092/TCP,60093/TCP,60094/TCP,60095/TCP,60096/TCP,60097/TCP,60098/TCP,60099/TCP,60100/TCP,60101/TCP   58m
kubernetes   ClusterIP   10.96.0.1        <none>        443/TCP                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      2d
mysql        ClusterIP   10.108.91.235    <none>        3306/TCP                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     58m
slurmctld    ClusterIP   10.108.97.196    <none>        6817/TCP,60001/TCP,60002/TCP,60003/TCP,60004/TCP,60005/TCP,60006/TCP,60007/TCP,60008/TCP,60009/TCP,60010/TCP,60011/TCP,60012/TCP,60013/TCP,60014/TCP,60015/TCP,60016/TCP,60017/TCP,60018/TCP,60019/TCP,60020/TCP,60021/TCP,60022/TCP,60023/TCP,60024/TCP,60025/TCP,60026/TCP,60027/TCP,60028/TCP,60029/TCP,60030/TCP,60031/TCP,60032/TCP,60033/TCP,60034/TCP,60035/TCP,60036/TCP,60037/TCP,60038/TCP,60039/TCP,60040/TCP,60041/TCP,60042/TCP,60043/TCP,60044/TCP,60045/TCP,60046/TCP,60047/TCP,60048/TCP,60049/TCP,60050/TCP,60051/TCP,60052/TCP,60053/TCP,60054/TCP,60055/TCP,60056/TCP,60057/TCP,60058/TCP,60059/TCP,60060/TCP,60061/TCP,60062/TCP,60063/TCP,60064/TCP,60065/TCP,60066/TCP,60067/TCP,60068/TCP,60069/TCP,60070/TCP,60071/TCP,60072/TCP,60073/TCP,60074/TCP,60075/TCP,60076/TCP,60077/TCP,60078/TCP,60079/TCP,60080/TCP,60081/TCP,60082/TCP,60083/TCP,60084/TCP,60085/TCP,60086/TCP,60087/TCP,60088/TCP,60089/TCP,60090/TCP,60091/TCP,60092/TCP,60093/TCP,60094/TCP,60095/TCP,60096/TCP,60097/TCP,60098/TCP,60099/TCP,60100/TCP,60101/TCP   58m
slurmdbd     ClusterIP   10.103.41.160    <none>        6819/TCP                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     58m

My problem is that in the log file in /var/log/slurm/slurmd.log of c1 node after the simple srun hostname command (killed by ctrl+c) tells that

[2021-12-16T14:51:55.517] launch task 3.0 request from UID:0 GID:0 HOST:172.17.0.1 PORT:60972

and after:

[2021-12-16T14:51:55.522] [3.0] debug2: slurm_connect failed: Connection refused
[2021-12-16T14:51:55.522] [3.0] debug2: slurm_connect failed: Connection refused
[2021-12-16T14:51:55.522] [3.0] debug2: slurm_connect failed: Connection refused
[2021-12-16T14:51:55.522] [3.0] debug2: slurm_connect failed: Connection refused
[2021-12-16T14:51:55.522] [3.0] debug2: Error connecting slurm stream socket at 172.17.0.1:60016: Connection refused
[2021-12-16T14:51:55.522] [3.0] error: connect io: Connection refused

I think it is trying to connect to the srun port to send output to the slurmcltd shell from which I launched the command but it cannot because of something.
I also noticed that 172.17.0.1 is an IP address that doesn’t belong to any of my pods, since I’m not an experienced k8s user I find very difficult to figure out what’s going on here.

Can someone please explain me why the task launch requests seems to come from 172.17.0.1 instead of 172.17.0.8 (slurmcltd). I think that is the main problem because c1 then tries to open a port on 172.17.0.1 but my service exposes that port on slurmctld 172.17.0.8 and so everything doesn’t work.

PS I slightly modified the Dockerfile and the docker-entrypoint.sh files so I put them here for completeness

Read more here: Source link