I’m a beginner k8s user, I’m trying to recreate this docker-compose SLURM cluster with kubernetes.
First I converted the docker-compose.yaml file into k8s yaml file in order to use
kubectl apply -f .
to create pods and services.
I’m using minikube on my computer with the none driver (like this tutorial since I need GPU support) to create a k8s cluster with one node.
Then I launch kubectl get pods
to get a list of running pods and open a bash shell into the slurmcltd container with kubectl exec -it <kubectl-xxxx-xxxx> -- bash
in order to launch slurm commands.
I have a simple bash script named test.sh:
#!/bin/bash -l
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --mem=0
#SBATCH --time=0-02:00:00
srun hostname
The problem is that if I run it by launching the command sbatch test.sh
on the slurmcltd bash shell, then I open the .out file in the c1 container of the cluster I find only
c1
instead of the expected output:
c1
c2
Also simple commands such as srun hostname
doesn’t work, it just hangs up on the command line and I have to kill it with ctrl+c
.
I think that is because srun
it’s an interactive command that makes use of an open port to send the output of the job to the shell on slurmcltd. This happens also in the sbatch
with more than one node since the output of c2 is collected on c1 (in the .out
file) and so c2 has to send to an open port on c1 the output of the job ran on itself.
From the SLURM documentation you can specify the port range from which the srun
command will pick a port to send interactive output, I chose to set SrunPortRange=60001-60101
in my slurm.conf
file.
In order to tell k8s to expose a port range I used gomplate
as suggested in one of these answers and I managed to expose the range on c1, c2 and slurmcltd pods.
The output of kubectl get pods -o wide --all-namespaces
is:
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
default c1-6b9d86d59d-sm946 1/1 Running 0 57m 172.17.0.9 miki-desktop <none> <none>
default c2-7fc7ccc789-5qgpl 1/1 Running 0 57m 172.17.0.4 miki-desktop <none> <none>
default mysql-5bc8b5c987-8wmnx 1/1 Running 0 57m 172.17.0.10 miki-desktop <none> <none>
default slurmctld-b8c9bf7c8-v5pfg 1/1 Running 0 57m 172.17.0.8 miki-desktop <none> <none>
default slurmdbd-c48bd8d64-rg7kl 1/1 Running 0 57m 172.17.0.3 miki-desktop <none> <none>
kube-system coredns-78fcd69978-pmfn6 1/1 Running 2 (17h ago) 2d 172.17.0.6 miki-desktop <none> <none>
kube-system etcd-miki-desktop 1/1 Running 3 (17h ago) 2d 192.168.1.10 miki-desktop <none> <none>
kube-system kube-apiserver-miki-desktop 1/1 Running 10 (17h ago) 2d 192.168.1.10 miki-desktop <none> <none>
kube-system kube-controller-manager-miki-desktop 1/1 Running 3 (17h ago) 2d 192.168.1.10 miki-desktop <none> <none>
kube-system kube-proxy-wp4qh 1/1 Running 2 (17h ago) 2d 192.168.1.10 miki-desktop <none> <none>
kube-system kube-scheduler-miki-desktop 1/1 Running 3 (17h ago) 2d 192.168.1.10 miki-desktop <none> <none>
kube-system nvidia-device-plugin-daemonset-g99wj 1/1 Running 2 2d 172.17.0.7 miki-desktop <none> <none>
kube-system storage-provisioner 1/1 Running 12 (60m ago) 2d 192.168.1.10 miki-desktop <none> <none>
kubernetes-dashboard dashboard-metrics-scraper-5594458c94-lmghb 1/1 Running 1 (17h ago) 25h 172.17.0.2 miki-desktop <none> <none>
kubernetes-dashboard kubernetes-dashboard-654cf69797-6g29f 1/1 Running 2 (60m ago) 25h 172.17.0.5 miki-desktop <none> <none>
While the output of kubectl get service is
:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
c1 ClusterIP 10.110.44.238 <none> 6818/TCP,60001/TCP,60002/TCP,60003/TCP,60004/TCP,60005/TCP,60006/TCP,60007/TCP,60008/TCP,60009/TCP,60010/TCP,60011/TCP,60012/TCP,60013/TCP,60014/TCP,60015/TCP,60016/TCP,60017/TCP,60018/TCP,60019/TCP,60020/TCP,60021/TCP,60022/TCP,60023/TCP,60024/TCP,60025/TCP,60026/TCP,60027/TCP,60028/TCP,60029/TCP,60030/TCP,60031/TCP,60032/TCP,60033/TCP,60034/TCP,60035/TCP,60036/TCP,60037/TCP,60038/TCP,60039/TCP,60040/TCP,60041/TCP,60042/TCP,60043/TCP,60044/TCP,60045/TCP,60046/TCP,60047/TCP,60048/TCP,60049/TCP,60050/TCP,60051/TCP,60052/TCP,60053/TCP,60054/TCP,60055/TCP,60056/TCP,60057/TCP,60058/TCP,60059/TCP,60060/TCP,60061/TCP,60062/TCP,60063/TCP,60064/TCP,60065/TCP,60066/TCP,60067/TCP,60068/TCP,60069/TCP,60070/TCP,60071/TCP,60072/TCP,60073/TCP,60074/TCP,60075/TCP,60076/TCP,60077/TCP,60078/TCP,60079/TCP,60080/TCP,60081/TCP,60082/TCP,60083/TCP,60084/TCP,60085/TCP,60086/TCP,60087/TCP,60088/TCP,60089/TCP,60090/TCP,60091/TCP,60092/TCP,60093/TCP,60094/TCP,60095/TCP,60096/TCP,60097/TCP,60098/TCP,60099/TCP,60100/TCP,60101/TCP 58m
c2 ClusterIP 10.107.158.107 <none> 6818/TCP,60001/TCP,60002/TCP,60003/TCP,60004/TCP,60005/TCP,60006/TCP,60007/TCP,60008/TCP,60009/TCP,60010/TCP,60011/TCP,60012/TCP,60013/TCP,60014/TCP,60015/TCP,60016/TCP,60017/TCP,60018/TCP,60019/TCP,60020/TCP,60021/TCP,60022/TCP,60023/TCP,60024/TCP,60025/TCP,60026/TCP,60027/TCP,60028/TCP,60029/TCP,60030/TCP,60031/TCP,60032/TCP,60033/TCP,60034/TCP,60035/TCP,60036/TCP,60037/TCP,60038/TCP,60039/TCP,60040/TCP,60041/TCP,60042/TCP,60043/TCP,60044/TCP,60045/TCP,60046/TCP,60047/TCP,60048/TCP,60049/TCP,60050/TCP,60051/TCP,60052/TCP,60053/TCP,60054/TCP,60055/TCP,60056/TCP,60057/TCP,60058/TCP,60059/TCP,60060/TCP,60061/TCP,60062/TCP,60063/TCP,60064/TCP,60065/TCP,60066/TCP,60067/TCP,60068/TCP,60069/TCP,60070/TCP,60071/TCP,60072/TCP,60073/TCP,60074/TCP,60075/TCP,60076/TCP,60077/TCP,60078/TCP,60079/TCP,60080/TCP,60081/TCP,60082/TCP,60083/TCP,60084/TCP,60085/TCP,60086/TCP,60087/TCP,60088/TCP,60089/TCP,60090/TCP,60091/TCP,60092/TCP,60093/TCP,60094/TCP,60095/TCP,60096/TCP,60097/TCP,60098/TCP,60099/TCP,60100/TCP,60101/TCP 58m
kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 2d
mysql ClusterIP 10.108.91.235 <none> 3306/TCP 58m
slurmctld ClusterIP 10.108.97.196 <none> 6817/TCP,60001/TCP,60002/TCP,60003/TCP,60004/TCP,60005/TCP,60006/TCP,60007/TCP,60008/TCP,60009/TCP,60010/TCP,60011/TCP,60012/TCP,60013/TCP,60014/TCP,60015/TCP,60016/TCP,60017/TCP,60018/TCP,60019/TCP,60020/TCP,60021/TCP,60022/TCP,60023/TCP,60024/TCP,60025/TCP,60026/TCP,60027/TCP,60028/TCP,60029/TCP,60030/TCP,60031/TCP,60032/TCP,60033/TCP,60034/TCP,60035/TCP,60036/TCP,60037/TCP,60038/TCP,60039/TCP,60040/TCP,60041/TCP,60042/TCP,60043/TCP,60044/TCP,60045/TCP,60046/TCP,60047/TCP,60048/TCP,60049/TCP,60050/TCP,60051/TCP,60052/TCP,60053/TCP,60054/TCP,60055/TCP,60056/TCP,60057/TCP,60058/TCP,60059/TCP,60060/TCP,60061/TCP,60062/TCP,60063/TCP,60064/TCP,60065/TCP,60066/TCP,60067/TCP,60068/TCP,60069/TCP,60070/TCP,60071/TCP,60072/TCP,60073/TCP,60074/TCP,60075/TCP,60076/TCP,60077/TCP,60078/TCP,60079/TCP,60080/TCP,60081/TCP,60082/TCP,60083/TCP,60084/TCP,60085/TCP,60086/TCP,60087/TCP,60088/TCP,60089/TCP,60090/TCP,60091/TCP,60092/TCP,60093/TCP,60094/TCP,60095/TCP,60096/TCP,60097/TCP,60098/TCP,60099/TCP,60100/TCP,60101/TCP 58m
slurmdbd ClusterIP 10.103.41.160 <none> 6819/TCP 58m
My problem is that in the log file in /var/log/slurm/slurmd.log
of c1 node after the simple srun hostname
command (killed by ctrl+c) tells that
[2021-12-16T14:51:55.517] launch task 3.0 request from UID:0 GID:0 HOST:172.17.0.1 PORT:60972
and after:
[2021-12-16T14:51:55.522] [3.0] debug2: slurm_connect failed: Connection refused
[2021-12-16T14:51:55.522] [3.0] debug2: slurm_connect failed: Connection refused
[2021-12-16T14:51:55.522] [3.0] debug2: slurm_connect failed: Connection refused
[2021-12-16T14:51:55.522] [3.0] debug2: slurm_connect failed: Connection refused
[2021-12-16T14:51:55.522] [3.0] debug2: Error connecting slurm stream socket at 172.17.0.1:60016: Connection refused
[2021-12-16T14:51:55.522] [3.0] error: connect io: Connection refused
I think it is trying to connect to the srun
port to send output to the slurmcltd shell from which I launched the command but it cannot because of something.
I also noticed that 172.17.0.1
is an IP address that doesn’t belong to any of my pods, since I’m not an experienced k8s user I find very difficult to figure out what’s going on here.
Can someone please explain me why the task launch requests seems to come from 172.17.0.1
instead of 172.17.0.8
(slurmcltd). I think that is the main problem because c1 then tries to open a port on 172.17.0.1
but my service exposes that port on slurmctld 172.17.0.8
and so everything doesn’t work.
PS I slightly modified the Dockerfile and the docker-entrypoint.sh files so I put them here for completeness
Read more here: Source link