Background Just for everyone else SGE
= Sun Grid Engine and is the qsub
system mentioned in the title of the question.
qsub
:
If by parallelising you mean multithreading a single job across loads of cores then the first thing is how many cores per node are there?
Usually it is around 10 (but could be much more). That exact number is super important to know because if you request more cores than exist on a single node the job will never run. At best qsub will refuse the job, at worst it will forever be stuck in the queue. For multi-processing on qsub
its …
qsub -pe omp 8 myscript.sh
Here its requesting 8 cores and myscript.sh will contain the shell command to run the python script. If there’s a node with 10-cores 2 are in use, it will then load the job to give its max. capacity.
Thus, qsub
does not do course grain-parallelisation, that is MPI
. Thus qsub
will only parallelise across the cores in a single node
. This is without question a limitation for cluster computing.
qsub
is more than just submitting to a queue, you need to monitor the queuing system pre- and post- submission. Firstly, this is for the number of existing jobs across that system. Secondly, this is to see whats happening to your stuff.
qstat
is the way to understand whats there:
qstat -f # whats what on the cluster
qstat -q
This lists all the queues.
qstat -u username # looks at what a given user is doing
qstat long-queue
Targets availability on a specific queue.
A complete list of qstat
is here. Also qusage -l
is useful.
“Parallelising” loops
If you are submitting loads of jobs which are all working independently in parallel it better to use a different strategy on qsub
. It is better to submit each job separately to the queue via a qsub
loop. Thus there would be two scripts the first is the script in the question, the second is a submission loop: this would comprise a qsub
argument within a loop that would submit each job sequentially. Thus the array might be better in the qsub
submission loop.
Monitoring the queue takes place as per normal via qstat
.
Rationale
The reason for this is the way the queue works qsub
prioritises single jobs over multi-threaded parallelisation, thus you get your results much quicker.
Read more here: Source link