[CSC 435] CSC 435: Issues with Queuing into cluster
Andrew J. Pounds
pounds_aj at mercer.edu
Tue Apr 23 03:35:59 EDT 2024
I killed two of the jobs and the cluster is processing again. I am
monitoring it.
I think the main culprit here is that when David ran his jobs he
specified the nodes as (in job 42480)
7:lab218:ppn=1
See that ppn=1 at the end. That means that PBS/Torque is assigning one
process of the job to that particular machine (which sounnds great) but
it ALSO means that the systems has other processors (9 more) that
PBS/Torque sees available for othe jobs. PBS/Torque will then let other
jobs run on that system. David schedule many jobs simultaneously using
commands like this and so PBS/Torque was trying to run many jobs
simultaneosly on the same sets of nodes.
BTW -- David, that is why you saw the "can't reach node error" -- the
PBS/Torque process network was flooded and PBS has since flagged those
nodes as "down". I am working to restore them.
If you look at the PBS scripts I gave you for these projects I always
use ppn=10 so the PBS/torque system will know to assign ALL of the node
to your job. If you are not doing that then you will most likely be
running on a node and collide with other jobs -- which, we have seen,
causes discrepancies in your runtimes. There are times when you want
to run with ppn less than 10, but not for this type of benchmarking.
Please note that In the MPIMMM.pbs job script I gave you we ALLOCATE all
of the node in PBS/Torque, but only select to use ONE MPI Process per
node which is specified with the --map-by ppr:1:node flag sent to the
mpirun command. I know that I discussed this in class briefly last
Wednesday for those that were there but several of you were processing
many things during class that day.
On 4/23/24 00:12, Ervin Escalona Pangilinan wrote:
> Hi Dr. Pounds,
>
> I haven’t been able to queue jobs into the cluster for the past couple
> of hours. David has a bunch of jobs lined up but the status of it
> hasn’t changed since a couple of hours ago. I attached a picture
> below. The same job has been frozen at 30 seconds for a while now.
>
> Thank you,
> Ervin Pangilinan
>
--
*/Andrew J. Pounds, Ph.D./*
/Professor of Chemistry and Computer Science/
/Director of the Computational Science Program/
/Mercer University, Macon, GA 31207 (478) 301-5627/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://theochem.mercer.edu/pipermail/csc435/attachments/20240423/0b8709c8/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Screenshot 2024-04-23 at 12.09.17?AM.png
Type: image/png
Size: 535483 bytes
Desc: not available
URL: <http://theochem.mercer.edu/pipermail/csc435/attachments/20240423/0b8709c8/attachment-0001.png>
More information about the csc435
mailing list