[CSC 435] CSC 435: Issues with Queuing into cluster

Andrew J. Pounds pounds_aj at mercer.edu
Tue Apr 23 03:35:59 EDT 2024


I killed two of the jobs and the cluster is processing again.   I am 
monitoring it.

I think the main culprit here is that when David ran his jobs he 
specified the nodes as (in job 42480)

7:lab218:ppn=1

See that ppn=1 at the end.  That means that PBS/Torque is assigning one 
process of the job to that particular machine (which sounnds great) but 
it ALSO means that the systems has other processors (9 more) that 
PBS/Torque sees available for othe jobs.  PBS/Torque will then let other 
jobs run on that system.  David schedule many jobs simultaneously using 
commands like this and so PBS/Torque was trying to run many jobs 
simultaneosly on the same sets of nodes.

BTW -- David, that is why you saw the "can't reach node error" -- the 
PBS/Torque process network was flooded and PBS has since flagged those 
nodes as "down".  I am working to restore them.

If you look at the PBS scripts I gave you for these projects I always 
use ppn=10 so the PBS/torque system will know to assign ALL of the node 
to your job.  If you are not doing that then you will most likely be 
running on a node and collide with other jobs -- which, we have seen, 
causes discrepancies in your runtimes.   There are times when you want 
to run with ppn less than 10, but not for this type of benchmarking.

Please note that In the MPIMMM.pbs job script I gave you we ALLOCATE all 
of the node in PBS/Torque, but only select to use ONE MPI Process per 
node which is specified with the --map-by ppr:1:node flag sent to the 
mpirun command.  I know that I discussed this in class briefly last 
Wednesday for those that were there but several of you were processing 
many things during class that day.




On 4/23/24 00:12, Ervin Escalona Pangilinan wrote:
> Hi Dr. Pounds,
>
> I haven’t been able to queue jobs into the cluster for the past couple 
> of hours. David has a bunch of jobs lined up but the status of it 
> hasn’t changed since a couple of hours ago. I attached a picture 
> below. The same job has been frozen at 30 seconds for a while now.
>
> Thank you,
> Ervin Pangilinan
>

-- 
*/Andrew J. Pounds, Ph.D./*
/Professor of Chemistry and Computer Science/
/Director of the Computational Science Program/
/Mercer University, Macon, GA 31207 (478) 301-5627/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://theochem.mercer.edu/pipermail/csc435/attachments/20240423/0b8709c8/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Screenshot 2024-04-23 at 12.09.17?AM.png
Type: image/png
Size: 535483 bytes
Desc: not available
URL: <http://theochem.mercer.edu/pipermail/csc435/attachments/20240423/0b8709c8/attachment-0001.png>


More information about the csc435 mailing list