<!DOCTYPE html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<div class="moz-cite-prefix">I killed two of the jobs and the
cluster is processing again. I am monitoring it.</div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix">I think the main culprit here is that
when David ran his jobs he specified the nodes as (in job 42480)</div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix">7:lab218:ppn=1</div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix">See that ppn=1 at the end. That means
that PBS/Torque is assigning one process of the job to that
particular machine (which sounnds great) but it ALSO means that
the systems has other processors (9 more) that PBS/Torque sees
available for othe jobs. PBS/Torque will then let other jobs run
on that system. David schedule many jobs simultaneously using
commands like this and so PBS/Torque was trying to run many jobs
simultaneosly on the same sets of nodes. <br>
</div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix">BTW -- David, that is why you saw the
"can't reach node error" -- the PBS/Torque process network was
flooded and PBS has since flagged those nodes as "down". I am
working to restore them.<br>
</div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix">If you look at the PBS scripts I gave
you for these projects I always use ppn=10 so the PBS/torque
system will know to assign ALL of the node to your job. If you
are not doing that then you will most likely be running on a node
and collide with other jobs -- which, we have seen, causes
discrepancies in your runtimes. There are times when you want to
run with ppn less than 10, but not for this type of benchmarking.<br>
</div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix">Please note that In the MPIMMM.pbs job
script I gave you we ALLOCATE all of the node in PBS/Torque, but
only select to use ONE MPI Process per node which is specified
with the --map-by ppr:1:node flag sent to the mpirun command. I
know that I discussed this in class briefly last Wednesday for
those that were there but several of you were processing many
things during class that day.</div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix"><br>
</div>
<br>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix">On 4/23/24 00:12, Ervin Escalona
Pangilinan wrote:<br>
</div>
<blockquote type="cite"
cite="mid:08563107bdc64b68820a1274a0369667@MW6PR01MB8580.prod.exchangelabs.com">
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<div class="BodyFragment"><font size="2"><span
style="font-size:11pt;">
<div class="PlainText">Hi Dr. Pounds,<br>
<br>
I haven’t been able to queue jobs into the cluster for the
past couple of hours. David has a bunch of jobs lined up
but the status of it hasn’t changed since a couple of
hours ago. I attached a picture below. The same job has
been frozen at 30 seconds for a while now.<br>
<br>
Thank you,<br>
Ervin Pangilinan<br>
<br>
</div>
</span></font></div>
<div><img src="cid:part1.qCl7Xz4P.kk7zyZP0@mercer.edu" class=""> </div>
</blockquote>
<p><br>
</p>
<div class="moz-signature">-- <br>
<b><i>Andrew J. Pounds, Ph.D.</i></b><br>
<i>Professor of Chemistry and Computer Science</i><br>
<i>Director of the Computational Science Program</i><br>
<i>Mercer University, Macon, GA 31207 (478) 301-5627</i></div>
</body>
</html>