<!DOCTYPE html>

<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

  </head>

  <body>

    <div class="moz-cite-prefix">I killed two of the jobs and the

      cluster is processing again.   I am monitoring it.</div>

    <div class="moz-cite-prefix"><br>

    </div>

    <div class="moz-cite-prefix">I think the main culprit here is that

      when David ran his jobs he specified the nodes as (in job 42480)</div>

    <div class="moz-cite-prefix"><br>

    </div>

    <div class="moz-cite-prefix">7:lab218:ppn=1</div>

    <div class="moz-cite-prefix"><br>

    </div>

    <div class="moz-cite-prefix">See that ppn=1 at the end.  That means

      that PBS/Torque is assigning one process of the job to that

      particular machine (which sounnds great) but it ALSO means that

      the systems has other processors (9 more) that PBS/Torque sees

      available for othe jobs.  PBS/Torque will then let other jobs run

      on that system.  David schedule many jobs simultaneously using

      commands like this and so PBS/Torque was trying to run many jobs

      simultaneosly on the same sets of nodes.  <br>

    </div>

    <div class="moz-cite-prefix"><br>

    </div>

    <div class="moz-cite-prefix">BTW -- David, that is why you saw the

      "can't reach node error" -- the PBS/Torque process network was

      flooded and PBS has since flagged those nodes as "down".  I am

      working to restore them.<br>

    </div>

    <div class="moz-cite-prefix"><br>

    </div>

    <div class="moz-cite-prefix">If you look at the PBS scripts I gave

      you for these projects I always use ppn=10 so the PBS/torque

      system will know to assign ALL of the node to your job.  If you

      are not doing that then you will most likely be running on a node

      and collide with other jobs -- which, we have seen, causes

      discrepancies in your runtimes.   There are times when you want to

      run with ppn less than 10, but not for this type of benchmarking.<br>

    </div>

    <div class="moz-cite-prefix"><br>

    </div>

    <div class="moz-cite-prefix">Please note that In the MPIMMM.pbs job

      script I gave you we ALLOCATE all of the node in PBS/Torque, but

      only select to use ONE MPI Process per node which is specified

      with the --map-by ppr:1:node flag sent to the mpirun command.  I

      know that I discussed this in class briefly last Wednesday for

      those that were there but several of you were processing many

      things during class that day.</div>

    <div class="moz-cite-prefix"><br>

    </div>

    <div class="moz-cite-prefix"><br>

    </div>

    <br>

    <div class="moz-cite-prefix"><br>

    </div>

    <div class="moz-cite-prefix">On 4/23/24 00:12, Ervin Escalona

      Pangilinan wrote:<br>

    </div>

    <blockquote type="cite"

cite="mid:08563107bdc64b68820a1274a0369667@MW6PR01MB8580.prod.exchangelabs.com">

      <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

      <div class="BodyFragment"><font size="2"><span

            style="font-size:11pt;">

            <div class="PlainText">Hi Dr. Pounds,<br>

              <br>

              I haven’t been able to queue jobs into the cluster for the

              past couple of hours. David has a bunch of jobs lined up

              but the status of it hasn’t changed since a couple of

              hours ago. I attached a picture below. The same job has

              been frozen at 30 seconds for a while now.<br>

              <br>

              Thank you,<br>

              Ervin Pangilinan<br>

              <br>

            </div>

          </span></font></div>

      <div><img src="cid:part1.qCl7Xz4P.kk7zyZP0@mercer.edu" class=""> </div>

    </blockquote>

    <p><br>

    </p>

    <div class="moz-signature">-- <br>

      <b><i>Andrew J. Pounds, Ph.D.</i></b><br>

      <i>Professor of Chemistry and Computer Science</i><br>

      <i>Director of the Computational Science Program</i><br>

      <i>Mercer University, Macon, GA 31207 (478) 301-5627</i></div>

  </body>

</html>