[CSC 435] MPI -- really think its fixed this time...

Andrew J. Pounds pounds_aj at mercer.edu
Wed Apr 27 00:30:41 EDT 2016


Thanks for Nick for letting me hop on his account and try some testing 
this evening.  The new version of MPI has some quirkiness that I have 
not encountered before, but after doing some reading and testing things 
out, I think I have it.  I know it works on his account.

There are SEVERAL things -- all small modifications to things we have 
already done -- to get this all to work.  I will try to explain each one.

FIRST  -- PBS.  Several of you noted today that you had not logged into 
systems using the full path name -- like csc204com21.cs.mercer.edu.  You 
have just been using csc204com21.   While the shorthand works for 
logging in, PBS/Torque requires the full path name to copy files.  Just 
look at your PBS e-mails and you will see a lot of "post-processing 
errors".   The problem is that the job on the node can't copy the output 
back to zeus.   From zeus login to zeus.cs.mercer.edu. This will put the 
appropriate line in the ./.ssh/known_hosts file and all should be good 
for copying files back to zeus.  Make sure that you do this for all the 
systems in lab 204!

SECOND -- PATHS -- in all your PATH and LD_LIBRARY_PATH environment 
variables move the CUDA stuff to the end and then move the 
/usr/local/maui/bin just prior to the CUDA stuff in your executable 
PATH.  This way you will be assured of being able to use showq on zeus, 
but will still have access to CUDA on the clusters.

THIRD -- I noted today that several of you could not run ther "orted" 
command even though it was in your path.  This only happened when you 
ran mpirun.  To get around this we need to tell the MPI environment 
exactly where to find the MPI installation on each system.  Modify your 
mpirun command in your PBS job file to look like this

  mpirun --prefix /usr/lib64/openmpi  -np $n --hostfile $PBS_NODEFILE 
--map-by ppr:1:node mpimmm

FOURTH -- Cleanup Script -- modify the ortecleanup.pl script so it uses 
the full path to orte-clean

change the lines that have orte-clean so that they are now 
/usr/lib64/openmpi/bin/orte-clean

After making all these changes, try recompiling your code and running 
the PBS job.

-- 
Andrew J. Pounds, Ph.D.  (pounds_aj at mercer.edu)
Professor of Chemistry and Computer Science
Mercer University,  Macon, GA 31207   (478) 301-5627
http://faculty.mercer.edu/pounds_aj

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://theochem.mercer.edu/pipermail/csc435/attachments/20160427/fd2d096b/attachment.html>


More information about the csc435 mailing list