[CSC 435] MPI -- really think its fixed this time...
Andrew J. Pounds
pounds_aj at mercer.edu
Wed Apr 27 00:30:41 EDT 2016
Thanks for Nick for letting me hop on his account and try some testing
this evening. The new version of MPI has some quirkiness that I have
not encountered before, but after doing some reading and testing things
out, I think I have it. I know it works on his account.
There are SEVERAL things -- all small modifications to things we have
already done -- to get this all to work. I will try to explain each one.
FIRST -- PBS. Several of you noted today that you had not logged into
systems using the full path name -- like csc204com21.cs.mercer.edu. You
have just been using csc204com21. While the shorthand works for
logging in, PBS/Torque requires the full path name to copy files. Just
look at your PBS e-mails and you will see a lot of "post-processing
errors". The problem is that the job on the node can't copy the output
back to zeus. From zeus login to zeus.cs.mercer.edu. This will put the
appropriate line in the ./.ssh/known_hosts file and all should be good
for copying files back to zeus. Make sure that you do this for all the
systems in lab 204!
SECOND -- PATHS -- in all your PATH and LD_LIBRARY_PATH environment
variables move the CUDA stuff to the end and then move the
/usr/local/maui/bin just prior to the CUDA stuff in your executable
PATH. This way you will be assured of being able to use showq on zeus,
but will still have access to CUDA on the clusters.
THIRD -- I noted today that several of you could not run ther "orted"
command even though it was in your path. This only happened when you
ran mpirun. To get around this we need to tell the MPI environment
exactly where to find the MPI installation on each system. Modify your
mpirun command in your PBS job file to look like this
mpirun --prefix /usr/lib64/openmpi -np $n --hostfile $PBS_NODEFILE
--map-by ppr:1:node mpimmm
FOURTH -- Cleanup Script -- modify the ortecleanup.pl script so it uses
the full path to orte-clean
change the lines that have orte-clean so that they are now
/usr/lib64/openmpi/bin/orte-clean
After making all these changes, try recompiling your code and running
the PBS job.
--
Andrew J. Pounds, Ph.D. (pounds_aj at mercer.edu)
Professor of Chemistry and Computer Science
Mercer University, Macon, GA 31207 (478) 301-5627
http://faculty.mercer.edu/pounds_aj
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://theochem.mercer.edu/pipermail/csc435/attachments/20160427/fd2d096b/attachment.html>
More information about the csc435
mailing list