[CSC 435] Working on Distributed Environments
Andrew J. Pounds
pounds_aj at mercer.edu
Tue Mar 18 18:47:56 EDT 2014
Okay -- so progress was made today. Hopefully by the end of class on
Thursday we all have CUDA code running and can work across multiple
machines.
When I looked at the errors you all sent me last week and when I went
through the torque system log files I was able to replicate the errors
you were seeing -- and also determined the reason why none of the TORQUE
output files were showing up in your directories. It was due to errors
in file transmissions. Misconfigured authorization files can be the
cause of this. In a nutshell, while we can use the short names (e.g. --
csc100com01) to log into each of the machines remotely, TORQUE wants to
see the fully qualified domain name (e.g. --
csc100com01.cs.mercer.edu). While you may have all of the short names
in your authorized_keys file and in your known_hosts file, the
known_hosts file needs to have BOTH the short and long version of the
machine name (I may fix my generate keys file later to do both
automatically later).
The other piece of the puzzle is that zeus sometimes has to rebooted (or
the torque server restarted) to enable you to submit jobs from the
client machines. This was not a problem with older versions of TORQUE,
but when I upgraded it to handle GPU I think the patch "broke" some of
the older pieces. Easy fix is to submit your jobs from zeus.
What I need you to do:
1. Anyways you need to make sure that you can log into all of the
machines using both the short name and the fully qualified domain name
without using a password. If you cannot then we need to fix the keys
for the machines where this is a problem.
2. Verify, using two or three of the machines that you confirmed work
in part 1, that you can run a batch job across them. You might want to
use a TORQUE file like this....
#!/bin/sh
#PBS -N NODEMAP
#PBS -m abe
#PBS -M pounds_aj at mercer.edu
#PBS -j oe
#PBS -k n
#PBS -l nodes=1:csc100com21:ppn=2+1:csc100com20:ppn=2,walltime=8:00:00
#PBS -V
#
setenv OMPI_MCA_btl self,tcp
cat $PBS_NODEFILE
cd /home/chemist/nodemapper
n=`wc -l < $PBS_NODEFILE`
n=`expr $n / 2`
mpirun -np $n --hostfile $PBS_NODEFILE --pernode nodemapper
Notice that in this example I used csc100com20 and csc100com21 (the two
machines I was working at today).
The only modifications you should have to make are the nodes, the
directory, and the EMAIL. You should already have this in your
nodemapper directory. Anyway -- give it a shot and let me know if it
creates and output file in your directory with the correct output.
Stay safe out there. Hopefully on Thursday we can get this all
straightened out and also figure out what was going on with the CUDA
card on Steve's computer today.
--
Andrew J. Pounds, Ph.D. (pounds_aj at mercer.edu)
Professor of Chemistry and Computer Science
Mercer University, Macon, GA 31207 (478) 301-5627
http://faculty.mercer.edu/pounds_aj
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://theochem.mercer.edu/pipermail/csc435/attachments/20140318/1cf005a2/attachment.html>
More information about the csc435
mailing list