[CSC 435] Working on Distributed Environments

Tue Mar 18 18:47:56 EDT 2014

Okay -- so progress was made today.  Hopefully by the end of class on 
Thursday we all have CUDA code running and can work across multiple 
machines.

When I looked at the errors you all sent me last week and when I went 
through the torque system log files I was able to replicate the errors 
you were seeing -- and also determined the reason why none of the TORQUE 
output files were showing up in your directories. It was due to errors 
in file transmissions.  Misconfigured authorization files can be the 
cause of this. In a nutshell, while we can use the short names (e.g. -- 
csc100com01) to log into each of the machines remotely, TORQUE wants to 
see the fully qualified domain name (e.g. -- 
csc100com01.cs.mercer.edu).  While you may have all of the short names 
in your authorized_keys file and in your known_hosts file, the 
known_hosts file needs to have BOTH the short and long version of the 
machine name (I may fix my generate keys file later to do both 
automatically later).

The other piece of the puzzle is that zeus sometimes has to rebooted (or 
the torque server restarted) to enable you to submit jobs from the 
client machines.  This was not a problem with older versions of TORQUE, 
but when I upgraded it to handle GPU I think the patch "broke" some of 
the older pieces.  Easy fix is to submit your jobs from zeus.

What I need you to do:

1.  Anyways you need to make sure that you can log into all of the 
machines using both the short name and the fully qualified domain name 
without using a password.  If you cannot then we need to fix the keys 
for the machines where this is a problem.

2.  Verify, using two or three of the machines that you confirmed work 
in part 1, that you can run a batch job across them.  You might want to 
use a TORQUE file like this....

#!/bin/sh
#PBS -N NODEMAP
#PBS -m abe
#PBS -M pounds_aj at mercer.edu
#PBS -j oe
#PBS -k n
#PBS -l nodes=1:csc100com21:ppn=2+1:csc100com20:ppn=2,walltime=8:00:00
#PBS -V
#
setenv OMPI_MCA_btl self,tcp
cat $PBS_NODEFILE
cd /home/chemist/nodemapper
n=`wc -l < $PBS_NODEFILE`
n=`expr $n / 2`
mpirun -np $n --hostfile $PBS_NODEFILE --pernode nodemapper

Notice that in this example I used csc100com20 and csc100com21 (the two 
machines I was working at today).

The only modifications you should have to make are the nodes, the 
directory, and the EMAIL.  You should already have this in your 
nodemapper directory.  Anyway -- give it a shot and let me know if it 
creates and output file in your directory with the correct output.

Stay safe out there.  Hopefully on Thursday we can get this all 
straightened out and also figure out what was going on with the CUDA 
card on Steve's computer today.

-- 
Andrew J. Pounds, Ph.D.  (pounds_aj at mercer.edu)
Professor of Chemistry and Computer Science
Mercer University,  Macon, GA 31207   (478) 301-5627
http://faculty.mercer.edu/pounds_aj

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://theochem.mercer.edu/pipermail/csc435/attachments/20140318/1cf005a2/attachment.html>