<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<font face="serif">Okay -- so progress was made today. Hopefully by
the end of class on Thursday we all have CUDA code running and can
work across multiple machines. <br>
<br>
</font><br>
<font face="serif"><font face="serif">When I looked at the errors
you all sent me last week and when I went through the torque
system log files I was able to replicate the errors you were
seeing -- and also determined the reason why none of the TORQUE
output files were showing up in your directories. It was due to
errors in file transmissions. Misconfigured authorization files
can be the cause of this. </font></font><font face="serif"><font
face="serif"><font face="serif">In a nutshell, while we can use
the short names (e.g. -- csc100com01) to log into each of the
machines remotely, TORQUE wants to see the fully qualified
domain name (e.g. -- csc100com01.cs.mercer.edu). While you
may have all of the short names in your authorized_keys file
and in your known_hosts file, the known_hosts file needs to
have BOTH the short and long version of the machine name (I
may fix my generate keys file later to do both automatically
later). <br>
<br>
</font>The other piece of the puzzle is that zeus sometimes has
to rebooted (or the torque server restarted) to enable you to
submit jobs from the client machines. This was not a problem
with older versions of TORQUE, but when I upgraded it to handle
GPU I think the patch "broke" some of the older pieces. Easy
fix is to submit your jobs from zeus.<br>
<br>
</font><br>
What I need you to do:<br>
<br>
1. Anyways you need to make sure that you can log into all of the
machines using both the short name and the fully qualified domain
name without using a password. If you cannot then we need to fix
the keys for the machines where this is a problem.<br>
<br>
2. Verify, using two or three of the machines that you confirmed
work in part 1, that you can run a batch job across them. You
might want to use a TORQUE file like this....<br>
<br>
</font><tt>#!/bin/sh</tt><tt><br>
</tt><tt>#PBS -N NODEMAP </tt><tt><br>
</tt><tt>#PBS -m abe</tt><tt><br>
</tt><tt>#PBS -M <a class="moz-txt-link-abbreviated" href="mailto:pounds_aj@mercer.edu">pounds_aj@mercer.edu</a></tt><tt><br>
</tt><tt>#PBS -j oe</tt><tt><br>
</tt><tt>#PBS -k n </tt><tt><br>
</tt><tt>#PBS -l
nodes=1:csc100com21:ppn=2+1:csc100com20:ppn=2,walltime=8:00:00</tt><tt><br>
</tt><tt>#PBS -V</tt><tt><br>
</tt><tt># </tt><tt><br>
</tt><tt>setenv OMPI_MCA_btl self,tcp</tt><tt><br>
</tt><tt>cat $PBS_NODEFILE</tt><tt><br>
</tt><tt>cd /home/chemist/nodemapper </tt><tt><br>
</tt><tt>n=`wc -l < $PBS_NODEFILE`</tt><tt><br>
</tt><tt>n=`expr $n / 2`</tt><tt><br>
</tt><tt>mpirun -np $n --hostfile $PBS_NODEFILE --pernode nodemapper
</tt><font face="serif"><br>
<br>
Notice that in this example I used csc100com20 and csc100com21
(the two machines I was working at today).<br>
<br>
The only modifications you should have to make are the nodes, the
directory, and the EMAIL. You should already have this in your
nodemapper directory. Anyway -- give it a shot and let me know if
it creates and output file in your directory with the correct
output.<br>
<br>
Stay safe out there. Hopefully on Thursday we can get this all
straightened out and also figure out what was going on with the
CUDA card on Steve's computer today.<br>
<br>
<br>
</font>
<pre class="moz-signature" cols="72">--
Andrew J. Pounds, Ph.D. (<a class="moz-txt-link-abbreviated" href="mailto:pounds_aj@mercer.edu">pounds_aj@mercer.edu</a>)
Professor of Chemistry and Computer Science
Mercer University, Macon, GA 31207 (478) 301-5627
<a class="moz-txt-link-freetext" href="http://faculty.mercer.edu/pounds_aj">http://faculty.mercer.edu/pounds_aj</a>
</pre>
</body>
</html>