<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
</head>
<body>
<p><font face="serif">Guys -- before I delete the class mailing
listserve for this semester, I wanted to thank you all for your
hard work. I think you all got a good "foundation" in high
performance computing that will hopefully serve you well in the
years to come. I say foundation because there are so many
topics that we did not get to cover. Most of you will never
utilize MPI again - but if you learn how to use it properly you
can really unlock the power of a cluster. If you ever get the
chance to work in this type of cluster arena again I encourage
you to do so and work on mastering the craft.<br>
</font></p>
<p><font face="serif">I didn't want to show you this before you got
your hybrid benchmarking project done because I didn't want you
focusing on trying to replicate my results -- but here are the
performance results for some of my MPI/OpenMP hybrid code
running over the GSC 218 cluster. In my compilations I
explicitly specified processor type, as well as L2 and L3 cache
sizes, provided additional information to MPI about how I wanted
to schedule processors, and also pinned threads to processor
sockets. In the mmm_mpi function ALL loops were completely
OpenMP parallelized (I removed any data dependencies of
bufIndex++ and re-coded for thread-safety). This sped up the
time between the MPI sends and receives. To ensure accuracy, I
constructed a well conditioned matrix, computed its inverse,
multiplied the matrix by its inverse, and summed up the
diagonal. The code worked perfectly for any dimension, cores,
and threads. </font><font face="serif"><font face="serif">For
the 8000x8000 case I got the following. </font></font></p>
<p><br>
</p>
<p><img moz-do-not-send="false"
src="cid:part1.ou4cxU8j.w4NP7w4K@mercer.edu" alt="mflops"
width="640" height="480"></p>
<img moz-do-not-send="false"
src="cid:part2.maw6ijDF.V3FCLv30@mercer.edu" alt="speedup"
width="640" height="480">
<p>I was able to run matrices up to 20000x20000 and get similar
results. At dimensions above 20000 I started to hit machine
memory limits. There are some interesting things happening at 16
nodes that I want to investigate (I did not see those results when
I ran this code before and I think they may be tied to new
switching protocols in MPI-4 that I built a couple of months ago
for the class - I did however, verify that the timings were
correct). What most of you saw in your results were maxima
between 2 and 6 nodes and 15 to 18 procs per node with less than
20 GFlops maximum performance and speedups less than 50. Some of
you saw significantly smaller performance numbers.<br>
</p>
<p>I hope that the plots above convince you that there are numerous
ways to improve the scalability and performance of the codes you
wrote. Combing MPI with OpenMP adds complexity that most students
never have an opportunity to see. On top of that, new versions of
MPI come out often (I just got a notification from the developers
that they are ready for me to evaluate MPI 5 release candidate
7). What you need to remember is that HPC is a topic that is
dynamic and ever-changing -- we scratched the surface -- I
encourage you to NEVER STOP LEARNING! I've been a developer in
the field for almost 30 years - and I'm still learning new stuff!<br>
</p>
<p>Again, thank you all for a great semester.<br>
</p>
<p>All the best!<br>
</p>
<div class="moz-signature">-- <br>
<b><i>Andrew J. Pounds, Ph.D.</i></b><br>
<i>Professor of Chemistry and Computer Science</i><br>
<i>Director of the Computational Science Program</i><br>
<i>Mercer University, Macon, GA 31207 (478) 301-5627</i></div>
</body>
</html>