[CSC 435] Thank You (and the Final...)

Sat May 14 15:19:06 EDT 2022

Guys -- before I delete the class mailing listserve for this semester, I 
wanted to thank you all for your hard work.  I think you all got a good 
"foundation" in high performance computing that will hopefully serve you 
well in the years to come.   I say foundation because there are so many 
topics that we did not get to cover.  Most of you will never utilize MPI 
again - but if you learn how to use it properly you can really unlock 
the power of a cluster.  If you ever get the chance to work in this type 
of cluster arena again I encourage you to do so and work on mastering 
the craft.

I didn't want to show you this before you got your hybrid benchmarking 
project done because I didn't want you focusing on trying to replicate 
my results -- but here are the performance results for  some of my 
MPI/OpenMP hybrid code running over the GSC 218 cluster.  In my 
compilations I explicitly specified processor type, as well as L2 and L3 
cache sizes, provided additional information to MPI about how I wanted 
to schedule processors, and also pinned threads to processor sockets.  
In the mmm_mpi function ALL loops were completely OpenMP parallelized (I 
removed any data dependencies of bufIndex++ and re-coded for 
thread-safety).  This sped up the time between the MPI sends and 
receives.  To ensure accuracy, I constructed a well conditioned matrix, 
computed its inverse, multiplied the matrix by its inverse, and summed 
up the diagonal.  The code worked perfectly for any dimension, cores, 
and threads. For the 8000x8000 case I got the following.

mflops

speedup

I was able to run matrices up to 20000x20000 and get similar results.  
At dimensions above 20000 I started to hit machine memory limits.  There 
are some interesting things happening at 16 nodes that I want to 
investigate (I did not see those results when I ran this code before and 
I think they may be tied to new switching protocols in MPI-4 that I 
built a couple of months ago for the class - I did however, verify that 
the timings were correct).  What most of you saw in your results were  
maxima between 2 and 6 nodes and 15 to 18 procs per node with less than 
20 GFlops maximum performance and speedups less than 50.  Some of you 
saw significantly smaller performance numbers.

I hope that the plots above convince you that there are numerous ways to 
improve the scalability and performance of the codes you wrote.  Combing 
MPI with OpenMP adds complexity that most students never have an 
opportunity to see.  On top of that, new versions of MPI come out often 
(I just got a notification from the developers that they are ready for 
me to evaluate MPI 5 release candidate 7).  What you need to remember is 
that HPC is a topic that is dynamic and ever-changing -- we scratched 
the surface -- I encourage you to NEVER STOP LEARNING!  I've been a 
developer in the field for almost 30 years - and I'm still learning new 
stuff!

Again, thank you all for a great semester.

All the best!

-- 
*/Andrew J. Pounds, Ph.D./*
/Professor of Chemistry and Computer Science/
/Director of the Computational Science Program/
/Mercer University, Macon, GA 31207 (478) 301-5627/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://theochem.mercer.edu/pipermail/csc435/attachments/20220514/acdfb820/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: mflops8000.png
Type: image/png
Size: 14246 bytes
Desc: not available
URL: <http://theochem.mercer.edu/pipermail/csc435/attachments/20220514/acdfb820/attachment-0002.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: speedup8000.png
Type: image/png
Size: 14259 bytes
Desc: not available
URL: <http://theochem.mercer.edu/pipermail/csc435/attachments/20220514/acdfb820/attachment-0003.png>