[CSC 435] Thank You (and the Final...)
Andrew J. Pounds
pounds_aj at mercer.edu
Sat May 14 15:19:06 EDT 2022
Guys -- before I delete the class mailing listserve for this semester, I
wanted to thank you all for your hard work. I think you all got a good
"foundation" in high performance computing that will hopefully serve you
well in the years to come. I say foundation because there are so many
topics that we did not get to cover. Most of you will never utilize MPI
again - but if you learn how to use it properly you can really unlock
the power of a cluster. If you ever get the chance to work in this type
of cluster arena again I encourage you to do so and work on mastering
the craft.
I didn't want to show you this before you got your hybrid benchmarking
project done because I didn't want you focusing on trying to replicate
my results -- but here are the performance results for some of my
MPI/OpenMP hybrid code running over the GSC 218 cluster. In my
compilations I explicitly specified processor type, as well as L2 and L3
cache sizes, provided additional information to MPI about how I wanted
to schedule processors, and also pinned threads to processor sockets.
In the mmm_mpi function ALL loops were completely OpenMP parallelized (I
removed any data dependencies of bufIndex++ and re-coded for
thread-safety). This sped up the time between the MPI sends and
receives. To ensure accuracy, I constructed a well conditioned matrix,
computed its inverse, multiplied the matrix by its inverse, and summed
up the diagonal. The code worked perfectly for any dimension, cores,
and threads. For the 8000x8000 case I got the following.
mflops
speedup
I was able to run matrices up to 20000x20000 and get similar results.
At dimensions above 20000 I started to hit machine memory limits. There
are some interesting things happening at 16 nodes that I want to
investigate (I did not see those results when I ran this code before and
I think they may be tied to new switching protocols in MPI-4 that I
built a couple of months ago for the class - I did however, verify that
the timings were correct). What most of you saw in your results were
maxima between 2 and 6 nodes and 15 to 18 procs per node with less than
20 GFlops maximum performance and speedups less than 50. Some of you
saw significantly smaller performance numbers.
I hope that the plots above convince you that there are numerous ways to
improve the scalability and performance of the codes you wrote. Combing
MPI with OpenMP adds complexity that most students never have an
opportunity to see. On top of that, new versions of MPI come out often
(I just got a notification from the developers that they are ready for
me to evaluate MPI 5 release candidate 7). What you need to remember is
that HPC is a topic that is dynamic and ever-changing -- we scratched
the surface -- I encourage you to NEVER STOP LEARNING! I've been a
developer in the field for almost 30 years - and I'm still learning new
stuff!
Again, thank you all for a great semester.
All the best!
--
*/Andrew J. Pounds, Ph.D./*
/Professor of Chemistry and Computer Science/
/Director of the Computational Science Program/
/Mercer University, Macon, GA 31207 (478) 301-5627/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://theochem.mercer.edu/pipermail/csc435/attachments/20220514/acdfb820/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: mflops8000.png
Type: image/png
Size: 14246 bytes
Desc: not available
URL: <http://theochem.mercer.edu/pipermail/csc435/attachments/20220514/acdfb820/attachment-0002.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: speedup8000.png
Type: image/png
Size: 14259 bytes
Desc: not available
URL: <http://theochem.mercer.edu/pipermail/csc435/attachments/20220514/acdfb820/attachment-0003.png>
More information about the csc435
mailing list