[CSC 435] CSC 435 BBLAS Question

Andrew J. Pounds pounds_aj at mercer.edu
Fri Apr 17 10:48:45 EDT 2020


Will -- are you working on hammer?  Hammer should show a degradation
after 24 processors.

I wish it were that simple.  Hammer has three Intel Xeon E5-2650
CPU's.   Each of these has 8 cores and can accommodate up to 16 threads.
There are in-processor algorothms and OS based algorithms (as well as
your own code) that determine when to allow for threading and when to
spread the work to another set of cores to avoid threading.  On a single
CPU the 8 cores should have their own cache lines (I have not looked at
the technical drawings for the E5-2650).  When a cores on a single CPU
start threading then they have to share the cache, which can degrade
performance.  

Competing for memory is a different issue altogether, but is tied to the
ability of the cache lines to pull from memory quickly and align the
data for processing. If you start to thread then you could have two
threads pulling from different sections of memory that are not at all
aligned.  This would cause a performance hit.  Depending on how the
memory is constructed there will be varying degrees of a performance hit
as well.   For example, if you had a memory structure that was
completely random access and every single memory element was directly
accessible, there would be less of a hit than if you had a system in
which your code had to jump to a certain section of memory and then
reference from the pointer to that section of memory.  Think about it
this way -- if you had a 64 GB of memory  you could have all that on one
memory stick or you could spread it across 4 sticks each with 16 GB
each.  The first option would be FASTER because all of you memory is on
a single chip.  The second would require going through the computer bus
to access portions and would therefore slower.  So why don't we just all
use the first method?  Because it is MUCH more expensive.

Your basic premise is correct -- but I wanted to clarify.

Hope that helps.


   

On 4/16/20 10:06 PM, William Carl Baglivio wrote:
> Hey Dr. Pounds,
>
> I want to get the semantics right on this: the reason why the speedup
> declines after 20 processors is that there are 2 threads per core, so
> when there are more than 1 thread per core, they have to compete for
> memory. Am I getting that right? Anything I missed out on?
>
> ~Will B.


-- 
Andrew J. Pounds, Ph.D.  (pounds_aj at mercer.edu)
Professor of Chemistry and Computer Science
Director of the Computational Science Program
Mercer University,  Macon, GA 31207   (478) 301-5627

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://theochem.mercer.edu/pipermail/csc435/attachments/20200417/dbfd7c52/attachment.html>


More information about the csc435 mailing list