<html>

  <head>


    <meta http-equiv="content-type" content="text/html; charset=UTF-8">

  </head>

  <body>

    <p><font face="serif">I am VERY CONCERNED about the lack of timing

        tests that I have seen as well as the performance of the timing

        tests that I have seen.  Some of the tests that are running in

        the queues never use beyond 2 threads on a node.   So that you

        all can have a calm week and not stress about the final I want

        to break the Hybrid MPI project in two and have a portion of it

        come in earlier in the week and then the second portion come in

        for the actual final.</font></p>

    <p><font face="serif">Following my theme that accuracy matters more

        than speed, I want you to first verify that you are getting

        ACCURATE RESULTS when you break up your matrix multiplication

        problem across multiple computers and multiple threads.  For

        example -- what happens if I use an odd number of processors and

        an odd number of threads, but a matrix dimension size that is

        even -- do you still get accurate results.   In this first paper

        you have to PROVE that you are actually using the number or

        processors, and number of threads per processor that you claim,

        AND that you get accurate results for any combination of these. 

        I recommend doing SMALL tests here where you pick a large

        matrix, but break it across 1 to 4 processors and then use 1 to

        10 threads per processor.   Don't go for "all the marbles" with

        exhaustive tests until you are sure that your single jobs that

        run in parallel are running as you expect.   You can log into

        one of the actual nodes that is being used to see if the code is

        running in parallel like you think it should be working. 

        Checking for accuracy should be obvious -- the returned matrix

        trace of the matrix should be the dimension of the matrix

        (whether it is done serially on one node, done in parallel

        across multiple threads on one node, done in parallel across

        multiple nodes with one thread each, or done across multiple

        nodes with each one using multiple threads).<br>

      </font></p>

    <p><font face="serif">Let's have those papers come in by TUESDAY

        NIGHT at midnight.  I will modify the CANVAS dropbox and

        assignment description.  <br>

      </font></p>

    <p><font face="serif">I will check those AS THEY COME IN.   Once I

        have given you the "okay", you can proceed to complete the final

        exam portion.</font></p>

    <p><font face="serif">The final exam portion will be essentially

        completing what is already shown for the Hybrid MPI project

        where you demonstrate what combination of processors and threads

        gives you the maximum performance.  I will modify the project

        description and post the Final exam "assignment" in CANVAS.  If

        these are parallelized correctly then they should run fast --

        and not take 60+ hours in the queues.   <br>

      </font></p>

    <p><font face="serif">I know I am pushing a project into finals week

        -- but based on what I was seeing there was going to be little

        to no chance of you all finishing this by Sunday night with

        correct results.  By breaking it up and forcing your hand to

        think about accuracy first, and then the levels of parallelism, 

        you all have a much higher chance of getting this done correctly

        and finishing strong.<br>

      </font></p>

    <p><font face="serif">Stay safe out there... and play nice!<br>

      </font></p>

    <p>p.s. -- I know that you are going to hit some snags in this

      coding -- and I'm pretty sure I know where those snags are.  I'm

      not going to give away the answers, but I will give you a hint and

      tell you that if you want to have any chance of the fine-grained

      threading to work properly, you've got to dump the array access

      via the sequentially incremented bufIndex variable in the worker

      process.</p>

    <p>p.p.s. -- qstat -rn shows you the processors on which a specific

      job number is running in PBS/Torque<br>

    </p>

    <pre class="moz-signature" cols="72">-- 

Andrew J. Pounds, Ph.D.  (<a class="moz-txt-link-abbreviated" href="mailto:pounds_aj@mercer.edu">pounds_aj@mercer.edu</a>)

Professor of Chemistry and Computer Science

Director of the Computational Science Program

Mercer University,  Macon, GA 31207   (478) 301-5627

</pre>

  </body>

</html>