<html>

  <head>

    <meta content="text/html; charset=windows-1252"

      http-equiv="Content-Type">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    <div class="moz-cite-prefix">On 04/01/2016 09:36 PM,  wrote:<br>

    </div>

    <blockquote

cite="mid:BY2PR01MB178384310024E789AD46BBE7EF9A0@BY2PR01MB1783.prod.exchangelabs.com"

      type="cite">

      <meta http-equiv="Content-Type" content="text/html;

        charset=windows-1252">

      <style type="text/css" style="display:none;"><!-- P {margin-top:0;margin-bottom:0;} --></style>

      <div id="divtagdefaultwrapper"

style="font-size:12pt;color:#000000;background-color:#FFFFFF;font-family:Calibri,Arial,Helvetica,sans-serif;">

        <p>Dr. Pounds:</p>

        <p> I have changed the OpenMP to have the pragma: </p>

        <div>#pragma omp parallel shared(N,maxiter,tol,a,b,x,XT,i,k,err)

          private(err,olderr,S,j,l,m) reduction(+ : err). Would this

          remove the dependency upon S and keep the while loop and outer

          for loop parallelized while letting the processor vectorize

          the insides of hte loops? Sorry, about all the trouble but

          data dependency and the parallelization has been giving me

          lots of trouble.</div>

        <div>Sincerely,</div>

        <br>

      </div>

    </blockquote>

    <font face="serif">Or you could try something like this...<br>

      <br>

      <font face="Courier New, Courier, monospace">#pragma omp parallel

        for shared(A,B,S,X,XT,N) private(i,j) <br>

                for (i=0; i&lt;N; i++) {<br>

                    *(S+i) = 0.0;<br>

                    for (j=0  ; j&lt;i; j++ ) *(S+i) = *(S+i) +

        *(A+i*N+j) * *(X+j);<br>

                    for (j=i+1; j&lt;N; j++ ) *(S+i) = *(S+i) +

        *(A+i*N+j) * *(X+j);<br>

                    *(S+i) = (*(B+i) - *(S+i))/ *(A+i*N+i);<br>

                    *(XT+i) = *(S+i) ;<br>

                }<br>

        <br>

                for (i=0; i&lt;N; i++) err = fmax(fabs(*(S+i)),err);</font><br>

      <br>

    </font><br>

    This will chunk the first loop into sizes equal to the number of

    threads and completely removes any data dependence between the

    threads.  The inner loops over j will not be parallelized, but will

    be running on independent threads over all the indices.  The last

    for loop (which includes err) will run serially.  You could

    parellelize that too, but I think the speedup would be minimal.<br>

    <br>

    <pre class="moz-signature" cols="72">-- 

Andrew J. Pounds, Ph.D.  (<a class="moz-txt-link-abbreviated" href="mailto:pounds_aj@mercer.edu">pounds_aj@mercer.edu</a>)

Professor of Chemistry and Computer Science

Mercer University,  Macon, GA 31207   (478) 301-5627

<a class="moz-txt-link-freetext" href="http://faculty.mercer.edu/pounds_aj">http://faculty.mercer.edu/pounds_aj</a>

</pre>

  </body>

</html>