| Note: please observe that in the routine conj_grad three |
| implementations of the sparse matrix-vector multiply have |
| been supplied. The default matrix-vector multiply is not |
| loop unrolled. The alternate implementations are unrolled |
| to a depth of 2 and unrolled to a depth of 8. Please |
| experiment with these to find the fastest for your particular |
| architecture. If reporting timing results, any of these three may |
| be used without penalty. |
| |
| Performance examples: |
| The non-unrolled version of the multiply is actually (slightly: |
| maybe %5) faster on the sp2-66MHz-WN on 16 nodes than is the |
| unrolled-by-2 version below. On the Cray t3d, the reverse is true, |
| i.e., the unrolled-by-two version is some 10% faster. |
| The unrolled-by-8 version below is significantly faster |
| on the Cray t3d - overall speed of code is 1.5 times faster. |