src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-OMP/CG/README.carefully - public/gem5-resources - Git at Google

 Note: please observe that in the routine conj_grad three
 implementations of the sparse matrix-vector multiply have
 been supplied.  The default matrix-vector multiply is not
 loop unrolled.  The alternate implementations are unrolled
 to a depth of 2 and unrolled to a depth of 8.  Please
 experiment with these to find the fastest for your particular
 architecture.  If reporting timing results, any of these three may
 be used without penalty.

 Performance examples:
 The non-unrolled version of the multiply is actually (slightly:
 maybe %5) faster on the sp2-66MHz-WN on 16 nodes than is the
 unrolled-by-2 version below.   On the Cray t3d, the reverse is true,
 i.e., the unrolled-by-two version is some 10% faster.
 The unrolled-by-8 version below is significantly faster
 on the Cray t3d - overall speed of code is 1.5 times faster.
	Note: please observe that in the routine conj_grad three
	implementations of the sparse matrix-vector multiply have
	been supplied. The default matrix-vector multiply is not
	loop unrolled. The alternate implementations are unrolled
	to a depth of 2 and unrolled to a depth of 8. Please
	experiment with these to find the fastest for your particular
	architecture. If reporting timing results, any of these three may
	be used without penalty.

	Performance examples:
	The non-unrolled version of the multiply is actually (slightly:
	maybe %5) faster on the sp2-66MHz-WN on 16 nodes than is the
	unrolled-by-2 version below. On the Cray t3d, the reverse is true,
	i.e., the unrolled-by-two version is some 10% faster.
	The unrolled-by-8 version below is significantly faster
	on the Cray t3d - overall speed of code is 1.5 times faster.