src/npb/disk-image/npb/npb-hooks/NPB3.3.1/NPB3.3-SER/README.install - public/gem5-resources - Git at Google

 Some explanations on NAS Parallel Benchmarks 3.3 - Serial version
 -----------------------------------------------------------------

 The serial version of NPB3.x (NPB3.x-SER) is based on NPB2.3-serial
 with a number of improvements (see Section 3 below) and added with
 two new benchmarks (UA and DC).

 For problem reports and suggestions on the implementation, please contact

    NAS Parallel Benchmarks Team
    npb@nas.nasa.gov


 1. Compilation

    NPB3.3-SER uses the same directory tree as NPB2.3.
    Before compilation, one needs to check the configuration file
    'make.def' in the config directory and modify the file as necessary.
    If it does not (yet) exist, copy 'make.def.template' or one of the
    sample files in the NAS.samples subdirectory to 'make.def' and
    edit the content for site- and machine-specific data.  Then

       make <benchmark> CLASS=<class> [VERSION=VEC]

    <benchmark> is one of (BT, SP, LU, FT, CG, MG, EP, IS, UA, DC) and
    <class> is one of (S, W, A, B, C).  Although Classes D and E are also
    defined for a number of benchmarks, the memory requirement and
    execution time likely exceed what most of the single processor
    systems can support.

    Classes C, D and E are not defined for DC.
    Class E is not defined for IS and UA.

    The "VERSION=VEC" option is used for selecting the vectorized
    versions of BT and LU.

    Class D for IS (Integer Sort) requires a compiler/system that
    supports the "long" type in C to be 64-bit.  As examples, the SGI
    MIPS compiler for the SGI Origin using the "-64" compilation flag and
    the Intel compiler for IA64 are known to work.

    In order to build the class E version of CG, the integer type
    needs to be promoted to 64-bit, which is usually done through
    compilation flag (such as "-i8" for FFLAGS in config/make.def).

    To build a suite of benchmarks, one can create the file
    "config/suite.def", which contains a list of executables to build.
    Each line in the file contains the name of a benchmark and the class,
    separated by spaces or tabs (see suite.def.template for an example).
    Then

       make suite


    ================================

    The "RAND" variable in make.def
    --------------------------------

    Most of the NPBs use a random number generator. In two of the NPBs (FT
    and EP) the computation of random numbers is included in the timed
    part of the calculation, and it is important that the random number
    generator be efficient.  The default random number generator package
    provided is called "randi8" and should be used where possible. It has
    the following requirements:

    randi8:
      1. Uses integer*8 arithmetic. Compiler must support integer*8
      2. Uses the Fortran 90 IAND intrinsic. Compiler must support IAND.
      3. Assumes overflow bits are discarded by the hardware. In particular,
         that the lowest 46 bits of a*b are always correct, even if the
         result a*b is larger than 2^64.

    Since randi8 may not work on all machines, we supply the following
    alternatives:

    randi8_safe
      1. Uses integer*8 arithmetic
      2. Uses the Fortran 90 IBITS intrinsic.
      3. Does not make any assumptions about overflow. Should always
         work correctly if compiler supports integer*8 and IBITS.

    randdp
      1. Uses double precision arithmetic (to simulate integer*8 operations).
         Should work with any system with support for 64-bit floating
         point arithmetic.

    randdpvec
      1. Similar to randdp but written to be easier to vectorize.


 2. Execution

    The executable is named <benchmark-name>.<class>.x and is placed
    in the bin subdirectory (or in the directory BINDIR specified in
    make.def, if you've defined it).  NPB3.3-SER can be run as regular
    executables without additional settings.  For example:

       bin/bt.A.x > BT.A_out

    It runs BT Class A problem and the output is stored to BT.A_out.

    Each benchmark includes a set of additional timers for profiling purpose
    (reporting timing for selected code blocks).  By default, these timers
    are disabled.  To enable the timers, create a dummy file 'timer.flag'
    in the current working directory (not necessarily where the executable
    is located) before running a benchmark.


 3. Notes on the implementation (NPB3.0-SER)

 3.1 BT

 This version is optimized for memory performance.  It uses much less
 memory than the original version due to the size reduction of working
 arrays.

 Serial performance in comparison with the original NPB2.3-serial.
 ----------------------------------------------------------------------
 Machine	(Speed) 	Class	NPB2.3-serial	NPB3.0-SER
 Origin2000 (250MHz)	A	2162.4(77.82)	1075.2(156.51)	50.3%
 T3E (300MHz)    	W	218.1(35.39)	117.0(65.95)	46.4%
                 	A	~5285.5(31.84)	2836.5(59.33)
 SGI R5000 (150MHz)	W	549.8(14.04)	265.0(29.13)	51.8%
 PPro (200MHz)		W	316.8(24.36)	121.2(63.69)	61.7%
 ----------------------------------------------------------------------
 -- memory usage (Class A):
       	 NPB2.3 - 323MB, PBN - 46MB
 ----------------------------------------------------------------------

 3.2 SP

 This version is optimized for memory performance.  The smaller dimension
 in U and RHS was moved to the inner-most, which gives better cache
 performance.  However, the code is not as friendly to vector machines as
 the original version.

 Serial performance in comparison with the original NPB2.3-serial.
 ----------------------------------------------------------------------
 Machine	(Speed) 	Class	NPB2.3-serial	NPB3.0-SERerial
 Origin2000 (250MHz)	A	1478.3(57.51)	971.4(87.52)	34.3%
 T3E (300MHz)    	A	3194.3(26.61)	1708.3(49.76)	46.5%
 SGI R5000 (150MHz)	W	1324.2(10.70)	774.1(18.31)	41.5%
 PPro (200MHz)		W	758.9(18.68)	449.0(31.57)	40.8%
 ----------------------------------------------------------------------
 -- memory usage (Class A):
       	 NPB2.3 - 82MB, PBN - 48MB
 ----------------------------------------------------------------------

 3.3 LU and LU-hp

 LU is essentially the same as the original NPB2.3-serial.
 It is a good starting point for a pipeline implementation.

 LU-hp contains a hyper-plane implementation of the SSOR algorithm.
 The default version is 3-D hyper-plane and has worse cache performance
 than LU.  Six relevant routines for a 2-D hyper-plane (wave-front)
 implementation are included in the subdirectory 'ver2'.

 Some of the timings on a single processor:
 ----------------------------------------------------------------------
 Class A                	LU      	LU-hp-3D	LU-hp-2D
 Origin2000 (250MHz)	1389.4(85.87)	1605.1(74.32)	1325.1(90.03)
 ----------------------------------------------------------------------

 3.4 FT

 Summary of changes from NPB2.3-serial

 - Reduce the use of memory for big arrays by 1/3
 - Random number generator is made parallelizable

 3.5 CG, MG

 Except for removal of some working buffers (used in the MPI
 program), the implementation has the same structure as the
 NPB2.3-serial.

 3.6 EP

 It has the same implementation as in the original NPB2.3-serial.

 3.7 IS

 An extra array copy in the iteration loop was eliminated in the new
 version.  This improved performance by about 35% on a CLASS A problem
 on Origin2000 (195MHz).

 Old version (NPB2.3-serial)-
  Time in seconds =                     9.06
  Mop/s total     =                     9.25

 New version (NPB3.0-SER)-
  Time in seconds =                     5.89
  Mop/s total     =                    14.23


 3.8 Timers

 NPB3.x-SER includes additional timers in the seven Fortran
 benchmarks.  To activate these timers, create a dummy file
 'timer.flag' in the directory where the program is to run.
	Some explanations on NAS Parallel Benchmarks 3.3 - Serial version
	-----------------------------------------------------------------

	The serial version of NPB3.x (NPB3.x-SER) is based on NPB2.3-serial
	with a number of improvements (see Section 3 below) and added with
	two new benchmarks (UA and DC).

	For problem reports and suggestions on the implementation, please contact

	NAS Parallel Benchmarks Team
	npb@nas.nasa.gov


	1. Compilation

	NPB3.3-SER uses the same directory tree as NPB2.3.
	Before compilation, one needs to check the configuration file
	'make.def' in the config directory and modify the file as necessary.
	If it does not (yet) exist, copy 'make.def.template' or one of the
	sample files in the NAS.samples subdirectory to 'make.def' and
	edit the content for site- and machine-specific data. Then

	make <benchmark> CLASS=<class> [VERSION=VEC]

	<benchmark> is one of (BT, SP, LU, FT, CG, MG, EP, IS, UA, DC) and
	<class> is one of (S, W, A, B, C). Although Classes D and E are also
	defined for a number of benchmarks, the memory requirement and
	execution time likely exceed what most of the single processor
	systems can support.

	Classes C, D and E are not defined for DC.
	Class E is not defined for IS and UA.

	The "VERSION=VEC" option is used for selecting the vectorized
	versions of BT and LU.

	Class D for IS (Integer Sort) requires a compiler/system that
	supports the "long" type in C to be 64-bit. As examples, the SGI
	MIPS compiler for the SGI Origin using the "-64" compilation flag and
	the Intel compiler for IA64 are known to work.

	In order to build the class E version of CG, the integer type
	needs to be promoted to 64-bit, which is usually done through
	compilation flag (such as "-i8" for FFLAGS in config/make.def).

	To build a suite of benchmarks, one can create the file
	"config/suite.def", which contains a list of executables to build.
	Each line in the file contains the name of a benchmark and the class,
	separated by spaces or tabs (see suite.def.template for an example).
	Then

	make suite


	================================

	The "RAND" variable in make.def
	--------------------------------

	Most of the NPBs use a random number generator. In two of the NPBs (FT
	and EP) the computation of random numbers is included in the timed
	part of the calculation, and it is important that the random number
	generator be efficient. The default random number generator package
	provided is called "randi8" and should be used where possible. It has
	the following requirements:

	randi8:
	1. Uses integer8 arithmetic. Compiler must support integer8
	2. Uses the Fortran 90 IAND intrinsic. Compiler must support IAND.
	3. Assumes overflow bits are discarded by the hardware. In particular,
	that the lowest 46 bits of a*b are always correct, even if the
	result a*b is larger than 2^64.

	Since randi8 may not work on all machines, we supply the following
	alternatives:

	randi8_safe
	1. Uses integer*8 arithmetic
	2. Uses the Fortran 90 IBITS intrinsic.
	3. Does not make any assumptions about overflow. Should always
	work correctly if compiler supports integer*8 and IBITS.

	randdp
	1. Uses double precision arithmetic (to simulate integer*8 operations).
	Should work with any system with support for 64-bit floating
	point arithmetic.

	randdpvec
	1. Similar to randdp but written to be easier to vectorize.


	2. Execution

	The executable is named <benchmark-name>.<class>.x and is placed
	in the bin subdirectory (or in the directory BINDIR specified in
	make.def, if you've defined it). NPB3.3-SER can be run as regular
	executables without additional settings. For example:

	bin/bt.A.x > BT.A_out

	It runs BT Class A problem and the output is stored to BT.A_out.

	Each benchmark includes a set of additional timers for profiling purpose
	(reporting timing for selected code blocks). By default, these timers
	are disabled. To enable the timers, create a dummy file 'timer.flag'
	in the current working directory (not necessarily where the executable
	is located) before running a benchmark.


	3. Notes on the implementation (NPB3.0-SER)

	3.1 BT

	This version is optimized for memory performance. It uses much less
	memory than the original version due to the size reduction of working
	arrays.

	Serial performance in comparison with the original NPB2.3-serial.
	----------------------------------------------------------------------
	Machine (Speed) Class NPB2.3-serial NPB3.0-SER
	Origin2000 (250MHz) A 2162.4(77.82) 1075.2(156.51) 50.3%
	T3E (300MHz) W 218.1(35.39) 117.0(65.95) 46.4%
	A ~5285.5(31.84) 2836.5(59.33)
	SGI R5000 (150MHz) W 549.8(14.04) 265.0(29.13) 51.8%
	PPro (200MHz) W 316.8(24.36) 121.2(63.69) 61.7%
	----------------------------------------------------------------------
	-- memory usage (Class A):
	NPB2.3 - 323MB, PBN - 46MB
	----------------------------------------------------------------------

	3.2 SP

	This version is optimized for memory performance. The smaller dimension
	in U and RHS was moved to the inner-most, which gives better cache
	performance. However, the code is not as friendly to vector machines as
	the original version.

	Serial performance in comparison with the original NPB2.3-serial.
	----------------------------------------------------------------------
	Machine (Speed) Class NPB2.3-serial NPB3.0-SERerial
	Origin2000 (250MHz) A 1478.3(57.51) 971.4(87.52) 34.3%
	T3E (300MHz) A 3194.3(26.61) 1708.3(49.76) 46.5%
	SGI R5000 (150MHz) W 1324.2(10.70) 774.1(18.31) 41.5%
	PPro (200MHz) W 758.9(18.68) 449.0(31.57) 40.8%
	----------------------------------------------------------------------
	-- memory usage (Class A):
	NPB2.3 - 82MB, PBN - 48MB
	----------------------------------------------------------------------

	3.3 LU and LU-hp

	LU is essentially the same as the original NPB2.3-serial.
	It is a good starting point for a pipeline implementation.

	LU-hp contains a hyper-plane implementation of the SSOR algorithm.
	The default version is 3-D hyper-plane and has worse cache performance
	than LU. Six relevant routines for a 2-D hyper-plane (wave-front)
	implementation are included in the subdirectory 'ver2'.

	Some of the timings on a single processor:
	----------------------------------------------------------------------
	Class A LU LU-hp-3D LU-hp-2D
	Origin2000 (250MHz) 1389.4(85.87) 1605.1(74.32) 1325.1(90.03)
	----------------------------------------------------------------------

	3.4 FT

	Summary of changes from NPB2.3-serial

	- Reduce the use of memory for big arrays by 1/3
	- Random number generator is made parallelizable

	3.5 CG, MG

	Except for removal of some working buffers (used in the MPI
	program), the implementation has the same structure as the
	NPB2.3-serial.

	3.6 EP

	It has the same implementation as in the original NPB2.3-serial.

	3.7 IS

	An extra array copy in the iteration loop was eliminated in the new
	version. This improved performance by about 35% on a CLASS A problem
	on Origin2000 (195MHz).

	Old version (NPB2.3-serial)-
	Time in seconds = 9.06
	Mop/s total = 9.25

	New version (NPB3.0-SER)-
	Time in seconds = 5.89
	Mop/s total = 14.23


	3.8 Timers

	NPB3.x-SER includes additional timers in the seven Fortran
	benchmarks. To activate these timers, create a dummy file
	'timer.flag' in the directory where the program is to run.