blob: a3d79df27d0665bffd944ef60cec29002b2705d5 [file] [log] [blame]
Some explanations on NAS Parallel Benchmarks 3.3 - Serial version
-----------------------------------------------------------------
The serial version of NPB3.x (NPB3.x-SER) is based on NPB2.3-serial
with a number of improvements (see Section 3 below) and added with
two new benchmarks (UA and DC).
For problem reports and suggestions on the implementation, please contact
NAS Parallel Benchmarks Team
npb@nas.nasa.gov
1. Compilation
NPB3.3-SER uses the same directory tree as NPB2.3.
Before compilation, one needs to check the configuration file
'make.def' in the config directory and modify the file as necessary.
If it does not (yet) exist, copy 'make.def.template' or one of the
sample files in the NAS.samples subdirectory to 'make.def' and
edit the content for site- and machine-specific data. Then
make <benchmark> CLASS=<class> [VERSION=VEC]
<benchmark> is one of (BT, SP, LU, FT, CG, MG, EP, IS, UA, DC) and
<class> is one of (S, W, A, B, C). Although Classes D and E are also
defined for a number of benchmarks, the memory requirement and
execution time likely exceed what most of the single processor
systems can support.
Classes C, D and E are not defined for DC.
Class E is not defined for IS and UA.
The "VERSION=VEC" option is used for selecting the vectorized
versions of BT and LU.
Class D for IS (Integer Sort) requires a compiler/system that
supports the "long" type in C to be 64-bit. As examples, the SGI
MIPS compiler for the SGI Origin using the "-64" compilation flag and
the Intel compiler for IA64 are known to work.
In order to build the class E version of CG, the integer type
needs to be promoted to 64-bit, which is usually done through
compilation flag (such as "-i8" for FFLAGS in config/make.def).
To build a suite of benchmarks, one can create the file
"config/suite.def", which contains a list of executables to build.
Each line in the file contains the name of a benchmark and the class,
separated by spaces or tabs (see suite.def.template for an example).
Then
make suite
================================
The "RAND" variable in make.def
--------------------------------
Most of the NPBs use a random number generator. In two of the NPBs (FT
and EP) the computation of random numbers is included in the timed
part of the calculation, and it is important that the random number
generator be efficient. The default random number generator package
provided is called "randi8" and should be used where possible. It has
the following requirements:
randi8:
1. Uses integer*8 arithmetic. Compiler must support integer*8
2. Uses the Fortran 90 IAND intrinsic. Compiler must support IAND.
3. Assumes overflow bits are discarded by the hardware. In particular,
that the lowest 46 bits of a*b are always correct, even if the
result a*b is larger than 2^64.
Since randi8 may not work on all machines, we supply the following
alternatives:
randi8_safe
1. Uses integer*8 arithmetic
2. Uses the Fortran 90 IBITS intrinsic.
3. Does not make any assumptions about overflow. Should always
work correctly if compiler supports integer*8 and IBITS.
randdp
1. Uses double precision arithmetic (to simulate integer*8 operations).
Should work with any system with support for 64-bit floating
point arithmetic.
randdpvec
1. Similar to randdp but written to be easier to vectorize.
2. Execution
The executable is named <benchmark-name>.<class>.x and is placed
in the bin subdirectory (or in the directory BINDIR specified in
make.def, if you've defined it). NPB3.3-SER can be run as regular
executables without additional settings. For example:
bin/bt.A.x > BT.A_out
It runs BT Class A problem and the output is stored to BT.A_out.
Each benchmark includes a set of additional timers for profiling purpose
(reporting timing for selected code blocks). By default, these timers
are disabled. To enable the timers, create a dummy file 'timer.flag'
in the current working directory (not necessarily where the executable
is located) before running a benchmark.
3. Notes on the implementation (NPB3.0-SER)
3.1 BT
This version is optimized for memory performance. It uses much less
memory than the original version due to the size reduction of working
arrays.
Serial performance in comparison with the original NPB2.3-serial.
----------------------------------------------------------------------
Machine (Speed) Class NPB2.3-serial NPB3.0-SER
Origin2000 (250MHz) A 2162.4(77.82) 1075.2(156.51) 50.3%
T3E (300MHz) W 218.1(35.39) 117.0(65.95) 46.4%
A ~5285.5(31.84) 2836.5(59.33)
SGI R5000 (150MHz) W 549.8(14.04) 265.0(29.13) 51.8%
PPro (200MHz) W 316.8(24.36) 121.2(63.69) 61.7%
----------------------------------------------------------------------
-- memory usage (Class A):
NPB2.3 - 323MB, PBN - 46MB
----------------------------------------------------------------------
3.2 SP
This version is optimized for memory performance. The smaller dimension
in U and RHS was moved to the inner-most, which gives better cache
performance. However, the code is not as friendly to vector machines as
the original version.
Serial performance in comparison with the original NPB2.3-serial.
----------------------------------------------------------------------
Machine (Speed) Class NPB2.3-serial NPB3.0-SERerial
Origin2000 (250MHz) A 1478.3(57.51) 971.4(87.52) 34.3%
T3E (300MHz) A 3194.3(26.61) 1708.3(49.76) 46.5%
SGI R5000 (150MHz) W 1324.2(10.70) 774.1(18.31) 41.5%
PPro (200MHz) W 758.9(18.68) 449.0(31.57) 40.8%
----------------------------------------------------------------------
-- memory usage (Class A):
NPB2.3 - 82MB, PBN - 48MB
----------------------------------------------------------------------
3.3 LU and LU-hp
LU is essentially the same as the original NPB2.3-serial.
It is a good starting point for a pipeline implementation.
LU-hp contains a hyper-plane implementation of the SSOR algorithm.
The default version is 3-D hyper-plane and has worse cache performance
than LU. Six relevant routines for a 2-D hyper-plane (wave-front)
implementation are included in the subdirectory 'ver2'.
Some of the timings on a single processor:
----------------------------------------------------------------------
Class A LU LU-hp-3D LU-hp-2D
Origin2000 (250MHz) 1389.4(85.87) 1605.1(74.32) 1325.1(90.03)
----------------------------------------------------------------------
3.4 FT
Summary of changes from NPB2.3-serial
- Reduce the use of memory for big arrays by 1/3
- Random number generator is made parallelizable
3.5 CG, MG
Except for removal of some working buffers (used in the MPI
program), the implementation has the same structure as the
NPB2.3-serial.
3.6 EP
It has the same implementation as in the original NPB2.3-serial.
3.7 IS
An extra array copy in the iteration loop was eliminated in the new
version. This improved performance by about 35% on a CLASS A problem
on Origin2000 (195MHz).
Old version (NPB2.3-serial)-
Time in seconds = 9.06
Mop/s total = 9.25
New version (NPB3.0-SER)-
Time in seconds = 5.89
Mop/s total = 14.23
3.8 Timers
NPB3.x-SER includes additional timers in the seven Fortran
benchmarks. To activate these timers, create a dummy file
'timer.flag' in the directory where the program is to run.