blob: 9a7a4231e710c6bed87bccd96346f77c5e62edc1 [file] [log] [blame]
Some explanations on the OpenMP implementation of NPB 3.3 (NPB3.3-OMP)
----------------------------------------------------------------------
NPB-OMP is a sample OpenMP implementation based on NPB3-SER,
the sequential implementation of the NAS Parallel Benchmarks.
This implementation (NPB3.3-OMP) contains all ten benchmarks:
eight in Fortran: BT, SP, LU, FT, CG, MG, EP, and UA; two in C: IS
and DC. Starting in version 3.3, only the pipeline OpenMP
implementation of LU is included.
For changes from different versions, see the Changes.log file
included in the upper directory of this distribution.
This version has been tested, among others, on an SGI Origin3000 and
an SGI Altix. For problem reports and suggestions on the implementation,
please contact
NAS Parallel Benchmark Team
npb@nas.nasa.gov
1. Compilation
NPB3.x-OMP uses the same directory tree as NPB3.x-SER (and NPB2.x) does.
Before compilation, one needs to check the configuration file
'make.def' in the config directory and modify the file if necessary.
If it does not (yet) exist, copy 'make.def.template' or one of the
sample files in the NAS.samples subdirectory to 'make.def' and
edit the content for site- and machine-specific data. Then
make <benchmark> CLASS=<class> [VERSION=VEC]
<benchmark> is one of (BT, SP, LU, FT, CG, MG, EP, UA, IS, DC)
and <class> is one of (S, W, A, B, C, D, E), except that classes C,
D, and E are not defined for DC and class E is not defined for IS and UA.
The "VERSION=VEC" option is used for selecting the vectorized
versions of BT and LU.
Class D for IS (Integer Sort) requires a compiler/system that
supports the "long" type in C to be 64-bit. As examples, the SGI
MIPS compiler for the SGI Origin using the "-64" compilation flag and
the Intel compiler for IA64 are known to work.
In order to build the class E version of CG, the integer type
needs to be promoted to 64-bit, which is usually done through
compilation flag (such as "-i8" for FFLAGS in config/make.def).
To build a suite of benchmarks, one can create the file
"config/suite.def", which contains a list of executables to build.
Each line in the file contains the name of a benchmark and the class,
separated by spaces or tabs (see suite.def.template for an example).
Then
make suite
================================
The "RAND" variable in make.def
--------------------------------
Most of the NPBs use a random number generator. In two of the NPBs (FT
and EP) the computation of random numbers is included in the timed
part of the calculation, and it is important that the random number
generator be efficient. The default random number generator package
provided is called "randi8" and should be used where possible. It has
the following requirements:
randi8:
1. Uses integer*8 arithmetic. Compiler must support integer*8
2. Uses the Fortran 90 IAND intrinsic. Compiler must support IAND.
3. Assumes overflow bits are discarded by the hardware. In particular,
that the lowest 46 bits of a*b are always correct, even if the
result a*b is larger than 2^64.
Since randi8 may not work on all machines, we supply the following
alternatives:
randi8_safe
1. Uses integer*8 arithmetic
2. Uses the Fortran 90 IBITS intrinsic.
3. Does not make any assumptions about overflow. Should always
work correctly if compiler supports integer*8 and IBITS.
randdp
1. Uses double precision arithmetic (to simulate integer*8 operations).
Should work with any system with support for 64-bit floating
point arithmetic.
randdpvec
1. Similar to randdp but written to be easier to vectorize.
2. Execution
The executable is named <benchmark-name>.<class>.x and is placed
in the bin subdirectory (or in the directory BINDIR specified in
make.def, if you've defined it). Folllowing is an example of running
a benchmark in csh:
setenv OMP_NUM_THREADS 4
bin/bt.A.x > BT.A_out.4
It runs BT Class A problem on 4 threads and the output is stored
in BT.A_out.4.
Each benchmark includes a set of additional timers for profiling purpose
(reporting timing for selected code blocks). By default, these timers
are disabled. To enable the timers, create a dummy file 'timer.flag'
in the current working directory (not necessarily where the executable
is located) before running a benchmark.
The printed number of threads is the activated threads during the run,
which may not be the same as what is requested.
3. Known issues
NPB-OMP assumes 'deterministic' static scheduling at run-time to
ensure the correctness of the results. Verification in some
benchmarks might fail if this condition is not met.
For larger problem sizes, the default stack size for slave threads
may need to be increased on certain platforms. For example on SGI
Origin 3000, the following command can be used:
setenv MP_SLAVE_STACKSIZE 50000000 (to about 50MB)
On SGI Altix using the Intel compiler, the runtime variable would be
setenv KMP_STACKSIZE 50m (for 50MB)
In order to build the class E version of CG, the integer type
needs to be promoted to 64-bit, which is usually done through
compilation flag (such as "-i8" for FFLAGS in config/make.def).
4. Notes on the implementation
- Based on NPB3.0-SER, except that FT was kept closer to
the original version in NPB2.3-serial.
- OpenMP directives were added to the outer-most parallel loops.
No nested parallelism was considered.
- Extra loops were added in the beginning of most of the benchmarks
to touch data pages. This is to set up a data layout based on the
'first touch' policy.
- For LU, the pipeline algorithm outperforms the hyperplane algorithm
consistently on most modern platforms. So, only the pipeline
implementation is included.
- The IS OpenMP benchmark enables bucket sort by default. To disable
bucket sort, comment out the line in IS/is.c:
#define USE_BUCKETS
See IS/README.carefully for additional information.
- For Unstructured Adaptive (UA) and DC benchmarks, please see
UA/README or DC/README for additional instruction.