blob: 7fee72c9f00cf3552f3bb3f7a7a8f99cbbf8d2b2 [file] [log] [blame]
Date: Oct 19, 1994
This is the directory for the second release of the Stanford Parallel
Applications for Shared-Memory (SPLASH-2) programs. For further
information contact splash@mojave.stanford.edu.
PLEASE NOTE: Due to our limited resources, we will be unable to spend
much time answering questions about the applications.
splash.tar contains the tared version of all the files. Grabbing this
file will get you everything you need. We also keep the files
individually untared for partial retrieval. The splash.tar file is not
compressed, but the large files in it are. We attempted to compress the
splash.tar file to reduce the file size further, but this resulted in
a negative compression ratio.
DIFFERENCES BETWEEN SPLASH AND SPLASH-2:
----------------------------------------
The SPLASH-2 suite contains two types of codes: full applications and
kernels. Each of the codes utilizes the Argonne National Laboratories
(ANL) parmacs macros for parallel constructs. Unlike the codes in the
original SPLASH release, each of the codes assumes the use of a
"lightweight threads" model (which we hereafter refer to as the "threads"
model) in which child processes share the same virtual address space as
their parent process. In order for the codes to function correctly,
the CREATE macro should call the proper Unix system routine (e.g. "sproc"
in the Silicon Graphics IRIX operating system) instead of the "fork"
routine that was used for SPLASH. The difference is that processes
created with the Unix fork command receive their own private copies of
all global variables. In the threads model, child processes share the
same virtual address space, and hence all global data. Some of the
codes function correctly when the Unix "fork" command is used for child
process creation as well. Comments in the code header denote those
applications which function correctly with "fork."
MACROS:
-------
Macros for the previous release of the SPLASH application suite can be
obtained via anonymous ftp to www-flash.stanford.edu. The macros are
contained in the pub/old_splash/splash/macros subdirectory. HOWEVER,
THE MACRO FILES MUST BE MODIFIED IN ORDER TO BE USED WITH SPLASH-2 CODES.
The CREATE macros must be changed so that they call the proper process
creation routine (See DIFFERENCES section above) instead of "fork."
In this macros subdirectory, macros and sample makefiles are provided
for three machines:
Encore Multimax (CMU Mach 2.5: C and Fortran)
SGI 4D/240 (IRIX System V Release 3.3: C only)
Alliant FX/8 (Alliant Rev. 5.0: C and Fortran)
These macros work for us with the above operating systems. Unfortunately,
our limited resources prevent us from supporting them in any way or
even fielding questions about them. If they don't work for you, please
contact Argonne National Labs for a version that will. An e-mail address
to try might be monitor-users-request@mcs.anl.gov. An excerpt from
a message, received from Argonne, concerning obtaining the macros follows:
"The parmacs package is in the public domain. Approximately 15 people at
Argonne (or associated with Argonne or students) have worked on the
parmacs package at one time or another. The parmacs package is
implemented via macros using the M4 macropreprocessor (standard on most
Unix systems). Current distribution of the software is somewhat ad hoc.
Most C versions can be obtained from netlib (send electronic mail to
netlib@ornl.gov with the message send index from parmacs). Fortran
versions have been emailed directly or sent on tape. The primary
documentation for the parmacs package is the book ``Portable Programs for
Parallel Processors'' by Lusk, et al, Holt, Rinehart, and Winston 1987."
The makefiles provided in the individual program directories specify
a null macro set that will turn the parallel programs into sequential
ones. Note that we do not have a null macro set for FORTRAN.
CODE ENHANCEMENTS:
------------------
All of the codes are designed for shared address space multiprocessors
with physically distributed main memory. For these types of machines,
process migration and poor data distribution can decrease performance
to suboptimal levels. In the applications, comments indicating potential
enhancements can be found which will improve performance. Each potential
enhancement is denoted by a comment beginning with "POSSIBLE ENHANCEMENT".
The potential enhancements which we identify are:
(1) Data Distribution
Comments are placed in the code indicating where directives should
be placed so that data can be migrated to the local memories of
nodes, thus allowing for remote communication to be minimized.
(2) Process-to-Processor Assignment
Comments are placed in the code indicating where directives should
be placed so that processes can be "pinned" to processors,
preventing them from migrating from processor to processor.
In addition, to facilitate simulation studies, we note points in the
codes where statistics gathering routines should be turned on so that
cold-start and initialization effects can be avoided.
As previously mentioned, processes are assumed to be created through calls
to a "threads" model creation routine. One important side effect is that
this model causes all global variables to be shared (whereas the fork model
causes all processes to get their own private copy of global variables).
In order to mimic the behavior of global variables in the fork model, many
of the applications provide arrays of structures that can be accessed by
process ID, such as:
struct per_process_info {
char pad[PAD_LENGTH];
unsigned start_time;
unsigned end_time;
char pad[PAD_LENGTH];
} PPI[MAX_PROCS];
In these structures, padding is inserted to ensure that the structure
information associated with each process can be placed on a different
page of memory, and can thus be explicitly migrated to that processor's
local memory system. We follow this strategy for certain variables since
these data really belong to a process and should be allocated in its local
memory. A programming model that had the ability to declare global private
data would have automatically ensured that these data were private, and
that false sharing did not occur across different structures in the
array. However, since the threads model does not provide this capability,
it is provided by explicitly introducing arrays of structures with padding.
The padding constants used in the programs (PAD_LENGTH in this example)
can easily be changed to suit the particular characteristics of a given
system. The actual data that is manipulated by individual applications
(e.g. grid points, particle data, etc) is not padded, however.
Finally, for some applications we provide less-optimized versions of the
codes. The less-optimized versions utilize data structures that lead to
simpler implementations, but which do not allow for optimal data
distribution (and can thus generate false-sharing).
REPORT:
-------
A report will be put together shortly describing the structure, function,
and performance characteristics of each application. The report will be
similar to the original SPLASH report (see the original report for the
issues discussed). The report will provide quantitative data (for two
different cache line size) for characteristics such as working set size
and miss rates (local versus remote, etc.). In addition, the report
will discuss cache behavior and synchronization behavior of the
applications as well. In the mean time, each application directory has
a README file that describes how to run each application. In addition,
most applications have comments in their headers describing how to run
each application.
README FILES:
-------------
Each application has an associated README file. It is VERY important to
read these files carefully, as they discuss the important parameters to
supply for each application, as well as other issues involved in running
the programs. In each README file, we discuss the impact of explicitly
distributing data on the Stanford DASH Multiprocessor. Unless otherwise
specified, we assume that the default data distribution mechanism is
through round-robin page allocation.
PROBLEM SIZES:
--------------
For each application, the README file describes a recommended problem
size that is a reasonable base problem size that both can be simulated
and is not too small for reality on a machine with up to 64 processors.
For the purposes of studying algorithm performance, the parameters
associated with each application can be varied. However, for the
purposes of comparing machine architectures, the README files describe
which parameters can be varied, and which should remain constant (or at
their default values) for comparability. If the specific "base"
parameters that are specified are not used, then results which are
reported should explicitly state which parameters were changed, what
their new values are, and address why they were changed.
CORE PROGRAMS:
--------------
Since the number of programs has increased over SPLASH, and since not
everyone may be able to use all the programs in a given study, we
identify some of the programs as "core" programs that should be used
in most studies for comparability. In the currently available set, these
core programs include:
(1) Ocean Simulation
(2) Hierarchical Radiosity
(3) Water Simulation with Spatial data structure
(4) Barnes-Hut
(5) FFT
(6) Blocked Sparse Cholesky Factorization
(7) Radix Sort
The less optimized versions of the programs, when provided, should be
used only in addition to these.
MAILING LIST:
-------------
Please send a note to splash@mojave.stanford.edu if you have copied over
the programs, so that we can put you on a mailing list for update reports.
AUTHORSHIP:
-----------
The applications provided in the SPLASH-2 suite were developed by a number
of people. The report lists authors primarily responsible for the
development of each application code. The codes were made ready for
distribution and the README files were prepared by Steven Cameron Woo and
Jaswinder Pal Singh.
CODE CHANGES:
-------------
If modifications are made to the codes which improve their performance,
we would like to hear about them. Please send email to
splash@mojave.stanford.edu detailing the changes.
UPDATE REPORTS:
---------------
Watch this file for information regarding changes to codes and additions
to the application suite.
CHANGES:
-------
10-21-94: Ocean code, contiguous partitions, line 247 of slave1.C changed
from
t2a[0][0] = hh3*t2a[0][0]+hh1*psi[procid][1][0][0];
to
t2a[0][0] = hh3*t2a[0][0]+hh1*t2c[0][0];
This change does not affect correctness; it is an optimization
that was performed elsewhere in the code but overlooked here.
11-01-94: Barnes, file code_io.C, line 55 changed from
in_real(instr, tnow);
to
in_real(instr, &tnow);
11-01-94: Raytrace, file main.C, lines 216-223 changed from
if ((pid == 0) || (dostats))
CLOCK(end);
gm->partime[0] = (end - begin) & 0x7FFFFFFF;
if (pid == 0) gm->par_start_time = begin;
/* printf("Process %ld elapsed time %lu.\n", pid, lapsed); */
}
to
if ((pid == 0) || (dostats)) {
CLOCK(end);
gm->partime[pid] = (end - begin) & 0x7FFFFFFF;
if (pid == 0) gm->par_start_time = begin;
}
11-13-94: Raytrace, file memory.C
The use of the word MAIN_INITENV in a comment in memory.c causes
m4 to expand this macro, and some implementations may get confused
and generate the wrong C code.
11-13-94: Radiosity, file rad_main.C
rad_main.C uses the macro CREATE_LITE. All three instances of
CREATE_LITE should be changed to CREATE.
11-13-94: Water-spatial and Water-nsquared, file makefile
makefiles were changed so that the compilation phases included the
CFLAGS options instead of the CCOPTS options, which did not exist.
11-17-94: FMM, file particle.C
Comment regarding data distribution of particle_array data
structure is incorrect. Round-robin allocation should be used.
11-18-94: OCEAN, contiguous partitions, files main.C and linkup.C
Eliminated a problem which caused non-doubleword aligned
accesses to doublewords for the uniprocessor case.
main.C: Added lines 467-471:
if (nprocs%2 == 1) { /* To make sure that the actual data
starts double word aligned, add an extra
pointer */
d_size += sizeof(double ***);
}
Added same lines in file linkup.C at line numbers 100 and 159.
07-30-95: RADIX has been changed. A tree-structured parallel prefix
computation is now used instead of a linear one.
LU had been modified. A comment describing how to distribute
data (one of the POSSIBLE ENHANCEMENTS) was incorrect for the
contiguous_blocks version of LU. Also, a modification was made
that reduces false sharing at line 206 of lu.C:
last_malloc[i] = (double *) (((unsigned) last_malloc[i]) + PAGE_SIZE -
((unsigned) last_malloc[i]) % PAGE_SIZE);
A subdirectory shmem_files was added under the codes directory.
This directory contains a file that can be compiled on SGI machines
which replaces the libsgi.a file distributed in the original SPLASH
release.
09-26-95: Fixed a bug in LU. Line 201 was changed from
last_malloc[i] = (double *) G_MALLOC(proc_bytes[i])
to
last_malloc[i] = (double *) G_MALLOC(proc_bytes[i] + PAGE_SIZE)
Fixed similar bugs in WATER-NSQUARED and WATER-SPATIAL. Both
codes needed a barrier added into the mdmain.C files. In both
codes, the line
BARRIER(gl->start, NumProcs);
was added. In WATER-NSQUARED, it was added in mdmain.C at line
84. In WATER-SPATIAL, it was added in mdmain.C at line 107.