src/parsec/disk-image/parsec/parsec-benchmark/ext/splash2/README.SPLASH2 - public/gem5-resources - Git at Google

 Date:  Oct 19, 1994

 This is the directory for the second release of the Stanford Parallel
 Applications for Shared-Memory (SPLASH-2) programs. For further
 information contact splash@mojave.stanford.edu.

 PLEASE NOTE:  Due to our limited resources, we will be unable to spend
 much time answering questions about the applications.

 splash.tar contains the tared version of all the files.  Grabbing this
 file will get you everything you need.  We also keep the files
 individually untared for partial retrieval.  The splash.tar file is not
 compressed, but the large files in it are.  We attempted to compress the
 splash.tar file to reduce the file size further, but this resulted in
 a negative compression ratio.


 DIFFERENCES BETWEEN SPLASH AND SPLASH-2:
 ----------------------------------------

 The SPLASH-2 suite contains two types of codes: full applications and
 kernels.  Each of the codes utilizes the Argonne National Laboratories
 (ANL) parmacs macros for parallel constructs.  Unlike the codes in the
 original SPLASH release, each of the codes assumes the use of a
 "lightweight threads" model (which we hereafter refer to as the "threads"
 model) in which child processes share the same virtual address space as
 their parent process.  In order for the codes to function correctly,
 the CREATE macro should call the proper Unix system routine (e.g. "sproc"
 in the Silicon Graphics IRIX operating system) instead of the "fork"
 routine that was used for SPLASH.  The difference is that processes
 created with the Unix fork command receive their own private copies of
 all global variables.  In the threads model, child processes share the
 same virtual address space, and hence all global data.  Some of the
 codes function correctly when the Unix "fork" command is used for child
 process creation as well.  Comments in the code header denote those
 applications which function correctly with "fork."


 MACROS:
 -------

 Macros for the previous release of the SPLASH application suite can be
 obtained via anonymous ftp to www-flash.stanford.edu.  The macros are
 contained in the pub/old_splash/splash/macros subdirectory.  HOWEVER,
 THE MACRO FILES MUST BE MODIFIED IN ORDER TO BE USED WITH SPLASH-2 CODES.
 The CREATE macros must be changed so that they call the proper process
 creation routine (See DIFFERENCES section above) instead of "fork."

 In this macros subdirectory, macros and sample makefiles are provided
 for three machines:

 Encore Multimax (CMU Mach 2.5: C and Fortran)
 SGI 4D/240      (IRIX System V Release 3.3: C only)
 Alliant FX/8    (Alliant Rev. 5.0: C and Fortran)

 These macros work for us with the above operating systems.  Unfortunately,
 our limited resources prevent us from supporting them in any way or
 even fielding questions about them.   If they don't work for you, please
 contact Argonne National Labs for a version that will.  An e-mail address
 to try might be monitor-users-request@mcs.anl.gov. An excerpt from
 a message, received from Argonne, concerning obtaining the macros follows:

 "The parmacs package is in the public domain.  Approximately 15 people at
 Argonne (or associated with Argonne or students) have worked on the
 parmacs package at one time or another.  The parmacs package is
 implemented via macros using the M4 macropreprocessor (standard on most
 Unix systems).  Current distribution of the software is somewhat ad hoc.
 Most C versions can be obtained from netlib (send electronic mail to
 netlib@ornl.gov with the message send index from parmacs).  Fortran
 versions have been emailed directly or sent on tape.  The primary
 documentation for the parmacs package is the book ``Portable Programs for
 Parallel Processors'' by Lusk, et al, Holt, Rinehart, and Winston 1987."

 The makefiles provided in the individual program directories specify
 a null macro set that will turn the parallel programs into sequential
 ones.  Note that we do not have a null macro set for FORTRAN.


 CODE ENHANCEMENTS:
 ------------------

 All of the codes are designed for shared address space multiprocessors
 with physically distributed main memory.  For these types of machines,
 process migration and poor data distribution can decrease performance
 to suboptimal levels.  In the applications, comments indicating potential
 enhancements can be found which will improve performance.  Each potential
 enhancement is denoted by a comment beginning with "POSSIBLE ENHANCEMENT".
 The potential enhancements which we identify are:

   (1) Data Distribution

       Comments are placed in the code indicating where directives should
       be placed so that data can be migrated to the local memories of
       nodes, thus allowing for remote communication to be minimized.

   (2) Process-to-Processor Assignment

       Comments are placed in the code indicating where directives should
       be placed so that processes can be "pinned" to processors,
       preventing them from migrating from processor to processor.

 In addition, to facilitate simulation studies, we note points in the
 codes where statistics gathering routines should be turned on so that
 cold-start and initialization effects can be avoided.

 As previously mentioned, processes are assumed to be created through calls
 to a "threads" model creation routine.  One important side effect is that
 this model causes all global variables to be shared (whereas the fork model
 causes all processes to get their own private copy of global variables).
 In order to mimic the behavior of global variables in the fork model, many
 of the applications provide arrays of structures that can be accessed by
 process ID, such as:

      struct per_process_info {
        char pad[PAD_LENGTH];
        unsigned start_time;
        unsigned end_time;
        char pad[PAD_LENGTH];
      } PPI[MAX_PROCS];

 In these structures, padding is inserted to ensure that the structure
 information associated with each process can be placed on a different
 page of memory, and can thus be explicitly migrated to that processor's
 local memory system.  We follow this strategy for certain variables since
 these data really belong to a process and should be allocated in its local
 memory.  A programming model that had the ability to declare global private
 data would have automatically ensured that these data were private, and
 that false sharing did not occur across different structures in the
 array.  However, since the threads model does not provide this capability,
 it is provided by explicitly introducing arrays of structures with padding.
 The padding constants used in the programs (PAD_LENGTH in this example)
 can easily be changed to suit the particular characteristics of a given
 system.  The actual data that is manipulated by individual applications
 (e.g. grid points, particle data, etc) is not padded, however.

 Finally, for some applications we provide less-optimized versions of the
 codes.  The less-optimized versions utilize data structures that lead to
 simpler implementations, but which do not allow for optimal data
 distribution (and can thus generate false-sharing).


 REPORT:
 -------

 A report will be put together shortly describing the structure, function,
 and performance characteristics of each application.  The report will be
 similar to the original SPLASH report (see the original report for the
 issues discussed).  The report will provide quantitative data (for two
 different cache line size) for characteristics such as working set size
 and miss rates (local versus remote, etc.).  In addition, the report
 will discuss cache behavior and synchronization behavior of the
 applications as well.  In the mean time, each application directory has
 a README file that describes how to run each application.  In addition,
 most applications have comments in their headers describing how to run
 each application.


 README FILES:
 -------------

 Each application has an associated README file.  It is VERY important to
 read these files carefully, as they discuss the important parameters to
 supply for each application, as well as other issues involved in running
 the programs.  In each README file, we discuss the impact of explicitly
 distributing data on the Stanford DASH Multiprocessor.  Unless otherwise
 specified, we assume that the default data distribution mechanism is
 through round-robin page allocation.


 PROBLEM SIZES:
 --------------

 For each application, the README file describes a recommended problem
 size that is a reasonable base problem size that both can be simulated
 and is not too small for reality on a machine with up to 64 processors.
 For the purposes of studying algorithm performance, the parameters
 associated with each application can be varied.  However, for the
 purposes of comparing machine architectures, the README files describe
 which parameters can be varied, and which should remain constant (or at
 their default values) for comparability.  If the specific "base"
 parameters that are specified are not used, then results which are
 reported should explicitly state which parameters were changed, what
 their new values are, and address why they were changed.


 CORE PROGRAMS:
 --------------

 Since the number of programs has increased over SPLASH, and since not
 everyone may be able to use all the programs in a given study, we
 identify some of the programs as "core" programs that should be used
 in most studies for comparability.  In the currently available set, these
 core programs include:

 (1) Ocean Simulation
 (2) Hierarchical Radiosity
 (3) Water Simulation with Spatial data structure
 (4) Barnes-Hut
 (5) FFT
 (6) Blocked Sparse Cholesky Factorization
 (7) Radix Sort

 The less optimized versions of the programs, when provided, should be
 used only in addition to these.


 MAILING LIST:
 -------------

 Please send a note to splash@mojave.stanford.edu if you have copied over
 the programs, so that we can put you on a mailing list for update reports.


 AUTHORSHIP:
 -----------

 The applications provided in the SPLASH-2 suite were developed by a number
 of people.  The report lists authors primarily responsible for the
 development of each application code.  The codes were made ready for
 distribution and the README files were prepared by Steven Cameron Woo and
 Jaswinder Pal Singh.


 CODE CHANGES:
 -------------

 If modifications are made to the codes which improve their performance,
 we would like to hear about them.  Please send email to
 splash@mojave.stanford.edu detailing the changes.


 UPDATE REPORTS:
 ---------------

 Watch this file for information regarding changes to codes and additions
 to the application suite.


 CHANGES:
 -------

 10-21-94: Ocean code, contiguous partitions, line 247 of slave1.C changed
           from

           t2a[0][0] = hh3*t2a[0][0]+hh1*psi[procid][1][0][0];

           to

           t2a[0][0] = hh3*t2a[0][0]+hh1*t2c[0][0];

           This change does not affect correctness; it is an optimization
           that was performed elsewhere in the code but overlooked here.

 11-01-94: Barnes, file code_io.C, line 55 changed from

           in_real(instr, tnow);

           to

           in_real(instr, &tnow);

 11-01-94: Raytrace, file main.C, lines 216-223 changed from

           if ((pid == 0) || (dostats))
           CLOCK(end);

           gm->partime[0] = (end - begin) & 0x7FFFFFFF;
           if (pid == 0) gm->par_start_time = begin;

 /*      printf("Process %ld elapsed time %lu.\n", pid, lapsed); */

           }

           to

           if ((pid == 0) || (dostats)) {
             CLOCK(end);
             gm->partime[pid] = (end - begin) & 0x7FFFFFFF;
             if (pid == 0) gm->par_start_time = begin;
           }

 11-13-94: Raytrace, file memory.C

           The use of the word MAIN_INITENV in a comment in memory.c causes
           m4 to expand this macro, and some implementations may get confused
           and generate the wrong C code.

 11-13-94: Radiosity, file rad_main.C

           rad_main.C uses the macro CREATE_LITE.  All three instances of
           CREATE_LITE should be changed to CREATE.

 11-13-94: Water-spatial and Water-nsquared, file makefile

           makefiles were changed so that the compilation phases included the
           CFLAGS options instead of the CCOPTS options, which did not exist.

 11-17-94: FMM, file particle.C

           Comment regarding data distribution of particle_array data
           structure is incorrect.  Round-robin allocation should be used.

 11-18-94: OCEAN, contiguous partitions, files main.C and linkup.C

           Eliminated a problem which caused non-doubleword aligned
           accesses to doublewords for the uniprocessor case.

           main.C: Added lines 467-471:

           if (nprocs%2 == 1) {         /* To make sure that the actual data
                                           starts double word aligned, add an extra
                                           pointer */
             d_size += sizeof(double ***);
           }

           Added same lines in file linkup.C at line numbers 100 and 159.

 07-30-95: RADIX has been changed.  A tree-structured parallel prefix
           computation is now used instead of a linear one.

           LU had been modified.  A comment describing how to distribute
           data (one of the POSSIBLE ENHANCEMENTS) was incorrect for the
           contiguous_blocks version of LU.  Also, a modification was made
           that reduces false sharing at line 206 of lu.C:

           last_malloc[i] = (double *) (((unsigned) last_malloc[i]) + PAGE_SIZE -
                            ((unsigned) last_malloc[i]) % PAGE_SIZE);

           A subdirectory shmem_files was added under the codes directory.
           This directory contains a file that can be compiled on SGI machines
           which replaces the libsgi.a file distributed in the original SPLASH
           release.

 09-26-95: Fixed a bug in LU. Line 201 was changed from

             last_malloc[i] = (double *) G_MALLOC(proc_bytes[i])

           to

             last_malloc[i] = (double *) G_MALLOC(proc_bytes[i] + PAGE_SIZE)

           Fixed similar bugs in WATER-NSQUARED and WATER-SPATIAL.  Both
           codes needed a barrier added into the mdmain.C files. In both
           codes, the line

             BARRIER(gl->start, NumProcs);

           was added.  In WATER-NSQUARED, it was added in mdmain.C at line
           84.  In WATER-SPATIAL, it was added in mdmain.C at line 107.
	Date: Oct 19, 1994

	This is the directory for the second release of the Stanford Parallel
	Applications for Shared-Memory (SPLASH-2) programs. For further
	information contact splash@mojave.stanford.edu.

	PLEASE NOTE: Due to our limited resources, we will be unable to spend
	much time answering questions about the applications.

	splash.tar contains the tared version of all the files. Grabbing this
	file will get you everything you need. We also keep the files
	individually untared for partial retrieval. The splash.tar file is not
	compressed, but the large files in it are. We attempted to compress the
	splash.tar file to reduce the file size further, but this resulted in
	a negative compression ratio.


	DIFFERENCES BETWEEN SPLASH AND SPLASH-2:
	----------------------------------------

	The SPLASH-2 suite contains two types of codes: full applications and
	kernels. Each of the codes utilizes the Argonne National Laboratories
	(ANL) parmacs macros for parallel constructs. Unlike the codes in the
	original SPLASH release, each of the codes assumes the use of a
	"lightweight threads" model (which we hereafter refer to as the "threads"
	model) in which child processes share the same virtual address space as
	their parent process. In order for the codes to function correctly,
	the CREATE macro should call the proper Unix system routine (e.g. "sproc"
	in the Silicon Graphics IRIX operating system) instead of the "fork"
	routine that was used for SPLASH. The difference is that processes
	created with the Unix fork command receive their own private copies of
	all global variables. In the threads model, child processes share the
	same virtual address space, and hence all global data. Some of the
	codes function correctly when the Unix "fork" command is used for child
	process creation as well. Comments in the code header denote those
	applications which function correctly with "fork."


	MACROS:
	-------

	Macros for the previous release of the SPLASH application suite can be
	obtained via anonymous ftp to www-flash.stanford.edu. The macros are
	contained in the pub/old_splash/splash/macros subdirectory. HOWEVER,
	THE MACRO FILES MUST BE MODIFIED IN ORDER TO BE USED WITH SPLASH-2 CODES.
	The CREATE macros must be changed so that they call the proper process
	creation routine (See DIFFERENCES section above) instead of "fork."

	In this macros subdirectory, macros and sample makefiles are provided
	for three machines:

	Encore Multimax (CMU Mach 2.5: C and Fortran)
	SGI 4D/240 (IRIX System V Release 3.3: C only)
	Alliant FX/8 (Alliant Rev. 5.0: C and Fortran)

	These macros work for us with the above operating systems. Unfortunately,
	our limited resources prevent us from supporting them in any way or
	even fielding questions about them. If they don't work for you, please
	contact Argonne National Labs for a version that will. An e-mail address
	to try might be monitor-users-request@mcs.anl.gov. An excerpt from
	a message, received from Argonne, concerning obtaining the macros follows:

	"The parmacs package is in the public domain. Approximately 15 people at
	Argonne (or associated with Argonne or students) have worked on the
	parmacs package at one time or another. The parmacs package is
	implemented via macros using the M4 macropreprocessor (standard on most
	Unix systems). Current distribution of the software is somewhat ad hoc.
	Most C versions can be obtained from netlib (send electronic mail to
	netlib@ornl.gov with the message send index from parmacs). Fortran
	versions have been emailed directly or sent on tape. The primary
	documentation for the parmacs package is the book ``Portable Programs for
	Parallel Processors'' by Lusk, et al, Holt, Rinehart, and Winston 1987."

	The makefiles provided in the individual program directories specify
	a null macro set that will turn the parallel programs into sequential
	ones. Note that we do not have a null macro set for FORTRAN.


	CODE ENHANCEMENTS:
	------------------

	All of the codes are designed for shared address space multiprocessors
	with physically distributed main memory. For these types of machines,
	process migration and poor data distribution can decrease performance
	to suboptimal levels. In the applications, comments indicating potential
	enhancements can be found which will improve performance. Each potential
	enhancement is denoted by a comment beginning with "POSSIBLE ENHANCEMENT".
	The potential enhancements which we identify are:

	(1) Data Distribution

	Comments are placed in the code indicating where directives should
	be placed so that data can be migrated to the local memories of
	nodes, thus allowing for remote communication to be minimized.

	(2) Process-to-Processor Assignment

	Comments are placed in the code indicating where directives should
	be placed so that processes can be "pinned" to processors,
	preventing them from migrating from processor to processor.

	In addition, to facilitate simulation studies, we note points in the
	codes where statistics gathering routines should be turned on so that
	cold-start and initialization effects can be avoided.

	As previously mentioned, processes are assumed to be created through calls
	to a "threads" model creation routine. One important side effect is that
	this model causes all global variables to be shared (whereas the fork model
	causes all processes to get their own private copy of global variables).
	In order to mimic the behavior of global variables in the fork model, many
	of the applications provide arrays of structures that can be accessed by
	process ID, such as:

	struct per_process_info {
	char pad[PAD_LENGTH];
	unsigned start_time;
	unsigned end_time;
	char pad[PAD_LENGTH];
	} PPI[MAX_PROCS];

	In these structures, padding is inserted to ensure that the structure
	information associated with each process can be placed on a different
	page of memory, and can thus be explicitly migrated to that processor's
	local memory system. We follow this strategy for certain variables since
	these data really belong to a process and should be allocated in its local
	memory. A programming model that had the ability to declare global private
	data would have automatically ensured that these data were private, and
	that false sharing did not occur across different structures in the
	array. However, since the threads model does not provide this capability,
	it is provided by explicitly introducing arrays of structures with padding.
	The padding constants used in the programs (PAD_LENGTH in this example)
	can easily be changed to suit the particular characteristics of a given
	system. The actual data that is manipulated by individual applications
	(e.g. grid points, particle data, etc) is not padded, however.

	Finally, for some applications we provide less-optimized versions of the
	codes. The less-optimized versions utilize data structures that lead to
	simpler implementations, but which do not allow for optimal data
	distribution (and can thus generate false-sharing).


	REPORT:
	-------

	A report will be put together shortly describing the structure, function,
	and performance characteristics of each application. The report will be
	similar to the original SPLASH report (see the original report for the
	issues discussed). The report will provide quantitative data (for two
	different cache line size) for characteristics such as working set size
	and miss rates (local versus remote, etc.). In addition, the report
	will discuss cache behavior and synchronization behavior of the
	applications as well. In the mean time, each application directory has
	a README file that describes how to run each application. In addition,
	most applications have comments in their headers describing how to run
	each application.


	README FILES:
	-------------

	Each application has an associated README file. It is VERY important to
	read these files carefully, as they discuss the important parameters to
	supply for each application, as well as other issues involved in running
	the programs. In each README file, we discuss the impact of explicitly
	distributing data on the Stanford DASH Multiprocessor. Unless otherwise
	specified, we assume that the default data distribution mechanism is
	through round-robin page allocation.


	PROBLEM SIZES:
	--------------

	For each application, the README file describes a recommended problem
	size that is a reasonable base problem size that both can be simulated
	and is not too small for reality on a machine with up to 64 processors.
	For the purposes of studying algorithm performance, the parameters
	associated with each application can be varied. However, for the
	purposes of comparing machine architectures, the README files describe
	which parameters can be varied, and which should remain constant (or at
	their default values) for comparability. If the specific "base"
	parameters that are specified are not used, then results which are
	reported should explicitly state which parameters were changed, what
	their new values are, and address why they were changed.


	CORE PROGRAMS:
	--------------

	Since the number of programs has increased over SPLASH, and since not
	everyone may be able to use all the programs in a given study, we
	identify some of the programs as "core" programs that should be used
	in most studies for comparability. In the currently available set, these
	core programs include:

	(1) Ocean Simulation
	(2) Hierarchical Radiosity
	(3) Water Simulation with Spatial data structure
	(4) Barnes-Hut
	(5) FFT
	(6) Blocked Sparse Cholesky Factorization
	(7) Radix Sort

	The less optimized versions of the programs, when provided, should be
	used only in addition to these.


	MAILING LIST:
	-------------

	Please send a note to splash@mojave.stanford.edu if you have copied over
	the programs, so that we can put you on a mailing list for update reports.


	AUTHORSHIP:
	-----------

	The applications provided in the SPLASH-2 suite were developed by a number
	of people. The report lists authors primarily responsible for the
	development of each application code. The codes were made ready for
	distribution and the README files were prepared by Steven Cameron Woo and
	Jaswinder Pal Singh.


	CODE CHANGES:
	-------------

	If modifications are made to the codes which improve their performance,
	we would like to hear about them. Please send email to
	splash@mojave.stanford.edu detailing the changes.


	UPDATE REPORTS:
	---------------

	Watch this file for information regarding changes to codes and additions
	to the application suite.


	CHANGES:
	-------

	10-21-94: Ocean code, contiguous partitions, line 247 of slave1.C changed
	from

	t2a[0][0] = hh3t2a[0][0]+hh1psi[procid][1][0][0];

	to

	t2a[0][0] = hh3t2a[0][0]+hh1t2c[0][0];

	This change does not affect correctness; it is an optimization
	that was performed elsewhere in the code but overlooked here.

	11-01-94: Barnes, file code_io.C, line 55 changed from

	in_real(instr, tnow);

	to

	in_real(instr, &tnow);

	11-01-94: Raytrace, file main.C, lines 216-223 changed from

	if ((pid == 0) \|\| (dostats))
	CLOCK(end);

	gm->partime[0] = (end - begin) & 0x7FFFFFFF;
	if (pid == 0) gm->par_start_time = begin;

	/* printf("Process %ld elapsed time %lu.\n", pid, lapsed); */

	}

	to

	if ((pid == 0) \|\| (dostats)) {
	CLOCK(end);
	gm->partime[pid] = (end - begin) & 0x7FFFFFFF;
	if (pid == 0) gm->par_start_time = begin;
	}

	11-13-94: Raytrace, file memory.C

	The use of the word MAIN_INITENV in a comment in memory.c causes
	m4 to expand this macro, and some implementations may get confused
	and generate the wrong C code.

	11-13-94: Radiosity, file rad_main.C

	rad_main.C uses the macro CREATE_LITE. All three instances of
	CREATE_LITE should be changed to CREATE.

	11-13-94: Water-spatial and Water-nsquared, file makefile

	makefiles were changed so that the compilation phases included the
	CFLAGS options instead of the CCOPTS options, which did not exist.

	11-17-94: FMM, file particle.C

	Comment regarding data distribution of particle_array data
	structure is incorrect. Round-robin allocation should be used.

	11-18-94: OCEAN, contiguous partitions, files main.C and linkup.C

	Eliminated a problem which caused non-doubleword aligned
	accesses to doublewords for the uniprocessor case.

	main.C: Added lines 467-471:

	if (nprocs%2 == 1) { /* To make sure that the actual data
	starts double word aligned, add an extra
	pointer */
	d_size += sizeof(double ***);
	}

	Added same lines in file linkup.C at line numbers 100 and 159.

	07-30-95: RADIX has been changed. A tree-structured parallel prefix
	computation is now used instead of a linear one.

	LU had been modified. A comment describing how to distribute
	data (one of the POSSIBLE ENHANCEMENTS) was incorrect for the
	contiguous_blocks version of LU. Also, a modification was made
	that reduces false sharing at line 206 of lu.C:

	last_malloc[i] = (double *) (((unsigned) last_malloc[i]) + PAGE_SIZE -
	((unsigned) last_malloc[i]) % PAGE_SIZE);

	A subdirectory shmem_files was added under the codes directory.
	This directory contains a file that can be compiled on SGI machines
	which replaces the libsgi.a file distributed in the original SPLASH
	release.

	09-26-95: Fixed a bug in LU. Line 201 was changed from

	last_malloc[i] = (double *) G_MALLOC(proc_bytes[i])

	to

	last_malloc[i] = (double *) G_MALLOC(proc_bytes[i] + PAGE_SIZE)

	Fixed similar bugs in WATER-NSQUARED and WATER-SPATIAL. Both
	codes needed a barrier added into the mdmain.C files. In both
	codes, the line

	BARRIER(gl->start, NumProcs);

	was added. In WATER-NSQUARED, it was added in mdmain.C at line
	84. In WATER-SPATIAL, it was added in mdmain.C at line 107.