src/parsec/disk-image/parsec/parsec-benchmark/ext/splash2x/kernels/lu_cb/src/README.lu - public/gem5-resources - Git at Google

 GENERAL INFORMATION:

 The LU program factors a dense matrix into the product of a lower
 triangular and an upper triangular matrix.  The factorization uses
 blocking to exploit temporal locality on individual submatrix elements.
 The algorithm used in this implementation is described in

 Woo, S. C., Singh, J. P., and Hennessy, J. L.  The Performance Advantages
      of Integrating Block Data Transfer in Cache-Coherent Multiprocessors.
      Proceedings of the 6th International Conference on Architectural
      Support for Programming Languages and Operating Systems (ASPLOS-VI),
      October 1994.

 Two implementations are provided in the SPLASH-2 distribution:

   (1) Non-contiguous block allocation

       This implementation (contained in the non_contiguous_blocks
       subdirectory) implements the matrix to be factored with a
       two-dimensional array.  This data structure prevents blocks from
       being allocated contiguously, but leads to a conceptually simple
       programming implementation.

   (2) Contiguous block allocation

       This implementation (contained in the contiguous_blocks
       subdirectory) implements the matrix to be factored as an array
       of blocks.  This data structure allows blocks to be allocated
       contiguously and entirely in the local memory of processors that
       "own" them, thus enhancing data locality properties.

 These programs work under both the Unix FORK and SPROC models.

 RUNNING THE PROGRAM:

 To see how to run the program, please see the comment at the top of the
 file lu.C, or run the application with the "-h" command line option.
 Three parameters may be specified on the command line, of which the
 ones that are normally changed are the matrix size and the number of
 processors.  It is suggested that the block size be kept at the value
 B=16, since this value works well in practice.  If this parameter is
 changed, the new value should be reported in any results that are
 presented.

 BASE PROBLEM SIZE:

 The base problem size for an upto-64 processor machine is a 512x512 matrix
 with a block size of B=16.

 DATA DISTRIBUTION:

 Our "POSSIBLE ENHANCEMENT" comments in the source code tell where one
 might want to distribute data and how.  Data distribution has a small
 impact on performance on the Stanford DASH multiprocessor.
	GENERAL INFORMATION:

	The LU program factors a dense matrix into the product of a lower
	triangular and an upper triangular matrix. The factorization uses
	blocking to exploit temporal locality on individual submatrix elements.
	The algorithm used in this implementation is described in

	Woo, S. C., Singh, J. P., and Hennessy, J. L. The Performance Advantages
	of Integrating Block Data Transfer in Cache-Coherent Multiprocessors.
	Proceedings of the 6th International Conference on Architectural
	Support for Programming Languages and Operating Systems (ASPLOS-VI),
	October 1994.

	Two implementations are provided in the SPLASH-2 distribution:

	(1) Non-contiguous block allocation

	This implementation (contained in the non_contiguous_blocks
	subdirectory) implements the matrix to be factored with a
	two-dimensional array. This data structure prevents blocks from
	being allocated contiguously, but leads to a conceptually simple
	programming implementation.

	(2) Contiguous block allocation

	This implementation (contained in the contiguous_blocks
	subdirectory) implements the matrix to be factored as an array
	of blocks. This data structure allows blocks to be allocated
	contiguously and entirely in the local memory of processors that
	"own" them, thus enhancing data locality properties.

	These programs work under both the Unix FORK and SPROC models.

	RUNNING THE PROGRAM:

	To see how to run the program, please see the comment at the top of the
	file lu.C, or run the application with the "-h" command line option.
	Three parameters may be specified on the command line, of which the
	ones that are normally changed are the matrix size and the number of
	processors. It is suggested that the block size be kept at the value
	B=16, since this value works well in practice. If this parameter is
	changed, the new value should be reported in any results that are
	presented.

	BASE PROBLEM SIZE:

	The base problem size for an upto-64 processor machine is a 512x512 matrix
	with a block size of B=16.

	DATA DISTRIBUTION:

	Our "POSSIBLE ENHANCEMENT" comments in the source code tell where one
	might want to distribute data and how. Data distribution has a small
	impact on performance on the Stanford DASH multiprocessor.