| GENERAL INFORMATION: |
| |
| The LU program factors a dense matrix into the product of a lower |
| triangular and an upper triangular matrix. The factorization uses |
| blocking to exploit temporal locality on individual submatrix elements. |
| The algorithm used in this implementation is described in |
| |
| Woo, S. C., Singh, J. P., and Hennessy, J. L. The Performance Advantages |
| of Integrating Block Data Transfer in Cache-Coherent Multiprocessors. |
| Proceedings of the 6th International Conference on Architectural |
| Support for Programming Languages and Operating Systems (ASPLOS-VI), |
| October 1994. |
| |
| Two implementations are provided in the SPLASH-2 distribution: |
| |
| (1) Non-contiguous block allocation |
| |
| This implementation (contained in the non_contiguous_blocks |
| subdirectory) implements the matrix to be factored with a |
| two-dimensional array. This data structure prevents blocks from |
| being allocated contiguously, but leads to a conceptually simple |
| programming implementation. |
| |
| (2) Contiguous block allocation |
| |
| This implementation (contained in the contiguous_blocks |
| subdirectory) implements the matrix to be factored as an array |
| of blocks. This data structure allows blocks to be allocated |
| contiguously and entirely in the local memory of processors that |
| "own" them, thus enhancing data locality properties. |
| |
| These programs work under both the Unix FORK and SPROC models. |
| |
| RUNNING THE PROGRAM: |
| |
| To see how to run the program, please see the comment at the top of the |
| file lu.C, or run the application with the "-h" command line option. |
| Three parameters may be specified on the command line, of which the |
| ones that are normally changed are the matrix size and the number of |
| processors. It is suggested that the block size be kept at the value |
| B=16, since this value works well in practice. If this parameter is |
| changed, the new value should be reported in any results that are |
| presented. |
| |
| BASE PROBLEM SIZE: |
| |
| The base problem size for an upto-64 processor machine is a 512x512 matrix |
| with a block size of B=16. |
| |
| DATA DISTRIBUTION: |
| |
| Our "POSSIBLE ENHANCEMENT" comments in the source code tell where one |
| might want to distribute data and how. Data distribution has a small |
| impact on performance on the Stanford DASH multiprocessor. |
| |