| GENERAL INFORMATION: |
| |
| The FFT program is a complex, one-dimensional version of the "Six-Step" |
| FFT described in |
| |
| Bailey, D. H. FFTs in External or Hierarchical Memory. |
| Journal of Supercomputing, 4(1):23-35, March 1990. |
| |
| Specific optimizations in this implementation include: (1) Performing |
| staggered, blocked transposes that exploit cache-line reuse; (2) The |
| roots of unity data structure is arranged and distributed for only local |
| accesses during application of the roots of unity step; (3) A small set of |
| roots of unity elements are replicated locally at each processor for |
| computation of the 1D FFTs; and (4) The matrix data structures are padded |
| to reduce cache mapping conflicts. The algorithm used in this |
| implementation is described in: |
| |
| Woo, S. C., Singh, J. P., and Hennessy, J. L. The Performance Advantages |
| of Integrating Block Data Transfer in Cache-Coherent Multiprocessors. |
| Proceedings of the 6th International Conference on Architectural |
| Support for Programming Languages and Operating Systems (ASPLOS-VI), |
| October 1994. |
| |
| This program works under both the Unix FORK and SPROC models. |
| |
| RUNNING THE PROGRAM: |
| |
| To see how to run the program, please see the comment at the top of the |
| file fft.C, or run the application with the "-h" command line option. |
| Four command-line parameters MUST be specified: The number of points |
| to transform, the number of processors, the log base 2 of the cache |
| line size, and the number of cache lines. |
| |
| The number of complex data points to be transformed and the number of |
| processors being used can be varied according to the following rules. |
| The number of data points must be an even power of 2 (2**16, 2**18, |
| etc). The command-line option "-mM" specifies the even power of two (M), |
| so for 65,536 data points, the command line option is "-m16." The number |
| of processors must be a power of 2. |
| |
| The main data structures are padded to reduce cache mapping conflicts. |
| The constant PAGE_SIZE should be changed to reflect the page size (in |
| bytes) of the target system. In addition, the program uses a blocking |
| technique to exploit cache line reuse. Both the log base 2 of the cache |
| line size and the number of cache lines in the cache should be specified |
| at the command line in order to allow the blocking algorithm to work |
| effectively. |
| |
| BASE PROBLEM SIZE: |
| |
| The base problem size for an upto-64 processor machine is 65,536 complex |
| data points (M=16). The PAGE_SIZE constant should be changed to reflect |
| the page size (in bytes) of the target system. In addition, the log base |
| 2 of the cache line size, the number of cache lines, and the number of |
| processors must be specified. |
| |
| DATA DISTRIBUTION: |
| |
| Our "POSSIBLE ENHANCEMENT" comments in the source code tell where one |
| might want to distribute data and how. Data distribution has an impact |
| on performance on the Stanford DASH multiprocessor. |
| |