| \documentclass[fleqn,12pt]{article} |
| % |
| \setlength{\oddsidemargin}{-0.25in} |
| \setlength{\textwidth}{7.0in} |
| \setlength{\topmargin}{-0.25in} |
| \setlength{\textheight}{9.5in} |
| % |
| \usepackage{algorithmic} |
| \newenvironment{algorithm}{\begin{quote} %\vspace{1em} |
| \begin{algorithmic}\samepage}{\end{algorithmic} %\vspace{1em} |
| \end{quote}} |
| \newcommand{\Real}{\mathop{\mathrm{Re}}} |
| \newcommand{\Imag}{\mathop{\mathrm{Im}}} |
| |
| \begin{document} |
| \title{FFT Algorithms} |
| \author{Brian Gough, {\tt bjg@network-theory.co.uk}} |
| \date{May 1997} |
| \maketitle |
| |
| \section{Introduction} |
| |
| Fast Fourier Transforms (FFTs) are efficient algorithms for |
| calculating the discrete fourier transform (DFT), |
| % |
| \begin{eqnarray} |
| h_a &=& \mathrm{DFT}(g_b) \\ |
| &=& \sum_{b=0}^{N-1} g_b \exp(-2\pi i a b /N) \qquad 0 \leq a \leq N-1\\ |
| &=& \sum_{b=0}^{N-1} g_b W_N^{ab} \qquad W_N= \exp(-2\pi i/N) |
| \end{eqnarray} |
| % |
| The DFT usually arises as an approximation to the continuous fourier |
| transform when functions are sampled at discrete intervals in space or |
| time. The naive evaluation of the discrete fourier transform is a |
| matrix-vector multiplication ${\mathbf W}\vec{g}$, and would take |
| $O(N^2)$ operations for $N$ data-points. The general principle of the |
| Fast Fourier Transform algorithms is to use a divide-and-conquer |
| strategy to factorize the matrix $W$ into smaller sub-matrices, |
| typically reducing the operation count to $O(N \sum f_i)$ if $N$ can |
| be factorized into smaller integers, $N=f_1 f_2 \dots f_n$. |
| |
| This chapter explains the algorithms used in the GSL FFT routines |
| and provides some information on how to extend them. To learn more about |
| the FFT you should read the review article {\em Fast Fourier |
| Transforms: A Tutorial Review and A State of the Art} by Duhamel and |
| Vetterli~\cite{duhamel90}. There are several introductory books on the |
| FFT with example programs, such as {\em The Fast Fourier Transform} by |
| Brigham~\cite{brigham74} and {\em DFT/FFT and Convolution Algorithms} |
| by Burrus and Parks~\cite{burrus84}. In 1979 the IEEE published a |
| compendium of carefully-reviewed Fortran FFT programs in {\em Programs |
| for Digital Signal Processing}~\cite{committee79} which is a useful |
| reference for implementations of many different FFT algorithms. If you |
| are interested in using DSPs then the {\em Handbook of Real-Time Fast |
| Fourier Transforms}~\cite{smith95} provides detailed information on |
| the algorithms and hardware needed to design, build and test DSP |
| applications. Many FFT algorithms rely on results from number theory. |
| These results are covered in the books {\em Fast transforms: |
| algorithms, analyses, applications}, by Elliott and |
| Rao~\cite{elliott82}, {\em Fast Algorithms for Digital Signal |
| Processing} by Blahut~\cite{blahut} and {\em Number Theory in Digital |
| Signal Processing} by McClellan and Rader~\cite{mcclellan79}. There is |
| also an annotated bibliography of papers on the FFT and related topics |
| by Burrus~\cite{burrus-note}. |
| |
| \section{Families of FFT algorithms} |
| % |
| There are two main families of FFT algorithms: the Cooley-Tukey |
| algorithm and the Prime Factor algorithm. These differ in the way they |
| map the full FFT into smaller sub-transforms. Of the Cooley-Tukey |
| algorithms there are two types of routine in common use: mixed-radix |
| (general-$N$) algorithms and radix-2 (power of 2) algorithms. Each |
| type of algorithm can be further classified by additional characteristics, |
| such as whether it operates in-place or uses additional scratch space, |
| whether its output is in a sorted or scrambled order, and whether it |
| uses decimation-in-time or -frequency iterations. |
| |
| Mixed-radix algorithms work by factorizing the data vector into |
| shorter lengths. These can then be transformed by small-$N$ FFTs. |
| Typical programs include FFTs for small prime factors, such as 2, 3, |
| 5, \dots which are highly optimized. The small-$N$ FFT modules act as |
| building blocks and can be multiplied together to make longer |
| transforms. By combining a reasonable set of modules it is possible to |
| compute FFTs of many different lengths. If the small-$N$ modules are |
| supplemented by an $O(N^2)$ general-$N$ module then an FFT of any |
| length can be computed, in principle. Of course, any lengths which |
| contain large prime factors would perform only as $O(N^2)$. |
| |
| Radix-2 algorithms, or ``power of two'' algorithms, are simplified |
| versions of the mixed-radix algorithm. They are restricted to lengths |
| which are a power of two. The basic radix-2 FFT module only involves |
| addition and subtraction, so the algorithms are very simple. Radix-2 |
| algorithms have been the subject of much research into optimizing the |
| FFT. Many of the most efficient radix-2 routines are based on the |
| ``split-radix'' algorithm. This is actually a hybrid which combines |
| the best parts of both radix-2 and radix-4 (``power of 4'') |
| algorithms~\cite{sorenson86,duhamel86}. |
| |
| The prime factor algorithm (PFA) is an alternative form of general-$N$ |
| algorithm based on a different way of recombining small-$N$ FFT |
| modules~\cite{temperton83pfa,temperton85}. It has a very simple indexing |
| scheme which makes it attractive. However it only works in the case |
| where all factors are mutually prime. This requirement makes it more |
| suitable as a specialized algorithm for given lengths. |
| |
| |
| \begin{subsection}{FFTs of prime lengths} |
| % |
| Large prime lengths cannot be handled efficiently by any of these |
| algorithms. However it may still possible to compute a DFT, by using |
| results from number theory. Rader showed that it is possible to |
| convert a length-$p$ FFT (where $p$ is prime) into a convolution of |
| length-($p-1$). There is a simple identity between the convolution of |
| length $N$ and the FFT of the same length, so if $p-1$ is easily |
| factorizable this allows the convolution to be computed efficiently |
| via the FFT. The idea is presented in the original paper by |
| Rader~\cite{raderprimes} (also reprinted in~\cite{mcclellan79}), but |
| for more details see the theoretical books mentioned earlier. |
| \end{subsection} |
| |
| %\subsection{In-place algorithms vs algorithms using scratch space} |
| %% |
| %For simple algorithms, such as the Radix-2 algorithms, the FFT can be |
| %performed in-place, without additional memory. For this to be possible |
| %the index calculations must be simple enough that the calculation of |
| %the result to be stored in index $k$ can be carried out at the same |
| %time as the data in $k$ is used, so that no temporary storage is |
| %needed. |
| |
| %The mixed-radix algorithm is too complicated for this. It requires an |
| %area of scratch space sufficient to hold intermediate copies of the |
| %input data. |
| |
| %\subsection{Self-sorting algorithms vs scrambling algorithms} |
| %% |
| %A self-sorting algorithm At each iteration of the FFT internal permutations are included |
| %which leave the final iteration in its natural order. |
| |
| |
| %The mixed-radix algorithm can be made self-sorting. The additional |
| %scratch space helps here, although it is in fact possible to design |
| %self-sorting algorithms which do not require scratch |
| %space. |
| |
| |
| %The in-place radix-2 algorithm is not self-sorting. The data is |
| %permuted, and transformed into bit-reversed order. Note that |
| %bit-reversal only refers to the order of the array, not the values for |
| %each index which are of course not changed. More precisely, the data |
| %stored in location $[b_n \dots b_2 b_1 b_0]$ is moved to location |
| %$[b_0 b_1 b_2 \dots b_n]$, where $b_0 \dots b_n$ are the raw bits of a |
| %given index. Placing the data in bit reversed order is easily done, |
| %using right shifts to extract the least significant bits of the index |
| %and left-shifts to feed them into a register for the bit-reversed |
| %location. Simple radix-2 FFT usually recompute the bit reverse |
| %ordering in the naive way for every call. For repeated calls it might |
| %be worthwhile to precompute the permutation and cache it. There are |
| %also more sophisticated algorithms which reduce the number of |
| %operations needed to perform bit-reversal~\cite{bit-reversal}. |
| |
| |
| %\begin{subsection}{Decimation-in-time (DIT) vs Decimation-in-Frequency (DIF)} |
| |
| %\end{subsection} |
| |
| |
| \subsection{Optimization} |
| % |
| There is no such thing as the single fastest FFT {\em algorithm}. FFT |
| algorithms involve a mixture of floating point calculations, integer |
| arithmetic and memory access. Each of these operations will have |
| different relative speeds on different platforms. The performance of |
| an algorithm is a function of the hardware it is implemented on. The |
| goal of optimization is thus to choose the algorithm best suited to |
| the characteristics of a given hardware platform. |
| |
| For example, the Winograd Fourier Transform (WFTA) is an algorithm |
| which is designed to reduce the number of floating point |
| multiplications in the FFT. However, it does this at the expense of |
| using many more additions and data transfers than other algorithms. |
| As a consequence the WFTA might be a good candidate algorithm for |
| machines where data transfers occupy a negligible time relative to |
| floating point arithmetic. However on most modern machines, where the |
| speed of data transfers is comparable to or slower than floating point |
| operations, it would be outperformed by an algorithm which used a |
| better mix of operations (i.e. more floating point operations but |
| fewer data transfers). |
| |
| For a study of this sort of effect in detail, comparing the different |
| algorithms on different platforms consult the paper {\it Effects of |
| Architecture Implementation on DFT Algorithm Performance} by Mehalic, |
| Rustan and Route~\cite{mehalic85}. The paper was written in the early |
| 1980's and has data for super- and mini-computers which you are |
| unlikely to see today, except in a museum. However, the methodology is |
| still valid and it would be interesting to see similar results for |
| present day computers. |
| |
| |
| \section{FFT Concepts} |
| % |
| Factorization is the key principle of the mixed-radix FFT divide-and-conquer |
| strategy. If $N$ can be factorized into a product of $n_f$ integers, |
| % |
| \begin{equation} |
| N = f_1 f_2 ... f_{n_f} , |
| \end{equation} |
| % |
| then the FFT itself can be divided into smaller FFTs for each factor. |
| More precisely, an FFT of length $N$ can be broken up into, |
| % |
| \begin{quote} |
| \begin{tabular}{l} |
| $(N/f_1)$ FFTs of length $f_1$, \\ |
| $(N/f_2)$ FFTs of length $f_2$, \\ |
| \dots \\ |
| $(N/f_{n_f})$ FFTs of length $f_{n_f}$. |
| \end{tabular} |
| \end{quote} |
| % |
| The total number of operations for these sub-operations will be $O( |
| N(f_1 + f_2 + ... + f_{n_f}))$. When the factors of $N$ are all small |
| integers this will be substantially less than $O(N^2)$. For example, |
| when $N$ is a power of 2 an FFT of length $N=2^m$ can be reduced to $m |
| N/2$ FFTs of length 2, or $O(N\log_2 N)$ operations. Here is a |
| demonstration which shows this: |
| |
| We start with the full DFT, |
| % |
| \begin{eqnarray} |
| h_a &=& \sum_{b=0}^{N-1} g_b W_N^{ab} \qquad W_N=\exp(-2\pi i/N) |
| \end{eqnarray} |
| % |
| and split the sum into even and odd terms, |
| % |
| \begin{eqnarray} |
| \phantom{h_a} |
| % &=& \left(\sum_{b=0,2,4,\dots} + \sum_{b=1,3,5,\dots}\right) g_b W_N^{ab}\\ |
| &=& \sum_{b=0}^{N/2-1} g_{2b} W_N^{a(2b)} + |
| \sum_{b=0}^{N/2-1} g_{2b+1} W_N^{a(2b+1)}. |
| \end{eqnarray} |
| % |
| This converts the original DFT of length $N$ into two DFTs of length |
| $N/2$, |
| % |
| \begin{equation} |
| h_a = \sum_{b=0}^{N/2-1} g_{2b} W_{(N/2)}^{ab} + |
| W_N^a \sum_{b=0}^{N/2-1} g_{2b+1} W_{(N/2)}^{ab} |
| \end{equation} |
| % |
| The first term is a DFT of the even elements of $g$. The second term |
| is a DFT of the odd elements of $g$, premultiplied by an exponential |
| factor $W_N^k$ (known as a {\em twiddle factor}). |
| % |
| \begin{equation} |
| \mathrm{DFT}(h) = \mathrm{DFT}(g_{even}) + W_N^k \mathrm{DFT}(g_{odd}) |
| \end{equation} |
| % |
| By splitting the DFT into its even and odd parts we have reduced the |
| operation count from $N^2$ (for a DFT of length $N$) to $2 (N/2)^2$ |
| (for two DFTs of length $N/2$). The cost of the splitting is that we |
| need an additional $O(N)$ operations to multiply by the twiddle factor |
| $W_N^k$ and recombine the two sums. |
| |
| We can repeat the splitting procedure recursively $\log_2 N$ times |
| until the full DFT is reduced to DFTs of single terms. The DFT of a |
| single value is just the identity operation, which costs nothing. |
| However since $O(N)$ operations were needed at each stage to recombine |
| the even and odd parts the total number of operations to obtain the |
| full DFT is $O(N \log_2 N)$. If we had used a length which was a |
| product of factors $f_1$, $f_2$, $\dots$ we could have split the sum |
| in a similar way. First we would split terms corresponding to the |
| factor $f_1$, instead of the even and odd terms corresponding to a |
| factor of two. Then we would repeat this procedure for the subsequent |
| factors. This would lead to a final operation count of $O(N \sum |
| f_i)$. |
| |
| This procedure gives some motivation for why the number of operations |
| in a DFT can in principle be reduced from $O(N^2)$ to $O(N \sum f_i)$. |
| It does not give a good explanation of how to implement the algorithm |
| in practice which is what we shall do in the next section. |
| |
| \section{Radix-2 Algorithms} |
| % |
| For radix-2 FFTs it is natural to write array indices in binary form |
| because the length of the data is a power of two. This is nicely |
| explained in the article {\em The FFT: Fourier Transforming One Bit at |
| a Time} by P.B. Visscher~\cite{visscher96}. A binary representation |
| for indices is the key to deriving the simplest efficient radix-2 |
| algorithms. |
| |
| We can write an index $b$ ($0 \leq b < 2^{n-1}$) in binary |
| representation like this, |
| % |
| \begin{equation} |
| b = [b_{n-1} \dots b_1 b_0] = 2^{n-1}b_{n-1} + \dots + 2 b_1 + b_0 . |
| \end{equation} |
| % |
| Each of the $b_0, b_1, \dots, b_{n-1}$ are the bits (either 0 or 1) of |
| $b$. |
| |
| Using this notation the original definition of the DFT |
| can be rewritten as a sum over the bits of $b$, |
| % |
| \begin{equation} |
| h(a) = \sum_{b=0}^{N-1} g_b \exp(-2\pi i a b /N) |
| \end{equation} |
| % |
| to give an equivalent summation like this, |
| % |
| \begin{equation} |
| h([a_{n-1} \dots a_1 a_0]) = |
| \sum_{b_0=0}^{1} |
| \sum_{b_1=0}^{1} |
| \dots |
| \sum_{b_{n-1}=0}^{1} |
| g([b_{n-1} \dots b_1 b_0]) W_N^{ab} |
| \end{equation} |
| % |
| where the bits of $a$ are $a=[a_{n-1} \dots a_1 a_0]$. |
| |
| To reduce the number of operations in the sum we will use the |
| periodicity of the exponential term, |
| % |
| \begin{equation} |
| W_N^{x+N} = W_N^{x}. |
| \end{equation} |
| % |
| Most of the products $ab$ in $W_N^{ab}$ are greater than $N$. By |
| making use of this periodicity they can all be collapsed down into the |
| range $0\dots N-1$. This allows us to reduce the number of operations |
| by combining common terms, modulo $N$. Using this idea we can derive |
| decimation-in-time or decimation-in-frequency algorithms, depending on |
| how we break the DFT summation down into common terms. We'll first |
| consider the decimation-in-time algorithm. |
| |
| \subsection{Radix-2 Decimation-in-Time (DIT)} |
| % |
| To derive the the decimation-in-time algorithm we start by separating |
| out the most significant bit of the index $b$, |
| % |
| \begin{equation} |
| [b_{n-1} \dots b_1 b_0] = 2^{n-1}b_{n-1} + [b_{n-2} \dots b_1 b_0] |
| \end{equation} |
| % |
| Now we can evaluate the innermost sum of the DFT without any |
| dependence on the remaining bits of $b$ in the exponential, |
| % |
| \begin{eqnarray} |
| h([a_{n-1} \dots a_1 a_0]) &=& |
| \sum_{b_0=0}^{1} |
| \sum_{b_1=0}^{1} |
| \dots |
| \sum_{b_{n-1}=0}^{1} |
| g(b) |
| W_N^{a(2^{n-1}b_{n-1}+[b_{n-2} \dots b_1 b_0])} \\ |
| &=& |
| \sum_{b_0=0}^{1} |
| \sum_{b_1=0}^{1} |
| \dots |
| \sum_{b_{n-2}=0}^{1} |
| W_N^{a[b_{n-2} \dots b_1 b_0]} |
| \sum_{b_{n-1}=0}^{1} |
| g(b) |
| W_N^{a(2^{n-1}b_{n-1})} |
| \end{eqnarray} |
| % |
| Looking at the term $W_N^{a(2^{n-1}b_{n-1})}$ we see that we can also |
| remove most of the dependence on $a$ as well, by using the periodicity of the |
| exponential, |
| % |
| \begin{eqnarray} |
| W_N^{a(2^{n-1}b_{n-1})} &=& |
| \exp(-2\pi i [a_{n-1} \dots a_1 a_0] 2^{n-1} b_{n-1}/ 2^n )\\ |
| &=& \exp(-2\pi i [a_{n-1} \dots a_1 a_0] b_{n-1}/ 2 )\\ |
| &=& \exp(-2\pi i ( 2^{n-2}a_{n-1} + \dots + a_1 + (a_0/2)) b_{n-1} )\\ |
| &=& \exp(-2\pi i a_0 b_{n-1}/2 ) \\ |
| &=& W_2^{a_0 b_{n-1}} |
| \end{eqnarray} |
| % |
| Thus the innermost exponential term simplifies so that it only |
| involves the highest order bit of $b$ and the lowest order bit of $a$, |
| and the sum can be reduced to, |
| % |
| \begin{equation} |
| h([a_{n-1} \dots a_1 a_0]) |
| = |
| \sum_{b_0=0}^{1} |
| \sum_{b_1=0}^{1} |
| \dots |
| \sum_{b_{n-2}=0}^{1} |
| W_N^{a[b_{n-2} \dots b_1 b_0]} |
| \sum_{b_{n-1}=0}^{1} |
| g(b) |
| W_2^{a_0 b_{n-1}}. |
| \end{equation} |
| % |
| We can repeat this this procedure for the next most significant bit of |
| $b$, $b_{n-2}$, using a similar identity, |
| % |
| \begin{eqnarray} |
| W_N^{a(2^{n-2}b_{n-2})} |
| &=& \exp(-2\pi i [a_{n-1} \dots a_1 a_0] 2^{n-2} b_{n-2}/ 2^n )\\ |
| &=& W_4^{ [a_1 a_0] b_{n-2}}. |
| \end{eqnarray} |
| % |
| to give a formula with even less dependence on the bits of $a$, |
| % |
| \begin{equation} |
| h([a_{n-1} \dots a_1 a_0]) |
| = |
| \sum_{b_0=0}^{1} |
| \sum_{b_1=0}^{1} |
| \dots |
| \sum_{b_{n-3}=0}^{1} |
| W_N^{a[b_{n-3} \dots b_1 b_0]} |
| \sum_{b_{n-2}=0}^{1} |
| W_4^{[a_1 a_0] b_{n-2}} |
| \sum_{b_{n-1}=0}^{1} |
| g(b) |
| W_2^{a_0 b_{n-1}}. |
| \end{equation} |
| % |
| If we repeat the process for all the remaining bits we obtain a |
| simplified DFT formula which is the basis of the radix-2 |
| decimation-in-time algorithm, |
| % |
| \begin{eqnarray} |
| h([a_{n-1} \dots a_1 a_0]) &=& |
| \sum_{b_0=0}^{1} |
| W_{N}^{[a_{n-1} \dots a_1 a_0]b_0} |
| %\sum_{b_1=0}^{1} |
| %W_{N/2}^{[a_{n-1} \dots a_1 a_0]b_1} |
| \dots |
| \sum_{b_{n-2}=0}^{1} |
| W_4^{ [a_1 a_0] b_{n-2}} |
| \sum_{b_{n-1}=0}^{1} |
| W_2^{a_0 b_{n-1}} |
| g(b) |
| \end{eqnarray} |
| % |
| To convert the formula to an algorithm we expand out the sum |
| recursively, evaluating each of the intermediate summations, which we |
| denote by $g_1$, $g_2$, \dots, $g_n$, |
| % |
| \begin{eqnarray} |
| g_1(a_0, b_{n-2}, b_{n-3}, \dots, b_1, b_0) |
| &=& |
| \sum_{b_{n-1}=0}^{1} |
| W_2^{a_0 b_{n-1}} |
| g([b_{n-1} b_{n-2} b_{n-3} \dots b_1 b_0]) \\ |
| g_2(a_0, {}_{\phantom{-2}} a_{1}, b_{n-3}, \dots, b_1, b_0) |
| &=& |
| \sum_{b_{n-2}=0}^{1} |
| W_4^{[a_1 a_0] b_{n-2}} |
| g_1(a_0, b_{n-2}, b_{n-3}, \dots, b_1, b_0) \\ |
| g_3(a_0, {}_{\phantom{-2}} a_{1}, {}_{\phantom{-3}} a_{2}, \dots, b_1, b_0) |
| &=& |
| \sum_{b_{n-3}=0}^{1} |
| W_8^{[a_2 a_1 a_0] b_{n-2}} |
| g_2(a_0, a_1, b_{n-3}, \dots, b_1, b_0) \\ |
| \dots &=& \dots \\ |
| g_n(a_0, a_{1}, a_{2}, \dots, a_{n-2}, a_{n-1}) |
| &=& |
| \sum_{b_{0}=0}^{1} |
| W_N^{[a_{n-1} \dots a_1 a_0]b_0} |
| g_{n-1}(a_0, a_1, a_2, \dots, a_{n-2}, b_0) |
| \end{eqnarray} |
| % |
| After the final sum, we can obtain the transform $h$ from $g_n$, |
| % |
| \begin{equation} |
| h([a_{n-1} \dots a_1 a_0]) |
| = |
| g_n(a_0, a_1, \dots, a_{n-1}) |
| \end{equation} |
| % |
| Note that we left the storage arrangements of the intermediate sums |
| unspecified by using the bits as function arguments and not as an |
| index. The storage of intermediate sums is different for the |
| decimation-in-time and decimation-in-frequency algorithms. |
| |
| Before deciding on the best storage scheme we'll show that the results |
| of each stage, $g_1$, $g_2$, \dots, can be carried out {\em |
| in-place}. For example, in the case of $g_1$, the inputs are, |
| % |
| \begin{equation} |
| g([\underline{b_{n-1}} b_{n-2} b_{n-3} \dots b_1 b_0]) |
| \end{equation} |
| % |
| for $b_{n-1}=(0,1)$, and the corresponding outputs are, |
| % |
| \begin{equation} |
| g_1(\underline{a_0},b_{n-2}, b_{n-3}, \dots, b_1, b_0) |
| \end{equation} |
| % |
| for $a_0=(0,1)$. It's clear that if we hold $b_{n-2}, b_{n-3}, \dots, |
| b_1, b_0$ fixed and compute the sum over $b_{n-1}$ in memory for both |
| values of $a_0 = 0,1$ then we can store the result for $a_0=0$ in the |
| location which originally had $b_0=0$ and the result for $a_0=1$ in |
| the location which originally had $b_0=1$. The two inputs and two |
| outputs are known as {\em dual node pairs}. At each stage of the |
| calculation the sums for each dual node pair are independent of the |
| others. It is this property which allows an in-place calculation. |
| |
| So for an in-place pass our storage has to be arranged so that the two |
| outputs $g_1(a_0,\dots)$ overwrite the two input terms |
| $g([b_{n-1},\dots])$. Note that the order of $a$ is reversed from the |
| natural order of $b$. i.e. the least significant bit of $a$ |
| replaces the most significant bit of $b$. This is inconvenient |
| because $a$ occurs in its natural order in all the exponentials, |
| $W^{ab}$. We could keep track of both $a$ and its bit-reverse, |
| $a^{\mathit bit-reversed}$ at all times but there is a neat trick |
| which avoids this: if we bit-reverse the order of the input data $g$ |
| before we start the calculation we can also bit-reverse the order of |
| $a$ when storing intermediate results. Since the storage involving $a$ |
| was originally in bit-reversed order the switch in the input $g$ now |
| allows us to use normal ordered storage for $a$, the same ordering |
| that occurs in the exponential factors. |
| |
| This is complicated to explain, so here is an example of the 4 passes |
| needed for an $N=16$ decimation-in-time FFT, with the initial data |
| stored in bit-reversed order, |
| % |
| \begin{eqnarray} |
| g_1([b_0 b_1 b_2 a_0]) |
| &=& |
| \sum_{b_3=0}^{1} W_2^{a_0 b_3} g([b_0 b_1 b_2 b_3]) |
| \\ |
| g_2([b_0 b_1 a_1 a_0]) |
| &=& |
| \sum_{b_2=0}^{1} W_4^{[a_1 a_0] b_2} g_1([b_0 b_1 b_2 a_0]) |
| \\ |
| g_3([b_0 a_2 a_1 a_0]) |
| &=& |
| \sum_{b_1=0}^{1} W_8^{[a_2 a_1 a_0] b_1} g_2([b_0 b_1 a_1 a_0]) |
| \\ |
| h(a) = g_4([a_3 a_2 a_1 a_0]) |
| &=& |
| \sum_{b_0=0}^{1} W_{16}^{[a_3 a_2 a_1 a_0] b_0} g_3([b_0 a_2 a_1 a_0]) |
| \end{eqnarray} |
| % |
| We compensate for the bit reversal of the input data by accessing $g$ |
| with the bit-reversed form of $b$ in the first stage. This ensures |
| that we are still carrying out the same calculation, using the same |
| data, and not accessing different values. Only single bits of $b$ ever |
| occur in the exponential so we never need the bit-reversed form of |
| $b$. |
| |
| Let's examine the third pass in detail, |
| % |
| \begin{equation} |
| g_3([b_0 a_2 a_1 a_0]) |
| = |
| \sum_{b_1=0}^{1} W_8^{[a_2 a_1 a_0] b_1} g_2([b_0 b_1 a_1 a_0]) |
| \end{equation} |
| % |
| First note that only one bit, $b_1$, varies in each summation. The |
| other bits of $b$ ($b_0$) and of $a$ ($a_1 a_0$) are essentially |
| ``spectators'' -- we must loop over all combinations of these bits and |
| carry out the same basic calculation for each, remembering to update |
| the exponentials involving $W_8$ appropriately. If we are storing the |
| results in-place (with $g_3$ overwriting $g_2$ we will need to compute |
| the sums involving $b_1=0,1$ and $a_2=0,1$ simultaneously. |
| % |
| \begin{equation} |
| \left( |
| \begin{array}{c} |
| g_3([b_0 0 a_1 a_0]) \vphantom{W_8^{[]}} \\ |
| g_3([b_0 1 a_1 a_0]) \vphantom{W_8^{[]}} |
| \end{array} |
| \right) |
| = |
| \left( |
| \begin{array}{c} |
| g_2([b_0 0 a_1 a_0]) + W_8^{[0 a_1 a_0]} g_2([b_2 1 a_1 a_0]) \\ |
| g_2([b_0 0 a_1 a_0]) + W_8^{[1 a_1 a_0]} g_2([b_2 1 a_1 a_0]) |
| \end{array} |
| \right) |
| \end{equation} |
| % |
| We can write this in a more symmetric form by simplifying the exponential, |
| % |
| \begin{equation} |
| W_8^{[a_2 a_1 a_0]} |
| = W_8^{4 a_2 + [a_1 a_0]} |
| = (-1)^{a_2} W_8^{[a_1 a_0]} |
| \end{equation} |
| % |
| \begin{equation} |
| \left( |
| \begin{array}{c} |
| g_3([b_0 0 a_1 a_0]) \vphantom{W_8^{[]}} \\ |
| g_3([b_0 1 a_1 a_0]) \vphantom{W_8^{[]}} |
| \end{array} |
| \right) |
| = |
| \left( |
| \begin{array}{c} |
| g_2([b_0 0 a_1 a_0]) + W_8^{[a_1 a_0]} g_2([b_2 1 a_1 a_0]) \\ |
| g_2([b_0 0 a_1 a_0]) - W_8^{[a_1 a_0]} g_2([b_2 1 a_1 a_0]) |
| \end{array} |
| \right) |
| \end{equation} |
| % |
| The exponentials $W_8^{[a_1 a_0]}$ are referred to as {\em twiddle |
| factors}. The form of this calculation, a symmetrical sum and |
| difference involving a twiddle factor is called {\em a butterfly}. |
| It is often shown diagrammatically, and in the case $b_0=a_0=a_1=0$ |
| would be drawn like this, |
| % |
| \begin{center} |
| \setlength{\unitlength}{1cm} |
| \begin{picture}(9,4) |
| % |
| %\put(0,0){\line(1,0){8}} |
| %\put(0,0){\line(0,1){4}} |
| %\put(8,4){\line(0,-1){4}} |
| %\put(8,4){\line(-1,0){8}} |
| % |
| \put(0,1){$g_2(4)$} \put(4.5,1){$g_3(4)=g_2(0) - W^a_8 g_2(4)$} |
| \put(0,3){$g_2(0)$} \put(4.5,3){$g_3(0)=g_2(0) + W^a_8 g_2(4)$} |
| \put(1,1){\vector(1,0){0.5}} |
| \put(1.5,1){\line(1,0){0.5}} |
| \put(1.5,0.5){$W^a_8$} |
| \put(1,3){\vector(1,0){0.5}}\put(1.5,3){\line(1,0){0.5}} |
| \put(2,1){\circle*{0.1}} |
| \put(2,3){\circle*{0.1}} |
| \put(2,1){\vector(1,1){2}} |
| \put(2,1){\vector(1,0){1}} |
| \put(3,1){\line(1,0){1}} |
| \put(3,0.5){$-1$} |
| \put(2,3){\vector(1,-1){2}} |
| \put(2,3){\vector(1,0){1}} |
| \put(3,3){\line(1,0){1}} |
| \put(4,1){\circle*{0.1}} |
| \put(4,3){\circle*{0.1}} |
| \end{picture} |
| \end{center} |
| % |
| The inputs are shown on the left and the outputs on the right. The |
| outputs are computed by multiplying the incoming lines by their |
| accompanying factors (shown next to the lines) and summing the results |
| at each node. |
| |
| In general, denoting the bit for dual-node pairs by $\Delta$ and the |
| remaining bits of $a$ and $b$ by ${\hat a}$ and ${\hat b}$, the |
| butterfly is, |
| % |
| \begin{equation} |
| \left( |
| \begin{array}{c} |
| g({\hat b} + {\hat a}) \\ |
| g({\hat b} + \Delta + {\hat a}) \\ |
| \end{array} |
| \right) |
| \leftarrow |
| \left( |
| \begin{array}{c} |
| g({\hat b} + {\hat a}) + W_{2\Delta}^{\hat a} g({\hat b} + \Delta + {\hat a})\\ |
| g({\hat b} + {\hat a}) - W_{2\Delta}^{\hat a} g({\hat b} + \Delta + {\hat a}) |
| \end{array} |
| \right) |
| \end{equation} |
| % |
| where ${\hat a}$ runs from $0 \dots \Delta-1$ and ${\hat b}$ runs |
| through $0 \times 2\Delta$, $1\times 2\Delta$, $\dots$, $(N/\Delta - |
| 1)2\Delta$. The value of $\Delta$ is 1 on the first pass, 2 on the |
| second pass and $2^{n-1}$ on the $n$-th pass. Each pass requires |
| $N/2$ in-place computations, each involving two input locations and |
| two output locations. |
| |
| In the example above $\Delta = [100] = 4$, ${\hat a} = [a_1 a_0]$ and |
| ${\hat b} = [b_0 0 0 0]$. |
| |
| This leads to the canonical radix-2 decimation-in-time FFT algorithm |
| for $2^n$ data points stored in the array $g(0) \dots g(2^n-1)$. |
| % |
| \begin{algorithm} |
| \STATE bit-reverse ordering of $g$ |
| \STATE {$\Delta \Leftarrow 1$} |
| \FOR {$\mbox{pass} = 1 \dots n$} |
| \STATE {$W \Leftarrow \exp(-2 \pi i / 2\Delta)$} |
| \FOR {$(a = 0 ; a < \Delta ; a{++})$} |
| \FOR{$(b = 0 ; b < N ; b {+=} 2*\Delta)$} |
| \STATE{$t_0 \Leftarrow g(b+a) + W^a g(b+\Delta+a)$} |
| \STATE{$t_1 \Leftarrow g(b+a) - W^a g(b+\Delta+a)$} |
| \STATE{$g(b+a) \Leftarrow t_0$} |
| \STATE{$g(b+\Delta+a) \Leftarrow t_1$} |
| \ENDFOR |
| \ENDFOR |
| \STATE{$\Delta \Leftarrow 2\Delta$} |
| \ENDFOR |
| \end{algorithm} |
| % |
| %This algorithm appears in Numerical Recipes as the routine {\tt |
| %four1}, where the implementation is attributed to N.M. Brenner. |
| % |
| \subsection{Details of the Implementation} |
| It is straightforward to implement a simple radix-2 decimation-in-time |
| routine from the algorithm above. Some optimizations can be made by |
| pulling the special case of $a=0$ out of the loop over $a$, to avoid |
| unnecessary multiplications when $W^a=1$, |
| % |
| \begin{algorithm} |
| \FOR{$(b = 0 ; b < N ; b {+=} 2*\Delta)$} |
| \STATE{$t_0 \Leftarrow g(b) + g(b+\Delta)$} |
| \STATE{$t_1 \Leftarrow g(b) - g(b+\Delta)$} |
| \STATE{$g(b) \Leftarrow t_0$} |
| \STATE{$g(b+\Delta) \Leftarrow t_1$} |
| \ENDFOR |
| \end{algorithm} |
| % |
| There are several algorithms for doing fast bit-reversal. We use the |
| Gold-Rader algorithm, which is simple and does not require any working |
| space, |
| % |
| \begin{algorithm} |
| \FOR{$i = 0 \dots n - 2$} |
| \STATE {$ k = n / 2 $} |
| \IF {$i < j$} |
| \STATE {swap $g(i)$ and $g(j)$} |
| \ENDIF |
| |
| \WHILE {$k \leq j$} |
| \STATE{$j \Leftarrow j - k$} |
| \STATE{$k \Leftarrow k / 2$} |
| \ENDWHILE |
| |
| \STATE{$j \Leftarrow j + k$} |
| \ENDFOR |
| \end{algorithm} |
| % |
| The Gold-Rader algorithm is typically twice as fast as a naive |
| bit-reversal algorithm (where the bit reversal is carried out by |
| left-shifts and right-shifts on the index). The library also has a |
| routine for the Rodriguez bit reversal algorithm, which also does not |
| require any working space~\cite{rodriguez89}. There are faster bit |
| reversal algorithms, but they all use additional scratch |
| space~\cite{rosel89}. |
| |
| Within the loop for $a$ we can compute $W^a$ using a trigonometric |
| recursion relation, |
| % |
| \begin{eqnarray} |
| W^{a+1} &=& W W^a \\ |
| &=& (\cos(2\pi/2\Delta) + i \sin(2\pi/2\Delta)) W^a |
| \end{eqnarray} |
| % |
| This requires only $2 \log_2 N$ trigonometric calls, to compute the |
| initial values of $\exp(2\pi i /2\Delta)$ for each pass. |
| |
| \subsection{Radix-2 Decimation-in-Frequency (DIF)} |
| % |
| To derive the decimation-in-frequency algorithm we start by separating |
| out the lowest order bit of the index $a$. Here is an example for the |
| decimation-in-frequency $N=16$ DFT. |
| % |
| \begin{eqnarray} |
| W_{16}^{[a_3 a_2 a_1 a_0][b_3 b_2 b_1 b_0]} |
| &=& |
| W_{16}^{[a_3 a_2 a_1 a_0][b_2 b_1 b_0]} W_{16}^{[a_3 a_2 a_1 a_0] [b_3 |
| 0 0 0]} \\ |
| &=& |
| W_8^{[a_3 a_2 a_1][b_2 b_1 b_0]} W_{16}^{a_0 [b_2 b_1 b_0]} W_2^{a_0 |
| b_3} \\ |
| &=& |
| W_8^{[a_3 a_2 a_1][b_2 b_1 b_0]} W_{16}^{a_0 [b_2 b_1 b_0]} (-1)^{a_0 b_3} |
| \end{eqnarray} |
| % |
| By repeating the same type of the expansion on the term, |
| % |
| \begin{equation} |
| W_8^{[a_3 a_2 a_1][b_2 b_1 b_0]} |
| \end{equation} |
| % |
| we can reduce the transform to an alternative simple form, |
| % |
| \begin{equation} |
| h(a) = |
| \sum_{b_0=0}^1 (-1)^{a_3 b_0} W_4^{a_2 b_0} |
| \sum_{b_1=0}^1 (-1)^{a_2 b_1} W_8^{a_1 [b_1 b_0]} |
| \sum_{b_2=0}^1 (-1)^{a_1 b_2} W_{16}^{a_0 [b_2 b_1 b_0]} |
| \sum_{b_3=0}^1 (-1)^{a_0 b_3} g(b) |
| \end{equation} |
| % |
| To implement this we can again write the sum recursively. In this case |
| we do not have the problem of the order of $a$ being bit reversed -- |
| the calculation can be done in-place using the natural ordering of |
| $a$ and $b$, |
| % |
| \begin{eqnarray} |
| g_1([a_0 b_2 b_1 b_0]) |
| &=& |
| W_{16}^{a_0 [b_2 b_1 b_0]} |
| \sum_{b_3=0}^1 (-1)^{a_0 b_3} g([b_3 b_2 b_1 b_0]) \\ |
| g_2([a_0 a_1 b_1 b_0]) |
| &=& |
| W_{8}^{a_1 [b_1 b_0]} |
| \sum_{b_2=0}^1 (-1)^{a_1 b_2} g_1([a_0 b_2 b_1 b_0]) \\ |
| g_3([a_0 a_1 a_2 b_0]) |
| &=& |
| W_{4}^{a_2 b_0} |
| \sum_{b_1=0}^1 (-1)^{a_2 b_1} g_2([a_0 a_1 b_1 b_0]) \\ |
| h(a) |
| = |
| g_4([a_0 a_1 a_2 a_3]) |
| &=& |
| \sum_{b_0=0}^1 (-1)^{a_3 b_0} g_3([a_0 a_1 a_2 b_0]) |
| \end{eqnarray} |
| % |
| The final pass leaves the data for $h(a)$ in bit-reversed order, but |
| this is easily fixed by a final bit-reversal of the ordering. |
| |
| The basic in-place calculation or butterfly for each pass is slightly |
| different from the decimation-in-time version, |
| % |
| \begin{equation} |
| \left( |
| \begin{array}{c} |
| g({\hat a} + {\hat b}) \\ |
| g({\hat a} + \Delta + {\hat b}) \\ |
| \end{array} |
| \right) |
| \leftarrow |
| \left( |
| \begin{array}{c} |
| g({\hat a} + {\hat b}) + g({\hat a} + \Delta + {\hat b})\\ |
| W_{\Delta}^{\hat b} |
| \left( g({\hat a} + {\hat b}) - g({\hat a} + \Delta + {\hat b}) \right) |
| \end{array} |
| \right) |
| \end{equation} |
| % |
| In each pass ${\hat b}$ runs from $0 \dots \Delta-1$ and ${\hat |
| a}$ runs from $0, 2\Delta, \dots, (N/\Delta -1) \Delta$. On the first |
| pass we start with $\Delta=16$, and on subsequent passes $\Delta$ takes |
| the values $8, 4, \dots, 1$. |
| |
| This leads to the canonical radix-2 decimation-in-frequency FFT |
| algorithm for $2^n$ data points stored in the array $g(0) \dots |
| g(2^n-1)$. |
| % |
| \begin{algorithm} |
| \STATE {$\Delta \Leftarrow 2^{n-1}$} |
| \FOR {$\mbox{pass} = 1 \dots n$} |
| \STATE {$W \Leftarrow \exp(-2 \pi i / 2\Delta)$} |
| \FOR {$(b = 0 ; b < \Delta ; b++)$} |
| \FOR{$(a = 0 ; a < N ; a += 2*\Delta)$} |
| \STATE{$t_0 \Leftarrow g(b+a) + g(a+\Delta+b)$} |
| \STATE{$t_1 \Leftarrow W^b \left( g(a+b) - g(a+\Delta+b) \right)$} |
| \STATE{$g(a+b) \Leftarrow t_0$} |
| \STATE{$g(a+\Delta+b) \Leftarrow t_1$} |
| \ENDFOR |
| \ENDFOR |
| \STATE{$\Delta \Leftarrow \Delta/2$} |
| \ENDFOR |
| \STATE bit-reverse ordering of $g$ |
| \end{algorithm} |
| % |
| |
| \section{Self-Sorting Mixed-Radix Complex FFTs} |
| % |
| This section is based on the review article {\em Self-sorting |
| Mixed-Radix Fast Fourier Transforms} by Clive |
| Temperton~\cite{temperton83}. You should consult his article for full |
| details of all the possible algorithms (there are many |
| variations). Here I have annotated the derivation of the simplest |
| mixed-radix decimation-in-frequency algorithm. |
| |
| For general-$N$ FFT algorithms the simple binary-notation of radix-2 |
| algorithms is no longer useful. The mixed-radix FFT has to be built |
| up using products of matrices acting on a data vector. The aim is to |
| take the full DFT matrix $W_N$ and factor it into a set of small, |
| sparse matrices corresponding to each factor of $N$. |
| |
| |
| We'll denote the components of matrices using either subscripts or |
| function notation, |
| % |
| \begin{equation} |
| M_{ij} = M(i,j) |
| \end{equation} |
| % |
| with (C-like) indices running from 0 to $N-1$. Matrix products will be |
| denoted using square brackets, |
| % |
| \begin{equation} |
| [AB]_{ij} = \sum_{k} A_{ik} B_{kj} |
| \end{equation} |
| % |
| % |
| Three special matrices will be needed in the mixed-radix factorization |
| of the DFT: the identity matrix, $I$, a permutation matrix, $P$ and a |
| matrix of twiddle factors, $D$, as well as the normal DFT matrices |
| $W_n$. |
| |
| We write the identity matrix of order $r$ as $I_r(n,m)$, |
| % |
| \begin{equation} |
| I_r(n,m) = \delta_{nm} |
| \end{equation} |
| % |
| for $0 \leq n,m \leq r-1$. |
| |
| We also need to define a permutation matrix $P^a_b$ that performs |
| digit reversal of the ordering of a vector. If the index of a vector |
| $j= 0\dots N-1$ is factorized into $j = la +m$, with $0 \leq l \leq |
| b-1$ and $0 \leq m \leq a-1$ then the operation of the matrix $P$ will |
| exchange positions $la+m$ and $mb+l$ in the vector (this sort of |
| digit-reversal is the generalization of bit-reversal to a number |
| system with exponents $a$ and $b$). |
| |
| In mathematical terms $P$ is a square matrix of size $ab \times ab$ |
| with the property, |
| % |
| \begin{eqnarray} |
| P^a_b(j,k) &=& 1 ~\mbox{if}~ j=ra+s ~\mbox{and}~ k=sb+r \\ |
| &=& 0 ~\mbox{otherwise} |
| \end{eqnarray} |
| % |
| |
| Finally the FFT algorithm needs a matrix of twiddle factors, $D^a_b$, |
| for the trigonometric sums. $D^a_b$ is a diagonal square matrix of |
| size $ab \times ab$ with the definition, |
| % |
| \begin{eqnarray} |
| D^a_b(j,k) &=& \omega^{sr}_{ab} ~\mbox{if}~ j=k=sb+r \\ |
| &=& 0 ~\mbox{otherwise} |
| \end{eqnarray} |
| % |
| where $\omega_{ab} = e^{-2\pi i/ab}$. |
| |
| |
| \subsection{The Kronecker Product} |
| The Kronecker matrix product plays an important role in all the |
| algorithms for combining operations on different subspaces. The |
| Kronecker product $A \otimes B$ of two square matrices $A$ and $B$, of |
| sizes $a \times a$ and $b \times b$ respectively, is a square matrix |
| of size $a b \times a b$, defined as, |
| % |
| \begin{equation} |
| [A \otimes B] (tb+u, rb+s) = A(t,r) B(u,s) |
| \end{equation} |
| % |
| where $0 \leq u,s < b$ and $0 \leq t,r < a$. Let's examine a specific |
| example. If we take a $2 \times 2$ matrix and a $3 |
| \times 3$ matrix, |
| % |
| \begin{equation} |
| \begin{array}{ll} |
| A = |
| \left( |
| \begin{array}{cc} |
| a_{11} & a_{12} \\ |
| a_{21} & a_{22} |
| \end{array} |
| \right) |
| & |
| B = |
| \left( |
| \begin{array}{ccc} |
| b_{11} & b_{12} & b_{13} \\ |
| b_{21} & b_{22} & b_{23} \\ |
| b_{31} & b_{32} & b_{33} |
| \end{array} |
| \right) |
| \end{array} |
| \end{equation} |
| % |
| then the Kronecker product $A \otimes B$ is, |
| % |
| \begin{eqnarray} |
| A \otimes B &= & |
| \left( |
| \begin{array}{cc} |
| a_{11} B & a_{12} B \\ |
| a_{12} B & a_{22} B |
| \end{array} |
| \right) \\ |
| &=& |
| \left( |
| \begin{array}{cccccc} |
| a_{11} b_{11} & a_{11} b_{12} & a_{11} b_{13} & |
| a_{12} b_{11} & a_{12} b_{12} & a_{12} b_{13} \\ |
| a_{11} b_{21} & a_{11} b_{22} & a_{11} b_{23} & |
| a_{12} b_{21} & a_{12} b_{22} & a_{12} b_{23} \\ |
| a_{11} b_{31} & a_{11} b_{32} & a_{11} b_{33} & |
| a_{12} b_{31} & a_{12} b_{32} & a_{12} b_{33} \\ |
| a_{21} b_{11} & a_{21} b_{12} & a_{21} b_{13} & |
| a_{22} b_{11} & a_{22} b_{12} & a_{22} b_{13} \\ |
| a_{21} b_{21} & a_{21} b_{22} & a_{21} b_{23} & |
| a_{22} b_{21} & a_{22} b_{22} & a_{22} b_{23} \\ |
| a_{21} b_{31} & a_{21} b_{32} & a_{21} b_{33} & |
| a_{22} b_{31} & a_{22} b_{32} & a_{22} b_{33} |
| \end{array} |
| \right) |
| \end{eqnarray} |
| % |
| When the Kronecker product $A \otimes B$ acts on a vector of length |
| $ab$, each matrix operates on a different subspace of the vector. |
| Writing the index $i$ as $i=t b + u$, with $0\leq u \leq b-1$ |
| and $0\leq t\leq a$, we can see this explicitly by looking at components, |
| % |
| \begin{eqnarray} |
| [(A \otimes B) v]_{(tb+u)} |
| & = & \sum_{t'=0}^{a-1} \sum_{u'=0}^{b-1} |
| [A \otimes B]_{(tb+u,t'b+u')} v_{t'b+u'} \\ |
| & = & \sum_{t'u'} A_{tt'} B_{uu'} v_{t'b+u'} |
| \end{eqnarray} |
| % |
| The matrix $B$ operates on the ``index'' $u'$, for all values of $t'$, and |
| the matrix $A$ operates on the ``index'' $t'$, for all values of $u'$. |
| % |
| The most important property needed for deriving the FFT factorization |
| is that the matrix product of two Kronecker products is the Kronecker |
| product of the two matrix products, |
| % |
| \begin{equation} |
| (A \otimes B)(C \otimes D) = (AC \otimes BD) |
| \end{equation} |
| % |
| This follows straightforwardly from the original definition of the |
| Kronecker product. |
| |
| \subsection{Two factor case, $N=ab$} |
| % |
| First consider the simplest possibility, where the data length $N$ can |
| be divided into two factors, $N=ab$. The aim is to reduce the DFT |
| matrix $W_N$ into simpler matrices corresponding to each factor. To |
| make the derivation easier we will start from the known factorization |
| and verify it (the initial factorization can be guessed by |
| generalizing from simple cases). Here is the factorization we are |
| going to prove, |
| % |
| \begin{equation} |
| W_{ab} = (W_b \otimes I_a) P^a_b D^a_b (W_a \otimes I_b). |
| \end{equation} |
| % |
| We can check it by expanding the product into components, |
| % |
| \begin{eqnarray} |
| \lefteqn{[(W_b \otimes I_a) P^a_b D^a_b (W_a \otimes I_b)](la+m,rb+s)} \\ |
| & = & |
| \sum_{u=0}^{b-1} \sum_{t=0}^{a-1} |
| [(W_b \otimes I_a)](la+m,ua+t) [P^a_b D^a_b (W_a \otimes I_b)](ua+t,rb+s) |
| \end{eqnarray} |
| % |
| where we have split the indices to match the Kronecker product $0 \leq |
| m, r \leq a$, $0 \leq l, s \leq b$. The first term in the sum can |
| easily be reduced to its component form, |
| % |
| \begin{eqnarray} |
| [(W_b \otimes I_a)](la+m,ua+t) |
| &=& W_b(l,u) I_a(m,t) \\ |
| &=& \omega_b^{lu} \delta_{mt} |
| \end{eqnarray} |
| % |
| The second term is more complicated. We can expand the Kronecker |
| product like this, |
| \begin{eqnarray} |
| (W_a \otimes I_b)(tb+u,rb+s) |
| &=& W_a(t,r) I_a(u,s) \\ |
| &=& \omega_a^{tr} \delta_{us} |
| \end{eqnarray} |
| % |
| and use this term to build up the product, $P^a_b D^a_b (W_a \otimes |
| I_b)$. We first multiply by $D^a_b$, |
| % |
| \begin{equation} |
| [D^a_b (W_a \otimes I_b)](tb+u,rb+s) |
| = |
| \omega^{tu}_{ab} \omega^{tr}_{a} \delta_{su} |
| \end{equation} |
| % |
| and then apply the permutation matrix, $P^a_b$, which digit-reverses |
| the ordering of the first index, to obtain, |
| % |
| \begin{equation} |
| [P^a_b D^a_b (W_a \otimes I_b)](ua+t,rb+s) |
| = |
| \omega^{tu}_{ab} \omega^{tr}_{a} \delta_{su} |
| \end{equation} |
| % |
| Combining the two terms in the matrix product we can obtain the full |
| expansion in terms of the exponential $\omega$, |
| % |
| \begin{eqnarray} |
| [(W_b \otimes I_a) P^a_b D^a_b (W_a \otimes I_b)](la+m,rb+s) |
| &=& |
| \sum_{u=0}^{b-1} \sum_{t=0}^{a-1} |
| \omega_b^{lu} \delta_{mt} \omega^{tu}_{ab} \omega^{tr}_{a} \delta_{su} |
| \end{eqnarray} |
| % |
| If we evaluate this sum explicitly we can make the connection between |
| the product involving $W_a$ and $W_b$ (above) and the expansion of the |
| full DFT matrix $W_{ab}$, |
| % |
| \begin{eqnarray} |
| \sum_{u=0}^{b-1} \sum_{t=0}^{a-1} |
| \omega_b^{lu} \delta_{mt} \omega^{tu}_{ab} \omega^{tr}_{a} \delta_{su} |
| &=& \omega^{ls}_b \omega^{ms}_{ab} \omega^{mr}_a \\ |
| &=& \omega^{als + ms +bmr}_{ab} \\ |
| &=& \omega^{als + ms +bmr}_{ab} \omega^{lrab}_{ab} \quad\mbox{using~} \omega^{ab}_{ab} =1\\ |
| &=& \omega^{(la+m)(rb+s)}_{ab} \\ |
| &=& W_{ab}(la+m,rb+s) |
| \end{eqnarray} |
| % |
| The final line shows that matrix product given above is identical to |
| the full two-factor DFT matrix, $W_{ab}$. |
| % |
| Thus the full DFT matrix $W_{ab}$ for two factors $a$, $b$ can be |
| broken down into a product of sub-transforms, $W_a$ and $W_b$, plus |
| permutations, $P$, and twiddle factors, $D$, according to the formula, |
| % |
| \begin{equation} |
| W_{ab} = (W_b \otimes I_a) P^a_b D^a_b (W_a \otimes I_b). |
| \end{equation} |
| % |
| This relation is the foundation of the general-$N$ mixed-radix FFT algorithm. |
| |
| \subsection{Three factor case, $N=abc$} |
| % |
| The result for the two-factor expansion can easily be generalized to |
| three factors. We first consider $abc$ as being a product of two |
| factors $a$ and $(bc)$, and then further expand the product $(bc)$ into |
| $b$ and $c$. The first step of the expansion looks like this, |
| % |
| \begin{eqnarray} |
| W_{abc} &=& W_{a(bc)}\\ |
| &=& (W_{bc} \otimes I_a) P^a_{bc} D^a_{bc} (W_a \otimes I_{bc}) . |
| \end{eqnarray} |
| % |
| And after using the two-factor result to expand out $W_{bc}$ we obtain |
| the factorization of $W_{abc}$, |
| % |
| \begin{eqnarray} |
| W_{abc} &=& (((W_c \otimes I_b) P^b_c D^b_c (W_b \otimes I_c)) \otimes I_a ) |
| P^a_{bc} D^a_{bc} (W_a \otimes I_{bc}) \\ |
| &=& (W_c \otimes I_{ab}) (P^b_c D^b_c \otimes I_a) (W_b \otimes I_{ac}) P^a_{bc} D^a_{bc} (W_a \otimes I_c) |
| \end{eqnarray} |
| % |
| We can write this factorization in a product form, with one term for |
| each factor, |
| % |
| \begin{equation} |
| W_{abc} = T_3 T_2 T_1 |
| \end{equation} |
| % |
| where we read off $T_1$, $T_2$ and $T_3$, |
| % |
| \begin{eqnarray} |
| T_1 &=& P^a_{bc} D^a_{bc} (W_a \otimes I_{bc}) \\ |
| T_2 &=& (P^b_c D^b_c \otimes I_a) (W_b \otimes I_{ac}) \\ |
| T_3 &=& (W_c \otimes I_{ab} ) |
| \end{eqnarray} |
| % |
| |
| |
| \subsection{General case, $N=f_1 f_2 \dots f_{n_f}$} |
| % |
| If we continue the procedure that we have used for two- and |
| three-factors then a general pattern begins to emerge in the |
| factorization of $W_{f_1 f_2 \dots f_{n_f}}$. To see the beginning of |
| the pattern we can rewrite the three factor case as, |
| % |
| \begin{eqnarray} |
| T_1 &=& (P^a_{bc} D^a_{bc} \otimes I_1) (W_a \otimes I_{bc}) \\ |
| T_2 &=& (P^b_c D^b_c \otimes I_a) (W_b \otimes I_{ac}) \\ |
| T_3 &=& (P^c_1 D^c_1 \otimes I_{ab}) (W_c \otimes I_{ab} ) |
| \end{eqnarray} |
| % |
| using the special cases $P^c_1 = D^c_1 = I_c$. |
| % |
| In general, we can write the factorization of $W_N$ for $N= f_1 f_2 |
| \dots f_{n_f}$ as, |
| % |
| \begin{equation} |
| W_N = T_{n_f} \dots T_2 T_1 |
| \end{equation} |
| % |
| where the matrix factors $T_i$ are, |
| % |
| \begin{equation} |
| T_i = (P^{f_i}_{q_i} D^{f_i}_{q_i} \otimes I_{p_{i-1}}) ( W_{f_i} |
| \otimes I_{m_i}) |
| \end{equation} |
| % |
| We have defined the following three additional variables $p$, $q$ and |
| $m$ to denote different partial products of the factors, |
| % |
| \begin{eqnarray} |
| p_i &=& f_1 f_2 \dots f_i \quad (p_0 = 1) \\ |
| q_i &=& N / p_i \\ |
| m_i &=& N / f_i |
| \end{eqnarray} |
| % |
| Note that the FFT modules $W$ are applied before the permutations $P$, |
| which makes this a decimation-in-frequency algorithm. |
| |
| \subsection{Implementation} |
| % |
| Now to the implementation of the algorithm. We start with a vector of |
| data, $z$, as input and want to apply the transform, |
| % |
| \begin{eqnarray} |
| x &=& W_N z \\ |
| &=& T_{n_f} \dots T_2 T_1 z |
| \end{eqnarray} |
| % |
| where $T_i = (P^{f_i}_{q_i} D^{f_i}_{q_i} \otimes I_{p_{i-1}}) ( |
| W_{f_i} \otimes I_{m_i})$. |
| |
| The outer structure of the implementation will be a loop over the |
| $n_f$ factors, applying each matrix $T_i$ to the vector in turn to |
| build up the complete transform. |
| % |
| \begin{algorithm} |
| \FOR{$(i = 1 \dots n_f)$} |
| \STATE{$v \Leftarrow T_i v $} |
| \ENDFOR |
| \end{algorithm} |
| % |
| The order of the factors is not important. Now we examine the iteration |
| $v \Leftarrow T_i v$, which we'll write as, |
| % |
| \begin{equation} |
| v' = |
| (P^{f_i}_{q_i} D^{f_i}_{q_i} \otimes I_{p_{i-1}}) ~ |
| ( W_{f_i} \otimes I_{m_i}) v |
| \end{equation} |
| % |
| There are two Kronecker product matrices in this iteration. The |
| rightmost matrix, which is the first to be applied, is a DFT of length |
| $f_i$ applied to $N/f_i$ subsets of the data. We'll call this $t$, |
| since it will be a temporary array, |
| % |
| \begin{equation} |
| t = (W_{f_i} \otimes I_{m_i}) v |
| \end{equation} |
| % |
| The second matrix applies a permutation and the exponential |
| twiddle-factors. We'll call this $v'$, since it is the result of the |
| full iteration on $v$, |
| % |
| \begin{equation} |
| v' = (P^{f_i}_{q_i} D^{f_i}_{q_i} \otimes I_{p_{i-1}}) t |
| \end{equation} |
| |
| The effect of the matrix $(W_{f_i} \otimes I_{m_i})$ is best seen by |
| an example. Suppose the factor is $f_i = 3$, and the length of the FFT |
| is $N=6$, then the relevant Kronecker product is, |
| % |
| \begin{equation} |
| t = (W_3 \otimes I_2) v |
| \end{equation} |
| % |
| which expands out to, |
| % |
| \begin{equation} |
| \left( |
| \begin{array}{c} |
| t_0 \\ |
| t_1 \\ |
| t_2 \\ |
| t_3 \\ |
| t_4 \\ |
| t_5 |
| \end{array} |
| \right) |
| = |
| \left( |
| \begin{array}{cccccc} |
| W_3(1,1) & 0 & W_3(1,2) & 0 & W_3(1,3) & 0 \\ |
| 0 & W_3(1,1) & 0 & W_3(1,2) & 0 & W_3(1,3) \\ |
| W_3(2,1) & 0 & W_3(2,2) & 0 & W_3(2,3) & 0 \\ |
| 0 & W_3(2,1) & 0 & W_3(2,2) & 0 & W_3(2,3) \\ |
| W_3(3,1) & 0 & W_3(3,2) & 0 & W_3(3,3) & 0 \\ |
| 0 & W_3(3,1) & 0 & W_3(3,2) & 0 & W_3(3,3) |
| \end{array} |
| \right) |
| \left( |
| \begin{array}{c} |
| v_0 \\ |
| v_1 \\ |
| v_2 \\ |
| v_3 \\ |
| v_4 \\ |
| v_5 |
| \end{array} |
| \right) |
| \end{equation} |
| % |
| We can rearrange the components in a computationally convenient form, |
| \begin{equation} |
| \left( |
| \begin{array}{c} |
| t_0 \\ |
| t_2 \\ |
| t_4 \\ |
| t_1 \\ |
| t_3 \\ |
| t_5 |
| \end{array} |
| \right) |
| = |
| \left( |
| \begin{array}{cccccc} |
| W_3(1,1) & W_3(1,2) & W_3(1,3) & 0 & 0 & 0 \\ |
| W_3(2,1) & W_3(2,2) & W_3(2,3) & 0 & 0 & 0 \\ |
| W_3(3,1) & W_3(3,2) & W_3(3,3) & 0 & 0 & 0 \\ |
| 0 & 0 & 0 & W_3(1,1) & W_3(1,2) & W_3(1,3) \\ |
| 0 & 0 & 0 & W_3(2,1) & W_3(2,2) & W_3(2,3) \\ |
| 0 & 0 & 0 & W_3(3,1) & W_3(3,2) & W_3(3,3) |
| \end{array} |
| \right) |
| \left( |
| \begin{array}{c} |
| v_0 \\ |
| v_2 \\ |
| v_4 \\ |
| v_1 \\ |
| v_3 \\ |
| v_5 |
| \end{array} |
| \right) |
| \end{equation} |
| % |
| which clearly shows that we just need to apply the $3\times 3$ DFT |
| matrix $W_3$ twice, once to the sub-vector of elements $(v_0, v_2, v_4)$, |
| and independently to the remaining sub-vector $(v_1, v_3, v_5)$. |
| |
| In the general case, if we index $t$ as $t_k = t(\lambda,\mu) = |
| t_{\lambda m + \mu}$ then $\lambda = 0 \dots f-1$ is an index within |
| each transform of length $f$ and $\mu = 0 \dots m-1$ labels the |
| independent subsets of data. We can see this by showing the |
| calculation with all indices present, |
| % |
| \begin{equation} |
| t = (W_f \otimes I_m) z |
| \end{equation} |
| % |
| becomes, |
| % |
| \begin{eqnarray} |
| t_{\lambda m + \mu} &=& \sum_{\lambda'=0}^{f-1} \sum_{\mu'=0}^{m-1} |
| (W_f \otimes I_m)_{(\lambda m + \mu)(\lambda' m + \mu')} |
| z_{\lambda'm + \mu} \\ |
| &=& \sum_{\lambda'\mu'} (W_f)_{\lambda\lambda'} \delta_{\mu\mu'} |
| z_{\lambda'm+\mu'} \\ |
| &=& \sum_{\lambda'} (W_f)_{\lambda\lambda'} z_{\lambda'm+\mu} |
| \end{eqnarray} |
| % |
| The DFTs on the index $\lambda$ will be computed using |
| special optimized modules for each $f$. |
| |
| To calculate the next stage, |
| % |
| \begin{equation} |
| v'=(P^f_q D^f_q \otimes I_{p_{i-1}}) t |
| \end{equation} |
| % |
| we note that the Kronecker product has the property of performing |
| $p_{i-1}$ independent multiplications of $PD$ on $q_{i-1}$ different |
| subsets of $t$. The index $\mu$ of $t(\lambda,\mu)$ which runs from 0 |
| to $m$ will include $q_i$ copies of each $PD$ operation because |
| $m=p_{i-1}q$. i.e. we can split the index $\mu$ further into $\mu = a |
| p_{i-1} + b$, where $a = 0 \dots q-1$ and $b=0 \dots p_{i-1}$, |
| % |
| \begin{eqnarray} |
| \lambda m + \mu &=& \lambda m + a p_{i-1} + b \\ |
| &=& (\lambda q + a) p_{i-1} + b. |
| \end{eqnarray} |
| % |
| Now we can expand the second stage, |
| % |
| \begin{eqnarray} |
| v'&=& (P^f_q D^f_q \otimes I_{p_{i-1}}) t \\ |
| v'_{\lambda m + \mu} &=& \sum_{\lambda' \mu'} |
| (P^f_q D^f_q \otimes I_{p_{i-1}})_{(\lambda m + \mu)(\lambda' m + \mu')} |
| t_{\lambda' m + \mu'} \\ |
| v'_{(\lambda q + a) p_{i-1} + b} &=& \sum_{\lambda' a' b'} |
| ( |
| P^f_q D^f_q \otimes I_{p_{i-1}} |
| )_{((\lambda q+ a)p_{i-1} + b)((\lambda' q+ a')p_{i-1} + b')} |
| t_{(\lambda' q + a')p_{i-1} +b'} |
| \end{eqnarray} |
| % |
| The first step in removing redundant indices is to take advantage of |
| the identity matrix $I$ and separate the subspaces of the Kronecker |
| product, |
| % |
| \begin{equation} |
| ( |
| P^f_q D^f_q \otimes I_{p_{i-1}} |
| )_{((\lambda q+ a)p_{i-1} + b)((\lambda' q+ a')p_{i-1} + b')} |
| = |
| (P^f_q D^f_q)_{(\lambda q + a)(\lambda' q + a')} |
| \delta_{bb'} |
| \end{equation} |
| % |
| This eliminates one sum, leaving us with, |
| % |
| \begin{equation} |
| v'_{(\lambda q + a) p_{i-1} + b} |
| = |
| \sum_{\lambda' a' } |
| (P^f_q D^f_q)_{(\lambda q + a)(\lambda' q + a')} t_{(\lambda'q+a')p_{i-1} + b} |
| \end{equation} |
| % |
| We can insert the definition of $D^f_q$ to give, |
| % |
| \begin{equation} |
| \phantom{v'_{(\lambda q + a) p_{i-1} + b}} |
| = \sum_{\lambda'a'} (P^f_q)_{(\lambda q + a)(\lambda'q + a')} |
| \omega^{\lambda'a'}_{q_{i-1}} t_{(\lambda'q+a')p_{i-1}+b} |
| \end{equation} |
| % |
| Using the definition of $P^f_q$, which exchanges an index of $\lambda |
| q + a$ with $a f + \lambda$, we get a final result with no matrix |
| multiplication, |
| % |
| \begin{equation} |
| v'_{(a f + \lambda) p_{i-1} + b} |
| = \omega^{\lambda a}_{q_{i-1}} t_{(\lambda q + a)p_{i-1} + b} |
| \end{equation} |
| % |
| All we have to do is premultiply each element of the temporary vector |
| $t$ by an exponential twiddle factor and store the result in another |
| index location, according to the digit reversal permutation of $P$. |
| |
| Here is the algorithm to implement the mixed-radix FFT, |
| % |
| \begin{algorithm} |
| \FOR{$i = 1 \dots n_f$} |
| \FOR{$a = 0 \dots q-1$} |
| \FOR{$b = 0 \dots p_{i-1} - 1$} |
| \FOR{$\lambda = 0 \dots f-1$} |
| \STATE{$t_\lambda \Leftarrow |
| \sum_{\lambda'=0}^{f-1} W_f(\lambda,\lambda') v_{b+\lambda'm+ap_{i-1}}$} |
| \COMMENT{DFT matrix-multiply module} |
| \ENDFOR |
| \FOR{$\lambda = 0 \dots f-1$} |
| \STATE{$v'_{(af+\lambda)p_{i-1}+b} |
| \Leftarrow \omega^{\lambda a}_{q_{i-1}} t_\lambda$} |
| \ENDFOR |
| \ENDFOR |
| \ENDFOR |
| \STATE{$v \Leftarrow v'$} |
| \ENDFOR |
| \end{algorithm} |
| % |
| \subsection{Details of the implementation} |
| % |
| First the function {\tt gsl\_fft\_complex\_wavetable\_alloc} allocates |
| $n$ elements of scratch space (to hold the vector $v'$ for each |
| iteration) and $n$ elements for a trigonometric lookup table of |
| twiddle factors. |
| |
| Then the length $n$ must be factorized. There is a general |
| factorization function {\tt gsl\_fft\_factorize} which takes a list of |
| preferred factors. It first factors out the preferred factors and then |
| removes general remaining prime factors. |
| |
| The algorithm used to generate the trigonometric lookup table is |
| % |
| \begin{algorithm} |
| \FOR {$a = 1 \dots n_f$} |
| \FOR {$b = 1 \dots f_i - 1$} |
| \FOR {$c = 1 \dots q_i$} |
| \STATE $\mbox{trig[k++]} = \exp(- 2\pi i b c p_{a-1}/N)$ |
| \ENDFOR |
| \ENDFOR |
| \ENDFOR |
| \end{algorithm} |
| % |
| Note that $\sum_{1}^{n_f} \sum_{0}^{f_i-1} \sum_{1}^{q_i} = |
| \sum_{1}^{n_f} (f_i-1)q_i = n - 1$ so $n$ elements are always |
| sufficient to store the lookup table. This is chosen because we need |
| to compute $\omega_{q_i-1}^{\lambda a} t_\lambda$ in |
| the FFT. In terms of the lookup table we can write this as, |
| % |
| \begin{eqnarray} |
| \omega_{q_{i-1}}^{\lambda a} t_\lambda |
| &=& \exp(-2\pi i \lambda a/q_{i-1}) t_\lambda \\ |
| &=& \exp(-2\pi i \lambda a p_{i-1}/N) t_\lambda \\ |
| &=& \left\{ |
| \begin{array}{ll} |
| t_\lambda & a=0 \\ |
| \mbox{trig}[\mbox{twiddle[i]}+\lambda q+(a-1)] t_\lambda & a\not=0 |
| \end{array} |
| \right. |
| \end{eqnarray} |
| % |
| \begin{sloppypar} |
| \noindent |
| The array {\tt twiddle[i]} maintains a set of pointers into {\tt trig} |
| for the starting points for the outer loop. The core of the |
| implementation is {\tt gsl\_fft\_complex}. This function loops over |
| the chosen factors of $N$, computing the iteration $v'=T_i v$ for each |
| pass. When the DFT for a factor is implemented the iteration is |
| handed-off to a dedicated small-$N$ module, such as {\tt |
| gsl\_fft\_complex\_pass\_3} or {\tt |
| gsl\_fft\_complex\_pass\_5}. Unimplemented factors are handled |
| by the general-$N$ routine {\tt gsl\_fft\_complex\_pass\_n}. The |
| structure of one of the small-$N$ modules is a simple transcription of |
| the basic algorithm given above. Here is an example for {\tt |
| gsl\_fft\_complex\_pass\_3}. For a pass with a factor of 3 we have to |
| calculate the following expression, |
| \end{sloppypar}% |
| \begin{equation} |
| v'_{(a f + \lambda) p_{i-1} + b} |
| = |
| \sum_{\lambda' = 0,1,2} |
| \omega^{\lambda a}_{q_{i-1}} W^{\lambda \lambda'}_{3} |
| v_{b + \lambda' m + a p_{i-1}} |
| \end{equation} |
| % |
| for $b = 0 \dots p_{i-1} - 1$, $a = 0 \dots q_{i} - 1$ and $\lambda = |
| 0, 1, 2$. This is implemented as, |
| % |
| \begin{algorithm} |
| \FOR {$a = 0 \dots q-1$} |
| \FOR {$b = 0 \dots p_{i-1} - 1$} |
| \STATE {$ |
| \left( |
| \begin{array}{c} |
| t_0 \\ t_1 \\ t_2 |
| \end{array} |
| \right) |
| = |
| \left( |
| \begin{array}{ccc} |
| W^{0}_3 & W^{0}_3 & W^{0}_3 \\ |
| W^{0}_3 & W^{1}_3 & W^{2}_3 \\ |
| W^{0}_3 & W^{2}_3 & W^{4}_3 |
| \end{array} |
| \right) |
| \left( |
| \begin{array}{l} |
| v_{b + a p_{i-1}} \\ |
| v_{b + a p_{i-1} + m} \\ |
| v_{b + a p_{i-1} +2m} |
| \end{array} |
| \right)$} |
| \STATE {$ v'_{a p_{i} + b} = t_0$} |
| \STATE {$ v'_{a p_{i} + b + p_{i-1}} = \omega^{a}_{q_{i-1}} t_1$} |
| \STATE {$ v'_{a p_{i} + b + 2 p_{i-1}} = \omega^{2a}_{q_{i-1}} t_2$} |
| \ENDFOR |
| \ENDFOR |
| \end{algorithm} |
| % |
| In the code we use the variables {\tt from0}, {\tt from1}, {\tt from2} |
| to index the input locations, |
| % |
| \begin{eqnarray} |
| \mbox{\tt from0} &=& b + a p_{i-1} \\ |
| \mbox{\tt from1} &=& b + a p_{i-1} + m \\ |
| \mbox{\tt from2} &=& b + a p_{i-1} + 2m |
| \end{eqnarray} |
| % |
| and the variables {\tt to0}, {\tt to1}, {\tt to2} to index the output |
| locations in the scratch vector $v'$, |
| % |
| \begin{eqnarray} |
| \mbox{\tt to0} &=& b + a p_{i} \\ |
| \mbox{\tt to1} &=& b + a p_{i} + p_{i-1} \\ |
| \mbox{\tt to2} &=& b + a p_{i} + 2 p_{i-1} |
| \end{eqnarray} |
| % |
| The DFT matrix multiplication is computed using the optimized |
| sub-transform modules given in the next section. The twiddle factors |
| $\omega^a_{q_{i-1}}$ are taken out of the {\tt trig} array. |
| |
| To compute the inverse transform we go back to the definition of the |
| fourier transform and note that the inverse matrix is just the complex |
| conjugate of the forward matrix (with a factor of $1/N$), |
| % |
| \begin{equation} |
| W^{-1}_N = W^*_N / N |
| \end{equation} |
| % |
| Therefore we can easily compute the inverse transform by conjugating |
| all the complex elements of the DFT matrices and twiddle factors that |
| we apply. (An alternative strategy is to conjugate the input data, |
| take a forward transform, and then conjugate the output data). |
| |
| \section{Fast Sub-transform Modules} |
| % |
| To implement the mixed-radix FFT we still need to compute the |
| small-$N$ DFTs for each factor. Fortunately many highly-optimized |
| small-$N$ modules are available, following the work of Winograd who |
| showed how to derive efficient small-$N$ sub-transforms by number |
| theoretic techniques. |
| |
| The algorithms in this section all compute, |
| % |
| \begin{equation} |
| x_a = \sum_{b=0}^{N-1} W_N^{ab} z_b |
| \end{equation} |
| % |
| The sub-transforms given here are the ones recommended by Temperton |
| and differ slightly from the canonical Winograd modules. According to |
| Temperton~\cite{temperton83} they are slightly more robust against |
| rounding errors and trade off some additions for multiplications. |
| % |
| For the $N=2$ DFT, |
| % |
| \begin{equation} |
| \begin{array}{ll} |
| x_0 = z_0 + z_1, & |
| x_1 = z_0 - z_1. |
| \end{array} |
| \end{equation} |
| % |
| For the $N=3$ DFT, |
| % |
| \begin{equation} |
| \begin{array}{lll} |
| t_1 = z_1 + z_2, & |
| t_2 = z_0 - t_1/2, & |
| t_3 = \sin(\pi/3) (z_1 - z_2), |
| \end{array} |
| \end{equation} |
| \begin{equation} |
| \begin{array}{lll} |
| x_0 = z_0 + t_1, & |
| x_1 = t_2 + i t_3, & |
| x_2 = t_2 - i t_3. |
| \end{array} |
| \end{equation} |
| % |
| The $N=4$ transform involves only additions and subtractions, |
| % |
| \begin{equation} |
| \begin{array}{llll} |
| t_1 = z_0 + z_2, & |
| t_2 = z_1 + z_3, & |
| t_3 = z_0 - z_2, & |
| t_4 = z_1 - z_3, |
| \end{array} |
| \end{equation} |
| \begin{equation} |
| \begin{array}{llll} |
| x_0 = t_1 + t_2, & |
| x_1 = t_3 + i t_4, & |
| x_2 = t_1 - t_2, & |
| x_3 = t_3 - i t_4. |
| \end{array} |
| \end{equation} |
| % |
| For the $N=5$ DFT, |
| % |
| \begin{equation} |
| \begin{array}{llll} |
| t_1 = z_1 + z_4, & |
| t_2 = z_2 + z_3, & |
| t_3 = z_1 - z_4, & |
| t_4 = z_2 - z_3, |
| \end{array} |
| \end{equation} |
| \begin{equation} |
| \begin{array}{llll} |
| t_5 = t_1 + t_2, & |
| t_6 = (\sqrt{5}/4) (t_1 - t_2), & |
| t_7 = z_0 - t_5/4, \\ |
| \end{array} |
| \end{equation} |
| \begin{equation} |
| \begin{array}{llll} |
| t_8 = t_7 + t_6, & |
| t_9 = t_7 - t_6, \\ |
| \end{array} |
| \end{equation} |
| \begin{equation} |
| \begin{array}{llll} |
| t_{10} = \sin(2\pi/5) t_3 + \sin(2\pi/10) t_4, & |
| t_{11} = \sin(2\pi/10) t_3 - \sin(2\pi/5) t_4, |
| \end{array} |
| \end{equation} |
| \begin{equation} |
| \begin{array}{llll} |
| x_0 = z_0 + t_5, |
| \end{array} |
| \end{equation} |
| \begin{equation} |
| \begin{array}{llll} |
| x_1 = t_8 + i t_{10}, & |
| x_2 = t_9 + i t_{11}, |
| \end{array} |
| \end{equation} |
| \begin{equation} |
| \begin{array}{llll} |
| x_3 = t_9 - i t_{11}, & |
| x_4 = t_8 - i t_{10}. |
| \end{array} |
| \end{equation} |
| % |
| The DFT matrix for $N=6$ can be written as a combination of $N=3$ and |
| $N=2$ transforms with index permutations, |
| % |
| \begin{equation} |
| \left( |
| \begin{array}{c} |
| x_0 \\ |
| x_4 \\ |
| x_2 \\ |
| \hline x_3 \\ |
| x_1 \\ |
| x_5 |
| \end{array} |
| \right) |
| = |
| \left( |
| \begin{array}{ccc|ccc} |
| & & & & & \\ |
| &W_3& & &W_3& \\ |
| & & & & & \\ |
| \hline & & & & & \\ |
| &W_3& & &-W_3& \\ |
| & & & & & |
| \end{array} |
| \right) |
| \left( |
| \begin{array}{c} |
| z_0 \\ |
| z_2 \\ |
| z_4 \\ |
| \hline z_3 \\ |
| z_5 \\ |
| z_1 |
| \end{array} |
| \right) |
| \end{equation} |
| % |
| This simplification is an example of the Prime Factor Algorithm, which |
| can be used because the factors 2 and 3 are mutually prime. For more |
| details consult one of the books on number theory for |
| FFTs~\cite{elliott82,blahut}. We can take advantage of the simple |
| indexing scheme of the PFA to write the $N=6$ DFT as, |
| % |
| \begin{equation} |
| \begin{array}{lll} |
| t_1 = z_2 + z_4, & |
| t_2 = z_0 - t_1/2, & |
| t_3 = \sin(\pi/3) (z_2 - z_4), |
| \end{array} |
| \end{equation} |
| \begin{equation} |
| \begin{array}{lll} |
| t_4 = z_5 + z_1, & |
| t_5 = z_3 - t_4/2, & |
| t_6 = \sin(\pi/3) (z_5 - z_1), |
| \end{array} |
| \end{equation} |
| \begin{equation} |
| \begin{array}{lll} |
| t_7 = z_0 + t_1, & |
| t_8 = t_2 + i t_3, & |
| t_9 = t_2 - i t_3, |
| \end{array} |
| \end{equation} |
| \begin{equation} |
| \begin{array}{lll} |
| t_{10} = z_3 + t_4, & |
| t_{11} = t_5 + i t_6, & |
| t_{12} = t_5 - i t_6, |
| \end{array} |
| \end{equation} |
| \begin{equation} |
| \begin{array}{lll} |
| x_0 = t_7 + t_{10}, & |
| x_4 = t_8 + t_{11}, & |
| x_2 = t_9 + t_{12}, |
| \end{array} |
| \end{equation} |
| \begin{equation} |
| \begin{array}{lll} |
| x_3 = t_7 - t_{10}, & |
| x_1 = t_8 - t_{11}, & |
| x_5 = t_9 - t_{12}. |
| \end{array} |
| \end{equation} |
| |
| For any remaining general factors we use Singleton's efficient method |
| for computing a DFT~\cite{singleton}. Although it is an $O(N^2)$ |
| algorithm it does reduce the number of multiplications by a factor of |
| 4 compared with a naive evaluation of the DFT. If we look at the |
| general stucture of a DFT matrix, shown schematically below, |
| % |
| \begin{equation} |
| \left( |
| \begin{array}{c} |
| h_0 \\ |
| h_1 \\ |
| h_2 \\ |
| \vdots \\ |
| h_{N-2} \\ |
| h_{N-1} |
| \end{array} |
| \right) |
| = |
| \left( |
| \begin{array}{cccccc} |
| 1 & 1 & 1 & \cdots & 1 & 1 \\ |
| 1 & W & W & \cdots & W & W \\ |
| 1 & W & W & \cdots & W & W \\ |
| \vdots & \vdots & \vdots & \cdots & \vdots & \vdots \\ |
| 1 & W & W & \cdots & W & W \\ |
| 1 & W & W & \cdots & W & W |
| \end{array} |
| \right) |
| \left( |
| \begin{array}{c} |
| g_0 \\ |
| g_1 \\ |
| g_2 \\ |
| \vdots \\ |
| g_{N-2} \\ |
| g_{N-1} |
| \end{array} |
| \right) |
| \end{equation} |
| % |
| we see that the outer elements of the DFT matrix are all unity. We can |
| remove these trivial multiplications but we will still be left with an |
| $(N-1) \times (N-1)$ sub-matrix of complex entries, which would appear |
| to require $(N-1)^2$ complex multiplications. Singleton's method, |
| uses symmetries of the DFT matrix to convert the complex |
| multiplications to an equivalent number of real multiplications. We |
| start with the definition of the DFT in component form, |
| % |
| \begin{equation} |
| a_k + i b_k = \sum_{j=0} (x_j+iy_j)(\cos(2\pi jk/f) - i\sin(2\pi jk/f)) |
| \end{equation} |
| % |
| The zeroth component can be computed using only additions, |
| % |
| \begin{equation} |
| a_0 + i b_0 = \sum_{j=0}^{(f-1)} x_j + i y_j |
| \end{equation} |
| % |
| We can rewrite the remaining components as, |
| % |
| \begin{eqnarray} |
| a_k + i b_k & = x_0 + i y_0 & + |
| \sum_{j=1}^{(f-1)/2} (x_j + x_{f-j}) \cos(2\pi jk/f) |
| + (y_j - y_{f-j}) \sin(2\pi jk/f) \\ |
| & & + i\sum_{j=1}^{(f-1)/2} (y_j + y_{f-j}) \cos(2\pi jk/f) |
| - (x_j - x_{f-j}) \sin(2\pi jk/f) |
| \end{eqnarray} |
| % |
| by using the following trigonometric identities, |
| % |
| \begin{eqnarray} |
| \cos(2\pi(f-j)k/f) &=& \phantom{-}\cos(2\pi jk/f) \\ |
| \sin(2\pi(f-j)k/f) &=& -\sin(2\pi jk/f) |
| \end{eqnarray} |
| % |
| These remaining components can all be computed using four partial |
| sums, |
| % |
| \begin{eqnarray} |
| a_k + i b_k & = & (a^+_k - a^-_k) + i (b^+_k + b^-_k) \\ |
| a_{f-k} + i b_{f-k} & = & (a^+_k + a^-_k) + i (b^+_k - b^-_k) |
| \end{eqnarray} |
| % |
| for $k = 1, 2, \dots, (f-1)/2$, where, |
| % |
| \begin{eqnarray} |
| a^+_k &=& x_0 + \sum_{j=1}^{(f-1)/2} (x_j + x_{f-j}) \cos(2\pi jk/f) \\ |
| a^-_k &=& \phantom{x_0} - \sum_{j=1}^{(f-1)/2} (y_j - y_{f-j}) \sin(2\pi jk/f) \\ |
| b^+_k &=& y_0 + \sum_{j=1}^{(f-1)/2} (y_j + y_{f-j}) \cos(2\pi jk/f) \\ |
| b^-_k &=& \phantom{y_0} - \sum_{j=1}^{(f-1)/2} (x_j - x_{f-j}) \sin(2\pi jk/f) |
| \end{eqnarray} |
| % |
| Note that the higher components $k'=f-k$ can be obtained directly |
| without further computation once $a^+$, $a^-$, $b^+$ and $b^-$ are |
| known. There are $4 \times (f-1)/2$ different sums, each involving |
| $(f-1)/2$ real multiplications, giving a total of $(f-1)^2$ real |
| multiplications instead of $(f-1)^2$ complex multiplications. |
| |
| To implement Singleton's method we make use of the input and output |
| vectors $v$ and $v'$ as scratch space, copying data back and forth |
| between them to obtain the final result. First we use $v'$ to store |
| the terms of the symmetrized and anti-symmetrized vectors of the form |
| $x_j + x_{f-j}$ and $x_j - y_{f-j}$. Then we multiply these by the |
| appropriate trigonometric factors to compute the partial sums $a^+$, |
| $a^-$, $b^+$ and $b^-$, storing the results $a_k + i b_k$ and $a_{f-k} |
| + i b_{f-k}$ back in $v$. Finally we multiply the DFT output by any |
| necessary twiddle factors and place the results in $v'$. |
| |
| \section{FFTs for real data} |
| % |
| This section is based on the articles {\em Fast Mixed-Radix Real |
| Fourier Transforms} by Clive Temperton~\cite{temperton83real} and |
| {\em Real-Valued Fast Fourier Transform Algorithms} by Sorensen, |
| Jones, Heideman and Burrus~\cite{burrus87real}. The DFT of a real |
| sequence has a special symmetry, called a {\em conjugate-complex} or |
| {\em half-complex} symmetry, |
| % |
| \begin{equation} |
| h(a) = h(N-a)^* |
| \end{equation} |
| % |
| The element $h(0)$ is real, and when $N$ is even $h(N/2)$ is also |
| real. It is straightforward to prove the symmetry, |
| % |
| \begin{eqnarray} |
| h(a) &=& \sum W^{ab}_N g(b) \\ |
| h(N-a)^* &=& \sum W^{-(N-a)b}_N g(b)^* \\ |
| &=& \sum W^{-Nb}_N W^{ab}_N g(b) \qquad{(W^N_N=1)} \\ |
| &=& \sum W^{ab}_N g(b) |
| \end{eqnarray} |
| % |
| Real-valued data is very common in practice (perhaps more common that |
| complex data) so it is worth having efficient FFT routines for real |
| data. In principle an FFT for real data should need half the |
| operations of an FFT on the equivalent complex data (where the |
| imaginary parts are set to zero). There are two different strategies |
| for computing FFTs of real-valued data: |
| |
| One strategy is to ``pack'' the real data (of length $N$) into a |
| complex array (of length $N/2$) by index transformations. A complex |
| FFT routine can then be used to compute the transform of that array. |
| By further index transformations the result can actually by |
| ``unpacked'' to the FFT of the original real data. It is also possible |
| to do two real FFTs simultaneously by packing one in the real part and |
| the other in the imaginary part of the complex array. These |
| techniques have some disadvantages. The packing and unpacking |
| procedures always add $O(N)$ operations, and packing a real array of |
| length $N$ into a complex array of length $N/2$ is only possible if |
| $N$ is even. In addition, if two unconnected datasets with very |
| different magnitudes are packed together in the same FFT there could |
| be ``cross-talk'' between them due to a loss of precision. |
| |
| A more straightforward strategy is to start with an FFT algorithm, |
| such as the complex mixed-radix algorithm, and prune out all the |
| operations involving the zero imaginary parts of the initial data. The |
| FFT is linear so the imaginary part of the data can be decoupled from |
| the real part. This procedure leads to a dedicated FFT for real-valued |
| data which works for any length and does not perform any unnecessary |
| operations. It also allows us to derive a corresponding inverse FFT |
| routine which transforms a half-complex sequence back into real data. |
| |
| \subsection{Radix-2 FFTs for real data} |
| % |
| Before embarking on the full mixed-radix real FFT we'll start with the |
| radix-2 case. It contains all the essential features of the |
| general-$N$ algorithm. To make it easier to see the analogy between |
| the two we will use the mixed-radix notation to describe the |
| factors. The factors are all 2, |
| % |
| \begin{equation} |
| f_1 = 2, f_2 = 2, \dots, f_{n_f} = 2 |
| \end{equation} |
| % |
| and the products $p_i$ are powers of 2, |
| % |
| \begin{eqnarray} |
| p_0 & = & 1 \\ |
| p_1 & = & f_1 = 2 \\ |
| p_2 & = & f_1 f_2 = 4 \\ |
| \dots &=& \dots \\ |
| p_i & = & f_1 f_2 \dots f_i = 2^i |
| \end{eqnarray} |
| % |
| Using this notation we can rewrite the radix-2 decimation-in-time |
| algorithm as, |
| % |
| \begin{algorithm} |
| \STATE bit-reverse ordering of $g$ |
| \FOR {$i = 1 \dots n$} |
| \FOR {$a = 0 \dots p_{i-1} - 1$} |
| \FOR{$b = 0 \dots q_i - 1$} |
| \STATE{$ |
| \left( |
| \begin{array}{c} |
| g(b p_i + a) \\ |
| g(b p_i + p_{i-1} + a) |
| \end{array} |
| \right) |
| = |
| \left( |
| \begin{array}{c} |
| g(b p_i + a) + W^a_{p_i} g(b p_i + p_{i-1} + a) \\ |
| g(b p_i + a) - W^a_{p_i} g(b p_i + p_{i-1} + a) |
| \end{array} |
| \right) $} |
| \ENDFOR |
| \ENDFOR |
| \ENDFOR |
| \end{algorithm} |
| % |
| where we have used $p_i = 2 \Delta$, and factored $2 \Delta$ out of |
| the original definition of $b$ ($b \to b p_i$). |
| |
| If we go back to the original recurrence relations we can see how to |
| write the intermediate results in a way which make the |
| real/half-complex symmetries explicit at each step. The first pass is |
| just a set of FFTs of length-2 on real values, |
| % |
| \begin{equation} |
| g_1([b_0 b_1 b_2 a_0]) = \sum_{b_3} W^{a_0 b_3}_2 g([b_0 b_1 b_2 b_3]) |
| \end{equation} |
| % |
| Using the symmetry $FFT(x)_k = FFT(x)^*_{N-k}$ we have the reality |
| condition, |
| % |
| \begin{eqnarray} |
| g_1([b_0 b_1 b_2 0]) &=& \mbox{real} \\ |
| g_1([b_0 b_1 b_2 1]) &=& \mbox{real'} |
| \end{eqnarray} |
| % |
| In the next pass we have a set of length-4 FFTs on the original data, |
| % |
| \begin{eqnarray} |
| g_2([b_0 b_1 b_1 a_0]) |
| &=& |
| \sum_{b_2}\sum_{b_3} |
| W^{[a_1 a_0]b_2}_4 W^{a_0 b_3}_2 |
| g([b_0 b_1 b_2 b_3]) \\ |
| &=& |
| \sum_{b_2}\sum_{b_3} |
| W^{[a_1 a_0][b_3 b_2]}_4 |
| g([b_0 b_1 b_2 b_3]) |
| \end{eqnarray} |
| % |
| This time symmetry gives us the following conditions on the |
| transformed data, |
| % |
| \begin{eqnarray} |
| g_2([b_0 b_1 0 0]) &=& \mbox{real} \\ |
| g_2([b_0 b_1 0 1]) &=& x + i y \\ |
| g_2([b_0 b_1 1 0]) &=& \mbox{real'} \\ |
| g_2([b_0 b_1 1 1]) &=& x - i y |
| \end{eqnarray} |
| % |
| We can see a pattern emerging here: the $i$-th pass computes a set of |
| independent length-$2^i$ FFTs on the original real data, |
| % |
| \begin{eqnarray} |
| g_i ( b p_i + a ) = \sum_{a' = 0}^{p_i-1} W_{p_i}^{aa'} g(b p_i + a') |
| \quad |
| \mbox{for $b = 0 \dots q_i - 1$} |
| \end{eqnarray} |
| % |
| As a consequence the we can apply the symmetry for an FFT of real data |
| to all the intermediate results -- not just the final result. |
| In general after the $i$-th pass we will |
| have the symmetry, |
| % |
| \begin{eqnarray} |
| g_i(b p_i) &=& \mbox{real} \\ |
| g_i(b p_i + a) &=& g_i(b p_i + p_i - a)^* \qquad a = 1 \dots p_{i}/2 - 1 \\ |
| g_i(b p_i + p_{i}/2) &=& \mbox{real'} |
| \end{eqnarray} |
| % |
| In the next section we'll show that this is a general property of |
| decimation-in-time algorithms. The same is not true for the |
| decimation-in-frequency algorithm, which does not have any simple |
| symmetries in the intermediate results. |
| |
| Since we can obtain the values of $g_i(b p_i + a)$ for $a > p_i/2$ |
| from the values for $a < p_i/2$ we can cut our computation and |
| storage in half compared with the full-complex case. |
| % |
| We can easily rewrite the algorithm to show how the computation can be |
| halved, simply by limiting all terms to involve only values for $a |
| \leq p_{i}/2$. Whenever we encounter a term $g_i(b p_i + a)$ with $a > |
| p_{i}/2$ we rewrite it in terms of its complex symmetry partner, |
| $g_i(b p_i + a')^*$, where $a' = p_i - a$. The butterfly computes two |
| values for each value of $a$, $b p_i + a$ and $b p_i + p_{i-1} - a$, |
| so we actually only need to compute from $a = 0$ to $p_{i-1}/2$. This |
| gives the following algorithm, |
| % |
| \begin{algorithm} |
| \FOR {$a = 0 \dots p_{i-1}/2$} |
| \FOR{$b = 0 \dots q_i - 1$} |
| \STATE{$ |
| \left( |
| \begin{array}{c} |
| g(b p_i + a) \\ |
| g(b p_i + p_{i-1} - a)^* |
| \end{array} |
| \right) |
| = |
| \left( |
| \begin{array}{c} |
| g(b p_{i} + a) + W^a_{p_i} g(b p_i + p_{i-1} + a) \\ |
| g(b p_{i} + a) - W^a_{p_i} g(b p_i + p_{i-1} + a) |
| \end{array} |
| \right) $} |
| \ENDFOR |
| \ENDFOR |
| \end{algorithm} |
| % |
| Although we have halved the number of operations we also need a |
| storage arrangement which will halve the memory requirement. The |
| algorithm above is still formulated in terms of a complex array $g$, |
| but the input to our routine will naturally be an array of $N$ real |
| values which we want to use in-place. |
| |
| Therefore we need a storage scheme which lays out the real and |
| imaginary parts within the real array, in a natural way so that there |
| is no need for complicated index calculations. In the radix-2 |
| algorithm we do not have any additional scratch space. The storage |
| scheme has to be designed to accommodate the in-place calculation |
| taking account of dual node pairs. |
| |
| Here is a scheme which takes these restrictions into account: On the |
| $i$-th pass we store the real part of $g(b p_i + a)$ in location $b |
| p_i + a$. We store the imaginary part in location $b p_i + p_i - |
| a$. This is the redundant location which corresponds to the conjugate |
| term $g(b p_i + a)^* = g(b p_i + p_i -a)$, so it is not needed. When |
| the results are purely real (as in the case $a = 0$ and $a = p_i/2$ we |
| store only the real part and drop the zero imaginary part). |
| |
| This storage scheme has to work in-place, because the radix-2 routines |
| should not use any scratch space. We will now check the in-place |
| property for each butterfly. A crucial point is that the scheme is |
| pass-dependent. Namely, when we are computing the result for pass $i$ |
| we are reading the results of pass $i-1$, and we must access them |
| using the scheme from the previous pass, i.e. we have to remember that |
| the results from the previous pass were stored using $b p_{i-1} + a$, |
| not $b p_i + a$, and the symmetry for these results will be $g_{i-1}(b |
| p_{i-1} + a) = g_{i-1}(b p_{i-1} + p_{i-1} - a)^*$. To take this into |
| account we'll write the right hand side of the iteration, $g_{i-1}$, |
| in terms of $p_{i-1}$. For example, instead of $b p_i$, which occurs |
| naturally in $g_i(b p_i + a)$ we will use $2 b p_{i-1}$, since $p_i = |
| 2 p_{i-1}$. |
| |
| Let's start with the butterfly for $a = 0$, |
| % |
| \begin{equation} |
| \left( |
| \begin{array}{c} |
| g(b p_i) \\ |
| g(b p_i + p_{i-1})^* |
| \end{array} |
| \right) |
| = |
| \left( |
| \begin{array}{c} |
| g(2 b p_{i-1}) + g((2 b + 1) p_{i-1}) \\ |
| g(2 b p_{i-1}) - g((2 b + 1) p_{i-1}) |
| \end{array} |
| \right) |
| \end{equation} |
| % |
| By the symmetry $g_{i-1}(b p_{i-1} + a) = g_{i-1}(b p_{i-1} + p_{i-1} |
| - a)^*$ all the inputs are purely real. The input $g(2 b p_{i-1})$ is |
| read from location $2 b p_{i-1}$ and $g((2 b + 1) p_{i-1})$ is read |
| from the location $(2 b + 1) p_{i-1}$. Here is the full breakdown, |
| % |
| \begin{center} |
| \renewcommand{\arraystretch}{1.5} |
| \begin{tabular}{|l|lll|} |
| \hline Term & & Location & \\ |
| \hline |
| $g(2 b p_{i-1})$ |
| & real part & $2 b p_{i-1} $ &$= b p_i$ \\ |
| & imag part & --- & \\ |
| $g((2 b+1) p_{i-1})$ |
| & real part & $(2 b + 1) p_{i-1} $&$= b p_i + p_{i-1} $ \\ |
| & imag part & --- & \\ |
| \hline |
| $g(b p_{i})$ |
| & real part & $b p_i$ &\\ |
| & imag part & --- & \\ |
| $g(b p_{i} + p_{i-1})$ |
| & real part & $b p_i + p_{i-1}$& \\ |
| & imag part & --- & \\ |
| \hline |
| \end{tabular} |
| \end{center} |
| % |
| The conjugation of the output term $g(b p_i + p_{i-1})^*$ is |
| irrelevant here since the results are purely real. The real results |
| are stored in locations $b p_i$ and $b p_i + p_{i-1}$, which |
| overwrites the inputs in-place. |
| |
| The general butterfly for $a = 1 \dots p_{i-1}/2 - 1$ is, |
| % |
| \begin{equation} |
| \left( |
| \begin{array}{c} |
| g(b p_i + a) \\ |
| g(b p_i + p_{i-1} - a)^* |
| \end{array} |
| \right) |
| = |
| \left( |
| \begin{array}{c} |
| g(2 b p_{i-1} + a) + W^a_{p_i} g((2 b + 1) p_{i-1} + a) \\ |
| g(2 b p_{i-1} + a) - W^a_{p_i} g((2 b + 1) p_{i-1} + a) |
| \end{array} |
| \right) |
| \end{equation} |
| % |
| All the terms are complex. To store a conjugated term like $g(b' p_i + |
| a')^*$ where $a > p_i/2$ we take the real part and store it in |
| location $b' p_i + a'$ and then take imaginary part, negate it, and |
| store the result in location $b' p_i + p_i - a'$. |
| |
| Here is the full breakdown of the inputs and outputs from the |
| butterfly, |
| % |
| \begin{center} |
| \renewcommand{\arraystretch}{1.5} |
| \begin{tabular}{|l|lll|} |
| \hline Term & & Location & \\ |
| \hline |
| $g(2 b p_{i-1} + a)$ |
| & real part & $2 b p_{i-1} + a $ &$= b p_i + a$ \\ |
| & imag part & $2 b p_{i-1} + p_{i-1} - a$ &$= b p_i + p_{i-1} - a$ \\ |
| $g((2 b+1) p_{i-1} + a)$ |
| & real part & $(2 b+1) p_{i-1} + a $&$= b p_i + p_{i-1} + a $ \\ |
| & imag part & $(2 b+1) p_{i-1} + p_{i-1} - a $&$= b p_i + p_i - a$\\ |
| \hline |
| $g(b p_{i} + a)$ |
| & real part & $b p_i + a$ &\\ |
| & imag part & $b p_i + p_i - a$& \\ |
| $g(b p_{i} + p_{i-1} - a)$ |
| & real part & $b p_i + p_{i-1} - a$& \\ |
| & imag part & $b p_i + p_{i-1} + a$& \\ |
| \hline |
| \end{tabular} |
| \end{center} |
| % |
| By comparing the input locations and output locations we can see |
| that the calculation is done in place. |
| |
| The final butterfly for $a = p_{i-1}/2$ is, |
| % |
| \begin{equation} |
| \left( |
| \begin{array}{c} |
| g(b p_i + p_{i-1}/2) \\ |
| g(b p_i + p_{i-1} - p_{i-1}/2)^* |
| \end{array} |
| \right) |
| = |
| \left( |
| \begin{array}{c} |
| g(2 b p_{i-1} + p_{i-1}/2) - i g((2 b + 1) p_{i-1} + p_{i-1}/2) \\ |
| g(2 b p_{i-1} + p_{i-1}/2) + i g((2 b + 1) p_{i-1} + p_{i-1}/2) |
| \end{array} |
| \right) |
| \end{equation} |
| % |
| where we have substituted for the twiddle factor, $W^a_{p_i} = -i$, |
| % |
| \begin{eqnarray} |
| W^{p_{i-1}/2}_{p_i} &=& \exp(-2\pi i p_{i-1}/2 p_i) \\ |
| &=& \exp(-2\pi i /4) \\ |
| &=& -i |
| \end{eqnarray} |
| % |
| For this butterfly the second line is just the conjugate of the first, |
| because $p_{i-1} - p_{i-1}/2 = p_{i-1}/2$. Therefore we only need to |
| consider the first line. The breakdown of the inputs and outputs is, |
| % |
| \begin{center} |
| \renewcommand{\arraystretch}{1.5} |
| \begin{tabular}{|l|lll|} |
| \hline Term & & Location & \\ |
| \hline |
| $g(2 b p_{i-1} + p_{i-1}/2)$ |
| & real part & $2 b p_{i-1} + p_{i-1}/2 $ &$= b p_i + p_{i-1}/2$ \\ |
| & imag part & --- & \\ |
| $g((2 b + 1) p_{i-1} + p_{i-1}/2)$ |
| & real part & $(2 b + 1) p_{i-1} + p_{i-1}/2 $&$= b p_i + p_{i} - p_{i-1}/2 $ \\ |
| & imag part & --- & \\ |
| \hline |
| $g(b p_{i} + p_{i-1}/2)$ |
| & real part & $b p_i + p_{i-1}/2$ &\\ |
| & imag part & $b p_i + p_i - p_{i-1}/2$& \\ |
| \hline |
| \end{tabular} |
| \end{center} |
| % |
| By comparing the locations of the inputs and outputs with the |
| operations in the butterfly we find that this computation is very |
| simple: the effect of the butterfly is to negate the location $b p_i + |
| p_i - p_{i-1}/2$ and leave other locations unchanged. This is clearly |
| an in-place operation. |
| |
| Here is the radix-2 algorithm for real data, in full, with the cases |
| of $a=0$, $a=1\dots p_{i-1}/2 - 1$ and $a = p_{i-1}/2$ in separate |
| blocks, |
| % |
| \begin{algorithm} |
| \STATE bit-reverse ordering of $g$ |
| \FOR {$i = 1 \dots n$} |
| \FOR{$b = 0 \dots q_i - 1$} |
| \STATE{$\left( |
| \begin{array}{c} |
| g(b p_i) \\ |
| g(b p_i + p_{i-1}) |
| \end{array} |
| \right) |
| \Leftarrow |
| \left( |
| \begin{array}{c} |
| g(b p_{i}) + g(b p_{i} + p_{i-1}) \\ |
| g(b p_{i}) - g(b p_{i} + p_{i-1}) |
| \end{array} |
| \right)$} |
| \ENDFOR |
| |
| \FOR {$a = 1 \dots p_{i-1}/2 - 1$} |
| \FOR{$b = 0 \dots q_i - 1$} |
| \STATE{$(\Real z_0, \Imag z_0) \Leftarrow |
| (g(b p_i + a), g(b p_i + p_{i-1} - a))$} |
| \STATE{$(\Real z_1, \Imag z_1) \Leftarrow |
| (g(b p_i + p_{i-1} + a), g(b p_i + p_{i} - a))$} |
| \STATE{$t_0 \Leftarrow z_0 + W^a_{p_i} z_1$} |
| \STATE{$t_1 \Leftarrow z_0 - W^a_{p_i} z_1$} |
| \STATE{$(g(b p_{i} + a),g(b p_{i} + p_i - a) \Leftarrow |
| (\Real t_0, \Imag t_0)$} |
| \STATE{$(g(b p_{i} + p_{i-1} - a), g(b p_{i} + p_{i-1} + a)) |
| \Leftarrow |
| (\Real t_1, -\Imag t_1)$} |
| \ENDFOR |
| \ENDFOR |
| |
| \FOR{$b = 0 \dots q_i - 1$} |
| \STATE{$g(b p_{i} - p_{i-1}/2) \Leftarrow -g(b p_{i} - p_{i-1}/2)$} |
| \ENDFOR |
| |
| \ENDFOR |
| \end{algorithm} |
| % |
| We split the loop over $a$ into three parts, $a=0$, $a=1\dots |
| p_{i-1}/2-1$ and $a = p_{i-1}/2$ for efficiency. When $a=0$ we have |
| $W^a_{p_i}=1$ so we can eliminate a complex multiplication within the |
| loop over $b$. When $a=p_{i-1}/2$ we have $W^a_{p_i} = -i$ which does |
| not require a full complex multiplication either. |
| |
| |
| \subsubsection{Calculating the Inverse} |
| % |
| The inverse FFT of complex data was easy to calculate, simply by |
| taking the complex conjugate of the DFT matrix. The input data and |
| output data were both complex and did not have any special |
| symmetry. For real data the inverse FFT is more complicated because |
| the half-complex symmetry of the transformed data is |
| different from the purely real input data. |
| |
| We can compute an inverse by stepping backwards through the forward |
| transform. To simplify the inversion it's convenient to write the |
| forward algorithm with the butterfly in matrix form, |
| % |
| \begin{algorithm} |
| \FOR {$i = 1 \dots n$} |
| \FOR {$a = 0 \dots p_{i-1}/2$} |
| \FOR{$b = 0 \dots q_i - 1$} |
| \STATE{$ |
| \left( |
| \begin{array}{c} |
| g(b p_i + a) \\ |
| g(b p_i + p_{i-1} + a) |
| \end{array} |
| \right) |
| = |
| \left( |
| \begin{array}{cc} |
| 1 & W^a_{p_{i}} \\ |
| 1 & -W^a_{p_{i}} |
| \end{array} |
| \right) |
| \left( |
| \begin{array}{c} |
| g(2 b p_{i-1} + a) \\ |
| g((2 b + 1) p_{i-1} + a) |
| \end{array} |
| \right) $} |
| \ENDFOR |
| \ENDFOR |
| \ENDFOR |
| \end{algorithm} |
| % |
| To invert the algorithm we run the iterations backwards and invert the |
| matrix multiplication in the innermost loop, |
| % |
| \begin{algorithm} |
| \FOR {$i = n \dots 1$} |
| \FOR {$a = 0 \dots p_{i-1}/2$} |
| \FOR{$b = 0 \dots q_i - 1$} |
| \STATE{$ |
| \left( |
| \begin{array}{c} |
| g(2 b p_{i-1} + a) \\ |
| g((2 b + 1) p_{i-1} + a) |
| \end{array} |
| \right) |
| = |
| \left( |
| \begin{array}{cc} |
| 1 & W^a_{p_{i}} \\ |
| 1 & -W^a_{p_{i}} |
| \end{array} |
| \right)^{-1} |
| \left( |
| \begin{array}{c} |
| g(b p_i + a) \\ |
| g(b p_i + p_{i-1} + a) |
| \end{array} |
| \right) $} |
| \ENDFOR |
| \ENDFOR |
| \ENDFOR |
| \end{algorithm} |
| % |
| There is no need to reverse the loops over $a$ and $b$ because the |
| result is independent of their order. The inverse of the matrix that |
| appears is, |
| % |
| \begin{equation} |
| \left( |
| \begin{array}{cc} |
| 1 & W^a_{p_{i}} \\ |
| 1 & -W^a_{p_{i}} |
| \end{array} |
| \right)^{-1} |
| = |
| {1 \over 2} |
| \left( |
| \begin{array}{cc} |
| 1 & 1 \\ |
| W^{-a}_{p_{i}} & -W^{-a}_{p_{i}} |
| \end{array} |
| \right) |
| \end{equation} |
| % |
| To save divisions we remove the factor of $1/2$ inside the loop. This |
| computes the unnormalized inverse, and the normalized inverse can be |
| retrieved by dividing the final result by $N = 2^n$. |
| |
| Here is the radix-2 half-complex to real inverse FFT algorithm, taking |
| into account the radix-2 storage scheme, |
| % |
| \begin{algorithm} |
| \FOR {$i = n \dots 1$} |
| \FOR{$b = 0 \dots q_i - 1$} |
| \STATE{$\left( |
| \begin{array}{c} |
| g(b p_i) \\ |
| g(b p_i + p_{i-1}) |
| \end{array} |
| \right) |
| \Leftarrow |
| \left( |
| \begin{array}{c} |
| g(b p_{i}) + g(b p_{i} + p_{i-1}) \\ |
| g(b p_{i}) - g(b p_{i} + p_{i-1}) |
| \end{array} |
| \right)$} |
| \ENDFOR |
| |
| \FOR {$a = 1 \dots p_{i-1}/2 - 1$} |
| \FOR{$b = 0 \dots q_i - 1$} |
| \STATE{$(\Real z_0, \Imag z_0) |
| \Leftarrow |
| (g(b p_i + a), g(b p_i + p_{i} - a))$} |
| \STATE{$(\Real z_1, \Imag z_1) |
| \Leftarrow |
| (g(b p_i + p_{i-1} - a), -g(b p_i + p_{i-1} + a))$} |
| \STATE{$t_0 \Leftarrow z_0 + z_1$} |
| \STATE{$t_1 \Leftarrow z_0 - z_1$} |
| \STATE{$(g(b p_{i} + a), g(b p_{i} + p_{i-1} - a)) |
| \Leftarrow |
| (\Real t_0, \Imag t_0) $} |
| \STATE{$(g(b p_{i} + p_{i-1} + a),g(b p_{i} + p_{i} - a)) |
| \Leftarrow |
| (\Real(W^a_{p_i}t_1), \Imag(W^a_{p_i}t_1))$} |
| \ENDFOR |
| \ENDFOR |
| |
| \FOR{$b = 0 \dots q_i - 1$} |
| \STATE{$g(b p_{i} + p_{i-1}/2) \Leftarrow 2 g(b p_{i} + p_{i-1}/2)$} |
| \STATE{$g(b p_{i} + p_{i-1} + p_{i-1}/2) \Leftarrow -2 g(b p_{i} + p_{i-1} + p_{i-1}/2)$} |
| \ENDFOR |
| |
| \ENDFOR |
| \STATE bit-reverse ordering of $g$ |
| \end{algorithm} |
| |
| |
| |
| \subsection{Mixed-Radix FFTs for real data} |
| % |
| As discussed earlier the radix-2 decimation-in-time algorithm had the |
| special property that its intermediate passes are interleaved fourier |
| transforms of the original data, and this generalizes to the |
| mixed-radix algorithm. The complex mixed-radix algorithm that we |
| derived earlier was a decimation-in-frequency algorithm, but we can |
| obtain a decimation-in-time version by taking the transpose of the |
| decimation-in-frequency DFT matrix like this, |
| % |
| \begin{eqnarray} |
| W_N &=& W_N^T \\ |
| &=& (T_{n_f} \dots T_2 T_1)^T \\ |
| &=& T_1^T T_2^T \dots T_{n_f}^T |
| \end{eqnarray} |
| % |
| with, |
| % |
| \begin{eqnarray} |
| T_i^T &=& \left( (P^{f_i}_{q_i} D^{f_i}_{q_i} \otimes I_{p_{i-1}}) |
| (W_{f_i} \otimes I_{m_i}) \right)^T \\ |
| &=& (W_{f_i} \otimes I_{m_i}) |
| ( D^{f_i}_{q_i} (P^{f_i}_{q_i})^T \otimes I_{p_{i-1}}). |
| \end{eqnarray} |
| % |
| We have used the fact that $W$, $D$ and $I$ are symmetric and that the |
| permutation matrix $P$ obeys, |
| % |
| \begin{equation} |
| (P^a_b)^T = P^b_a. |
| \end{equation} |
| % |
| From the definitions of $D$ and $P$ we can derive the following identity, |
| % |
| \begin{equation} |
| D^a_b P^b_a = P^b_a D^b_a. |
| \end{equation} |
| % |
| This allows us to put the transpose into a simple form, |
| % |
| \begin{equation} |
| T_i^T = (W_{f_i} \otimes I_{m_i}) |
| (P^{q_i}_{f_i} D^{q_i}_{f_i} \otimes I_{p_{i-1}}). |
| \end{equation} |
| % |
| The transposed matrix, $T^T$ applies the digit-reversal $P$ before the |
| DFT $W$, giving the required decimation-in-time algorithm. The |
| transpose reverses the order of the factors --- $T_{n_f}$ is applied |
| first and $T_1$ is applied last. For convenience we can reverse the |
| order of the factors, $f_1 \leftrightarrow f_{n_f}$, $f_2 |
| \leftrightarrow f_{n_f-1}$, \dots and make the corresponding |
| substitution $p_{i-1} \leftrightarrow q_i$. These substitutions give |
| us a decimation-in-time algorithm with the same ordering as the |
| decimation-in-frequency algorithm, |
| % |
| \begin{equation} |
| W_N = T_{n_f} \dots T_2 T_1 |
| \end{equation} |
| % |
| \begin{equation} |
| T_i = (W_{f_i} \otimes I_{m_i}) |
| (P^{p_{i-1}}_{f_i} D^{p_{i-1}}_{f_i} \otimes I_{q_i}) |
| \end{equation} |
| % |
| where $p_i$, $q_i$ and $m_i$ now have the same meanings as before, |
| namely, |
| % |
| \begin{eqnarray} |
| p_i &=& f_1 f_2 \dots f_i \quad (p_0 = 1) \\ |
| q_i &=& N / p_i \\ |
| m_i &=& N / f_i |
| \end{eqnarray} |
| % |
| Now we would like to prove that the iteration for computing $x = W z = |
| T_{n_f} \dots T_2 T_1 z$ has the special property interleaving |
| property. If we write the result of each intermediate pass as |
| $v^{(i)}$, |
| % |
| \begin{eqnarray} |
| v^{(0)} &=& z \\ |
| v^{(1)} &=& T_1 v^{(0)} \\ |
| v^{(2)} &=& T_2 v^{(1)} \\ |
| \dots &=& \dots \\ |
| v^{(i)} &=& T_i v^{(i-1)} |
| \end{eqnarray} |
| % |
| then we will show that the intermediate results $v^{(i)}$ on any pass |
| can be written as, |
| % |
| \begin{equation} |
| v^{(i)} = (W_{p_i} \otimes I_{q_i}) z |
| \end{equation} |
| % |
| Each intermediate stage will be a set of $q_i$ interleaved fourier |
| transforms, each of length $p_i$. We can prove this result by |
| induction. First we assume that the result is true for $v^{(i-1)}$, |
| % |
| \begin{equation} |
| v^{(i-1)} = (W_{p_{i-1}} \otimes I_{q_{i-1}}) z \qquad \mbox{(assumption)} |
| \end{equation} |
| % |
| And then we examine the next iteration using this assumption, |
| % |
| \begin{eqnarray} |
| v^{(i)} &=& T_i v^{(i-1)} \\ |
| &=& T_i (W_{p_{i-1}} \otimes I_{q_{i-1}}) z \\ |
| &=& (W_{f_i} \otimes I_{m_i}) |
| (P^{p_{i-1}}_{f_i} D^{p_{i-1}}_{f_i} \otimes I_{q_i}) |
| (W_{p_{i-1}} \otimes I_{q_{i-1}}) z \label{dit-induction} |
| \end{eqnarray} |
| % |
| Using the relation $m_i = p_{i-1} q_i$, we can write $I_{m_i}$ as |
| $I_{p_{i-1} q_i}$ and $I_{q_{i-1}}$ as $I_{f_i q_i}$. By combining these |
| with the basic matrix identity, |
| % |
| \begin{equation} |
| I_{ab} = I_a \otimes I_b |
| \end{equation} |
| % |
| we can rewrite $v^{(i)}$ in the following form, |
| % |
| \begin{equation} |
| v^{(i)} = (((W_{f_i} \otimes I_{p_{i-1}}) |
| (P^{p_{i-1}}_{f_i} D^{p_{i-1}}_{f_i}) |
| (W_{p_{i-1}} \otimes I_{f_i})) \otimes I_{q_i}) z . |
| \end{equation} |
| % |
| The first part of this matrix product is the two-factor expansion of |
| $W_{ab}$, for $a = p_{i-1}$ and $b = f_i$, |
| % |
| \begin{equation} |
| W_{p_{i-1} f_i} = ((W_{f_i} \otimes I_{p_{i-1}}) |
| (P^{p_{i-1}}_{f_i} D^{p_{i-1}}_{f_i}) |
| (W_{p_{i-1}} \otimes I_{f_i})). |
| \end{equation} |
| % |
| If we substitute this result, remembering that $p_i = p_{i-1} f_i$, we |
| obtain, |
| % |
| \begin{equation} |
| v^{(i)} = (W_{p_i} \otimes I_{q_i}) z |
| \end{equation} |
| % |
| which is the desired result. The case $i=1$ can be verified |
| explicitly, and induction then shows that the result is true for all |
| $i$. As discussed for the radix-2 algorithm this result is important |
| because if the initial data $z$ is real then each intermediate pass is |
| a set of interleaved fourier transforms of $z$, having half-complex |
| symmetries (appropriately applied in the subspaces of the Kronecker |
| product). Consequently only $N$ real numbers are needed to store the |
| intermediate and final results. |
| |
| \subsection{Implementation} |
| % |
| The implementation of the mixed-radix real FFT algorithm follows the |
| same principles as the full complex transform. Some of the steps are |
| applied in the opposite order because we are dealing with a decimation |
| in time algorithm instead of a decimation in frequency algorithm, but |
| the basic outer structure of the algorithm is the same. We want to |
| apply the factorized version of the DFT matrix $W_N$ to the input |
| vector $z$, |
| % |
| \begin{eqnarray} |
| x &=& W_N z \\ |
| &=& T_{n_f} \dots T_2 T_1 z |
| \end{eqnarray} |
| % |
| We loop over the $n_f$ factors, applying each matrix $T_i$ to the |
| vector in turn to build up the complete transform, |
| % |
| \begin{algorithm} |
| \FOR{$(i = 1 \dots n_f)$} |
| \STATE{$v \Leftarrow T_i v $} |
| \ENDFOR |
| \end{algorithm} |
| % |
| In this case the definition of $T_i$ is different because we have |
| taken the transpose, |
| % |
| \begin{equation} |
| T_i = |
| (W_{f_i} \otimes I_{m_i}) |
| (P^{p_{i-1}}_{f_i} D^{p_{i-1}}_{f_i} \otimes I_{q_i}). |
| \end{equation} |
| % |
| We'll define a temporary vector $t$ to denote the results of applying the |
| rightmost matrix, |
| % |
| \begin{equation} |
| t = (P^{p_{i-1}}_{f_i} D^{p_{i-1}}_{f_i} \otimes I_{q_i}) v |
| \end{equation} |
| % |
| If we expand this out into individual components, as before, we find a |
| similar simplification, |
| % |
| \begin{eqnarray} |
| t_{aq+b} |
| &=& |
| \sum_{a'b'} |
| (P^{p_{i-1}}_{f_i} D^{p_{i-1}}_{f_i} \otimes I_{q_i})_{(aq+b)(a'q+b')} |
| v_{a'q+b'} \\ |
| &=& |
| \sum_{a'} |
| (P^{p_{i-1}}_{f_i} D^{p_{i-1}}_{f_i})_{a a'} |
| v_{a'q+b} |
| \end{eqnarray} |
| % |
| We have factorized the indices into the form $aq+b$, with $0 \leq a < |
| p_{i}$ and $0 \leq b < q$. Just as in the decimation in frequency |
| algorithm we can split the index $a$ to remove the matrix |
| multiplication completely. We have to write $a$ as $\mu f + \lambda$, |
| where $0 \leq \mu < p_{i-1}$ and $0 \leq \lambda < f$, |
| % |
| \begin{eqnarray} |
| t_{(\mu f + \lambda)q+b} |
| &=& |
| \sum_{\mu'\lambda'} |
| (P^{p_{i-1}}_{f_i} D^{p_{i-1}}_{f_i})_{(\mu f + \lambda)(\mu' f + \lambda')} |
| v_{(\mu' f + \lambda')q+b} |
| \\ |
| &=& |
| \sum_{\mu'\lambda'} |
| (P^{p_{i-1}}_{f_i})_{(\mu f + \lambda)(\mu' f + \lambda')} |
| \omega^{\mu'\lambda'}_{p_{i}} |
| v_{(\mu' f + \lambda')q+b} |
| \end{eqnarray} |
| % |
| The matrix $P^{p_{i-1}}_{f_i}$ exchanges an index of $(\mu f + |
| \lambda) q + b$ with $(\lambda p_{i-1} + \mu) q + b$, giving a final |
| result of, |
| % |
| \begin{eqnarray} |
| t_{(\lambda p_{i-1} + \mu) q + b} |
| = |
| w^{\mu\lambda}_{p_i} v_{(\mu f + \lambda)q +b} |
| \end{eqnarray} |
| % |
| To calculate the next stage, |
| % |
| \begin{equation} |
| v' = (W_{f_i} \otimes I_{m_i}) t, |
| \end{equation} |
| % |
| we temporarily rearrange the index of $t$ to separate the $m_{i}$ |
| independent DFTs in the Kronecker product, |
| % |
| \begin{equation} |
| v'_{(\lambda p_{i-1} + \mu) q + b} |
| = |
| \sum_{\lambda' \mu' b'} |
| (W_{f_i} \otimes I_{m_i})_{ |
| ((\lambda p_{i-1} + \mu) q + b) |
| ((\lambda' p_{i-1} + \mu') q + b')} |
| t_{(\lambda' p_{i-1} + \mu') q + b'} |
| \end{equation} |
| % |
| If we use the identity $m = p_{i-1} q$ to rewrite the index of $t$ |
| like this, |
| % |
| \begin{equation} |
| t_{(\lambda p_{i-1} + \mu) q + b} = t_{\lambda m + (\mu q + b)} |
| \end{equation} |
| % |
| we can split the Kronecker product, |
| % |
| \begin{eqnarray} |
| v'_{(\lambda p_{i-1} + \mu) q + b} |
| &=& |
| \sum_{\lambda' \mu' b'} |
| (W_{f_i} \otimes I_{m_i})_{ |
| ((\lambda p_{i-1} + \mu) q + b) |
| ((\lambda' p_{i-1} + \mu') q + b')} |
| t_{(\lambda' p_{i-1} + \mu') q + b'}\\ |
| &=& |
| \sum_{\lambda'} |
| (W_{f_i})_{\lambda \lambda'} |
| t_{\lambda' m_i + (\mu q + b)} |
| \end{eqnarray} |
| % |
| If we switch back to the original form of the index in the last line we obtain, |
| % |
| \begin{eqnarray} |
| \phantom{v'_{(\lambda p_{i-1} + \mu) q + b}} |
| &=& |
| \sum_{\lambda'} |
| (W_{f_i})_{\lambda \lambda'} |
| t_{(\lambda p_{i-1} + \mu) q + b} |
| \end{eqnarray} |
| % |
| which allows us to substitute our previous result for $t$, |
| % |
| \begin{equation} |
| v'_{(\lambda p_{i-1} + \mu) q + b} |
| = |
| \sum_{\lambda'} |
| (W_{f_i})_{\lambda \lambda'} |
| w^{\mu\lambda'}_{p_i} v_{(\mu f + \lambda')q + b} |
| \end{equation} |
| % |
| This gives us the basic decimation-in-time mixed-radix algorithm for |
| complex data which we will be able to specialize to real data, |
| % |
| \begin{algorithm} |
| \FOR{$i = 1 \dots n_f$} |
| \FOR{$\mu = 0 \dots p_{i-1} - 1$} |
| \FOR{$b = 0 \dots q-1$} |
| \FOR{$\lambda = 0 \dots f-1$} |
| \STATE{$t_\lambda \Leftarrow |
| \omega^{\mu\lambda'}_{p_{i}} v_{(\mu f + \lambda')q + b}$} |
| \ENDFOR |
| \FOR{$\lambda = 0 \dots f-1$} |
| \STATE{$v'_{(\lambda p_{i-1} + \mu)q + b} = |
| \sum_{\lambda'=0}^{f-1} W_f(\lambda,\lambda') t_{\lambda'}$} |
| \COMMENT{DFT matrix-multiply module} |
| \ENDFOR |
| \ENDFOR |
| \ENDFOR |
| \STATE{$v \Leftarrow v'$} |
| \ENDFOR |
| \end{algorithm} |
| % |
| We are now at the point where we can convert an algorithm formulated |
| in terms of complex variables to one in terms of real variables by |
| choosing a suitable storage scheme. We will adopt the FFTPACK storage |
| convention. FFTPACK uses a scheme where individual FFTs are |
| contiguous, not interleaved, and real-imaginary pairs are stored in |
| neighboring locations. This has better locality than was possible for |
| the radix-2 case. |
| |
| The interleaving of the intermediate FFTs results from the Kronecker |
| product, $W_p \otimes I_q$. The FFTs can be made contiguous if we |
| reorder the Kronecker product on the intermediate passes, |
| % |
| \begin{equation} |
| W_p \otimes I_q \Rightarrow I_q \otimes W_p |
| \end{equation} |
| % |
| This can be implemented by a simple change in indexing. On pass-$i$ |
| we store element $v_{a q_i + b}$ in location $v_{b p_i+a}$. We |
| compensate for this change by making the same transposition when |
| reading the data. Note that this only affects the indices of the |
| intermediate passes. On the zeroth iteration the transposition has no |
| effect because $p_0 = 1$. Similarly there is no effect on the last |
| iteration, which has $q_{n_f} = 1$. This is how the algorithm above |
| looks after this index transformation, |
| % |
| \begin{algorithm} |
| \FOR{$i = 1 \dots n_f$} |
| \FOR{$\mu = 0 \dots p_{i-1} - 1$} |
| \FOR{$b = 0 \dots q-1$} |
| \FOR{$\lambda = 0 \dots f-1$} |
| \STATE{$t_\lambda \Leftarrow |
| \omega^{\mu\lambda'}_{p_{i}} v_{(\lambda'q + b)p_{i-1} + \mu}$} |
| \ENDFOR |
| \FOR{$\lambda = 0 \dots f-1$} |
| \STATE{$v'_{b p + (\lambda p_{i-1} + \mu)} = |
| \sum_{\lambda'=0}^{f-1} W_f(\lambda,\lambda') t_{\lambda'}$} |
| \COMMENT{DFT matrix-multiply module} |
| \ENDFOR |
| \ENDFOR |
| \ENDFOR |
| \STATE{$v \Leftarrow v'$} |
| \ENDFOR |
| \end{algorithm} |
| % |
| We transpose the input terms by writing the index in the form $a |
| q_{i-1} + b$, to take account of the pass-dependence of the scheme, |
| % |
| \begin{equation} |
| v_{(\mu f + \lambda')q + b} = v_{\mu q_{i-1} + \lambda'q + b} |
| \end{equation} |
| % |
| We used the identity $q_{i-1} = f q$ to split the index. Note that in |
| this form $\lambda'q + b$ runs from 0 to $q_{i-1} - 1$ as $b$ runs |
| from 0 to $q-1$ and $\lambda'$ runs from 0 to $f-1$. The transposition |
| for the input terms then gives, |
| % |
| \begin{equation} |
| v_{\mu q_{i-1} + \lambda'q + b} \Rightarrow v_{(\lambda'q + b) p_{i-1} + \mu} |
| \end{equation} |
| % |
| In the FFTPACK scheme the intermediate output data have the same |
| half-complex symmetry as the radix-2 example, namely, |
| % |
| \begin{equation} |
| v^{(i)}_{b p + a} = v^{(i)*}_{b p + (p - a)} |
| \end{equation} |
| % |
| and on the input data from the previous pass have the symmetry, |
| % |
| \begin{equation} |
| v^{(i-1)}_{(\lambda' q + b) p_{i-1} + \mu} = v^{(i-1)*}_{(\lambda' q + |
| b) p_{i-1} + (p_{i-1} - \mu)} |
| \end{equation} |
| % |
| Using these symmetries we can halve the storage and computation |
| requirements for each pass. Compared with the radix-2 algorithm we |
| have more freedom because the computation does not have to be done in |
| place. The storage scheme adopted by FFTPACK places elements |
| sequentially with real and imaginary parts in neighboring |
| locations. Imaginary parts which are known to be zero are not |
| stored. Here are the full details of the scheme, |
| % |
| \begin{center} |
| \renewcommand{\arraystretch}{1.5} |
| \begin{tabular}{|l|lll|} |
| \hline Term & & Location & \\ |
| \hline |
| $g(b p_i)$ |
| & real part & $b p_{i} $ & \\ |
| & imag part & --- & \\ |
| \hline |
| $g(b p_i + a)$ |
| & real part & $b p_{i} + 2a - 1 $& for $a = 1 \dots p_i/2 - 1$ \\ |
| & imag part & $b p_{i} + 2a$ & \\ |
| \hline |
| $g(b p_{i} + p_{i}/2)$ |
| & real part & $b p_i + p_{i} - 1$ & if $p_i$ is even\\ |
| & imag part & --- & \\ |
| \hline |
| \end{tabular} |
| \end{center} |
| % |
| The real element for $a=0$ is stored in location $bp$. The real parts |
| for $a = 1 \dots p/2 - 1$ are stored in locations $bp + 2a -1$ and the |
| imaginary parts are stored in locations $b p + 2 a$. When $p$ is even |
| the term for $a = p/2$ is purely real and we store it in location $bp |
| + p - 1$. The zero imaginary part is not stored. |
| |
| When we compute the basic iteration, |
| % |
| \begin{equation} |
| v^{(i)}_{b p + (\lambda p_{i-1} + \mu)} = \sum_{\lambda'} |
| W^{\lambda \lambda'}_f |
| \omega^{\mu\lambda'}_{p_i} v^{(i-1)}_{(\lambda' q + b)p_{i-1} + \mu} |
| \end{equation} |
| % |
| we eliminate the redundant conjugate terms with $a > p_{i}/2$ as we |
| did in the radix-2 algorithm. Whenever we need to store a term with $a |
| > p_{i}/2$ we consider instead the corresponding conjugate term with |
| $a' = p - a$. Similarly when reading data we replace terms with $\mu > |
| p_{i-1}/2$ with the corresponding conjugate term for $\mu' = p_{i-1} - |
| \mu$. |
| |
| Since the input data on each stage has half-complex symmetry we only |
| need to consider the range $\mu=0 \dots p_{i-1}/2$. We can consider |
| the best ways to implement the basic iteration for each pass, $\mu = 0 |
| \dots p_{i-1}/2$. |
| |
| On the first pass where $\mu=0$ we will be accessing elements which |
| are the zeroth components of the independent transforms $W_{p_{i-1}} |
| \otimes I_{q_{i-1}}$, and are purely real. |
| % |
| We can code the pass with $\mu=0$ as a special case for real input |
| data, and conjugate-complex output. When $\mu=0$ the twiddle factors |
| $\omega^{\mu\lambda'}_{p_i}$ are all unity, giving a further saving. |
| We can obtain small-$N$ real-data DFT modules by removing the |
| redundant operations from the complex modules. |
| % |
| For example the $N=3$ module was, |
| % |
| \begin{equation} |
| \begin{array}{lll} |
| t_1 = z_1 + z_2, & |
| t_2 = z_0 - t_1/2, & |
| t_3 = \sin(\pi/3) (z_1 - z_2), |
| \end{array} |
| \end{equation} |
| \begin{equation} |
| \begin{array}{lll} |
| x_0 = z_0 + t_1, & |
| x_1 = t_2 + i t_3, & |
| x_2 = t_2 - i t_3. |
| \end{array} |
| \end{equation} |
| % |
| In the complex case all the operations were complex, for complex input |
| data $z_0$, $z_1$, $z_2$. In the real case $z_0$, $z_1$ and $z_2$ are |
| all real. Consequently $t_1$, $t_2$ and $t_3$ are also real, and the |
| symmetry $x_1 = t_1 + i t_2 = x^*_2$ means that we do not have to |
| compute $x_2$ once we have computed $x_1$. |
| |
| For subsequent passes $\mu = 1 \dots p_{i-1}/2 - 1$ the input data is |
| complex and we have to compute full complex DFTs using the same |
| modules as in the complex case. Note that the inputs are all of the |
| form $v_{(\lambda q + b) p_{i-1} + \mu}$ with $\mu < p_{i-1}/2$ so we |
| never need to use the symmetry to access the conjugate elements with |
| $\mu > p_{i-1}/2$. |
| |
| If $p_{i-1}$ is even then we reach the halfway point $\mu=p_{i-1}/2$, |
| which is another special case. The input data in this case is purely |
| real because $\mu = p_{i-1} - \mu$ for $\mu = p_{i-1}/2$. We can code |
| this as a special case, using real inputs and real-data DFT modules as |
| we did for $\mu=0$. However, for $\mu = p_{i-1}/2$ the twiddle factors |
| are not unity, |
| % |
| \begin{eqnarray} |
| \omega^{\mu\lambda'}_{p_i} &=& \omega^{(p_{i-1}/2)\lambda'}_{p_i} \\ |
| &=& \exp(-i\pi\lambda'/f_i) |
| \end{eqnarray} |
| % |
| These twiddle factors introduce an additional phase which modifies the |
| symmetry of the outputs. Instead of the conjugate-complex symmetry |
| which applied for $\mu=0$ there is a shifted conjugate-complex |
| symmetry, |
| % |
| \begin{equation} |
| t_\lambda = t^*_{f-(\lambda+1)} |
| \end{equation} |
| % |
| This is easily proved, |
| % |
| \begin{eqnarray} |
| t_\lambda |
| &=& |
| \sum e^{-2\pi i \lambda\lambda'/f_i} e^{-i\pi \lambda'/f_i} r_{\lambda'} \\ |
| t_{f - (\lambda + 1)} |
| &=& |
| \sum e^{-2\pi i (f-\lambda-1)\lambda'/f_i} e^{-i\pi \lambda'/f_i} r_{\lambda'} \\ |
| &=& |
| \sum e^{2\pi i \lambda\lambda'/f_i} e^{i\pi \lambda'/f_i} r_{\lambda'} \\ |
| &=& t^*_\lambda |
| \end{eqnarray} |
| % |
| The symmetry of the output means that we only need to compute half of |
| the output terms, the remaining terms being conjugates or zero |
| imaginary parts. For example, when $f=4$ the outputs are $(x_0 + i |
| y_0, x_1 + i y_1, x_1 - i y_1, x_0 - i y_0)$. For $f=5$ the outputs |
| are $(x_0 + i y_0, x_1 + i y_1, x_2, x_1 - i y_1, x_0 - i y_0)$. By |
| combining the twiddle factors and DFT matrix we can make a combined |
| module which applies both at the same time. By starting from the |
| complex DFT modules and bringing in twiddle factors we can derive |
| optimized modules. Here are the modules given by Temperton for $z = W |
| \Omega x$ where $x$ is real and $z$ has the shifted conjugate-complex |
| symmetry. The matrix of twiddle factors, $\Omega$, is given by, |
| % |
| \begin{equation} |
| \Omega = \mathrm{diag}(1, e^{-i\pi/f}, e^{-2\pi i/f}, \dots, e^{-i\pi(f-1)/f}) |
| \end{equation} |
| % |
| We write $z$ in terms of two real vectors $z = a + i b$. |
| % |
| For $N=2$, |
| % |
| \begin{equation} |
| \begin{array}{ll} |
| a_0 = x_0, & |
| b_0 = - x_1. |
| \end{array} |
| \end{equation} |
| % |
| For $N=3$, |
| % |
| \begin{equation} |
| \begin{array}{l} |
| t_1 = x_1 - x_2, |
| \end{array} |
| \end{equation} |
| \begin{equation} |
| \begin{array}{ll} |
| a_0 = x_0 + t_1/2, & b_0 = x_0 - t_1, |
| \end{array} |
| \end{equation} |
| \begin{equation} |
| \begin{array}{l} |
| a_1 = - \sin(\pi/3) (x_1 + x_2) |
| \end{array} |
| \end{equation} |
| % |
| For $N=4$, |
| % |
| \begin{equation} |
| \begin{array}{ll} |
| t_1 = (x_1 - x_3)/\sqrt{2}, & t_2 = (x_1 + x_3)/\sqrt{2}, |
| \end{array} |
| \end{equation} |
| \begin{equation} |
| \begin{array}{ll} |
| a_0 = x_0 + t_1, & b_0 = -x_2 - t_2, |
| \end{array} |
| \end{equation} |
| \begin{equation} |
| \begin{array}{ll} |
| a_1 = x_0 - t_1, & b_1 = x_2 - t_2. |
| \end{array} |
| \end{equation} |
| % |
| For $N=5$, |
| % |
| \begin{equation} |
| \begin{array}{ll} |
| t_1 = x_1 - x_4, & t_2 = x_1 + x_4, |
| \end{array} |
| \end{equation} |
| \begin{equation} |
| \begin{array}{ll} |
| t_3 = x_2 - x_3, & t_4 = x_2 + x_3, |
| \end{array} |
| \end{equation} |
| \begin{equation} |
| \begin{array}{ll} |
| t_5 = t_1 - t_3, & t_6 = x_0 + t_5 / 4, |
| \end{array} |
| \end{equation} |
| \begin{equation} |
| \begin{array}{ll} |
| t_7 = (\sqrt{5}/4)(t_1 + t_3) & |
| \end{array} |
| \end{equation} |
| \begin{equation} |
| \begin{array}{ll} |
| a_0 = t_6 + t_7, & b_0 = -\sin(2\pi/10) t_2 - \sin(2\pi/5) t_4, |
| \end{array} |
| \end{equation} |
| \begin{equation} |
| \begin{array}{ll} |
| a_1 = t_6 - t_7, & b_1 = -\sin(2\pi/5) t_2 + \sin(2\pi/10) t_4, |
| \end{array} |
| \end{equation} |
| \begin{equation} |
| \begin{array}{ll} |
| a_2 = x_0 - t_5 & |
| \end{array} |
| \end{equation} |
| % |
| For $N=6$, |
| % |
| \begin{equation} |
| \begin{array}{ll} |
| t_1 = \sin(\pi/3)(x_5 - x_1), & t_2 = \sin(\pi/3) (x_2 + x_4), |
| \end{array} |
| \end{equation} |
| \begin{equation} |
| \begin{array}{ll} |
| t_3 = x_2 - x_4, & t_4 = x_1 + x_5, |
| \end{array} |
| \end{equation} |
| \begin{equation} |
| \begin{array}{ll} |
| t_5 = x_0 + t_3 / 2, & t_6 = -x_3 - t_4 / 2, |
| \end{array} |
| \end{equation} |
| \begin{equation} |
| \begin{array}{ll} |
| a_0 = t_5 - t_1, & b_0 = t_6 - t_2, |
| \end{array} |
| \end{equation} |
| \begin{equation} |
| \begin{array}{ll} |
| a_1 = x_0 - t_3, & b_1 = x_3 - t_4, |
| \end{array} |
| \end{equation} |
| \begin{equation} |
| \begin{array}{ll} |
| a_2 = t_5 + t_1, & b_2 = t_6 + t_2 |
| \end{array} |
| \end{equation} |
| |
| \section{Computing the mixed-radix inverse for real data} |
| % |
| To compute the inverse of the mixed-radix FFT on real data we step |
| through the algorithm in reverse and invert each operation. |
| |
| This gives the following algorithm using FFTPACK indexing, |
| % |
| \begin{algorithm} |
| \FOR{$i = n_f \dots 1$} |
| \FOR{$\mu = 0 \dots p_{i-1} - 1$} |
| \FOR{$b = 0 \dots q-1$} |
| \FOR{$\lambda = 0 \dots f-1$} |
| \STATE{$t_{\lambda'} = |
| \sum_{\lambda'=0}^{f-1} W_f(\lambda,\lambda') |
| v_{b p + (\lambda p_{i-1} + \mu)}$} |
| \COMMENT{DFT matrix-multiply module} |
| \ENDFOR |
| \FOR{$\lambda = 0 \dots f-1$} |
| \STATE{$v'_{(\lambda'q + b)p_{i-1} + \mu} \Leftarrow |
| \omega^{-\mu\lambda'}_{p_{i}} t_\lambda$} |
| \ENDFOR |
| |
| \ENDFOR |
| \ENDFOR |
| \STATE{$v \Leftarrow v'$} |
| \ENDFOR |
| \end{algorithm} |
| % |
| When $\mu=0$ we are applying an inverse DFT to half-complex data, |
| giving a real result. The twiddle factors are all unity. We can code |
| this as a special case, just as we did for the forward routine. We |
| start with complex module and eliminate the redundant terms. In this |
| case it is the final result which has the zero imaginary part, and we |
| eliminate redundant terms by using the half-complex symmetry of the |
| input data. |
| |
| When $\mu=1 \dots p_{i-1}/2 - 1$ we have full complex transforms on |
| complex data, so no simplification is possible. |
| |
| When $\mu = p_{i-1}/2$ (which occurs only when $p_{i-1}$ is even) we |
| have a combination of twiddle factors and DFTs on data with the |
| shifted half-complex symmetry which give a real result. We implement |
| this as a special module, essentially by inverting the system of |
| equations given for the forward case. We use the modules given by |
| Temperton, appropriately modified for our version of the algorithm. He |
| uses a slightly different convention which differs by factors of two |
| for some terms (consult his paper for details~\cite{temperton83real}). |
| |
| For $N=2$, |
| % |
| \begin{equation} |
| \begin{array}{ll} |
| x_0 = 2 a_0, & x_1 = - 2 b_0 . |
| \end{array} |
| \end{equation} |
| % |
| For $N=3$, |
| % |
| \begin{equation} |
| \begin{array}{ll} |
| t_1 = a_0 - a_1, & t_2 = \sqrt{3} b_1, \\ |
| \end{array} |
| \end{equation} |
| \begin{equation} |
| \begin{array}{lll} |
| x_0 = 2 a_0 + a_1, & x_1 = t_1 - t_2, & x_2 = - t_1 - t_2 |
| \end{array} |
| \end{equation} |
| % |
| For $N=4$, |
| \begin{equation} |
| \begin{array}{ll} |
| t_1 = \sqrt{2} (b_0 + b_1), & t_2 = \sqrt{2} (a_0 - a_1), |
| \end{array} |
| \end{equation} |
| \begin{equation} |
| \begin{array}{ll} |
| x_0 = 2(a_0 + a_1), & x_1 = t_2 - t_1 , |
| \end{array} |
| \end{equation} |
| \begin{equation} |
| \begin{array}{ll} |
| x_2 = 2(b_1 - b_0), & x_3 = -(t_2 + t_1). |
| \end{array} |
| \end{equation} |
| % |
| For $N=5$, |
| % |
| \begin{equation} |
| \begin{array}{ll} |
| t_1 = 2 (a_0 + a_1), & t_2 = t_1 / 4 - a_2, |
| \end{array} |
| \end{equation} |
| \begin{equation} |
| \begin{array}{ll} |
| t_3 = (\sqrt{5}/2) (a_0 - a_1), |
| \end{array} |
| \end{equation} |
| \begin{equation} |
| \begin{array}{l} |
| t_4 = 2(\sin(2\pi/10) b_0 + \sin(2\pi/5) b_1), |
| \end{array} |
| \end{equation} |
| \begin{equation} |
| \begin{array}{l} |
| t_5 = 2(\sin(2\pi/10) b_0 - \sin(2\pi/5) b_1), |
| \end{array} |
| \end{equation} |
| \begin{equation} |
| \begin{array}{ll} |
| t_6 = t_3 + t_2, & t_7 = t_3 - t_2, |
| \end{array} |
| \end{equation} |
| \begin{equation} |
| \begin{array}{ll} |
| x_0 = t_1 + a_2, & x_1 = t_6 - t_4 , |
| \end{array} |
| \end{equation} |
| \begin{equation} |
| \begin{array}{ll} |
| x_2 = t_7 - t_5, & x_3 = - t_7 - t_5, |
| \end{array} |
| \end{equation} |
| \begin{equation} |
| \begin{array}{ll} |
| x_4 = -t_6 - t_4. |
| \end{array} |
| \end{equation} |
| |
| \section{Conclusions} |
| % |
| We have described the basic algorithms for one-dimensional radix-2 and |
| mixed-radix FFTs. It would be nice to have a pedagogical explanation |
| of the split-radix FFT algorithm, which is faster than the simple |
| radix-2 algorithm we used. We could also have a whole chapter on |
| multidimensional FFTs. |
| % |
| %\section{Multidimensional FFTs} |
| %\section{Testing FFTs, Numerical Analysis} |
| |
| %\nocite{*} |
| \bibliographystyle{unsrt} |
| \bibliography{fftalgorithms} |
| |
| \end{document} |
| |
| |
| |
| |