src/parsec/disk-image/parsec/parsec-benchmark/pkgs/libs/gsl/src/doc/fftalgorithms.tex - public/gem5-resources - Git at Google

 \documentclass[fleqn,12pt]{article}
 %
 \setlength{\oddsidemargin}{-0.25in}
 \setlength{\textwidth}{7.0in}
 \setlength{\topmargin}{-0.25in}
 \setlength{\textheight}{9.5in}
 %
 \usepackage{algorithmic}
 \newenvironment{algorithm}{\begin{quote} %\vspace{1em}
 \begin{algorithmic}\samepage}{\end{algorithmic} %\vspace{1em}
 \end{quote}}
 \newcommand{\Real}{\mathop{\mathrm{Re}}}
 \newcommand{\Imag}{\mathop{\mathrm{Im}}}

 \begin{document}
 \title{FFT Algorithms}
 \author{Brian Gough, {\tt bjg@network-theory.co.uk}}
 \date{May 1997}
 \maketitle

 \section{Introduction}

 Fast Fourier Transforms (FFTs) are efficient algorithms for
 calculating the discrete fourier transform (DFT),
 %
 \begin{eqnarray}
 h_a &=& \mathrm{DFT}(g_b) \\
     &=& \sum_{b=0}^{N-1} g_b \exp(-2\pi i a b /N) \qquad 0 \leq a \leq N-1\\
     &=& \sum_{b=0}^{N-1} g_b W_N^{ab} \qquad W_N= \exp(-2\pi i/N)
 \end{eqnarray}
 %
 The DFT usually arises as an approximation to the continuous fourier
 transform when functions are sampled at discrete intervals in space or
 time. The naive evaluation of the discrete fourier transform is a
 matrix-vector multiplication ${\mathbf W}\vec{g}$, and would take
 $O(N^2)$ operations for $N$ data-points. The general principle of the
 Fast Fourier Transform algorithms is to use a divide-and-conquer
 strategy to factorize the matrix $W$ into smaller sub-matrices,
 typically reducing the operation count to $O(N \sum f_i)$ if $N$ can
 be factorized into smaller integers, $N=f_1 f_2 \dots f_n$.

 This chapter explains the algorithms used in the GSL FFT routines
 and provides some information on how to extend them. To learn more about
 the FFT you should read the review article {\em Fast Fourier
 Transforms: A Tutorial Review and A State of the Art} by Duhamel and
 Vetterli~\cite{duhamel90}. There are several introductory books on the
 FFT with example programs, such as {\em The Fast Fourier Transform} by
 Brigham~\cite{brigham74} and {\em DFT/FFT and Convolution Algorithms}
 by Burrus and Parks~\cite{burrus84}. In 1979 the IEEE published a
 compendium of carefully-reviewed Fortran FFT programs in {\em Programs
 for Digital Signal Processing}~\cite{committee79} which is a useful
 reference for implementations of many different FFT algorithms. If you
 are interested in using DSPs then the {\em Handbook of Real-Time Fast
 Fourier Transforms}~\cite{smith95} provides detailed information on
 the algorithms and hardware needed to design, build and test DSP
 applications. Many FFT algorithms rely on results from number theory.
 These results are covered in the books {\em Fast transforms:
 algorithms, analyses, applications}, by Elliott and
 Rao~\cite{elliott82}, {\em Fast Algorithms for Digital Signal
 Processing} by Blahut~\cite{blahut} and {\em Number Theory in Digital
 Signal Processing} by McClellan and Rader~\cite{mcclellan79}. There is
 also an annotated bibliography of papers on the FFT and related topics
 by Burrus~\cite{burrus-note}.

 \section{Families of FFT algorithms}
 %
 There are two main families of FFT algorithms: the Cooley-Tukey
 algorithm and the Prime Factor algorithm. These differ in the way they
 map the full FFT into smaller sub-transforms. Of the Cooley-Tukey
 algorithms there are two types of routine in common use: mixed-radix
 (general-$N$) algorithms and radix-2 (power of 2) algorithms. Each
 type of algorithm can be further classified by additional characteristics,
 such as whether it operates in-place or uses additional scratch space,
 whether its output is in a sorted or scrambled order, and whether it
 uses decimation-in-time or -frequency iterations.

 Mixed-radix algorithms work by factorizing the data vector into
 shorter lengths. These can then be transformed by small-$N$ FFTs.
 Typical programs include FFTs for small prime factors, such as 2, 3,
 5, \dots which are highly optimized. The small-$N$ FFT modules act as
 building blocks and can be multiplied together to make longer
 transforms. By combining a reasonable set of modules it is possible to
 compute FFTs of many different lengths. If the small-$N$ modules are
 supplemented by an $O(N^2)$ general-$N$ module then an FFT of any
 length can be computed, in principle. Of course, any lengths which
 contain large prime factors would perform only as $O(N^2)$.

 Radix-2 algorithms, or ``power of two'' algorithms, are simplified
 versions of the mixed-radix algorithm. They are restricted to lengths
 which are a power of two. The basic radix-2 FFT module only involves
 addition and subtraction, so the algorithms are very simple. Radix-2
 algorithms have been the subject of much research into optimizing the
 FFT. Many of the most efficient radix-2 routines are based on the
 ``split-radix'' algorithm. This is actually a hybrid which combines
 the best parts of both radix-2 and radix-4 (``power of 4'')
 algorithms~\cite{sorenson86,duhamel86}.

 The prime factor algorithm (PFA) is an alternative form of general-$N$
 algorithm based on a different way of recombining small-$N$ FFT
 modules~\cite{temperton83pfa,temperton85}. It has a very simple indexing
 scheme which makes it attractive. However it only works in the case
 where all factors are mutually prime. This requirement makes it more
 suitable as a specialized algorithm for given lengths.


 \begin{subsection}{FFTs of prime lengths}
 %
   Large prime lengths cannot be handled efficiently by any of these
 algorithms. However it may still possible to compute a DFT, by using
 results from number theory. Rader showed that it is possible to
 convert a length-$p$ FFT (where $p$ is prime) into a convolution of
 length-($p-1$).  There is a simple identity between the convolution of
 length $N$ and the FFT of the same length, so if $p-1$ is easily
 factorizable this allows the convolution to be computed efficiently
 via the FFT. The idea is presented in the original paper by
 Rader~\cite{raderprimes} (also reprinted in~\cite{mcclellan79}), but
 for more details see the theoretical books mentioned earlier.
 \end{subsection}

 %\subsection{In-place algorithms vs algorithms using scratch space}
 %%
 %For simple algorithms, such as the Radix-2 algorithms, the FFT can be
 %performed in-place, without additional memory. For this to be possible
 %the index calculations must be simple enough that the calculation of
 %the result to be stored in index $k$ can be carried out at the same
 %time as the data in $k$ is used, so that no temporary storage is
 %needed.

 %The mixed-radix algorithm is too complicated for this. It requires an
 %area of scratch space sufficient to hold intermediate copies of the
 %input data.

 %\subsection{Self-sorting algorithms vs scrambling algorithms}
 %%
 %A self-sorting algorithm At each iteration of the FFT internal permutations are included
 %which leave the final iteration in its natural order.


 %The mixed-radix algorithm can be made self-sorting. The additional
 %scratch space helps here, although it is in fact possible to design
 %self-sorting algorithms which do not require scratch
 %space.


 %The in-place radix-2 algorithm is not self-sorting. The data is
 %permuted, and transformed into bit-reversed order. Note that
 %bit-reversal only refers to the order of the array, not the values for
 %each index which are of course not changed. More precisely, the data
 %stored in location $[b_n \dots b_2 b_1 b_0]$ is moved to location
 %$[b_0 b_1 b_2 \dots b_n]$, where $b_0 \dots b_n$ are the raw bits of a
 %given index. Placing the data in bit reversed order is easily done,
 %using right shifts to extract the least significant bits of the index
 %and left-shifts to feed them into a register for the bit-reversed
 %location. Simple radix-2 FFT usually recompute the bit reverse
 %ordering in the naive way for every call. For repeated calls it might
 %be worthwhile to precompute the permutation and cache it. There are
 %also more sophisticated algorithms which reduce the number of
 %operations needed to perform bit-reversal~\cite{bit-reversal}.


 %\begin{subsection}{Decimation-in-time (DIT) vs Decimation-in-Frequency (DIF)}

 %\end{subsection}


 \subsection{Optimization}
 %
 There is no such thing as the single fastest FFT {\em algorithm}. FFT
 algorithms involve a mixture of floating point calculations, integer
 arithmetic and memory access. Each of these operations will have
 different relative speeds on different platforms.  The performance of
 an algorithm is a function of the hardware it is implemented on.  The
 goal of optimization is thus to choose the algorithm best suited to
 the characteristics of a given hardware platform.

 For example, the Winograd Fourier Transform (WFTA) is an algorithm
 which is designed to reduce the number of floating point
 multiplications in the FFT. However, it does this at the expense of
 using many more additions and data transfers than other algorithms.
 As a consequence the WFTA might be a good candidate algorithm for
 machines where data transfers occupy a negligible time relative to
 floating point arithmetic. However on most modern machines, where the
 speed of data transfers is comparable to or slower than floating point
 operations, it would be outperformed by an algorithm which used a
 better mix of operations (i.e. more floating point operations but
 fewer data transfers).

 For a study of this sort of effect in detail, comparing the different
 algorithms on different platforms consult the paper {\it Effects of
 Architecture Implementation on DFT Algorithm Performance} by Mehalic,
 Rustan and Route~\cite{mehalic85}. The paper was written in the early
 1980's and has data for super- and mini-computers which you are
 unlikely to see today, except in a museum. However, the methodology is
 still valid and it would be interesting to see similar results for
 present day computers.


 \section{FFT Concepts}
 %
 Factorization is the key principle of the mixed-radix FFT divide-and-conquer
 strategy. If $N$ can be factorized into a product of $n_f$ integers,
 %
 \begin{equation}
 N = f_1 f_2 ... f_{n_f} ,
 \end{equation}
 %
 then the FFT itself can be divided into smaller FFTs for each factor.
 More precisely, an FFT of length $N$ can be broken up into,
 %
 \begin{quote}
 \begin{tabular}{l}
 $(N/f_1)$ FFTs of length $f_1$, \\
 $(N/f_2)$ FFTs of length $f_2$, \\
 \dots \\
 $(N/f_{n_f})$ FFTs of length $f_{n_f}$.
 \end{tabular}
 \end{quote}
 %
 The total number of operations for these sub-operations will be $O(
 N(f_1 + f_2 + ... + f_{n_f}))$. When the factors of $N$ are all small
 integers this will be substantially less than $O(N^2)$. For example,
 when $N$ is a power of 2 an FFT of length $N=2^m$ can be reduced to $m
 N/2$ FFTs of length 2, or $O(N\log_2 N)$ operations.  Here is a
 demonstration which shows this:

 We start with the full DFT,
 %
 \begin{eqnarray}
 h_a &=& \sum_{b=0}^{N-1} g_b W_N^{ab}       \qquad W_N=\exp(-2\pi i/N)
 \end{eqnarray}
 %
 and split the sum into even and odd terms,
 %
 \begin{eqnarray}
 \phantom{h_a}
 %   &=& \left(\sum_{b=0,2,4,\dots} + \sum_{b=1,3,5,\dots}\right) g_b W_N^{ab}\\
    &=& \sum_{b=0}^{N/2-1} g_{2b} W_N^{a(2b)} +
       \sum_{b=0}^{N/2-1} g_{2b+1} W_N^{a(2b+1)}.
 \end{eqnarray}
 %
 This converts the original DFT of length $N$ into two DFTs of length
 $N/2$,
 %
 \begin{equation}
 h_a = \sum_{b=0}^{N/2-1} g_{2b} W_{(N/2)}^{ab} +
       W_N^a \sum_{b=0}^{N/2-1} g_{2b+1} W_{(N/2)}^{ab}
 \end{equation}
 %
 The first term is a DFT of the even elements of $g$. The second term
 is a DFT of the odd elements of $g$, premultiplied by an exponential
 factor $W_N^k$ (known as a {\em twiddle factor}).
 %
 \begin{equation}
 \mathrm{DFT}(h)  =  \mathrm{DFT}(g_{even}) + W_N^k \mathrm{DFT}(g_{odd})
 \end{equation}
 %
 By splitting the DFT into its even and odd parts we have reduced the
 operation count from $N^2$ (for a DFT of length $N$) to $2 (N/2)^2$
 (for two DFTs of length $N/2$). The cost of the splitting is that we
 need an additional $O(N)$ operations to multiply by the twiddle factor
 $W_N^k$ and recombine the two sums.

 We can repeat the splitting procedure recursively $\log_2 N$ times
 until the full DFT is reduced to DFTs of single terms. The DFT of a
 single value is just the identity operation, which costs nothing.
 However since $O(N)$ operations were needed at each stage to recombine
 the even and odd parts the total number of operations to obtain the
 full DFT is $O(N \log_2 N)$. If we had used a length which was a
 product of factors $f_1$, $f_2$, $\dots$ we could have split the sum
 in a similar way. First we would split terms corresponding to the
 factor $f_1$, instead of the even and odd terms corresponding to a
 factor of two.  Then we would repeat this procedure for the subsequent
 factors. This would lead to a final operation count of $O(N \sum
 f_i)$.

 This procedure gives some motivation for why the number of operations
 in a DFT can in principle be reduced from $O(N^2)$ to $O(N \sum f_i)$.
 It does not give a good explanation of how to implement the algorithm
 in practice which is what we shall do in the next section.

 \section{Radix-2 Algorithms}
 %
 For radix-2 FFTs it is natural to write array indices in binary form
 because the length of the data is a power of two. This is nicely
 explained in the article {\em The FFT: Fourier Transforming One Bit at
 a Time} by P.B. Visscher~\cite{visscher96}. A binary representation
 for indices is the key to deriving the simplest efficient radix-2
 algorithms.

 We can write an index $b$ ($0 \leq b < 2^{n-1}$) in binary
 representation like this,
 %
 \begin{equation}
 b = [b_{n-1} \dots b_1 b_0] = 2^{n-1}b_{n-1} + \dots + 2 b_1 + b_0 .
 \end{equation}
 %
 Each of the $b_0, b_1, \dots, b_{n-1}$ are the bits (either 0 or 1) of
 $b$.

 Using this notation the original definition of the DFT
 can be rewritten as a sum over the bits of $b$,
 %
 \begin{equation}
 h(a) = \sum_{b=0}^{N-1} g_b \exp(-2\pi i a b /N)
 \end{equation}
 %
 to give an equivalent summation like this,
 %
 \begin{equation}
 h([a_{n-1} \dots a_1 a_0]) =
 \sum_{b_0=0}^{1}
 \sum_{b_1=0}^{1}
 \dots
 \sum_{b_{n-1}=0}^{1}
  g([b_{n-1} \dots b_1 b_0]) W_N^{ab}
 \end{equation}
 %
 where the bits of $a$ are $a=[a_{n-1} \dots a_1 a_0]$.

 To reduce the number of operations in the sum we will use the
 periodicity of the exponential term,
 %
 \begin{equation}
 W_N^{x+N} = W_N^{x}.
 \end{equation}
 %
 Most of the products $ab$ in $W_N^{ab}$ are greater than $N$. By
 making use of this periodicity they can all be collapsed down into the
 range $0\dots N-1$. This allows us to reduce the number of operations
 by combining common terms, modulo $N$. Using this idea we can derive
 decimation-in-time or decimation-in-frequency algorithms, depending on
 how we break the DFT summation down into common terms. We'll first
 consider the decimation-in-time algorithm.

 \subsection{Radix-2 Decimation-in-Time (DIT)}
 %
 To derive the the decimation-in-time algorithm we start by separating
 out the most significant bit of the index $b$,
 %
 \begin{equation}
 [b_{n-1} \dots b_1 b_0] = 2^{n-1}b_{n-1} + [b_{n-2} \dots b_1 b_0]
 \end{equation}
 %
 Now we can evaluate the innermost sum of the DFT without any
 dependence on the remaining bits of $b$ in the exponential,
 %
 \begin{eqnarray}
 h([a_{n-1} \dots a_1 a_0]) &=&
 \sum_{b_0=0}^{1}
 \sum_{b_1=0}^{1}
 \dots
 \sum_{b_{n-1}=0}^{1}
  g(b)
 W_N^{a(2^{n-1}b_{n-1}+[b_{n-2} \dots b_1 b_0])} \\
  &=&
 \sum_{b_0=0}^{1}
 \sum_{b_1=0}^{1}
 \dots
 \sum_{b_{n-2}=0}^{1}
 W_N^{a[b_{n-2} \dots b_1 b_0]}
 \sum_{b_{n-1}=0}^{1}
  g(b)
 W_N^{a(2^{n-1}b_{n-1})}
 \end{eqnarray}
 %
 Looking at the term $W_N^{a(2^{n-1}b_{n-1})}$ we see that we can also
 remove most of the dependence on $a$ as well, by using the periodicity of the
 exponential,
 %
 \begin{eqnarray}
 W_N^{a(2^{n-1}b_{n-1})} &=&
 \exp(-2\pi i [a_{n-1} \dots a_1 a_0] 2^{n-1} b_{n-1}/ 2^n )\\
 &=& \exp(-2\pi i [a_{n-1} \dots a_1 a_0] b_{n-1}/ 2 )\\
 &=& \exp(-2\pi i ( 2^{n-2}a_{n-1} + \dots + a_1 + (a_0/2)) b_{n-1} )\\
 &=& \exp(-2\pi i a_0 b_{n-1}/2 ) \\
 &=& W_2^{a_0 b_{n-1}}
 \end{eqnarray}
 %
 Thus the innermost exponential term simplifies so that it only
 involves the highest order bit of $b$ and the lowest order bit of $a$,
 and the sum can be reduced to,
 %
 \begin{equation}
 h([a_{n-1} \dots a_1 a_0])
 =
 \sum_{b_0=0}^{1}
 \sum_{b_1=0}^{1}
 \dots
 \sum_{b_{n-2}=0}^{1}
 W_N^{a[b_{n-2} \dots b_1 b_0]}
 \sum_{b_{n-1}=0}^{1}
  g(b)
 W_2^{a_0 b_{n-1}}.
 \end{equation}
 %
 We can repeat this this procedure for the next most significant bit of
 $b$, $b_{n-2}$, using a similar identity,
 %
 \begin{eqnarray}
 W_N^{a(2^{n-2}b_{n-2})}
 &=& \exp(-2\pi i [a_{n-1} \dots a_1 a_0] 2^{n-2} b_{n-2}/ 2^n )\\
 &=& W_4^{ [a_1 a_0] b_{n-2}}.
 \end{eqnarray}
 %
 to give a formula with even less dependence on the bits of $a$,
 %
 \begin{equation}
 h([a_{n-1} \dots a_1 a_0])
 =
 \sum_{b_0=0}^{1}
 \sum_{b_1=0}^{1}
 \dots
 \sum_{b_{n-3}=0}^{1}
 W_N^{a[b_{n-3} \dots b_1 b_0]}
 \sum_{b_{n-2}=0}^{1}
 W_4^{[a_1 a_0] b_{n-2}}
 \sum_{b_{n-1}=0}^{1}
  g(b)
 W_2^{a_0 b_{n-1}}.
 \end{equation}
 %
 If we repeat the process for all the remaining bits we obtain a
 simplified DFT formula which is the basis of the radix-2
 decimation-in-time algorithm,
 %
 \begin{eqnarray}
 h([a_{n-1} \dots a_1 a_0]) &=&
 \sum_{b_0=0}^{1}
 W_{N}^{[a_{n-1} \dots a_1 a_0]b_0}
 %\sum_{b_1=0}^{1}
 %W_{N/2}^{[a_{n-1} \dots a_1 a_0]b_1}
 \dots
 \sum_{b_{n-2}=0}^{1}
 W_4^{ [a_1 a_0] b_{n-2}}
 \sum_{b_{n-1}=0}^{1}
 W_2^{a_0 b_{n-1}}
 g(b)
 \end{eqnarray}
 %
 To convert the formula to an algorithm we expand out the sum
 recursively, evaluating each of the intermediate summations, which we
 denote by $g_1$, $g_2$, \dots, $g_n$,
 %
 \begin{eqnarray}
 g_1(a_0,  b_{n-2}, b_{n-3}, \dots, b_1, b_0)
 &=&
 \sum_{b_{n-1}=0}^{1}
 W_2^{a_0 b_{n-1}}
 g([b_{n-1}  b_{n-2}  b_{n-3}  \dots  b_1  b_0]) \\
 g_2(a_0, {}_{\phantom{-2}} a_{1}, b_{n-3}, \dots, b_1, b_0)
 &=&
 \sum_{b_{n-2}=0}^{1}
 W_4^{[a_1 a_0] b_{n-2}}
 g_1(a_0, b_{n-2}, b_{n-3}, \dots, b_1, b_0) \\
 g_3(a_0, {}_{\phantom{-2}} a_{1}, {}_{\phantom{-3}} a_{2}, \dots, b_1, b_0)
 &=&
 \sum_{b_{n-3}=0}^{1}
 W_8^{[a_2 a_1 a_0] b_{n-2}}
 g_2(a_0, a_1, b_{n-3}, \dots, b_1, b_0) \\
 \dots &=& \dots \\
 g_n(a_0, a_{1}, a_{2}, \dots, a_{n-2}, a_{n-1})
 &=&
 \sum_{b_{0}=0}^{1}
 W_N^{[a_{n-1} \dots a_1 a_0]b_0}
 g_{n-1}(a_0, a_1, a_2, \dots, a_{n-2}, b_0)
 \end{eqnarray}
 %
 After the final sum, we can obtain the transform $h$ from $g_n$,
 %
 \begin{equation}
 h([a_{n-1} \dots a_1 a_0])
 =
 g_n(a_0, a_1, \dots, a_{n-1})
 \end{equation}
 %
 Note that we left the storage arrangements of the intermediate sums
 unspecified by using the bits as function arguments and not as an
 index. The storage of intermediate sums is different for the
 decimation-in-time and decimation-in-frequency algorithms.

 Before deciding on the best storage scheme we'll show that the results
 of each stage, $g_1$, $g_2$, \dots, can be carried out {\em
 in-place}. For example, in the case of $g_1$, the inputs are,
 %
 \begin{equation}
 g([\underline{b_{n-1}} b_{n-2} b_{n-3} \dots b_1 b_0])
 \end{equation}
 %
 for $b_{n-1}=(0,1)$, and the corresponding outputs are,
 %
 \begin{equation}
 g_1(\underline{a_0},b_{n-2}, b_{n-3}, \dots, b_1, b_0)
 \end{equation}
 %
 for $a_0=(0,1)$.  It's clear that if we hold $b_{n-2}, b_{n-3}, \dots,
 b_1, b_0$ fixed and compute the sum over $b_{n-1}$ in memory for both
 values of $a_0 = 0,1$ then we can store the result for $a_0=0$ in the
 location which originally had $b_0=0$ and the result for $a_0=1$ in
 the location which originally had $b_0=1$. The two inputs and two
 outputs are known as {\em dual node pairs}. At each stage of the
 calculation the sums for each dual node pair are independent of the
 others. It is this property which allows an in-place calculation.

 So for an in-place pass our storage has to be arranged so that the two
 outputs $g_1(a_0,\dots)$ overwrite the two input terms
 $g([b_{n-1},\dots])$. Note that the order of $a$ is reversed from the
 natural order of $b$. i.e. the least significant bit of $a$
 replaces the most significant bit of $b$. This is inconvenient
 because $a$ occurs in its natural order in all the exponentials,
 $W^{ab}$. We could keep track of both $a$ and its bit-reverse,
 $a^{\mathit bit-reversed}$ at all times but there is a neat trick
 which avoids this: if we bit-reverse the order of the input data $g$
 before we start the calculation we can also bit-reverse the order of
 $a$ when storing intermediate results. Since the storage involving $a$
 was originally in bit-reversed order the switch in the input $g$ now
 allows us to use normal ordered storage for $a$, the same ordering
 that occurs in the exponential factors.

 This is complicated to explain, so here is an example of the 4 passes
 needed for an $N=16$ decimation-in-time FFT, with the initial data
 stored in bit-reversed order,
 %
 \begin{eqnarray}
 g_1([b_0 b_1 b_2 a_0])
 &=&
 \sum_{b_3=0}^{1} W_2^{a_0 b_3} g([b_0 b_1 b_2 b_3])
 \\
 g_2([b_0 b_1 a_1 a_0])
 &=&
 \sum_{b_2=0}^{1} W_4^{[a_1 a_0] b_2} g_1([b_0 b_1 b_2 a_0])
 \\
 g_3([b_0 a_2 a_1 a_0])
 &=&
 \sum_{b_1=0}^{1} W_8^{[a_2 a_1 a_0] b_1} g_2([b_0 b_1 a_1 a_0])
 \\
 h(a) = g_4([a_3 a_2 a_1 a_0])
 &=&
 \sum_{b_0=0}^{1} W_{16}^{[a_3 a_2 a_1 a_0] b_0} g_3([b_0 a_2 a_1 a_0])
 \end{eqnarray}
 %
 We compensate for the bit reversal of the input data by accessing $g$
 with the bit-reversed form of $b$ in the first stage. This ensures
 that we are still carrying out the same calculation, using the same
 data, and not accessing different values. Only single bits of $b$ ever
 occur in the exponential so we never need the bit-reversed form of
 $b$.

 Let's examine the third pass in detail,
 %
 \begin{equation}
 g_3([b_0 a_2 a_1 a_0])
 =
 \sum_{b_1=0}^{1} W_8^{[a_2 a_1 a_0] b_1} g_2([b_0 b_1 a_1 a_0])
 \end{equation}
 %
 First note that only one bit, $b_1$, varies in each summation.  The
 other bits of $b$ ($b_0$) and of $a$ ($a_1 a_0$) are essentially
 ``spectators'' -- we must loop over all combinations of these bits and
 carry out the same basic calculation for each, remembering to update
 the exponentials involving $W_8$ appropriately.  If we are storing the
 results in-place (with $g_3$ overwriting $g_2$ we will need to compute
 the sums involving $b_1=0,1$ and $a_2=0,1$ simultaneously.
 %
 \begin{equation}
 \left(
 \begin{array}{c}
 g_3([b_0 0 a_1 a_0]) \vphantom{W_8^{[]}} \\
 g_3([b_0 1 a_1 a_0]) \vphantom{W_8^{[]}}
 \end{array}
 \right)
 =
 \left(
 \begin{array}{c}
 g_2([b_0 0 a_1 a_0]) + W_8^{[0 a_1 a_0]} g_2([b_2 1 a_1 a_0]) \\
 g_2([b_0 0 a_1 a_0]) + W_8^{[1 a_1 a_0]} g_2([b_2 1 a_1 a_0])
 \end{array}
 \right)
 \end{equation}
 %
 We can write this in a more symmetric form by simplifying the exponential,
 %
 \begin{equation}
 W_8^{[a_2 a_1 a_0]}
 = W_8^{4 a_2 + [a_1 a_0]}
 = (-1)^{a_2} W_8^{[a_1 a_0]}
 \end{equation}
 %
 \begin{equation}
 \left(
 \begin{array}{c}
 g_3([b_0 0 a_1 a_0]) \vphantom{W_8^{[]}} \\
 g_3([b_0 1 a_1 a_0]) \vphantom{W_8^{[]}}
 \end{array}
 \right)
 =
 \left(
 \begin{array}{c}
 g_2([b_0 0 a_1 a_0]) + W_8^{[a_1 a_0]} g_2([b_2 1 a_1 a_0]) \\
 g_2([b_0 0 a_1 a_0]) - W_8^{[a_1 a_0]} g_2([b_2 1 a_1 a_0])
 \end{array}
 \right)
 \end{equation}
 %
 The exponentials $W_8^{[a_1 a_0]}$ are referred to as {\em twiddle
 factors}. The form of this calculation, a symmetrical sum and
 difference involving a twiddle factor is called {\em a butterfly}.
 It is often shown diagrammatically, and in the case $b_0=a_0=a_1=0$
 would be drawn like this,
 %
 \begin{center}
 \setlength{\unitlength}{1cm}
 \begin{picture}(9,4)
 %
 %\put(0,0){\line(1,0){8}}
 %\put(0,0){\line(0,1){4}}
 %\put(8,4){\line(0,-1){4}}
 %\put(8,4){\line(-1,0){8}}
 %
 \put(0,1){$g_2(4)$} \put(4.5,1){$g_3(4)=g_2(0) - W^a_8 g_2(4)$}
 \put(0,3){$g_2(0)$} \put(4.5,3){$g_3(0)=g_2(0) + W^a_8 g_2(4)$}
 \put(1,1){\vector(1,0){0.5}}
 \put(1.5,1){\line(1,0){0.5}}
 \put(1.5,0.5){$W^a_8$}
 \put(1,3){\vector(1,0){0.5}}\put(1.5,3){\line(1,0){0.5}}
 \put(2,1){\circle*{0.1}}
 \put(2,3){\circle*{0.1}}
 \put(2,1){\vector(1,1){2}}
 \put(2,1){\vector(1,0){1}}
 \put(3,1){\line(1,0){1}}
 \put(3,0.5){$-1$}
 \put(2,3){\vector(1,-1){2}}
 \put(2,3){\vector(1,0){1}}
 \put(3,3){\line(1,0){1}}
 \put(4,1){\circle*{0.1}}
 \put(4,3){\circle*{0.1}}
 \end{picture}
 \end{center}
 %
 The inputs are shown on the left and the outputs on the right. The
 outputs are computed by multiplying the incoming lines by their
 accompanying factors (shown next to the lines) and summing the results
 at each node.

 In general, denoting the bit for dual-node pairs by $\Delta$ and the
 remaining bits of $a$ and $b$ by ${\hat a}$ and ${\hat b}$, the
 butterfly is,
 %
 \begin{equation}
 \left(
 \begin{array}{c}
 g({\hat b} + {\hat a}) \\
 g({\hat b} + \Delta + {\hat a}) \\
 \end{array}
 \right)
 \leftarrow
 \left(
 \begin{array}{c}
 g({\hat b} + {\hat a}) + W_{2\Delta}^{\hat a} g({\hat b} + \Delta + {\hat a})\\
 g({\hat b} + {\hat a}) - W_{2\Delta}^{\hat a} g({\hat b} + \Delta + {\hat a})
 \end{array}
 \right)
 \end{equation}
 %
 where ${\hat a}$ runs from $0 \dots \Delta-1$ and ${\hat b}$ runs
 through $0 \times 2\Delta$, $1\times 2\Delta$, $\dots$, $(N/\Delta -
 1)2\Delta$. The value of $\Delta$ is 1 on the first pass, 2 on the
 second pass and $2^{n-1}$ on the $n$-th pass.  Each pass requires
 $N/2$ in-place computations, each involving two input locations and
 two output locations.

 In the example above $\Delta = [100] = 4$, ${\hat a} = [a_1 a_0]$ and
 ${\hat b} = [b_0 0 0 0]$.

 This leads to the canonical radix-2 decimation-in-time FFT algorithm
 for $2^n$ data points stored in the array $g(0) \dots g(2^n-1)$.
 %
 \begin{algorithm}
 \STATE bit-reverse ordering of $g$
 \STATE {$\Delta \Leftarrow 1$}
 \FOR {$\mbox{pass} = 1 \dots n$}
   \STATE {$W \Leftarrow \exp(-2 \pi i / 2\Delta)$}
   \FOR {$(a = 0 ; a < \Delta ; a{++})$}
     \FOR{$(b = 0 ; b < N ; b {+=} 2*\Delta)$}
         \STATE{$t_0 \Leftarrow g(b+a) + W^a g(b+\Delta+a)$}
         \STATE{$t_1 \Leftarrow g(b+a) - W^a g(b+\Delta+a)$}
         \STATE{$g(b+a) \Leftarrow t_0$}
         \STATE{$g(b+\Delta+a) \Leftarrow t_1$}
     \ENDFOR
   \ENDFOR
   \STATE{$\Delta \Leftarrow 2\Delta$}
 \ENDFOR
 \end{algorithm}
 %
 %This algorithm appears in Numerical Recipes as the routine {\tt
 %four1}, where the implementation is attributed to N.M. Brenner.
 %
 \subsection{Details of the Implementation}
 It is straightforward to implement a simple radix-2 decimation-in-time
 routine from the algorithm above. Some optimizations can be made by
 pulling the special case of $a=0$ out of the loop over $a$, to avoid
 unnecessary multiplications when $W^a=1$,
 %
 \begin{algorithm}
     \FOR{$(b = 0 ; b < N ; b {+=} 2*\Delta)$}
         \STATE{$t_0 \Leftarrow g(b) + g(b+\Delta)$}
         \STATE{$t_1 \Leftarrow g(b) - g(b+\Delta)$}
         \STATE{$g(b) \Leftarrow t_0$}
         \STATE{$g(b+\Delta) \Leftarrow t_1$}
     \ENDFOR
 \end{algorithm}
 %
 There are several algorithms for doing fast bit-reversal. We use the
 Gold-Rader algorithm, which is simple and does not require any working
 space,
 %
 \begin{algorithm}
 \FOR{$i = 0 \dots n - 2$}
         \STATE {$ k = n / 2 $}
         \IF {$i < j$}
                 \STATE {swap $g(i)$ and $g(j)$}
         \ENDIF

         \WHILE {$k \leq j$}
                 \STATE{$j \Leftarrow j - k$}
                 \STATE{$k \Leftarrow k / 2$}
         \ENDWHILE

       \STATE{$j \Leftarrow j + k$}
 \ENDFOR
 \end{algorithm}
 %
 The Gold-Rader algorithm is typically twice as fast as a naive
 bit-reversal algorithm (where the bit reversal is carried out by
 left-shifts and right-shifts on the index).  The library also has a
 routine for the Rodriguez bit reversal algorithm, which also does not
 require any working space~\cite{rodriguez89}. There are faster bit
 reversal algorithms, but they all use additional scratch
 space~\cite{rosel89}.

 Within the loop for $a$ we can compute $W^a$  using a trigonometric
 recursion relation,
 %
 \begin{eqnarray}
 W^{a+1} &=& W W^a \\
         &=& (\cos(2\pi/2\Delta) + i \sin(2\pi/2\Delta)) W^a
 \end{eqnarray}
 %
 This requires only $2 \log_2 N$ trigonometric calls, to compute the
 initial values of $\exp(2\pi i /2\Delta)$ for each pass.

 \subsection{Radix-2 Decimation-in-Frequency (DIF)}
 %
 To derive the decimation-in-frequency algorithm we start by separating
 out the lowest order bit of the index $a$. Here is an example for the
 decimation-in-frequency $N=16$ DFT.
 %
 \begin{eqnarray}
 W_{16}^{[a_3 a_2 a_1 a_0][b_3 b_2 b_1 b_0]}
 &=&
 W_{16}^{[a_3 a_2 a_1 a_0][b_2 b_1 b_0]} W_{16}^{[a_3 a_2 a_1 a_0] [b_3
 0 0 0]} \\
 &=&
 W_8^{[a_3 a_2 a_1][b_2 b_1 b_0]} W_{16}^{a_0 [b_2 b_1 b_0]} W_2^{a_0
 b_3} \\
 &=&
 W_8^{[a_3 a_2 a_1][b_2 b_1 b_0]} W_{16}^{a_0 [b_2 b_1 b_0]} (-1)^{a_0 b_3}
 \end{eqnarray}
 %
 By repeating the same type of the expansion on the term,
 %
 \begin{equation}
 W_8^{[a_3 a_2 a_1][b_2 b_1 b_0]}
 \end{equation}
 %
 we can reduce the transform to an alternative simple form,
 %
 \begin{equation}
 h(a) =
 \sum_{b_0=0}^1 (-1)^{a_3 b_0} W_4^{a_2 b_0}
 \sum_{b_1=0}^1 (-1)^{a_2 b_1} W_8^{a_1 [b_1 b_0]}
 \sum_{b_2=0}^1 (-1)^{a_1 b_2} W_{16}^{a_0 [b_2 b_1 b_0]}
 \sum_{b_3=0}^1 (-1)^{a_0 b_3} g(b)
 \end{equation}
 %
 To implement this we can again write the sum recursively. In this case
 we do not have the problem of the order of $a$ being bit reversed --
 the calculation can be done in-place using the natural ordering of
 $a$ and $b$,
 %
 \begin{eqnarray}
 g_1([a_0 b_2 b_1 b_0])
 &=&
 W_{16}^{a_0 [b_2 b_1 b_0]}
 \sum_{b_3=0}^1 (-1)^{a_0 b_3} g([b_3 b_2 b_1 b_0]) \\
 g_2([a_0 a_1 b_1 b_0])
 &=&
 W_{8}^{a_1 [b_1 b_0]}
 \sum_{b_2=0}^1 (-1)^{a_1 b_2} g_1([a_0 b_2 b_1 b_0]) \\
 g_3([a_0 a_1 a_2 b_0])
 &=&
 W_{4}^{a_2 b_0}
 \sum_{b_1=0}^1 (-1)^{a_2 b_1} g_2([a_0 a_1 b_1 b_0]) \\
 h(a)
 =
 g_4([a_0 a_1 a_2 a_3])
 &=&
 \sum_{b_0=0}^1 (-1)^{a_3 b_0} g_3([a_0 a_1 a_2 b_0])
 \end{eqnarray}
 %
 The final pass leaves the data for $h(a)$ in bit-reversed order, but
 this is easily fixed by a final bit-reversal of the ordering.

 The basic in-place calculation or butterfly for each pass is slightly
 different from the decimation-in-time version,
 %
 \begin{equation}
 \left(
 \begin{array}{c}
 g({\hat a} + {\hat b}) \\
 g({\hat a} + \Delta + {\hat b}) \\
 \end{array}
 \right)
 \leftarrow
 \left(
 \begin{array}{c}
 g({\hat a} + {\hat b}) +  g({\hat a} + \Delta + {\hat b})\\
 W_{\Delta}^{\hat b}
 \left( g({\hat a} + {\hat b}) -  g({\hat a} + \Delta + {\hat b}) \right)
 \end{array}
 \right)
 \end{equation}
 %
 In each pass ${\hat b}$ runs from $0 \dots \Delta-1$ and ${\hat
 a}$ runs from $0, 2\Delta, \dots, (N/\Delta -1) \Delta$. On the first
 pass we start with $\Delta=16$, and on subsequent passes $\Delta$ takes
 the values $8, 4, \dots, 1$.

 This leads to the canonical radix-2 decimation-in-frequency FFT
 algorithm for $2^n$ data points stored in the array $g(0) \dots
 g(2^n-1)$.
 %
 \begin{algorithm}
 \STATE {$\Delta \Leftarrow 2^{n-1}$}
 \FOR {$\mbox{pass} = 1 \dots n$}
   \STATE {$W \Leftarrow \exp(-2 \pi i / 2\Delta)$}
   \FOR {$(b = 0 ; b < \Delta ; b++)$}
     \FOR{$(a = 0 ; a < N ; a += 2*\Delta)$}
         \STATE{$t_0 \Leftarrow g(b+a) + g(a+\Delta+b)$}
         \STATE{$t_1 \Leftarrow W^b \left( g(a+b) - g(a+\Delta+b) \right)$}
         \STATE{$g(a+b) \Leftarrow t_0$}
         \STATE{$g(a+\Delta+b) \Leftarrow t_1$}
     \ENDFOR
   \ENDFOR
   \STATE{$\Delta \Leftarrow \Delta/2$}
 \ENDFOR
 \STATE bit-reverse ordering of $g$
 \end{algorithm}
 %

 \section{Self-Sorting Mixed-Radix Complex FFTs}
 %
 This section is based on the review article {\em Self-sorting
 Mixed-Radix Fast Fourier Transforms} by Clive
 Temperton~\cite{temperton83}. You should consult his article for full
 details of all the possible algorithms (there are many
 variations). Here I have annotated the derivation of the simplest
 mixed-radix decimation-in-frequency algorithm.

 For general-$N$ FFT algorithms the simple binary-notation of radix-2
 algorithms is no longer useful.  The mixed-radix FFT has to be built
 up using products of matrices acting on a data vector.  The aim is to
 take the full DFT matrix $W_N$ and factor it into a set of small,
 sparse matrices corresponding to each factor of $N$.


 We'll denote the components of matrices using either subscripts or
 function notation,
 %
 \begin{equation}
 M_{ij} = M(i,j)
 \end{equation}
 %
 with (C-like) indices running from 0 to $N-1$.  Matrix products will be
 denoted using square brackets,
 %
 \begin{equation}
 [AB]_{ij} = \sum_{k} A_{ik} B_{kj}
 \end{equation}
 %
 %
 Three special matrices will be needed in the mixed-radix factorization
 of the DFT: the identity matrix, $I$, a permutation matrix, $P$ and a
 matrix of twiddle factors, $D$, as well as the normal DFT matrices
 $W_n$.

 We write the identity matrix of order $r$ as $I_r(n,m)$,
 %
 \begin{equation}
 I_r(n,m) = \delta_{nm}
 \end{equation}
 %
 for $0 \leq n,m \leq r-1$.

 We also need to define a permutation matrix $P^a_b$ that performs
 digit reversal of the ordering of a vector. If the index of a vector
 $j= 0\dots N-1$ is factorized into $j = la +m$, with $0 \leq l \leq
 b-1$ and $0 \leq m \leq a-1$ then the operation of the matrix $P$ will
 exchange positions $la+m$ and $mb+l$ in the vector (this sort of
 digit-reversal is the generalization of bit-reversal to a number
 system with exponents $a$ and $b$).

 In mathematical terms $P$ is a square matrix of size $ab \times ab$
 with the property,
 %
 \begin{eqnarray}
 P^a_b(j,k) &=& 1 ~\mbox{if}~ j=ra+s ~\mbox{and}~ k=sb+r \\
            &=& 0 ~\mbox{otherwise}
 \end{eqnarray}
 %

 Finally the FFT algorithm needs a matrix of twiddle factors, $D^a_b$,
 for the trigonometric sums. $D^a_b$ is a diagonal square matrix of
 size $ab \times ab$ with the definition,
 %
 \begin{eqnarray}
 D^a_b(j,k) &=& \omega^{sr}_{ab} ~\mbox{if}~ j=k=sb+r \\
            &=& 0 ~\mbox{otherwise}
 \end{eqnarray}
 %
 where $\omega_{ab} = e^{-2\pi i/ab}$.


 \subsection{The Kronecker Product}
 The Kronecker matrix product plays an important role in all the
 algorithms for combining operations on different subspaces. The
 Kronecker product $A \otimes B$ of two square matrices $A$ and $B$, of
 sizes $a \times a$ and $b \times b$ respectively, is a square matrix
 of size $a b \times a b$, defined as,
 %
 \begin{equation}
 [A \otimes B] (tb+u, rb+s) = A(t,r) B(u,s)
 \end{equation}
 %
 where $0 \leq u,s < b$ and $0 \leq t,r < a$.  Let's examine a specific
 example. If we take a $2 \times 2$ matrix and a $3
 \times 3$ matrix,
 %
 \begin{equation}
 \begin{array}{ll}
 A =
 \left(
 \begin{array}{cc}
 a_{11} & a_{12} \\
 a_{21} & a_{22}
 \end{array}
 \right)
 &
 B =
 \left(
 \begin{array}{ccc}
 b_{11} & b_{12} & b_{13} \\
 b_{21} & b_{22} & b_{23} \\
 b_{31} & b_{32} & b_{33}
 \end{array}
 \right)
 \end{array}
 \end{equation}
 %
 then the Kronecker product $A \otimes B$ is,
 %
 \begin{eqnarray}
 A \otimes B &= &
 \left(
 \begin{array}{cc}
 a_{11} B & a_{12} B \\
 a_{12} B & a_{22} B
 \end{array}
 \right) \\
  &=&
 \left(
 \begin{array}{cccccc}
 a_{11} b_{11} & a_{11} b_{12} & a_{11} b_{13} &
   a_{12} b_{11} & a_{12} b_{12} & a_{12} b_{13} \\
 a_{11} b_{21} & a_{11} b_{22} & a_{11} b_{23} &
   a_{12} b_{21} & a_{12} b_{22} & a_{12} b_{23} \\
 a_{11} b_{31} & a_{11} b_{32} & a_{11} b_{33} &
   a_{12} b_{31} & a_{12} b_{32} & a_{12} b_{33} \\
 a_{21} b_{11} & a_{21} b_{12} & a_{21} b_{13} &
   a_{22} b_{11} & a_{22} b_{12} & a_{22} b_{13} \\
 a_{21} b_{21} & a_{21} b_{22} & a_{21} b_{23} &
   a_{22} b_{21} & a_{22} b_{22} & a_{22} b_{23} \\
 a_{21} b_{31} & a_{21} b_{32} & a_{21} b_{33} &
   a_{22} b_{31} & a_{22} b_{32} & a_{22} b_{33}
 \end{array}
 \right)
 \end{eqnarray}
 %
 When the Kronecker product $A \otimes B$ acts on a vector of length
 $ab$, each matrix operates on a different subspace of the vector.
 Writing the index $i$ as $i=t b + u$, with $0\leq u \leq b-1$
 and $0\leq t\leq a$, we can see this explicitly by looking at components,
 %
 \begin{eqnarray}
 [(A \otimes B) v]_{(tb+u)}
 & = & \sum_{t'=0}^{a-1} \sum_{u'=0}^{b-1}
         [A \otimes B]_{(tb+u,t'b+u')} v_{t'b+u'} \\
 & = & \sum_{t'u'} A_{tt'} B_{uu'} v_{t'b+u'}
 \end{eqnarray}
 %
 The matrix $B$ operates on the ``index'' $u'$, for all values of $t'$, and
 the matrix $A$ operates on the ``index'' $t'$, for all values of $u'$.
 %
 The most important property needed for deriving the FFT factorization
 is that the matrix product of two Kronecker products is the Kronecker
 product of the two matrix products,
 %
 \begin{equation}
 (A \otimes B)(C \otimes D) = (AC \otimes BD)
 \end{equation}
 %
 This follows straightforwardly from the original definition of the
 Kronecker product.

 \subsection{Two factor case, $N=ab$}
 %
 First consider the simplest possibility, where the data length $N$ can
 be divided into two factors, $N=ab$. The aim is to reduce the DFT
 matrix $W_N$ into simpler matrices corresponding to each factor. To
 make the derivation easier we will start from the known factorization
 and verify it (the initial factorization can be guessed by
 generalizing from simple cases). Here is the factorization we are
 going to prove,
 %
 \begin{equation}
 W_{ab} = (W_b \otimes I_a) P^a_b D^a_b (W_a \otimes I_b).
 \end{equation}
 %
 We can check it by expanding the product into components,
 %
 \begin{eqnarray}
 \lefteqn{[(W_b \otimes I_a) P^a_b D^a_b (W_a \otimes I_b)](la+m,rb+s)}  \\
 & = &
 \sum_{u=0}^{b-1} \sum_{t=0}^{a-1}
 [(W_b \otimes I_a)](la+m,ua+t) [P^a_b D^a_b (W_a \otimes I_b)](ua+t,rb+s)
 \end{eqnarray}
 %
 where we have split the indices to match the Kronecker product $0 \leq
 m, r \leq a$, $0 \leq l, s \leq b$.  The first term in the sum can
 easily be reduced to its component form,
 %
 \begin{eqnarray}
 [(W_b \otimes I_a)](la+m,ua+t)
 &=& W_b(l,u) I_a(m,t) \\
 &=& \omega_b^{lu} \delta_{mt}
 \end{eqnarray}
 %
 The second term is more complicated. We can expand the Kronecker
 product like this,
 \begin{eqnarray}
 (W_a \otimes I_b)(tb+u,rb+s)
 &=& W_a(t,r) I_a(u,s) \\
 &=& \omega_a^{tr} \delta_{us}
 \end{eqnarray}
 %
 and use this term to build up the product, $P^a_b D^a_b (W_a \otimes
 I_b)$. We first multiply by $D^a_b$,
 %
 \begin{equation}
 [D^a_b (W_a \otimes I_b)](tb+u,rb+s)
 =
 \omega^{tu}_{ab} \omega^{tr}_{a} \delta_{su}
 \end{equation}
 %
 and then apply the permutation matrix, $P^a_b$, which digit-reverses
 the ordering of the first index, to obtain,
 %
 \begin{equation}
 [P^a_b D^a_b (W_a \otimes I_b)](ua+t,rb+s)
 =
 \omega^{tu}_{ab} \omega^{tr}_{a} \delta_{su}
 \end{equation}
 %
 Combining the two terms in the matrix product we can obtain the full
 expansion in terms of the exponential $\omega$,
 %
 \begin{eqnarray}
 [(W_b \otimes I_a) P^a_b D^a_b (W_a \otimes I_b)](la+m,rb+s)
 &=&
 \sum_{u=0}^{b-1} \sum_{t=0}^{a-1}
 \omega_b^{lu} \delta_{mt} \omega^{tu}_{ab} \omega^{tr}_{a} \delta_{su}
 \end{eqnarray}
 %
 If we evaluate this sum explicitly we can make the connection between
 the product involving $W_a$ and $W_b$ (above) and the expansion of the
 full DFT matrix $W_{ab}$,
 %
 \begin{eqnarray}
 \sum_{u=0}^{b-1} \sum_{t=0}^{a-1}
 \omega_b^{lu} \delta_{mt} \omega^{tu}_{ab} \omega^{tr}_{a} \delta_{su}
 &=& \omega^{ls}_b \omega^{ms}_{ab} \omega^{mr}_a \\
 &=& \omega^{als + ms +bmr}_{ab} \\
 &=& \omega^{als + ms +bmr}_{ab} \omega^{lrab}_{ab} \quad\mbox{using~} \omega^{ab}_{ab} =1\\
 &=& \omega^{(la+m)(rb+s)}_{ab} \\
 &=& W_{ab}(la+m,rb+s)
 \end{eqnarray}
 %
 The final line shows that matrix product given above is identical to
 the full two-factor DFT matrix, $W_{ab}$.
 %
 Thus the full DFT matrix $W_{ab}$ for two factors $a$, $b$ can be
 broken down into a product of sub-transforms, $W_a$ and $W_b$, plus
 permutations, $P$, and twiddle factors, $D$, according to the formula,
 %
 \begin{equation}
 W_{ab} = (W_b \otimes I_a) P^a_b D^a_b (W_a \otimes I_b).
 \end{equation}
 %
 This relation is the foundation of the general-$N$ mixed-radix FFT algorithm.

 \subsection{Three factor case, $N=abc$}
 %
 The result for the two-factor expansion can easily be generalized to
 three factors. We first consider $abc$ as being a product of two
 factors $a$ and $(bc)$, and then further expand the product $(bc)$ into
 $b$ and $c$. The first step of the expansion looks like this,
 %
 \begin{eqnarray}
 W_{abc} &=& W_{a(bc)}\\
 &=& (W_{bc} \otimes I_a) P^a_{bc} D^a_{bc} (W_a \otimes I_{bc}) .
 \end{eqnarray}
 %
 And after using the two-factor result to expand out $W_{bc}$ we obtain
 the factorization of $W_{abc}$,
 %
 \begin{eqnarray}
 W_{abc} &=& (((W_c \otimes I_b) P^b_c D^b_c (W_b \otimes I_c)) \otimes I_a )
 P^a_{bc} D^a_{bc} (W_a \otimes I_{bc}) \\
 &=& (W_c \otimes I_{ab}) (P^b_c D^b_c \otimes I_a) (W_b \otimes I_{ac}) P^a_{bc} D^a_{bc} (W_a \otimes I_c)
 \end{eqnarray}
 %
 We can write this factorization in a product form, with one term for
 each factor,
 %
 \begin{equation}
 W_{abc} = T_3 T_2 T_1
 \end{equation}
 %
 where we read off $T_1$, $T_2$ and $T_3$,
 %
 \begin{eqnarray}
 T_1 &=& P^a_{bc} D^a_{bc} (W_a \otimes I_{bc}) \\
 T_2 &=& (P^b_c D^b_c \otimes I_a) (W_b \otimes I_{ac}) \\
 T_3 &=& (W_c \otimes I_{ab} )
 \end{eqnarray}
 %


 \subsection{General case, $N=f_1 f_2 \dots f_{n_f}$}
 %
 If we continue the procedure that we have used for two- and
 three-factors then a general pattern begins to emerge in the
 factorization of $W_{f_1 f_2 \dots f_{n_f}}$. To see the beginning of
 the pattern we can rewrite the three factor case as,
 %
 \begin{eqnarray}
 T_1 &=& (P^a_{bc} D^a_{bc} \otimes I_1) (W_a \otimes I_{bc}) \\
 T_2 &=& (P^b_c D^b_c \otimes I_a) (W_b \otimes I_{ac}) \\
 T_3 &=& (P^c_1 D^c_1 \otimes I_{ab}) (W_c \otimes I_{ab} )
 \end{eqnarray}
 %
 using the special cases $P^c_1 = D^c_1 = I_c$.
 %
 In general, we can write the factorization of $W_N$ for $N= f_1 f_2
 \dots f_{n_f}$ as,
 %
 \begin{equation}
 W_N = T_{n_f} \dots T_2 T_1
 \end{equation}
 %
 where the matrix factors $T_i$ are,
 %
 \begin{equation}
 T_i = (P^{f_i}_{q_i} D^{f_i}_{q_i} \otimes I_{p_{i-1}}) ( W_{f_i}
 \otimes I_{m_i})
 \end{equation}
 %
 We have defined the following three additional variables $p$, $q$ and
 $m$ to denote different partial products of the factors,
 %
 \begin{eqnarray}
 p_i &=& f_1 f_2 \dots f_i \quad (p_0 = 1)  \\
 q_i &=& N / p_i  \\
 m_i &=& N / f_i
 \end{eqnarray}
 %
 Note that the FFT modules $W$ are applied before the permutations $P$,
 which makes this a decimation-in-frequency algorithm.

 \subsection{Implementation}
 %
 Now to the implementation of the algorithm. We start with a vector of
 data, $z$, as input and want to apply the transform,
 %
 \begin{eqnarray}
 x &=& W_N z \\
   &=& T_{n_f} \dots T_2 T_1 z
 \end{eqnarray}
 %
 where $T_i = (P^{f_i}_{q_i} D^{f_i}_{q_i} \otimes I_{p_{i-1}}) (
 W_{f_i} \otimes I_{m_i})$.

 The outer structure of the implementation will be a loop over the
 $n_f$ factors, applying each matrix $T_i$ to the vector in turn to
 build up the complete transform.
 %
 \begin{algorithm}
 \FOR{$(i = 1 \dots n_f)$}
         \STATE{$v \Leftarrow T_i v $}
 \ENDFOR
 \end{algorithm}
 %
 The order of the factors is not important. Now we examine the iteration
 $v \Leftarrow T_i v$, which we'll write as,
 %
 \begin{equation}
 v' =
 (P^{f_i}_{q_i} D^{f_i}_{q_i} \otimes I_{p_{i-1}}) ~
 ( W_{f_i} \otimes I_{m_i}) v
 \end{equation}
 %
 There are two Kronecker product matrices in this iteration. The
 rightmost matrix, which is the first to be applied, is a DFT of length
 $f_i$ applied to $N/f_i$ subsets of the data. We'll call this $t$,
 since it will be a temporary array,
 %
 \begin{equation}
 t = (W_{f_i} \otimes I_{m_i}) v
 \end{equation}
 %
 The second matrix applies a permutation and the exponential
 twiddle-factors. We'll call this $v'$, since it is the result of the
 full iteration on $v$,
 %
 \begin{equation}
 v' = (P^{f_i}_{q_i} D^{f_i}_{q_i} \otimes I_{p_{i-1}}) t
 \end{equation}

 The effect of the matrix $(W_{f_i} \otimes I_{m_i})$ is best seen by
 an example. Suppose the factor is $f_i = 3$, and the length of the FFT
 is $N=6$, then the relevant Kronecker product is,
 %
 \begin{equation}
 t = (W_3 \otimes I_2) v
 \end{equation}
 %
 which expands out to,
 %
 \begin{equation}
 \left(
 \begin{array}{c}
 t_0 \\
 t_1 \\
 t_2 \\
 t_3 \\
 t_4 \\
 t_5
 \end{array}
 \right)
 =
 \left(
 \begin{array}{cccccc}
 W_3(1,1) & 0 & W_3(1,2) & 0 & W_3(1,3) & 0 \\
 0 & W_3(1,1) & 0 & W_3(1,2) & 0 & W_3(1,3) \\
 W_3(2,1) & 0 & W_3(2,2) & 0 & W_3(2,3) & 0 \\
 0 & W_3(2,1) & 0 & W_3(2,2) & 0 & W_3(2,3) \\
 W_3(3,1) & 0 & W_3(3,2) & 0 & W_3(3,3) & 0 \\
 0 & W_3(3,1) & 0 & W_3(3,2) & 0 & W_3(3,3)
 \end{array}
 \right)
 \left(
 \begin{array}{c}
 v_0 \\
 v_1 \\
 v_2 \\
 v_3 \\
 v_4 \\
 v_5
 \end{array}
 \right)
 \end{equation}
 %
 We can rearrange the components in a computationally convenient form,
 \begin{equation}
 \left(
 \begin{array}{c}
 t_0 \\
 t_2 \\
 t_4 \\
 t_1 \\
 t_3 \\
 t_5
 \end{array}
 \right)
 =
 \left(
 \begin{array}{cccccc}
 W_3(1,1) & W_3(1,2) & W_3(1,3) & 0 & 0 & 0 \\
 W_3(2,1) & W_3(2,2) & W_3(2,3) & 0 & 0 & 0 \\
 W_3(3,1) & W_3(3,2) & W_3(3,3) & 0 & 0 & 0 \\
 0 & 0 & 0 & W_3(1,1) & W_3(1,2) & W_3(1,3) \\
 0 & 0 & 0 & W_3(2,1) & W_3(2,2) & W_3(2,3) \\
 0 & 0 & 0 & W_3(3,1) & W_3(3,2) & W_3(3,3)
 \end{array}
 \right)
 \left(
 \begin{array}{c}
 v_0 \\
 v_2 \\
 v_4 \\
 v_1 \\
 v_3 \\
 v_5
 \end{array}
 \right)
 \end{equation}
 %
 which clearly shows that we just need to apply the $3\times 3$ DFT
 matrix $W_3$ twice, once to the sub-vector of elements $(v_0, v_2, v_4)$,
 and independently to the remaining sub-vector $(v_1, v_3, v_5)$.

 In the general case, if we index $t$ as $t_k = t(\lambda,\mu) =
 t_{\lambda m + \mu}$ then $\lambda = 0 \dots f-1$ is an index within
 each transform of length $f$ and $\mu = 0 \dots m-1$ labels the
 independent subsets of data. We can see this by showing the
 calculation with all indices present,
 %
 \begin{equation}
 t = (W_f \otimes I_m) z
 \end{equation}
 %
 becomes,
 %
 \begin{eqnarray}
 t_{\lambda m + \mu} &=& \sum_{\lambda'=0}^{f-1} \sum_{\mu'=0}^{m-1}
         (W_f \otimes I_m)_{(\lambda m + \mu)(\lambda' m + \mu')}
         z_{\lambda'm + \mu} \\
 &=& \sum_{\lambda'\mu'} (W_f)_{\lambda\lambda'} \delta_{\mu\mu'}
         z_{\lambda'm+\mu'} \\
 &=& \sum_{\lambda'} (W_f)_{\lambda\lambda'} z_{\lambda'm+\mu}
 \end{eqnarray}
 %
 The DFTs on the index $\lambda$ will be computed using
 special optimized modules for each $f$.

 To calculate the next stage,
 %
 \begin{equation}
 v'=(P^f_q D^f_q \otimes I_{p_{i-1}}) t
 \end{equation}
 %
 we note that the Kronecker product has the property of performing
 $p_{i-1}$ independent multiplications of $PD$ on $q_{i-1}$ different
 subsets of $t$. The index $\mu$ of $t(\lambda,\mu)$ which runs from 0
 to $m$ will include $q_i$ copies of each $PD$ operation because
 $m=p_{i-1}q$. i.e. we can split the index $\mu$ further into $\mu = a
 p_{i-1} + b$, where $a = 0 \dots q-1$ and $b=0 \dots p_{i-1}$,
 %
 \begin{eqnarray}
 \lambda m + \mu &=& \lambda m + a p_{i-1} + b \\
         &=& (\lambda q + a) p_{i-1} + b.
 \end{eqnarray}
 %
 Now we can expand the second stage,
 %
 \begin{eqnarray}
 v'&=& (P^f_q D^f_q \otimes I_{p_{i-1}}) t \\
 v'_{\lambda m + \mu} &=& \sum_{\lambda' \mu'}
  (P^f_q D^f_q \otimes I_{p_{i-1}})_{(\lambda m + \mu)(\lambda' m + \mu')}
         t_{\lambda' m + \mu'} \\
 v'_{(\lambda q + a) p_{i-1} + b} &=& \sum_{\lambda' a' b'}
 (
 P^f_q D^f_q \otimes I_{p_{i-1}}
 )_{((\lambda q+ a)p_{i-1} + b)((\lambda' q+ a')p_{i-1} + b')}
 t_{(\lambda' q + a')p_{i-1} +b'}
 \end{eqnarray}
 %
 The first step in removing redundant indices is to take advantage of
 the identity matrix $I$ and separate the subspaces of the Kronecker
 product,
 %
 \begin{equation}
 (
 P^f_q D^f_q \otimes I_{p_{i-1}}
 )_{((\lambda q+ a)p_{i-1} + b)((\lambda' q+ a')p_{i-1} + b')}
 =
 (P^f_q D^f_q)_{(\lambda q + a)(\lambda' q + a')}
 \delta_{bb'}
 \end{equation}
 %
 This eliminates one sum, leaving us with,
 %
 \begin{equation}
 v'_{(\lambda q + a) p_{i-1} + b}
 =
 \sum_{\lambda' a' }
 (P^f_q D^f_q)_{(\lambda q + a)(\lambda' q + a')} t_{(\lambda'q+a')p_{i-1} + b}
 \end{equation}
 %
 We can insert the definition of $D^f_q$ to give,
 %
 \begin{equation}
 \phantom{v'_{(\lambda q + a) p_{i-1} + b}}
 = \sum_{\lambda'a'} (P^f_q)_{(\lambda q + a)(\lambda'q + a')}
 \omega^{\lambda'a'}_{q_{i-1}} t_{(\lambda'q+a')p_{i-1}+b}
 \end{equation}
 %
 Using the definition of $P^f_q$, which exchanges an index of $\lambda
 q + a$ with $a f + \lambda$, we get a final result with no matrix
 multiplication,
 %
 \begin{equation}
 v'_{(a f + \lambda) p_{i-1} + b}
 = \omega^{\lambda a}_{q_{i-1}} t_{(\lambda q + a)p_{i-1} + b}
 \end{equation}
 %
 All we have to do is premultiply each element of the temporary vector
 $t$ by an exponential twiddle factor and store the result in another
 index location, according to the digit reversal permutation of $P$.

 Here is the algorithm to implement the mixed-radix FFT,
 %
 \begin{algorithm}
 \FOR{$i = 1 \dots n_f$}
 \FOR{$a = 0 \dots q-1$}
 \FOR{$b = 0 \dots p_{i-1} - 1$}
 \FOR{$\lambda = 0 \dots f-1$}
 \STATE{$t_\lambda \Leftarrow
        \sum_{\lambda'=0}^{f-1} W_f(\lambda,\lambda') v_{b+\lambda'm+ap_{i-1}}$}
        \COMMENT{DFT matrix-multiply module}
 \ENDFOR
 \FOR{$\lambda = 0 \dots f-1$}
 \STATE{$v'_{(af+\lambda)p_{i-1}+b}
         \Leftarrow \omega^{\lambda a}_{q_{i-1}} t_\lambda$}
 \ENDFOR
 \ENDFOR
 \ENDFOR
 \STATE{$v \Leftarrow v'$}
 \ENDFOR
 \end{algorithm}
 %
 \subsection{Details of the implementation}
 %
 First the function {\tt gsl\_fft\_complex\_wavetable\_alloc} allocates
 $n$ elements of scratch space (to hold the vector $v'$ for each
 iteration) and $n$ elements for a trigonometric lookup table of
 twiddle factors.

 Then the length $n$ must be factorized. There is a general
 factorization function {\tt gsl\_fft\_factorize} which takes a list of
 preferred factors. It first factors out the preferred factors and then
 removes general remaining prime factors.

 The algorithm used to generate the trigonometric lookup table is
 %
 \begin{algorithm}
 \FOR {$a = 1 \dots n_f$}
 \FOR {$b = 1 \dots f_i - 1$}
 \FOR {$c = 1 \dots q_i$}
 \STATE $\mbox{trig[k++]} = \exp(- 2\pi i b c p_{a-1}/N)$
 \ENDFOR
 \ENDFOR
 \ENDFOR
 \end{algorithm}
 %
 Note that $\sum_{1}^{n_f} \sum_{0}^{f_i-1} \sum_{1}^{q_i} =
 \sum_{1}^{n_f} (f_i-1)q_i = n - 1$ so $n$ elements are always
 sufficient to store the lookup table. This is chosen because we need
 to compute $\omega_{q_i-1}^{\lambda a} t_\lambda$ in
 the FFT. In terms of the lookup table we can write this as,
 %
 \begin{eqnarray}
 \omega_{q_{i-1}}^{\lambda a} t_\lambda
 &=&  \exp(-2\pi i \lambda a/q_{i-1}) t_\lambda \\
 &=&  \exp(-2\pi i \lambda a p_{i-1}/N) t_\lambda \\
 &=& \left\{
     \begin{array}{ll}
     t_\lambda & a=0 \\
     \mbox{trig}[\mbox{twiddle[i]}+\lambda q+(a-1)] t_\lambda & a\not=0
 \end{array}
 \right.
 \end{eqnarray}
 %
 \begin{sloppypar}
 \noindent
 The array {\tt twiddle[i]} maintains a set of pointers into {\tt trig}
 for the starting points for the outer loop.  The core of the
 implementation is {\tt gsl\_fft\_complex}. This function loops over
 the chosen factors of $N$, computing the iteration $v'=T_i v$ for each
 pass. When the DFT for a factor is implemented the iteration is
 handed-off to a dedicated small-$N$ module, such as {\tt
 gsl\_fft\_complex\_pass\_3} or {\tt
 gsl\_fft\_complex\_pass\_5}.  Unimplemented factors are handled
 by the general-$N$ routine {\tt gsl\_fft\_complex\_pass\_n}. The
 structure of one of the small-$N$ modules is a simple transcription of
 the basic algorithm given above.  Here is an example for {\tt
 gsl\_fft\_complex\_pass\_3}. For a pass with a factor of 3 we have to
 calculate the following expression,
 \end{sloppypar}%
 \begin{equation}
 v'_{(a f + \lambda) p_{i-1} + b}
 =
 \sum_{\lambda' = 0,1,2}
 \omega^{\lambda a}_{q_{i-1}} W^{\lambda \lambda'}_{3}
 v_{b + \lambda' m + a p_{i-1}}
 \end{equation}
 %
 for $b = 0 \dots p_{i-1} - 1$, $a = 0 \dots q_{i} - 1$ and $\lambda =
 0, 1, 2$.  This is implemented as,
 %
 \begin{algorithm}
 \FOR {$a = 0 \dots q-1$}
 \FOR {$b = 0 \dots p_{i-1} - 1$}
 \STATE {$
         \left(
         \begin{array}{c}
         t_0 \\ t_1 \\ t_2
         \end{array}
         \right)
         =
         \left(
         \begin{array}{ccc}
         W^{0}_3 & W^{0}_3 & W^{0}_3 \\
         W^{0}_3 & W^{1}_3 & W^{2}_3 \\
         W^{0}_3 & W^{2}_3 & W^{4}_3
         \end{array}
         \right)
         \left(
         \begin{array}{l}
         v_{b + a p_{i-1}} \\
         v_{b + a p_{i-1} + m} \\
         v_{b + a p_{i-1} +2m}
         \end{array}
         \right)$}
         \STATE {$ v'_{a p_{i} + b}  = t_0$}
         \STATE {$ v'_{a p_{i} + b + p_{i-1}} = \omega^{a}_{q_{i-1}} t_1$}
         \STATE {$ v'_{a p_{i} + b + 2 p_{i-1}} = \omega^{2a}_{q_{i-1}} t_2$}
 \ENDFOR
 \ENDFOR
 \end{algorithm}
 %
 In the code we use the variables {\tt from0}, {\tt from1}, {\tt from2}
 to index the input locations,
 %
 \begin{eqnarray}
 \mbox{\tt from0} &=& b + a p_{i-1} \\
 \mbox{\tt from1} &=& b + a p_{i-1} + m \\
 \mbox{\tt from2} &=& b + a p_{i-1} + 2m
 \end{eqnarray}
 %
 and the variables {\tt to0}, {\tt to1}, {\tt to2} to index the output
 locations in the scratch vector $v'$,
 %
 \begin{eqnarray}
 \mbox{\tt to0} &=& b + a p_{i} \\
 \mbox{\tt to1} &=& b + a p_{i} + p_{i-1} \\
 \mbox{\tt to2} &=& b + a p_{i} + 2 p_{i-1}
 \end{eqnarray}
 %
 The DFT matrix multiplication is computed using the optimized
 sub-transform modules given in the next section. The twiddle factors
 $\omega^a_{q_{i-1}}$ are taken out of the {\tt trig} array.

 To compute the inverse transform we go back to the definition of the
 fourier transform and note that the inverse matrix is just the complex
 conjugate of the forward matrix (with a factor of $1/N$),
 %
 \begin{equation}
 W^{-1}_N = W^*_N / N
 \end{equation}
 %
 Therefore we can easily compute the inverse transform by conjugating
 all the complex elements of the DFT matrices and twiddle factors that
 we apply. (An alternative strategy is to conjugate the input data,
 take a forward transform, and then conjugate the output data).

 \section{Fast Sub-transform Modules}
 %
 To implement the mixed-radix FFT we still need to compute the
 small-$N$ DFTs for each factor. Fortunately many highly-optimized
 small-$N$ modules are available, following the work of Winograd who
 showed how to derive efficient small-$N$ sub-transforms by number
 theoretic techniques.

 The algorithms in this section all compute,
 %
 \begin{equation}
 x_a = \sum_{b=0}^{N-1} W_N^{ab} z_b
 \end{equation}
 %
 The sub-transforms given here are the ones recommended by Temperton
 and differ slightly from the canonical Winograd modules. According to
 Temperton~\cite{temperton83} they are slightly more robust against
 rounding errors and trade off some additions for multiplications.
 %
 For the $N=2$ DFT,
 %
 \begin{equation}
 \begin{array}{ll}
 x_0 = z_0 + z_1, &
 x_1 = z_0 - z_1.
 \end{array}
 \end{equation}
 %
 For the $N=3$ DFT,
 %
 \begin{equation}
 \begin{array}{lll}
 t_1 = z_1 + z_2, &
 t_2 = z_0 - t_1/2, &
 t_3 = \sin(\pi/3) (z_1 - z_2),
 \end{array}
 \end{equation}
 \begin{equation}
 \begin{array}{lll}
 x_0 = z_0 + t_1, &
 x_1 = t_2 + i t_3, &
 x_2 = t_2 - i t_3.
 \end{array}
 \end{equation}
 %
 The $N=4$ transform involves only additions and subtractions,
 %
 \begin{equation}
 \begin{array}{llll}
 t_1 = z_0 + z_2, &
 t_2 = z_1 + z_3, &
 t_3 = z_0 - z_2, &
 t_4 = z_1 - z_3,
 \end{array}
 \end{equation}
 \begin{equation}
 \begin{array}{llll}
 x_0 = t_1 + t_2, &
 x_1 = t_3 + i t_4, &
 x_2 = t_1 - t_2, &
 x_3 = t_3 - i t_4.
 \end{array}
 \end{equation}
 %
 For the $N=5$ DFT,
 %
 \begin{equation}
 \begin{array}{llll}
 t_1 = z_1 + z_4, &
 t_2 = z_2 + z_3, &
 t_3 = z_1 - z_4, &
 t_4 = z_2 - z_3,
 \end{array}
 \end{equation}
 \begin{equation}
 \begin{array}{llll}
 t_5 = t_1 + t_2, &
 t_6 = (\sqrt{5}/4) (t_1 - t_2), &
 t_7 = z_0 - t_5/4, \\
 \end{array}
 \end{equation}
 \begin{equation}
 \begin{array}{llll}
 t_8 = t_7 + t_6, &
 t_9 = t_7 - t_6, \\
 \end{array}
 \end{equation}
 \begin{equation}
 \begin{array}{llll}
 t_{10} = \sin(2\pi/5) t_3 + \sin(2\pi/10) t_4, &
 t_{11} = \sin(2\pi/10) t_3 - \sin(2\pi/5) t_4,
 \end{array}
 \end{equation}
 \begin{equation}
 \begin{array}{llll}
 x_0 = z_0 + t_5,
 \end{array}
 \end{equation}
 \begin{equation}
 \begin{array}{llll}
 x_1 = t_8 + i t_{10}, &
 x_2 = t_9 + i t_{11},
 \end{array}
 \end{equation}
 \begin{equation}
 \begin{array}{llll}
 x_3 = t_9 - i t_{11}, &
 x_4 = t_8 - i t_{10}.
 \end{array}
 \end{equation}
 %
 The DFT matrix for $N=6$ can be written as a combination of $N=3$ and
 $N=2$ transforms with index permutations,
 %
 \begin{equation}
 \left(
 \begin{array}{c}
 x_0 \\
 x_4 \\
 x_2 \\
 \hline x_3 \\
 x_1 \\
 x_5
 \end{array}
 \right)
 =
 \left(
 \begin{array}{ccc|ccc}
   &   &  &  &   & \\
   &W_3&  &  &W_3& \\
   &   &  &  &   & \\
 \hline  &   &  &  &   & \\
   &W_3&  &  &-W_3& \\
   &   &  &  &   &
 \end{array}
 \right)
 \left(
 \begin{array}{c}
 z_0 \\
 z_2 \\
 z_4 \\
 \hline z_3 \\
 z_5 \\
 z_1
 \end{array}
 \right)
 \end{equation}
 %
 This simplification is an example of the Prime Factor Algorithm, which
 can be used because the factors 2 and 3 are mutually prime.  For more
 details consult one of the books on number theory for
 FFTs~\cite{elliott82,blahut}. We can take advantage of the simple
 indexing scheme of the PFA to write the $N=6$ DFT as,
 %
 \begin{equation}
 \begin{array}{lll}
 t_1 = z_2 + z_4, &
 t_2 = z_0 - t_1/2, &
 t_3 = \sin(\pi/3) (z_2 - z_4),
 \end{array}
 \end{equation}
 \begin{equation}
 \begin{array}{lll}
 t_4 = z_5 + z_1, &
 t_5 = z_3 - t_4/2, &
 t_6 = \sin(\pi/3) (z_5 - z_1),
 \end{array}
 \end{equation}
 \begin{equation}
 \begin{array}{lll}
 t_7 = z_0 + t_1, &
 t_8 = t_2 + i t_3, &
 t_9 = t_2 - i t_3,
 \end{array}
 \end{equation}
 \begin{equation}
 \begin{array}{lll}
 t_{10} = z_3 + t_4, &
 t_{11} = t_5 + i t_6, &
 t_{12} = t_5 - i t_6,
 \end{array}
 \end{equation}
 \begin{equation}
 \begin{array}{lll}
 x_0 = t_7 + t_{10}, &
 x_4 = t_8 + t_{11}, &
 x_2 = t_9 + t_{12},
 \end{array}
 \end{equation}
 \begin{equation}
 \begin{array}{lll}
 x_3 = t_7 - t_{10}, &
 x_1 = t_8 - t_{11}, &
 x_5 = t_9 - t_{12}.
 \end{array}
 \end{equation}

 For any remaining general factors we use Singleton's efficient method
 for computing a DFT~\cite{singleton}. Although it is an $O(N^2)$
 algorithm it does reduce the number of multiplications by a factor of
 4 compared with a naive evaluation of the DFT. If we look at the
 general stucture of a DFT matrix, shown schematically below,
 %
 \begin{equation}
 \left(
 \begin{array}{c}
 h_0 \\
 h_1 \\
 h_2 \\
 \vdots \\
 h_{N-2} \\
 h_{N-1}
 \end{array}
 \right)
 =
 \left(
 \begin{array}{cccccc}
 1 & 1 & 1 & \cdots & 1 & 1 \\
 1 & W & W & \cdots & W & W \\
 1 & W & W & \cdots & W & W \\
 \vdots & \vdots & \vdots & \cdots & \vdots & \vdots \\
 1 & W & W & \cdots & W & W \\
 1 & W & W & \cdots & W & W
 \end{array}
 \right)
 \left(
 \begin{array}{c}
 g_0 \\
 g_1 \\
 g_2 \\
 \vdots \\
 g_{N-2} \\
 g_{N-1}
 \end{array}
 \right)
 \end{equation}
 %
 we see that the outer elements of the DFT matrix are all unity. We can
 remove these trivial multiplications but we will still be left with an
 $(N-1) \times (N-1)$ sub-matrix of complex entries, which would appear
 to require $(N-1)^2$ complex multiplications.  Singleton's method,
 uses symmetries of the DFT matrix to convert the complex
 multiplications to an equivalent number of real multiplications. We
 start with the definition of the DFT in component form,
 %
 \begin{equation}
 a_k + i b_k = \sum_{j=0} (x_j+iy_j)(\cos(2\pi jk/f) - i\sin(2\pi jk/f))
 \end{equation}
 %
 The zeroth component can be computed using only additions,
 %
 \begin{equation}
 a_0 + i b_0 = \sum_{j=0}^{(f-1)} x_j + i y_j
 \end{equation}
 %
 We can rewrite the remaining components as,
 %
 \begin{eqnarray}
 a_k + i b_k & =   x_0 + i y_0 & +
   \sum_{j=1}^{(f-1)/2} (x_j + x_{f-j}) \cos(2\pi jk/f)
  + (y_j - y_{f-j}) \sin(2\pi jk/f) \\
 & & + i\sum_{j=1}^{(f-1)/2} (y_j + y_{f-j}) \cos(2\pi jk/f)
  - (x_j - x_{f-j}) \sin(2\pi jk/f)
 \end{eqnarray}
 %
 by using the following trigonometric identities,
 %
 \begin{eqnarray}
  \cos(2\pi(f-j)k/f) &=& \phantom{-}\cos(2\pi jk/f) \\
  \sin(2\pi(f-j)k/f) &=&  -\sin(2\pi jk/f)
 \end{eqnarray}
 %
 These remaining components can all be computed using four partial
 sums,
 %
 \begin{eqnarray}
 a_k + i b_k & = & (a^+_k - a^-_k) + i (b^+_k + b^-_k) \\
 a_{f-k} + i b_{f-k} & = & (a^+_k + a^-_k) + i (b^+_k - b^-_k)
 \end{eqnarray}
 %
 for $k = 1, 2, \dots, (f-1)/2$, where,
 %
 \begin{eqnarray}
 a^+_k &=& x_0 + \sum_{j=1}^{(f-1)/2} (x_j + x_{f-j}) \cos(2\pi jk/f) \\
 a^-_k &=& \phantom{x_0} - \sum_{j=1}^{(f-1)/2} (y_j - y_{f-j}) \sin(2\pi jk/f) \\
 b^+_k &=& y_0 + \sum_{j=1}^{(f-1)/2} (y_j + y_{f-j}) \cos(2\pi jk/f) \\
 b^-_k &=& \phantom{y_0} - \sum_{j=1}^{(f-1)/2} (x_j - x_{f-j}) \sin(2\pi jk/f)
 \end{eqnarray}
 %
 Note that the higher components $k'=f-k$ can be obtained directly
 without further computation once $a^+$, $a^-$, $b^+$ and $b^-$ are
 known. There are $4 \times (f-1)/2$ different sums, each involving
 $(f-1)/2$ real multiplications, giving a total of $(f-1)^2$ real
 multiplications instead of $(f-1)^2$ complex multiplications.

 To implement Singleton's method we make use of the input and output
 vectors $v$ and $v'$ as scratch space, copying data back and forth
 between them to obtain the final result.  First we use $v'$ to store
 the terms of the symmetrized and anti-symmetrized vectors of the form
 $x_j + x_{f-j}$ and $x_j - y_{f-j}$. Then we multiply these by the
 appropriate trigonometric factors to compute the partial sums $a^+$,
 $a^-$, $b^+$ and $b^-$, storing the results $a_k + i b_k$ and $a_{f-k}
 + i b_{f-k}$ back in $v$. Finally we multiply the DFT output by any
 necessary twiddle factors and place the results in $v'$.

 \section{FFTs for real data}
 %
 This section is based on the articles {\em Fast Mixed-Radix Real
   Fourier Transforms} by Clive Temperton~\cite{temperton83real} and
 {\em Real-Valued Fast Fourier Transform Algorithms} by Sorensen,
 Jones, Heideman and Burrus~\cite{burrus87real}. The DFT of a real
 sequence has a special symmetry, called a {\em conjugate-complex} or
 {\em half-complex} symmetry,
 %
 \begin{equation}
 h(a) = h(N-a)^*
 \end{equation}
 %
 The element $h(0)$ is real, and when $N$ is even $h(N/2)$ is also
 real. It is straightforward to prove the symmetry,
 %
 \begin{eqnarray}
 h(a) &=& \sum W^{ab}_N g(b) \\
 h(N-a)^* &=& \sum W^{-(N-a)b}_N g(b)^*  \\
          &=& \sum W^{-Nb}_N W^{ab}_N g(b) \qquad{(W^N_N=1)} \\
          &=& \sum W^{ab}_N g(b)
 \end{eqnarray}
 %
 Real-valued data is very common in practice (perhaps more common that
 complex data) so it is worth having efficient FFT routines for real
 data. In principle an FFT for real data should need half the
 operations of an FFT on the equivalent complex data (where the
 imaginary parts are set to zero). There are two different strategies
 for computing FFTs of real-valued data:

 One strategy is to ``pack'' the real data (of length $N$) into a
 complex array (of length $N/2$) by index transformations. A complex
 FFT routine can then be used to compute the transform of that array.
 By further index transformations the result can actually by
 ``unpacked'' to the FFT of the original real data. It is also possible
 to do two real FFTs simultaneously by packing one in the real part and
 the other in the imaginary part of the complex array.  These
 techniques have some disadvantages. The packing and unpacking
 procedures always add $O(N)$ operations, and packing a real array of
 length $N$ into a complex array of length $N/2$ is only possible if
 $N$ is even. In addition, if two unconnected datasets with very
 different magnitudes are packed together in the same FFT there could
 be ``cross-talk'' between them due to a loss of precision.

 A more straightforward strategy is to start with an FFT algorithm,
 such as the complex mixed-radix algorithm, and prune out all the
 operations involving the zero imaginary parts of the initial data. The
 FFT is linear so the imaginary part of the data can be decoupled from
 the real part. This procedure leads to a dedicated FFT for real-valued
 data which works for any length and does not perform any unnecessary
 operations. It also allows us to derive a corresponding inverse FFT
 routine which transforms a half-complex sequence back into real data.

 \subsection{Radix-2 FFTs for real data}
 %
 Before embarking on the full mixed-radix real FFT we'll start with the
 radix-2 case. It contains all the essential features of the
 general-$N$ algorithm. To make it easier to see the analogy between
 the two we will use the mixed-radix notation to describe the
 factors. The factors are all 2,
 %
 \begin{equation}
 f_1 = 2, f_2 = 2, \dots, f_{n_f} = 2
 \end{equation}
 %
 and the products $p_i$ are powers of 2,
 %
 \begin{eqnarray}
 p_0 & = & 1 \\
 p_1 & = & f_1 = 2 \\
 p_2 & = & f_1 f_2 = 4 \\
 \dots &=& \dots \\
 p_i & = & f_1 f_2 \dots f_i = 2^i
 \end{eqnarray}
 %
 Using this notation we can rewrite the radix-2 decimation-in-time
 algorithm as,
 %
 \begin{algorithm}
 \STATE bit-reverse ordering of $g$
 \FOR {$i = 1 \dots n$}
   \FOR {$a = 0 \dots p_{i-1} - 1$}
     \FOR{$b = 0 \dots q_i - 1$}
         \STATE{$
                 \left(
                 \begin{array}{c}
                 g(b p_i + a) \\
                 g(b p_i + p_{i-1} + a)
                 \end{array}
                 \right)
                 =
                 \left(
                 \begin{array}{c}
                 g(b p_i + a) + W^a_{p_i} g(b p_i + p_{i-1} + a) \\
                 g(b p_i + a) - W^a_{p_i} g(b p_i + p_{i-1} + a)
                 \end{array}
                 \right) $}
      \ENDFOR
   \ENDFOR
 \ENDFOR
 \end{algorithm}
 %
 where we have used $p_i = 2 \Delta$, and factored $2 \Delta$ out of
 the original definition of $b$ ($b \to b p_i$).

 If we go back to the original recurrence relations we can see how to
 write the intermediate results in a way which make the
 real/half-complex symmetries explicit at each step. The first pass is
 just a set of FFTs of length-2 on real values,
 %
 \begin{equation}
 g_1([b_0 b_1 b_2 a_0]) = \sum_{b_3} W^{a_0 b_3}_2 g([b_0 b_1 b_2 b_3])
 \end{equation}
 %
 Using the symmetry $FFT(x)_k = FFT(x)^*_{N-k}$ we have the reality
 condition,
 %
 \begin{eqnarray}
 g_1([b_0 b_1 b_2 0]) &=& \mbox{real} \\
 g_1([b_0 b_1 b_2 1]) &=& \mbox{real'}
 \end{eqnarray}
 %
 In the next pass we have a set of length-4 FFTs on the original data,
 %
 \begin{eqnarray}
 g_2([b_0 b_1 b_1 a_0])
 &=&
 \sum_{b_2}\sum_{b_3}
 W^{[a_1 a_0]b_2}_4 W^{a_0 b_3}_2
 g([b_0 b_1 b_2 b_3]) \\
 &=&
 \sum_{b_2}\sum_{b_3}
 W^{[a_1 a_0][b_3 b_2]}_4
 g([b_0 b_1 b_2 b_3])
 \end{eqnarray}
 %
 This time symmetry gives us the following conditions on the
 transformed data,
 %
 \begin{eqnarray}
 g_2([b_0 b_1 0 0]) &=& \mbox{real} \\
 g_2([b_0 b_1 0 1]) &=& x + i y  \\
 g_2([b_0 b_1 1 0]) &=& \mbox{real'} \\
 g_2([b_0 b_1 1 1]) &=& x - i y
 \end{eqnarray}
 %
 We can see a pattern emerging here: the $i$-th pass computes a set of
 independent length-$2^i$ FFTs on the original real data,
 %
 \begin{eqnarray}
 g_i ( b p_i  + a ) = \sum_{a' = 0}^{p_i-1} W_{p_i}^{aa'} g(b p_i + a')
 \quad
 \mbox{for $b = 0 \dots q_i - 1$}
 \end{eqnarray}
 %
 As a consequence the we can apply the symmetry for an FFT of real data
 to all the intermediate results -- not just the final result.
 In general after the $i$-th pass we will
 have the symmetry,
 %
 \begin{eqnarray}
 g_i(b p_i) &=& \mbox{real} \\
 g_i(b p_i + a) &=& g_i(b p_i + p_i - a)^* \qquad a = 1 \dots p_{i}/2 - 1  \\
 g_i(b p_i + p_{i}/2) &=& \mbox{real'}
 \end{eqnarray}
 %
 In the next section we'll show that this is a general property of
 decimation-in-time algorithms. The same is not true for the
 decimation-in-frequency algorithm, which does not have any simple
 symmetries in the intermediate results.

 Since we can obtain the values of $g_i(b p_i + a)$ for $a > p_i/2$
 from the values for $a < p_i/2$ we can cut our computation and
 storage in half compared with the full-complex case.
 %
 We can easily rewrite the algorithm to show how the computation can be
 halved, simply by limiting all terms to involve only values for $a
 \leq p_{i}/2$. Whenever we encounter a term $g_i(b p_i + a)$ with $a >
 p_{i}/2$ we rewrite it in terms of its complex symmetry partner,
 $g_i(b p_i + a')^*$, where $a' = p_i - a$.  The butterfly computes two
 values for each value of $a$, $b p_i + a$ and $b p_i + p_{i-1} - a$,
 so we actually only need to compute from $a = 0$ to $p_{i-1}/2$.  This
 gives the following algorithm,
 %
 \begin{algorithm}
 \FOR {$a = 0 \dots p_{i-1}/2$}
   \FOR{$b = 0 \dots q_i - 1$}
         \STATE{$
                 \left(
                 \begin{array}{c}
                 g(b p_i + a) \\
                 g(b p_i + p_{i-1} - a)^*
                 \end{array}
                 \right)
                 =
                 \left(
                 \begin{array}{c}
                 g(b p_{i} + a) + W^a_{p_i} g(b p_i + p_{i-1} + a) \\
                 g(b p_{i} + a) - W^a_{p_i} g(b p_i + p_{i-1} + a)
                 \end{array}
                 \right) $}
      \ENDFOR
   \ENDFOR
 \end{algorithm}
 %
 Although we have halved the number of operations we also need a
 storage arrangement which will halve the memory requirement. The
 algorithm above is still formulated in terms of a complex array $g$,
 but the input to our routine will naturally be an array of $N$ real
 values which we want to use in-place.

 Therefore we need a storage scheme which lays out the real and
 imaginary parts within the real array, in a natural way so that there
 is no need for complicated index calculations. In the radix-2
 algorithm we do not have any additional scratch space. The storage
 scheme has to be designed to accommodate the in-place calculation
 taking account of dual node pairs.

 Here is a scheme which takes these restrictions into account: On the
 $i$-th pass we store the real part of $g(b p_i + a)$ in location $b
 p_i + a$. We store the imaginary part in location $b p_i + p_i -
 a$. This is the redundant location which corresponds to the conjugate
 term $g(b p_i + a)^* = g(b p_i + p_i -a)$, so it is not needed.  When
 the results are purely real (as in the case $a = 0$ and $a = p_i/2$ we
 store only the real part and drop the zero imaginary part).

 This storage scheme has to work in-place, because the radix-2 routines
 should not use any scratch space. We will now check the in-place
 property for each butterfly.  A crucial point is that the scheme is
 pass-dependent. Namely, when we are computing the result for pass $i$
 we are reading the results of pass $i-1$, and we must access them
 using the scheme from the previous pass, i.e. we have to remember that
 the results from the previous pass were stored using $b p_{i-1} + a$,
 not $b p_i + a$, and the symmetry for these results will be $g_{i-1}(b
 p_{i-1} + a) = g_{i-1}(b p_{i-1} + p_{i-1} - a)^*$. To take this into
 account we'll write the right hand side of the iteration, $g_{i-1}$,
 in terms of $p_{i-1}$. For example, instead of $b p_i$, which occurs
 naturally in $g_i(b p_i + a)$ we will use $2 b p_{i-1}$, since $p_i =
 2 p_{i-1}$.

 Let's start with the butterfly for $a = 0$,
 %
 \begin{equation}
 \left(
 \begin{array}{c}
 g(b p_i) \\
 g(b p_i + p_{i-1})^*
 \end{array}
 \right)
 =
 \left(
 \begin{array}{c}
 g(2 b p_{i-1}) + g((2 b + 1) p_{i-1}) \\
 g(2 b p_{i-1}) - g((2 b + 1) p_{i-1})
 \end{array}
 \right)
 \end{equation}
 %
 By the symmetry $g_{i-1}(b p_{i-1} + a) = g_{i-1}(b p_{i-1} + p_{i-1}
 - a)^*$ all the inputs are purely real.  The input $g(2 b p_{i-1})$ is
 read from location $2 b p_{i-1}$ and $g((2 b + 1) p_{i-1})$ is read
 from the location $(2 b + 1) p_{i-1}$. Here is the full breakdown,
 %
 \begin{center}
 \renewcommand{\arraystretch}{1.5}
 \begin{tabular}{|l|lll|}
 \hline Term & & Location & \\
 \hline
 $g(2 b p_{i-1})$
         & real part & $2 b p_{i-1} $ &$= b p_i$ \\
         & imag part & --- & \\
 $g((2 b+1) p_{i-1})$
         & real part & $(2 b + 1) p_{i-1}  $&$= b p_i + p_{i-1} $ \\
         & imag part & --- & \\
 \hline
 $g(b p_{i})$
         & real part & $b p_i$ &\\
         & imag part & --- & \\
 $g(b p_{i} + p_{i-1})$
         & real part & $b p_i + p_{i-1}$& \\
         & imag part & --- &  \\
 \hline
 \end{tabular}
 \end{center}
 %
 The conjugation of the output term $g(b p_i + p_{i-1})^*$ is
 irrelevant here since the results are purely real. The real results
 are stored in locations $b p_i$ and $b p_i + p_{i-1}$, which
 overwrites the inputs in-place.

 The general butterfly for $a = 1 \dots p_{i-1}/2 - 1$ is,
 %
 \begin{equation}
 \left(
 \begin{array}{c}
 g(b p_i + a) \\
 g(b p_i + p_{i-1} - a)^*
 \end{array}
 \right)
 =
 \left(
 \begin{array}{c}
 g(2 b p_{i-1} + a) + W^a_{p_i} g((2 b + 1) p_{i-1} + a) \\
 g(2 b p_{i-1} + a) - W^a_{p_i} g((2 b + 1) p_{i-1} + a)
 \end{array}
 \right)
 \end{equation}
 %
 All the terms are complex. To store a conjugated term like $g(b' p_i +
 a')^*$ where $a > p_i/2$ we take the real part and store it in
 location $b' p_i + a'$ and then take imaginary part, negate it, and
 store the result in location $b' p_i + p_i - a'$.

 Here is the full breakdown of the inputs and outputs from the
 butterfly,
 %
 \begin{center}
 \renewcommand{\arraystretch}{1.5}
 \begin{tabular}{|l|lll|}
 \hline Term & & Location & \\
 \hline
 $g(2 b p_{i-1} + a)$
         & real part & $2 b p_{i-1} + a $ &$= b p_i + a$ \\
         & imag part & $2 b p_{i-1} + p_{i-1} - a$ &$= b p_i + p_{i-1} - a$ \\
 $g((2 b+1) p_{i-1} + a)$
         & real part & $(2 b+1) p_{i-1} + a $&$= b p_i + p_{i-1} + a $ \\
         & imag part & $(2 b+1) p_{i-1} + p_{i-1} - a $&$= b p_i + p_i - a$\\
 \hline
 $g(b p_{i} + a)$
         & real part & $b p_i + a$ &\\
         & imag part & $b p_i + p_i - a$& \\
 $g(b p_{i} + p_{i-1} - a)$
         & real part & $b p_i + p_{i-1} - a$& \\
         & imag part & $b p_i + p_{i-1} + a$&  \\
 \hline
 \end{tabular}
 \end{center}
 %
 By comparing the input locations and output locations we can see
 that the calculation is done in place.

 The final butterfly for $a = p_{i-1}/2$ is,
 %
 \begin{equation}
 \left(
 \begin{array}{c}
 g(b p_i + p_{i-1}/2) \\
 g(b p_i + p_{i-1} - p_{i-1}/2)^*
 \end{array}
 \right)
 =
 \left(
 \begin{array}{c}
 g(2 b p_{i-1} + p_{i-1}/2) - i g((2 b + 1) p_{i-1} + p_{i-1}/2) \\
 g(2 b p_{i-1} + p_{i-1}/2) + i g((2 b + 1) p_{i-1} + p_{i-1}/2)
 \end{array}
 \right)
 \end{equation}
 %
 where we have substituted for the twiddle factor, $W^a_{p_i} = -i$,
 %
 \begin{eqnarray}
 W^{p_{i-1}/2}_{p_i} &=& \exp(-2\pi i p_{i-1}/2 p_i) \\
                     &=& \exp(-2\pi i /4) \\
                     &=& -i
 \end{eqnarray}
 %
 For this butterfly the second line is just the conjugate of the first,
 because $p_{i-1} - p_{i-1}/2 = p_{i-1}/2$. Therefore we only need to
 consider the first line. The breakdown of the inputs and outputs is,
 %
 \begin{center}
 \renewcommand{\arraystretch}{1.5}
 \begin{tabular}{|l|lll|}
 \hline Term & & Location & \\
 \hline
 $g(2 b p_{i-1} + p_{i-1}/2)$
         & real part & $2 b p_{i-1} + p_{i-1}/2 $ &$= b p_i + p_{i-1}/2$ \\
         & imag part & --- & \\
 $g((2 b + 1) p_{i-1} + p_{i-1}/2)$
         & real part & $(2 b + 1) p_{i-1} + p_{i-1}/2 $&$= b p_i + p_{i} - p_{i-1}/2 $ \\
         & imag part & --- & \\
 \hline
 $g(b p_{i} + p_{i-1}/2)$
         & real part & $b p_i + p_{i-1}/2$ &\\
         & imag part & $b p_i + p_i - p_{i-1}/2$& \\
 \hline
 \end{tabular}
 \end{center}
 %
 By comparing the locations of the inputs and outputs with the
 operations in the butterfly we find that this computation is very
 simple: the effect of the butterfly is to negate the location $b p_i +
 p_i - p_{i-1}/2$ and leave other locations unchanged. This is clearly
 an in-place operation.

 Here is the radix-2 algorithm for real data, in full, with the cases
 of $a=0$, $a=1\dots p_{i-1}/2 - 1$ and $a = p_{i-1}/2$ in separate
 blocks,
 %
 \begin{algorithm}
 \STATE bit-reverse ordering of $g$
 \FOR {$i = 1 \dots n$}
   \FOR{$b = 0 \dots q_i - 1$}
   \STATE{$\left(
           \begin{array}{c}
           g(b p_i) \\
           g(b p_i + p_{i-1})
           \end{array}
           \right)
           \Leftarrow
           \left(
           \begin{array}{c}
           g(b p_{i}) + g(b p_{i} + p_{i-1}) \\
           g(b p_{i}) - g(b p_{i} + p_{i-1})
           \end{array}
           \right)$}
   \ENDFOR

   \FOR {$a = 1 \dots p_{i-1}/2 - 1$}
     \FOR{$b = 0 \dots q_i - 1$}
         \STATE{$(\Real z_0, \Imag z_0) \Leftarrow
                 (g(b p_i + a),  g(b p_i + p_{i-1} - a))$}
         \STATE{$(\Real z_1, \Imag z_1) \Leftarrow
                 (g(b p_i + p_{i-1} + a), g(b p_i + p_{i} - a))$}
         \STATE{$t_0 \Leftarrow z_0 + W^a_{p_i} z_1$}
         \STATE{$t_1 \Leftarrow z_0 - W^a_{p_i} z_1$}
         \STATE{$(g(b p_{i} + a),g(b p_{i} + p_i - a) \Leftarrow
                 (\Real t_0, \Imag t_0)$}
         \STATE{$(g(b p_{i} + p_{i-1} - a), g(b p_{i} + p_{i-1} + a))
                  \Leftarrow
                 (\Real t_1, -\Imag t_1)$}
      \ENDFOR
   \ENDFOR

   \FOR{$b = 0 \dots q_i - 1$}
         \STATE{$g(b p_{i} - p_{i-1}/2) \Leftarrow -g(b p_{i} - p_{i-1}/2)$}
   \ENDFOR

 \ENDFOR
 \end{algorithm}
 %
 We split the loop over $a$ into three parts, $a=0$, $a=1\dots
 p_{i-1}/2-1$ and $a = p_{i-1}/2$ for efficiency.  When $a=0$ we have
 $W^a_{p_i}=1$ so we can eliminate a complex multiplication within the
 loop over $b$. When $a=p_{i-1}/2$ we have $W^a_{p_i} = -i$ which does
 not require a full complex multiplication either.


 \subsubsection{Calculating the Inverse}
 %
 The inverse FFT of complex data was easy to calculate, simply by
 taking the complex conjugate of the DFT matrix. The input data and
 output data were both complex and did not have any special
 symmetry. For real data the inverse FFT is more complicated because
 the half-complex symmetry of the transformed data is
 different from the purely real input data.

 We can compute an inverse by stepping backwards through the forward
 transform.  To simplify the inversion it's convenient to write the
 forward algorithm with the butterfly in matrix form,
 %
 \begin{algorithm}
 \FOR {$i = 1 \dots n$}
   \FOR {$a = 0 \dots p_{i-1}/2$}
     \FOR{$b = 0 \dots q_i - 1$}
         \STATE{$
                 \left(
                 \begin{array}{c}
                 g(b p_i + a) \\
                 g(b p_i + p_{i-1} + a)
                 \end{array}
                 \right)
                 =
                 \left(
                 \begin{array}{cc}
                 1 & W^a_{p_{i}} \\
                 1 & -W^a_{p_{i}}
                 \end{array}
                 \right)
                 \left(
                 \begin{array}{c}
                 g(2 b p_{i-1} + a) \\
                 g((2 b + 1) p_{i-1} + a)
                 \end{array}
                 \right) $}
      \ENDFOR
   \ENDFOR
 \ENDFOR
 \end{algorithm}
 %
 To invert the algorithm we run the iterations backwards and invert the
 matrix multiplication in the innermost loop,
 %
 \begin{algorithm}
 \FOR {$i = n \dots 1$}
   \FOR {$a = 0 \dots p_{i-1}/2$}
     \FOR{$b = 0 \dots q_i - 1$}
         \STATE{$
                 \left(
                 \begin{array}{c}
                 g(2 b p_{i-1} + a) \\
                 g((2 b + 1) p_{i-1} + a)
                 \end{array}
                 \right)
                 =
                 \left(
                 \begin{array}{cc}
                 1 & W^a_{p_{i}} \\
                 1 & -W^a_{p_{i}}
                 \end{array}
                 \right)^{-1}
                 \left(
                 \begin{array}{c}
                 g(b p_i + a) \\
                 g(b p_i + p_{i-1} + a)
                 \end{array}
                 \right) $}
      \ENDFOR
   \ENDFOR
 \ENDFOR
 \end{algorithm}
 %
 There is no need to reverse the loops over $a$ and $b$ because the
 result is independent of their order. The inverse of the matrix that
 appears is,
 %
 \begin{equation}
 \left(
 \begin{array}{cc}
 1 & W^a_{p_{i}} \\
 1 & -W^a_{p_{i}}
 \end{array}
 \right)^{-1}
 =
 {1 \over 2}
 \left(
 \begin{array}{cc}
 1 & 1 \\
 W^{-a}_{p_{i}} & -W^{-a}_{p_{i}}
 \end{array}
 \right)
 \end{equation}
 %
 To save divisions we remove the factor of $1/2$ inside the loop. This
 computes the unnormalized inverse, and the normalized inverse can be
 retrieved by dividing the final result by $N = 2^n$.

 Here is the radix-2 half-complex to real inverse FFT algorithm, taking
 into account the radix-2 storage scheme,
 %
 \begin{algorithm}
 \FOR {$i = n \dots 1$}
   \FOR{$b = 0 \dots q_i - 1$}
   \STATE{$\left(
           \begin{array}{c}
           g(b p_i) \\
           g(b p_i + p_{i-1})
           \end{array}
           \right)
           \Leftarrow
           \left(
           \begin{array}{c}
           g(b p_{i}) + g(b p_{i} + p_{i-1}) \\
           g(b p_{i}) - g(b p_{i} + p_{i-1})
           \end{array}
           \right)$}
   \ENDFOR

   \FOR {$a = 1 \dots p_{i-1}/2 - 1$}
     \FOR{$b = 0 \dots q_i - 1$}
         \STATE{$(\Real z_0, \Imag z_0)
                 \Leftarrow
                 (g(b p_i + a), g(b p_i + p_{i} - a))$}
         \STATE{$(\Real z_1, \Imag z_1)
                 \Leftarrow
                 (g(b p_i + p_{i-1} - a),  -g(b p_i + p_{i-1} + a))$}
         \STATE{$t_0 \Leftarrow z_0 + z_1$}
         \STATE{$t_1 \Leftarrow z_0 - z_1$}
         \STATE{$(g(b p_{i} + a), g(b p_{i} + p_{i-1} - a))
                  \Leftarrow
                 (\Real t_0, \Imag t_0) $}
         \STATE{$(g(b p_{i} + p_{i-1} + a),g(b p_{i} + p_{i} - a))
                 \Leftarrow
                 (\Real(W^a_{p_i}t_1), \Imag(W^a_{p_i}t_1))$}
      \ENDFOR
   \ENDFOR

   \FOR{$b = 0 \dots q_i - 1$}
         \STATE{$g(b p_{i} + p_{i-1}/2) \Leftarrow 2 g(b p_{i} + p_{i-1}/2)$}
         \STATE{$g(b p_{i} + p_{i-1} + p_{i-1}/2) \Leftarrow -2 g(b p_{i} + p_{i-1} + p_{i-1}/2)$}
   \ENDFOR

 \ENDFOR
 \STATE bit-reverse ordering of $g$
 \end{algorithm}


 \subsection{Mixed-Radix FFTs for real data}
 %
 As discussed earlier the radix-2 decimation-in-time algorithm had the
 special property that its intermediate passes are interleaved fourier
 transforms of the original data, and this generalizes to the
 mixed-radix algorithm. The complex mixed-radix algorithm that we
 derived earlier was a decimation-in-frequency algorithm, but we can
 obtain a decimation-in-time version by taking the transpose of the
 decimation-in-frequency DFT matrix like this,
 %
 \begin{eqnarray}
 W_N &=& W_N^T  \\
 &=& (T_{n_f} \dots T_2 T_1)^T \\
 &=& T_1^T T_2^T \dots T_{n_f}^T
 \end{eqnarray}
 %
 with,
 %
 \begin{eqnarray}
 T_i^T &=& \left( (P^{f_i}_{q_i} D^{f_i}_{q_i} \otimes I_{p_{i-1}})
         (W_{f_i} \otimes I_{m_i}) \right)^T \\
         &=&     (W_{f_i} \otimes I_{m_i})
                 ( D^{f_i}_{q_i} (P^{f_i}_{q_i})^T \otimes I_{p_{i-1}}).
 \end{eqnarray}
 %
 We have used the fact that $W$, $D$ and $I$ are symmetric and that the
 permutation matrix $P$ obeys,
 %
 \begin{equation}
 (P^a_b)^T = P^b_a.
 \end{equation}
 %
 From the definitions of $D$ and $P$ we can derive the following identity,
 %
 \begin{equation}
 D^a_b P^b_a = P^b_a D^b_a.
 \end{equation}
 %
 This allows us to put the transpose into a simple form,
 %
 \begin{equation}
 T_i^T =         (W_{f_i} \otimes I_{m_i})
                 (P^{q_i}_{f_i} D^{q_i}_{f_i} \otimes I_{p_{i-1}}).
 \end{equation}
 %
 The transposed matrix, $T^T$ applies the digit-reversal $P$ before the
 DFT $W$, giving the required decimation-in-time algorithm.  The
 transpose reverses the order of the factors --- $T_{n_f}$ is applied
 first and $T_1$ is applied last. For convenience we can reverse the
 order of the factors, $f_1 \leftrightarrow f_{n_f}$, $f_2
 \leftrightarrow f_{n_f-1}$, \dots and make the corresponding
 substitution $p_{i-1} \leftrightarrow q_i$. These substitutions give
 us a decimation-in-time algorithm with the same ordering as the
 decimation-in-frequency algorithm,
 %
 \begin{equation}
 W_N = T_{n_f} \dots T_2 T_1
 \end{equation}
 %
 \begin{equation}
 T_i = (W_{f_i} \otimes I_{m_i})
         (P^{p_{i-1}}_{f_i} D^{p_{i-1}}_{f_i} \otimes I_{q_i})
 \end{equation}
 %
 where $p_i$, $q_i$ and $m_i$ now have the same meanings as before,
 namely,
 %
 \begin{eqnarray}
 p_i &=& f_1 f_2 \dots f_i \quad (p_0 = 1)  \\
 q_i &=& N / p_i \\
 m_i &=& N / f_i
 \end{eqnarray}
 %
 Now we would like to prove that the iteration for computing $x = W z =
 T_{n_f} \dots T_2 T_1 z$ has the special property interleaving
 property. If we write the result of each intermediate pass as
 $v^{(i)}$,
 %
 \begin{eqnarray}
 v^{(0)} &=& z \\
 v^{(1)} &=& T_1 v^{(0)} \\
 v^{(2)} &=& T_2 v^{(1)} \\
 \dots   &=& \dots \\
 v^{(i)} &=& T_i v^{(i-1)}
 \end{eqnarray}
 %
 then we will show that the intermediate results $v^{(i)}$ on any pass
 can be written as,
 %
 \begin{equation}
 v^{(i)} = (W_{p_i} \otimes I_{q_i}) z
 \end{equation}
 %
 Each intermediate stage will be a set of $q_i$ interleaved fourier
 transforms, each of length $p_i$. We can prove this result by
 induction. First we assume that the result is true for $v^{(i-1)}$,
 %
 \begin{equation}
 v^{(i-1)} = (W_{p_{i-1}} \otimes I_{q_{i-1}}) z \qquad \mbox{(assumption)}
 \end{equation}
 %
 And then we examine the next iteration using this assumption,
 %
 \begin{eqnarray}
 v^{(i)} &=& T_i v^{(i-1)} \\
         &=& T_i (W_{p_{i-1}} \otimes I_{q_{i-1}}) z \\
         &=& (W_{f_i} \otimes I_{m_i})
                 (P^{p_{i-1}}_{f_i} D^{p_{i-1}}_{f_i} \otimes I_{q_i})
                 (W_{p_{i-1}} \otimes I_{q_{i-1}}) z \label{dit-induction}
 \end{eqnarray}
 %
 Using the relation $m_i = p_{i-1} q_i$, we can write $I_{m_i}$ as
 $I_{p_{i-1} q_i}$ and $I_{q_{i-1}}$ as $I_{f_i q_i}$. By combining these
 with the basic matrix identity,
 %
 \begin{equation}
 I_{ab} = I_a \otimes I_b
 \end{equation}
 %
 we can rewrite $v^{(i)}$ in the following form,
 %
 \begin{equation}
 v^{(i)} =  (((W_{f_i} \otimes I_{p_{i-1}})
                 (P^{p_{i-1}}_{f_i} D^{p_{i-1}}_{f_i})
                 (W_{p_{i-1}} \otimes I_{f_i})) \otimes I_{q_i}) z .
 \end{equation}
 %
 The first part of this matrix product is the two-factor expansion of
 $W_{ab}$, for $a = p_{i-1}$ and $b = f_i$,
 %
 \begin{equation}
 W_{p_{i-1} f_i} = ((W_{f_i} \otimes I_{p_{i-1}})
                   (P^{p_{i-1}}_{f_i} D^{p_{i-1}}_{f_i})
                   (W_{p_{i-1}} \otimes I_{f_i})).
 \end{equation}
 %
 If we substitute this result, remembering that $p_i = p_{i-1} f_i$, we
 obtain,
 %
 \begin{equation}
 v^{(i)} = (W_{p_i} \otimes I_{q_i}) z
 \end{equation}
 %
 which is the desired result. The case $i=1$ can be verified
 explicitly, and induction then shows that the result is true for all
 $i$.  As discussed for the radix-2 algorithm this result is important
 because if the initial data $z$ is real then each intermediate pass is
 a set of interleaved fourier transforms of $z$, having half-complex
 symmetries (appropriately applied in the subspaces of the Kronecker
 product). Consequently only $N$ real numbers are needed to store the
 intermediate and final results.

 \subsection{Implementation}
 %
 The implementation of the mixed-radix real FFT algorithm follows the
 same principles as the full complex transform. Some of the steps are
 applied in the opposite order because we are dealing with a decimation
 in time algorithm instead of a decimation in frequency algorithm, but
 the basic outer structure of the algorithm is the same. We want to
 apply the factorized version of the DFT matrix $W_N$ to the input
 vector $z$,
 %
 \begin{eqnarray}
 x &=& W_N z \\
   &=& T_{n_f} \dots T_2 T_1 z
 \end{eqnarray}
 %
 We loop over the $n_f$ factors, applying each matrix $T_i$ to the
 vector in turn to build up the complete transform,
 %
 \begin{algorithm}
 \FOR{$(i = 1 \dots n_f)$}
         \STATE{$v \Leftarrow T_i v $}
 \ENDFOR
 \end{algorithm}
 %
 In this case the definition of $T_i$ is different because we have
 taken the transpose,
 %
 \begin{equation}
 T_i =
   (W_{f_i} \otimes I_{m_i})
   (P^{p_{i-1}}_{f_i} D^{p_{i-1}}_{f_i} \otimes I_{q_i}).
 \end{equation}
 %
 We'll define a temporary vector $t$ to denote the results of applying the
 rightmost matrix,
 %
 \begin{equation}
 t = (P^{p_{i-1}}_{f_i} D^{p_{i-1}}_{f_i} \otimes I_{q_i}) v
 \end{equation}
 %
 If we expand this out into individual components, as before, we find a
 similar simplification,
 %
 \begin{eqnarray}
 t_{aq+b}
 &=&
 \sum_{a'b'}
 (P^{p_{i-1}}_{f_i} D^{p_{i-1}}_{f_i} \otimes I_{q_i})_{(aq+b)(a'q+b')}
 v_{a'q+b'} \\
 &=&
 \sum_{a'}
 (P^{p_{i-1}}_{f_i} D^{p_{i-1}}_{f_i})_{a a'}
 v_{a'q+b}
 \end{eqnarray}
 %
 We have factorized the indices into the form $aq+b$, with $0 \leq a <
 p_{i}$ and $0 \leq b < q$.  Just as in the decimation in frequency
 algorithm we can split the index $a$ to remove the matrix
 multiplication completely. We have to write $a$ as $\mu f + \lambda$,
 where $0 \leq \mu < p_{i-1}$ and $0 \leq \lambda < f$,
 %
 \begin{eqnarray}
 t_{(\mu f + \lambda)q+b}
 &=&
 \sum_{\mu'\lambda'}
 (P^{p_{i-1}}_{f_i} D^{p_{i-1}}_{f_i})_{(\mu f + \lambda)(\mu' f + \lambda')}
 v_{(\mu' f + \lambda')q+b}
 \\
 &=&
 \sum_{\mu'\lambda'}
 (P^{p_{i-1}}_{f_i})_{(\mu f + \lambda)(\mu' f + \lambda')}
 \omega^{\mu'\lambda'}_{p_{i}}
 v_{(\mu' f + \lambda')q+b}
 \end{eqnarray}
 %
 The matrix $P^{p_{i-1}}_{f_i}$ exchanges an index of $(\mu f +
 \lambda) q + b$ with $(\lambda p_{i-1} + \mu) q + b$, giving a final
 result of,
 %
 \begin{eqnarray}
 t_{(\lambda p_{i-1} + \mu) q + b}
 =
 w^{\mu\lambda}_{p_i} v_{(\mu f + \lambda)q +b}
 \end{eqnarray}
 %
 To calculate the next stage,
 %
 \begin{equation}
 v' = (W_{f_i} \otimes I_{m_i}) t,
 \end{equation}
 %
 we temporarily rearrange the index of $t$ to separate the $m_{i}$
 independent DFTs in the Kronecker product,
 %
 \begin{equation}
 v'_{(\lambda p_{i-1} + \mu) q + b}
 =
 \sum_{\lambda' \mu' b'}
 (W_{f_i} \otimes I_{m_i})_{
 ((\lambda p_{i-1} + \mu) q + b)
 ((\lambda' p_{i-1} + \mu') q + b')}
 t_{(\lambda' p_{i-1} + \mu') q + b'}
 \end{equation}
 %
 If we use the identity $m = p_{i-1} q$ to rewrite the index of $t$
 like this,
 %
 \begin{equation}
 t_{(\lambda p_{i-1} + \mu) q + b} = t_{\lambda m + (\mu q + b)}
 \end{equation}
 %
 we can split the Kronecker product,
 %
 \begin{eqnarray}
 v'_{(\lambda p_{i-1} + \mu) q + b}
 &=&
 \sum_{\lambda' \mu' b'}
 (W_{f_i} \otimes I_{m_i})_{
 ((\lambda p_{i-1} + \mu) q + b)
 ((\lambda' p_{i-1} + \mu') q + b')}
 t_{(\lambda' p_{i-1} + \mu') q + b'}\\
 &=&
 \sum_{\lambda'}
 (W_{f_i})_{\lambda \lambda'}
 t_{\lambda' m_i + (\mu q + b)}
 \end{eqnarray}
 %
 If we switch back to the original form of the index in the last line we obtain,
 %
 \begin{eqnarray}
 \phantom{v'_{(\lambda p_{i-1} + \mu) q + b}}
 &=&
 \sum_{\lambda'}
 (W_{f_i})_{\lambda \lambda'}
 t_{(\lambda p_{i-1} + \mu) q + b}
 \end{eqnarray}
 %
 which allows us to substitute our previous result for $t$,
 %
 \begin{equation}
 v'_{(\lambda p_{i-1} + \mu) q + b}
 =
 \sum_{\lambda'}
 (W_{f_i})_{\lambda \lambda'}
 w^{\mu\lambda'}_{p_i} v_{(\mu f + \lambda')q + b}
 \end{equation}
 %
 This gives us the basic decimation-in-time mixed-radix algorithm for
 complex data which we will be able to specialize to real data,
 %
 \begin{algorithm}
 \FOR{$i = 1 \dots n_f$}
 \FOR{$\mu = 0 \dots p_{i-1} - 1$}
 \FOR{$b = 0 \dots q-1$}
 \FOR{$\lambda = 0 \dots f-1$}
 \STATE{$t_\lambda \Leftarrow
                \omega^{\mu\lambda'}_{p_{i}} v_{(\mu f + \lambda')q + b}$}
 \ENDFOR
 \FOR{$\lambda = 0 \dots f-1$}
 \STATE{$v'_{(\lambda p_{i-1} + \mu)q + b} =
        \sum_{\lambda'=0}^{f-1}  W_f(\lambda,\lambda') t_{\lambda'}$}
        \COMMENT{DFT matrix-multiply module}
 \ENDFOR
 \ENDFOR
 \ENDFOR
 \STATE{$v \Leftarrow v'$}
 \ENDFOR
 \end{algorithm}
 %
 We are now at the point where we can convert an algorithm formulated
 in terms of complex variables to one in terms of real variables by
 choosing a suitable storage scheme.  We will adopt the FFTPACK storage
 convention. FFTPACK uses a scheme where individual FFTs are
 contiguous, not interleaved, and real-imaginary pairs are stored in
 neighboring locations. This has better locality than was possible for
 the radix-2 case.

 The interleaving of the intermediate FFTs results from the Kronecker
 product, $W_p \otimes I_q$. The FFTs can be made contiguous if we
 reorder the Kronecker product on the intermediate passes,
 %
 \begin{equation}
 W_p \otimes I_q \Rightarrow I_q \otimes W_p
 \end{equation}
 %
 This can be implemented by a simple change in indexing.  On pass-$i$
 we store element $v_{a q_i + b}$ in location $v_{b p_i+a}$. We
 compensate for this change by making the same transposition when
 reading the data. Note that this only affects the indices of the
 intermediate passes.  On the zeroth iteration the transposition has no
 effect because $p_0 = 1$. Similarly there is no effect on the last
 iteration, which has $q_{n_f} = 1$. This is how the algorithm above
 looks after this index transformation,
 %
 \begin{algorithm}
 \FOR{$i = 1 \dots n_f$}
 \FOR{$\mu = 0 \dots p_{i-1} - 1$}
 \FOR{$b = 0 \dots q-1$}
 \FOR{$\lambda = 0 \dots f-1$}
 \STATE{$t_\lambda \Leftarrow
                \omega^{\mu\lambda'}_{p_{i}} v_{(\lambda'q + b)p_{i-1} + \mu}$}
 \ENDFOR
 \FOR{$\lambda = 0 \dots f-1$}
 \STATE{$v'_{b p + (\lambda p_{i-1} + \mu)} =
        \sum_{\lambda'=0}^{f-1}  W_f(\lambda,\lambda') t_{\lambda'}$}
        \COMMENT{DFT matrix-multiply module}
 \ENDFOR
 \ENDFOR
 \ENDFOR
 \STATE{$v \Leftarrow v'$}
 \ENDFOR
 \end{algorithm}
 %
 We transpose the input terms by writing the index in the form $a
 q_{i-1} + b$, to take account of the pass-dependence of the scheme,
 %
 \begin{equation}
 v_{(\mu f + \lambda')q + b} = v_{\mu q_{i-1} + \lambda'q + b}
 \end{equation}
 %
 We used the identity $q_{i-1} = f q$ to split the index. Note that in
 this form $\lambda'q + b$ runs from 0 to $q_{i-1} - 1$ as $b$ runs
 from 0 to $q-1$ and $\lambda'$ runs from 0 to $f-1$. The transposition
 for the input terms then gives,
 %
 \begin{equation}
 v_{\mu q_{i-1} + \lambda'q + b} \Rightarrow  v_{(\lambda'q + b) p_{i-1} + \mu}
 \end{equation}
 %
 In the FFTPACK scheme the intermediate output data have the same
 half-complex symmetry as the radix-2 example, namely,
 %
 \begin{equation}
 v^{(i)}_{b p + a} = v^{(i)*}_{b p + (p - a)}
 \end{equation}
 %
 and on the input data from the previous pass have the symmetry,
 %
 \begin{equation}
 v^{(i-1)}_{(\lambda' q + b) p_{i-1} + \mu} = v^{(i-1)*}_{(\lambda' q +
 b) p_{i-1} + (p_{i-1} - \mu)}
 \end{equation}
 %
 Using these symmetries we can halve the storage and computation
 requirements for each pass. Compared with the radix-2 algorithm we
 have more freedom because the computation does not have to be done in
 place. The storage scheme adopted by FFTPACK places elements
 sequentially with real and imaginary parts in neighboring
 locations. Imaginary parts which are known to be zero are not
 stored. Here are the full details of the scheme,
 %
 \begin{center}
 \renewcommand{\arraystretch}{1.5}
 \begin{tabular}{|l|lll|}
 \hline Term & & Location & \\
 \hline
 $g(b p_i)$
         & real part & $b p_{i} $ & \\
         & imag part & --- & \\
 \hline
 $g(b p_i + a)$
         & real part & $b p_{i} + 2a - 1 $& for $a = 1 \dots p_i/2 - 1$ \\
         & imag part & $b p_{i} + 2a$ & \\
 \hline
 $g(b p_{i} + p_{i}/2)$
         & real part & $b p_i + p_{i} - 1$ & if $p_i$ is even\\
         & imag part & --- & \\
 \hline
 \end{tabular}
 \end{center}
 %
 The real element for $a=0$ is stored in location $bp$.  The real parts
 for $a = 1 \dots p/2 - 1$ are stored in locations $bp + 2a -1$ and the
 imaginary parts are stored in locations $b p + 2 a$.  When $p$ is even
 the term for $a = p/2$ is purely real and we store it in location $bp
 + p - 1$. The zero imaginary part is not stored.

 When we compute the basic iteration,
 %
 \begin{equation}
 v^{(i)}_{b p + (\lambda p_{i-1} + \mu)} = \sum_{\lambda'}
 W^{\lambda \lambda'}_f
 \omega^{\mu\lambda'}_{p_i} v^{(i-1)}_{(\lambda' q + b)p_{i-1} + \mu}
 \end{equation}
 %
 we eliminate the redundant conjugate terms with $a > p_{i}/2$ as we
 did in the radix-2 algorithm. Whenever we need to store a term with $a
 > p_{i}/2$ we consider instead the corresponding conjugate term with
 $a' = p - a$. Similarly when reading data we replace terms with $\mu >
 p_{i-1}/2$ with the corresponding conjugate term for $\mu' = p_{i-1} -
 \mu$.

 Since the input data on each stage has half-complex symmetry we only
 need to consider the range $\mu=0 \dots p_{i-1}/2$. We can consider
 the best ways to implement the basic iteration for each pass, $\mu = 0
 \dots p_{i-1}/2$.

 On the first pass where $\mu=0$ we will be accessing elements which
 are the zeroth components of the independent transforms $W_{p_{i-1}}
 \otimes I_{q_{i-1}}$, and are purely real.
 %
 We can code the pass with $\mu=0$ as a special case for real input
 data, and conjugate-complex output. When $\mu=0$ the twiddle factors
 $\omega^{\mu\lambda'}_{p_i}$ are all unity, giving a further saving.
 We can obtain small-$N$ real-data DFT modules by removing the
 redundant operations from the complex modules.
 %
 For example the $N=3$ module was,
 %
 \begin{equation}
 \begin{array}{lll}
 t_1 = z_1 + z_2, &
 t_2 = z_0 - t_1/2, &
 t_3 = \sin(\pi/3) (z_1 - z_2),
 \end{array}
 \end{equation}
 \begin{equation}
 \begin{array}{lll}
 x_0 = z_0 + t_1, &
 x_1 = t_2 + i t_3, &
 x_2 = t_2 - i t_3.
 \end{array}
 \end{equation}
 %
 In the complex case all the operations were complex, for complex input
 data $z_0$, $z_1$, $z_2$. In the real case $z_0$, $z_1$ and $z_2$ are
 all real. Consequently $t_1$, $t_2$ and $t_3$ are also real, and the
 symmetry $x_1 = t_1 + i t_2 = x^*_2$ means that we do not have to
 compute $x_2$ once we have computed $x_1$.

 For subsequent passes $\mu = 1 \dots p_{i-1}/2 - 1$ the input data is
 complex and we have to compute full complex DFTs using the same
 modules as in the complex case. Note that the inputs are all of the
 form $v_{(\lambda q + b) p_{i-1} + \mu}$ with $\mu < p_{i-1}/2$ so we
 never need to use the symmetry to access the conjugate elements with
 $\mu > p_{i-1}/2$.

 If $p_{i-1}$ is even then we reach the halfway point $\mu=p_{i-1}/2$,
 which is another special case. The input data in this case is purely
 real because $\mu = p_{i-1} - \mu$ for $\mu = p_{i-1}/2$. We can code
 this as a special case, using real inputs and real-data DFT modules as
 we did for $\mu=0$. However, for $\mu = p_{i-1}/2$ the twiddle factors
 are not unity,
 %
 \begin{eqnarray}
 \omega^{\mu\lambda'}_{p_i} &=& \omega^{(p_{i-1}/2)\lambda'}_{p_i} \\
 &=& \exp(-i\pi\lambda'/f_i)
 \end{eqnarray}
 %
 These twiddle factors introduce an additional phase which modifies the
 symmetry of the outputs. Instead of the conjugate-complex symmetry
 which applied for $\mu=0$ there is a shifted conjugate-complex
 symmetry,
 %
 \begin{equation}
 t_\lambda = t^*_{f-(\lambda+1)}
 \end{equation}
 %
 This is easily proved,
 %
 \begin{eqnarray}
 t_\lambda
 &=&
 \sum e^{-2\pi i \lambda\lambda'/f_i} e^{-i\pi \lambda'/f_i} r_{\lambda'} \\
 t_{f - (\lambda + 1)}
 &=&
 \sum e^{-2\pi i (f-\lambda-1)\lambda'/f_i} e^{-i\pi \lambda'/f_i} r_{\lambda'} \\
 &=&
 \sum e^{2\pi i \lambda\lambda'/f_i} e^{i\pi \lambda'/f_i} r_{\lambda'} \\
 &=& t^*_\lambda
 \end{eqnarray}
 %
 The symmetry of the output means that we only need to compute half of
 the output terms, the remaining terms being conjugates or zero
 imaginary parts. For example, when $f=4$ the outputs are $(x_0 + i
 y_0, x_1 + i y_1, x_1 - i y_1, x_0 - i y_0)$. For $f=5$ the outputs
 are $(x_0 + i y_0, x_1 + i y_1, x_2, x_1 - i y_1, x_0 - i y_0)$. By
 combining the twiddle factors and DFT matrix we can make a combined
 module which applies both at the same time. By starting from the
 complex DFT modules and bringing in twiddle factors we can derive
 optimized modules. Here are the modules given by Temperton for $z = W
 \Omega x$ where $x$ is real and $z$ has the shifted conjugate-complex
 symmetry. The matrix of twiddle factors, $\Omega$, is given by,
 %
 \begin{equation}
 \Omega = \mathrm{diag}(1, e^{-i\pi/f}, e^{-2\pi i/f}, \dots, e^{-i\pi(f-1)/f})
 \end{equation}
 %
 We write $z$ in terms of two real vectors $z = a + i b$.
 %
 For $N=2$,
 %
 \begin{equation}
 \begin{array}{ll}
 a_0 = x_0, &
 b_0 = - x_1.
 \end{array}
 \end{equation}
 %
 For $N=3$,
 %
 \begin{equation}
 \begin{array}{l}
 t_1 = x_1 - x_2,
 \end{array}
 \end{equation}
 \begin{equation}
 \begin{array}{ll}
 a_0 = x_0 + t_1/2, & b_0 = x_0 - t_1,
 \end{array}
 \end{equation}
 \begin{equation}
 \begin{array}{l}
 a_1 = - \sin(\pi/3) (x_1 + x_2)
 \end{array}
 \end{equation}
 %
 For $N=4$,
 %
 \begin{equation}
 \begin{array}{ll}
 t_1 = (x_1 - x_3)/\sqrt{2}, & t_2 = (x_1 + x_3)/\sqrt{2},
 \end{array}
 \end{equation}
 \begin{equation}
 \begin{array}{ll}
 a_0 = x_0 + t_1, & b_0 = -x_2 - t_2,
 \end{array}
 \end{equation}
 \begin{equation}
 \begin{array}{ll}
 a_1 = x_0 - t_1, & b_1 = x_2 - t_2.
 \end{array}
 \end{equation}
 %
 For $N=5$,
 %
 \begin{equation}
 \begin{array}{ll}
 t_1 = x_1 - x_4, & t_2 = x_1 + x_4,
 \end{array}
 \end{equation}
 \begin{equation}
 \begin{array}{ll}
 t_3 = x_2 - x_3, & t_4 = x_2 + x_3,
 \end{array}
 \end{equation}
 \begin{equation}
 \begin{array}{ll}
 t_5 = t_1 - t_3, & t_6 = x_0 + t_5 / 4,
 \end{array}
 \end{equation}
 \begin{equation}
 \begin{array}{ll}
 t_7 = (\sqrt{5}/4)(t_1 + t_3) &
 \end{array}
 \end{equation}
 \begin{equation}
 \begin{array}{ll}
 a_0 = t_6 + t_7, & b_0 = -\sin(2\pi/10) t_2 - \sin(2\pi/5) t_4,
 \end{array}
 \end{equation}
 \begin{equation}
 \begin{array}{ll}
 a_1 = t_6 - t_7, & b_1 = -\sin(2\pi/5) t_2 + \sin(2\pi/10) t_4,
 \end{array}
 \end{equation}
 \begin{equation}
 \begin{array}{ll}
 a_2 = x_0 - t_5 &
 \end{array}
 \end{equation}
 %
 For $N=6$,
 %
 \begin{equation}
 \begin{array}{ll}
 t_1 = \sin(\pi/3)(x_5 - x_1), & t_2 = \sin(\pi/3) (x_2 + x_4),
 \end{array}
 \end{equation}
 \begin{equation}
 \begin{array}{ll}
 t_3 = x_2 - x_4, & t_4 = x_1 + x_5,
 \end{array}
 \end{equation}
 \begin{equation}
 \begin{array}{ll}
 t_5 = x_0 + t_3 / 2, & t_6 = -x_3 - t_4 / 2,
 \end{array}
 \end{equation}
 \begin{equation}
 \begin{array}{ll}
 a_0 = t_5 - t_1, & b_0 = t_6 - t_2,
 \end{array}
 \end{equation}
 \begin{equation}
 \begin{array}{ll}
 a_1 = x_0 - t_3, & b_1 = x_3 - t_4,
 \end{array}
 \end{equation}
 \begin{equation}
 \begin{array}{ll}
 a_2 = t_5 + t_1, & b_2 = t_6 + t_2
 \end{array}
 \end{equation}

 \section{Computing the mixed-radix inverse for real data}
 %
 To compute the inverse of the mixed-radix FFT on real data we step
 through the algorithm in reverse and invert each operation.

 This gives the following algorithm using FFTPACK indexing,
 %
 \begin{algorithm}
 \FOR{$i = n_f \dots 1$}
 \FOR{$\mu = 0 \dots p_{i-1} - 1$}
 \FOR{$b = 0 \dots q-1$}
 \FOR{$\lambda = 0 \dots f-1$}
 \STATE{$t_{\lambda'} =
        \sum_{\lambda'=0}^{f-1}  W_f(\lambda,\lambda')
        v_{b p + (\lambda p_{i-1} + \mu)}$}
        \COMMENT{DFT matrix-multiply module}
 \ENDFOR
 \FOR{$\lambda = 0 \dots f-1$}
 \STATE{$v'_{(\lambda'q + b)p_{i-1} + \mu} \Leftarrow
                \omega^{-\mu\lambda'}_{p_{i}} t_\lambda$}
 \ENDFOR

 \ENDFOR
 \ENDFOR
 \STATE{$v \Leftarrow v'$}
 \ENDFOR
 \end{algorithm}
 %
 When $\mu=0$ we are applying an inverse DFT to half-complex data,
 giving a real result. The twiddle factors are all unity. We can code
 this as a special case, just as we did for the forward routine. We
 start with complex module and eliminate the redundant terms. In this
 case it is the final result which has the zero imaginary part, and we
 eliminate redundant terms by using the half-complex symmetry of the
 input data.

 When $\mu=1 \dots p_{i-1}/2 - 1$ we have full complex transforms on
 complex data, so no simplification is possible.

 When $\mu = p_{i-1}/2$ (which occurs only when $p_{i-1}$ is even) we
 have a combination of twiddle factors and DFTs on data with the
 shifted half-complex symmetry which give a real result. We implement
 this as a special module, essentially by inverting the system of
 equations given for the forward case. We use the modules given by
 Temperton, appropriately modified for our version of the algorithm. He
 uses a slightly different convention which differs by factors of two
 for some terms (consult his paper for details~\cite{temperton83real}).

 For $N=2$,
 %
 \begin{equation}
 \begin{array}{ll}
 x_0 = 2 a_0, & x_1 = - 2 b_0 .
 \end{array}
 \end{equation}
 %
 For $N=3$,
 %
 \begin{equation}
 \begin{array}{ll}
 t_1 = a_0 - a_1, & t_2 = \sqrt{3} b_1, \\
 \end{array}
 \end{equation}
 \begin{equation}
 \begin{array}{lll}
 x_0 = 2 a_0 + a_1, & x_1 = t_1 - t_2, & x_2 = - t_1 - t_2
 \end{array}
 \end{equation}
 %
 For $N=4$,
 \begin{equation}
 \begin{array}{ll}
 t_1 = \sqrt{2} (b_0 + b_1), & t_2 = \sqrt{2} (a_0 - a_1),
 \end{array}
 \end{equation}
 \begin{equation}
 \begin{array}{ll}
 x_0 = 2(a_0 + a_1), & x_1 = t_2 - t_1 ,
 \end{array}
 \end{equation}
 \begin{equation}
 \begin{array}{ll}
 x_2 = 2(b_1 - b_0), & x_3 = -(t_2 + t_1).
 \end{array}
 \end{equation}
 %
 For $N=5$,
 %
 \begin{equation}
 \begin{array}{ll}
 t_1 = 2 (a_0 + a_1), & t_2 = t_1 / 4 - a_2,
 \end{array}
 \end{equation}
 \begin{equation}
 \begin{array}{ll}
 t_3 = (\sqrt{5}/2) (a_0 - a_1),
 \end{array}
 \end{equation}
 \begin{equation}
 \begin{array}{l}
 t_4 = 2(\sin(2\pi/10) b_0 + \sin(2\pi/5) b_1),
 \end{array}
 \end{equation}
 \begin{equation}
 \begin{array}{l}
 t_5 = 2(\sin(2\pi/10) b_0 - \sin(2\pi/5) b_1),
 \end{array}
 \end{equation}
 \begin{equation}
 \begin{array}{ll}
 t_6 = t_3 + t_2, & t_7 = t_3 - t_2,
 \end{array}
 \end{equation}
 \begin{equation}
 \begin{array}{ll}
 x_0 = t_1 + a_2, & x_1 = t_6 - t_4 ,
 \end{array}
 \end{equation}
 \begin{equation}
 \begin{array}{ll}
 x_2 = t_7 - t_5, & x_3 = - t_7 - t_5,
 \end{array}
 \end{equation}
 \begin{equation}
 \begin{array}{ll}
 x_4 = -t_6 - t_4.
 \end{array}
 \end{equation}

 \section{Conclusions}
 %
 We have described the basic algorithms for one-dimensional radix-2 and
 mixed-radix FFTs. It would be nice to have a pedagogical explanation
 of the split-radix FFT algorithm, which is faster than the simple
 radix-2 algorithm we used. We could also have a whole chapter on
 multidimensional FFTs.
 %
 %\section{Multidimensional FFTs}
 %\section{Testing FFTs, Numerical Analysis}

 %\nocite{*}
 \bibliographystyle{unsrt}
 \bibliography{fftalgorithms}

 \end{document}