PFB_32, A 32-pnt Biplex Pipelined Polyphase Filter Bank
Parsons, A., Werthimer, D.
Background
This document is intended to be a guide for implementing and customizing an
architecture for performing a polyphase filter bank (PFB) in a Xilinx FPGA
using a blockset designed under the Matlab/Simulink/System Generator toolset.
These designs are the property of the UC Berkeley SETI group, copyright 2004.
The latest revision of the files in question is
here.
The PFB was invented to account for the fact that the Discrete Fourier
Transform (DFT) is an imperfect approximation of a set of band-pass filters.
Ideally, each sample out of a DFT would represent the response of a perfectly
rectangular band-pass filter, centered on an integral multiple Fn/N (the
Nyquist frequency divided by size of the DFT), with a width of exactly
Fn/N. In reality, each sample out of a DFT has a sin(x)/x frequency response.
The PFB squares off this filter shape by first convolving the input with a
sin(x)/x function over a long time window before following up with an DFT.
The length of this window determines the steepness of the edges on the filter;
typically, it may be 4 to 8 times the length of the DFT.
View a plot illustrating the improved filter shape of a frequency sample using
a PFB vs. a DFT
The FIR Filter
In the architecture implemented here, the convolution is performed in an FIR
filter. The taps of this filter draw signal samples spaced out in time in
increments of the length of the DFT. That is, for a 32 point DFT, each tap of
the corresponding FIR filter is seperated by a 32 sample delay.
Each tap is fed by a rotating cycle of coefficients. These coefficients are
determined by taking a length pi*T (where T is the number of FIR taps) interval
of a sin(x)/x function, centered on zero, and divided into T*L (where L is the
length of the DFT window) samples. Each tap is fed by one length of L of these
samples, arranged so that the first input sample is multiplied by the sin(x)/x
sample corresponding to the highest x, and the last input sample is paired
with the lowest-x sin(x)/x sample.
Filter response can be modified (generally by trading steepness of
rolloff for flatness in the passband) by applying a windowing function
to the sin(x)/x coefficients. Two commonly used windows are the Hamming window
(1/2 + cos(x)/2) and the Parzen window (1 + x for negative x, 1 - x for positive
x). Thus for an 8 tap FIR with a Hamming window, the coefficients correspond
to the function:
(1/2 + cos(x/4)/2) * sin(x)/x on the interval [-4*pi, 4*pi).
View a Simulink implementation of an 8 Tap FIR filter,
sans coefficient generator.
The windowing function and, to a lesser extent, the number of taps determines
signal gain through the FIR filter. The gain varies from sample to sample, but
repeats over the window length of the DFT. It makes the most sense to talk
about the gain of a signal through the entire PFB. Parcival's theorum states
that the power into a DFT is equal to the power out of one (barring any
additional rescaling added to it), so the gain of a PFB is equal to the FIR
gain, summed over the window length of the DFT. Thus, one can calculate
the gain of specific filter to noise by summing the squares of all the
FIR coefficients and dividing by the length of the DFT window. An 8 tap
FIR without an additional windowing function has a gain of -0.11 dB. Applying
a Hamming window reduces gain to -0.47 dB, and a Parzen window further reduces
gain to -0.93 dB.
FIR Implementation
When considering trade-offs between hardware resources and performance, it
can be useful to evaluate the effects of filter taps and finite precision
arithmetic on the filter response of a PFB bin. However, some of the following
plots can be misleading. Finite precision arithmetic places a noise floor on
plots which can easily be misinterpreted as a floor of the filter response.
For example, a plot may show that a high frequency signal causes a -40 dB
response it a PFB bin. However, if this related to finite precision arethmentic
(i.e. if this is caused by rounding in the least significant bit) then this
does not mean that a +40 dB signal at that frequency will cause a
response of +0 dB in that bin. Rather, the features of the filter shape below
the noise floor have been obscured by rounding error. The features are
there--the plots just do not illustrate them.
In considering the effects of various parameters on the filter response
of a PFB bin, the standard design with be a 32-pnt PFB with 8 FIR taps, 8 bit
input samples, 8 bit FIR coefficients, 18 bit samples into an FFT, and 18 bit
samples out of the FFT.
View a plot illustating the dependence of filter shape on the # of FIR taps.
View a plot illustating the dependence of filter shape on the bit width of
input samples.
View a plot illustating the dependence of filter shape on the bit width of
FIR coefficients
As an interesting aside--the means by which an input signal is converted to
a fixed point number can have a substantial effect on quiescent power in th
0th (DC) frequency bin. If an input signal is floored (truncated) to 8
bits, as is typically done by ADC's, a sizeable DC offset is introduced to
the signal, and all frequencies tend to leak power into that bin. However,
if the input signal is rounded to the nearest bit value, the DC bin displays
the expected response. The lesson here is that it may be
necessary to account for the flooring nature of the ADC by adding a small
DC bias (if the ADC has no other DC bias, this will only be 1/2 the value of
the least significant bit (LSB)) back into the signal.
View an overlay of all 32 PFB bins with a floored conversion of the input
signal to a fixed point number.
View an overlay of all 32 PFB bins with a rounded conversion of the input
signal to a fixed point number.
Implementing the FIR portion of the PFB requires the Simulink library file
"pfb_library.mdl". The simplest solution
is to instantiate a "pfb_fir" block, which implements a variable tap FIR filter
with a windowing function of your choice.
Bit widths of the input samples, output
samples, and coefficients may be set via this block's mask parameters. This
block handles the real and imaginary components of 2 independent input signals,
and uses one set of coefficients for all FIR filters.
P1_in and P2_in are the complex inputs for two independent data streams. The
upper half of the bits of these signals is interpreted as the real component
of the complex signal, and the lower half is interpreted as the imaginary
component. The "Sync_in" input
is used to reset the rotation of coefficients into the FIR filters. This
signal is a boolean vector warning signal, which is to say, it should be
asserted one clock before valid data appears on the data inputs. (P1_in, P2_in).
Only one pulse of this signal is necessary to set the rotation of coefficients
for the entire operation of the PFB, but since the rotation of coefficients
returns to its initial state every N clks (where N is the length of the DFT),
additional pulses can occur every N clks without interrupting the operation of
the PFB. "Sync_out" is an echo of the vector warning signal which appears one
clock before the output samples are valid.
One implementation quirk is that the "pfb_fir" block breaks its link with
the "pfb_library" block in order to modify its own contents. It may be
necessary to restore that link (which it will promptly break again) in order
to propagate the latest library revisions to the model block. Additionally,
the block only updates itself when the number of taps is changed (many of
its sub-blocks also have this property), so it is sometimes necessary to change
the number of taps to some arbitrary different number, and back again, in order
to force the block to reimplement itself.
The Radix-2 Biplex Pipelined FFT
The Fast Fourier Transform is an arithmetically optimized algorithm for
performing a DFT. It makes use of intermediate computations to ensure that
redundant computations are not performed when integrating an input signal
against sine waves of various frequencies. However, there are various
mathematically equivalent ways of implementing an FFT in hardware. A typical
solution, paralleling a
Radix-2 butterfly diagram, is to buffer
an entire window of input samples and for each butterfly drawn, implement a
separate Radix-2 butterfly in hardware (a Radix-2
butterfly, by the way, takes 2 inputs (a, b) and a coefficient (w) and
forms the results (a + b * w) and (a - b * w)).
This is an ideal solution if all input samples are arriving simultaneously, but
is inefficient if they show up one at a time, as they do when digitizing
a signal from a radio telescope. The Biplex Pipelined architecture reuses
butterflies in hardware and minimizes the buffering of data. This provides a
modest increase in hardware efficiency for a single polarization of data, but
the true power of this architecture lies in its ability to perform the FFT of
a second polarization at no additional cost in hardware.
View a biplex pipelined implementation of an 8 point FFT.
Implementing a biplex pipelined FFT in Simulink requires the Simulink library
file "fft_library.mdl". Instantiate an
"fft" block, and set its length via the 'Length of FFT' parameter. As described
in the FIR Implementation section, this block breaks its own link to the
"fft_library" block, and may require a link restoration to propagate the latest
revisions to the implemented block. Also, similar the the FIR tap contingency,
it is sometimes necessary to change the FFT length to some different value and
back again in order to force the block to reimplement itself. The "fft" block
has several parameters available to customize its implementation. 'Bitwidth'
adjusts the quiescent data width into and out of each FFT stage. The
'Implement Delays In BRAM' option specifies the cut-off on when a sample delay
should be implemented in BRAM vs. Distributed RAM. Similarly, the 'Use BRAM
for Coeff LUTs' parameter controls the cut-off for when coefficient look-up
tables should be implemented in BRAM. 'Shift Every ? Stages' controls how
often samples are divided by 2 in order to prevent overflow in finite precision
arithmetic as power is accumulated into frequency bins. Typically, gaussian
noise will require a shift every second stage (equivalent to rescaling data
by root-two every stage), and a pure frequency will double the power in
selected bins every stage, requiring a down-shift every stage to prevent
overflow. Finally, 'Cap Coefficient LUT Depth' sets a limit on how many
coefficients a butterfly is allowed to use. For large FFTs, later stages
may require vast numbers of coefficients which do not differ largely from one
another, and usually can perform nearly as well with far fewer coefficients.