Open Access

Implementation of a reconfigurable ASIP for high throughput low power DFT/DCT/FIR engine

EURASIP Journal on Embedded Systems20122012:3

DOI: 10.1186/1687-3963-2012-3

Received: 19 June 2011

Accepted: 5 April 2012

Published: 5 April 2012


In this article we present an ASIP design for a discrete fourier transform (DFT)/discrete cosine transform (DCT)/finite impulse response filters (FIR) engine. The engine is intended for use in an accelerator-chain implementation of wireless communication systems. The engine offers a very high degree of flexibility, accepting and accelerating performance approaches that of any-number DFT and inverse discrete fourier transform, one and two dimension DCT, and even general implementations of FIR equations. Performance approaches that of dedicated implementations of such algorithms. A customized yet flexible redundant memory map allows processor-like access while maintaining the pipeline full in a dedicated architecture-like manner. The engine is supported by a proprietary software tool that automatically sets the rounding pattern for the accelerator rounder to maintain a required signal to quantization noise or output RMS for any given algorithm. Programming of the processor is done through a mid-level language that combines register-specific instructions with DFT/DCT/FIR specific-instructions. Overall the engine allows users to program a very wide range of applications with software-like ease, while delivering performance very close to hardware. This puts the engine in an excellent spot in the current wireless communications environment with its profusion of multi-mode and emerging standards.


DFT DCT FIR ASIP reconfigurable hardware

1 Introduction

The rapid increase in the performance demand of wireless communication systems combined with the proliferation of standards both finalized and unfinalized has increased the need for a paradigm shift in the design of communication system blocks. Recent trends favor Software Defined Radio (SDR) systems due to their scalability and the ability to support multiple standards on the same platform. However, keeping performance within acceptable levels while doing this is a challenging research question.

Different approaches have been taken to address this question. Authors of [13] used Digital Signal Processors (DSPs) owing to their high configurability and adaptive capabilities. Although DSP performance is improving, it is still impractical due to its high power consumption and low throughput. On the other hand [4, 5] used configurable HW systems due to the high performance afforded by such platforms. However, these designs fail to catch up with the rapid growth in communication standards; they only support a limited class of algorithms for which they are specifically designed. Application specific instruction processors (ASIPs) offer an interesting position between the two approaches, allowing programming-like flexibility for a certain class of applications under speed and power constraints.

Different approaches to ASIPs offer different levels of flexibility. For example: [68] proposed an ASIP design which has the reconfigurability to support all/some functions of the physical layer Orthogonal Frequency Division (OFDM) receiver chain including OFDM Modulation/Demodulation, channel estimation, turbo decoder, etc. This reconfigurability between non-similar functions has a severe effect on performance, lowering throughput, raising power, or both. Realizing that these blocks operate simultaneously in a pipeline in an OFDM receiver, a different approach to partitioning the problem can be taken.

The work presented provides a limited class of MICRO-CODED programmable solutions to support a large class of OFDM wireless applications. The receiver chain is divided to four main ASIP processors seen in Figure 1. Each block has enough flexibility to support an extensive set of applications and configurations within its class while at the same time preserving hardwired-like performance.
Figure 1

Physical layer OFDM receiver model.

This chapter proposes the OFDM Modulation/Demodulation block which is basically based on Discrete Fourier Transform (DFT) and extended to support similar transformations like Discrete Cosine Transform (DCT) and finite impulse response filters (FIR). DFTs, DCTs, and FIRs are used in innumerable communication and signal processing applications. For example: the DFT is commonly used in high data rate Orthogonal Frequency Division Multiplexing (OFDM) systems such as Long Term Evolution (LTE), WiMax, WiLAN, DVB-T, etc; one of the main reasons is to increase robustness against frequency selective fading and narrow-band interference. One and two dimensional DCT are often used in audio and image processing systems such as interactive multimedia, digital TV-NTSC, low bit rate video conferencing, etc; owing to its compaction of energy into the lower frequencies. Finally FIR, is commonly used in digital signal processing applications that have a frequency spectrum with a wide range of frequency to filter frequency components by isolation, rejection or attenuation depending on system implementation.

1.1 Paper overview

We build on previous studies in [9, 10] where we presented a memory based architecture controlled by an instruction set processor. In this study we combine all elements of the design: performing further optimization on the processing elements (PE) to increase their flexibility and performance; as well as presenting a complete implementation including the full memory map and the programming front-end.

The supported mathematical algorithms are discussed in Section 2, This is followed by the system architecture and embedded processor in Section 3. The hardware (HW) accelerators in Section 4, and engine programing with coding example in Section 5. Section 6 details ASIC results and comparison among previously published designs. Section 7 concludes the article.

2 Supported algorithms

The engine can support multiple algorithms some of these algorithms are listed below.

2.1 DFT

N-point Discrete Fourier Transform is defined as:
DFT ( x n ) = n = 0 N - 1 x ( n ) W N k n

where: k = 0 , N - 1 W N = e - 2 π i / N

The direct implementation of Equation (1) is O(N2) which makes it difficult to meet typical throughput requirements. Common DFT symbol length in different communication and signal processing standard is in form 2 x except LTE down link which supports length 1536 = 29 × 3. Thus optimizing the throughput of a 2 x × 3 y -point DFT is our main concern.

Cooley-Tukey [11] proposed radix-r algorithms, which reduce the N-point DFT computational complexity to O(N log r N). The main principle of these algorithms is decomposing the computation of the discrete fourier transform of a sequence of length N into smaller discrete fourier transforms see Figure 2.
Figure 2

Flow graph of the decimation-in-frequency decomposition of an N -point DFT computation into four (N/4)-point DFT computations ( N = 16).

For lower computation cycle counts, Higher radix algorithm should be used. In practice, the radix-2 algorithm throughput requires four times the number of cycles than the radix-4 algorithm and radix-4 algorithm requires four times the number of cycles of the radix-8 algorithm. On the other hand, higher radix implementations have big butterflies thus they consume higher power and need more complex address generators to handle data flow.

From this trade of between the radix-r algorithm throughput and used butterfly size. We defined the parameter power efficiency which introduces how much power is taken to have certain throughput. Table 1 shows a comparison between the three radix butterflies. For fair comparison we toke the following assumptions:
Table 1

Energy consumed for N-point FFT vs.





Number of butterflies




Number of stages




Number of butterflies


N 2

N 4

N 8

Number of clock





Total number of clock

cycles for N -point FFT

N 4 log 2 ( N ) N 4 ( x )

N 4 log 4 ( N ) N 4 x 2

N 4 log 8 ( N ) × 2 N 4 x 3

Normalized power




Power efficiency

0.5 × N × x

0.375 × N × x

0.43 × N × x

As N = 2 x


Radix-r algorithms

  • Fix the address generators complexity, by assuming the data are read from memory 4 samples by 4 samples.

  • Normalize butterfly power by number of complex multipliers on it, which is the the dominant power consumer in the butterfly.
    Power efficiency = Power Throughput No of multipliers 1 / No of cycles to end
From Table 1 The Radix-4 algorithm have a lowest power consumption in addition to its regularity, it more interested specially in memory based architectures. Radix-4 algorithm supports only 4 z -point DFTs, So radix-2 and radix-3 algorithms are required to support all symbol lengths in the form of 2 x × 3 y . Radix-4, 2 and 3 butterflies are shown in Figures 3 and 4.
Figure 3

Radix-4 butterfly.

Figure 4

The other supported Radix-r butterflies (a) Radix-2 butterfly. (b) Radix-3 butterfly.

2.2 Inverse DFT

Swapping the real and imaginary parts of input and output data of DFT, we can get the N-point Inverse Discrete Fourier Transform (IDFT) (Equation 3) of a sequence X(K) scaled by N (Equation 4).
IDFT ( X k ) = 1 N n = 0 N - 1 X ( k ) W N - k n , k = 0 , , N - 1
IDFT ( X k ) * N = n = 0 N - 1 X ( k ) W N - k n = k = 0 N - 1 X * T ( k ) W N k n * T IDFT ( X k ) * T N scale factor = k = 0 N - 1 X * T ( k ) W N k n DFT of x * T

2.3 DCT

Several types of the DCT of a sequence x(n) are defined in [12]. The most popular being type II which is defined as:
DCT ( x n ) = ω ( k ) n = 0 N - 1 x ( n ) cos ( 2 n + 1 ) π k 2 N , ω ( k ) = 1 / N k 0 2 / N k = 0
Braganza and Leeser [13] proposed an implemention to get a real DCT from the DFT by constructing a sequence v(n) from real input data x(n) as follows:
v ( n ) = x ( n ) n = 0 N - 1 x ( 2 N - n - 1 ) n = N 2 N - 1

Then the output of DFT(v n ) is multiplied by 2 ω ( k ) e - i 2 π k 2 N .

2.4 Inverse DCT

The inverse DCT of type II is type III which is defined as:
IDCT ( x k ) = k = 0 N - 1 ω ( k ) X k cos ( 2 n + 1 ) π k 2 N , ω ( k ) = 2 / N
For the IDCT, we reverse the above steps. First, X (k) is rearranged to form a complex hermitian symmetric sequence V(k):
V ( k ) = 1 2 e j π k 2 N [ x ( k ) - j x ( N - k ) ] , k = 0 , 1 , 2 N - 1

Then construct v(n) by getting the IDFT of V (k), finally rearrange v(n) to get x(n).

2-Dimension modes

For 2D modes, the 1D mode is performed two times: one time in all rows of input frame then another time on the columns of the result Figure 5.
Figure 5


2.5 FIR

The FIR filter Equation (9) is handled using multiply accumulate (MAC) operations and accelerated by using Multiple computing units.
y t = k = 0 N - 1 x ( k ) a t - k

where a's are the filter coefficients.

2.6 Other transformations

Other transformations like any-point DFT can also be handled using basic operations like MAC, accumulator and vector operations.

3 ASIP processor

Embedded architectures are divided to pipelined [14] and memory based architectures (iterative designs) [5, 15]. The pipelined architectures are constructed from long chain from butterflies connected to individual memories. For example to support 4K-DFT by pipelined architecture like Radix-4 Singlepath Delay Feedback (R4SDF) [16] (seen in Figure 6). It needs six pipeline radix-4 butterflies (three complex multipliers) connected to six dual port memories. The memories have a read and write operation in each clock cycle. While The memory based architectures usually consist of one butterfly with only two dual port memories. The memories in the based architectures have also a read and write operation in each clock cycle which is approximately similar to the memory transactions in the pipelined architectures. From this discussion we prefer to use memory-based architecture and we prove our selection in Section 6 by comparing our results versus anther publish pipeline architecture.
Figure 6

R4SDF pipeline architecture.

The first step in the design of a flexible and efficient ASIP is to identify the common set of operations in the class of operations which must be supported. The computationally intensive operations are defined as coefficient-generation, address-generation, and PE. These operations are supported by HW acceleration.

To meet the high throughput demand, data operations are handled through vector instructions. Synchronization in the processing pipeline is handled through handshakes between the system blocks. This greatly reduces the load on decoders, allowing continuous flow in the pipeline and providing dedicated design-like throughput.

The critical path in the PE is relatively short. This simplicity combined with the high throughput of the pipeline allows the user to greatly under clock the circuit, thus allowing significant power scaling with application.

When a valid configuration radix-r stage is received, the HW accelerators are configured to operate on a user selected DFT/IDFT size. The read address generator is responsible for generating data addresses with their memory enables and giving its state to the coefficient generator to maintain synchronization between data and coefficients. The data and coefficients are handed to the PE which is configured to apply radix-r calculations. Upon finishing, the PE enables the write address generator and finally the processed data is saved in the 2nd memory Figure 7.
Figure 7

Pipeline process.

To allow instantaneous reading and writing and to keep the pipeline full, two N-word memories are used one for reading data and another for writing results. The source and destination memories are exchanged each stage. Each memory contains four dual port banks and has four input and output complex data buses to match the configurable memory requirements. The memory bus controller is responsible for applying the input and output data to the corresponding memory banks depending on its bank number and the memory state (read or write). Memory architecture is shown in Figure 8.
Figure 8

Memory architecture.

In the embedded processor architecture seen in Figure 9, input/output signals handle the interface between the decoder and the external environment. Depending on the external environment state, the decoder enables data transmission, importing, exporting or both. The I/O data bus contains four complex word buses, two for importing data and the other for exporting.
Figure 9

DFT/DCT/FIR processor.

The boot-loading memory consists of a non-volatile bank responsible for initializing the processor RAMs with the required micro-code. The engine is controlled by a non-pipelined decoder with 16 registers in the register file and a 26-bit instruction set with 66 instructions.
  1. (1)

    The register file is divided into even and odd sets, the real parts of complex words are saved in the even registers and the imaginary parts in following odd registers. Complex words are called by their real register number while a real word may be saved in any register and called by its index.

  2. (2)

    The instruction set is divided into five classes:

  • Radix instructions like: Radix-2/3/4, Inverse Radix-2/3/4 used for DFT.

  • MAC instructions for FIR: multiply two data vectors and accumulate, multiply data vector by coefficient and accumulate.

  • Vector Multiplications instructions For DCT/IDCT: multiply by coefficient,

  • Vector instructions like: accumulate, power, energy, addition, subtraction, multiplication, multiply by coefficient used to perform general vector arithmetic.

  • Word instructions like: shift, set, load, store, complex or word addition, subtraction, multiplication used mostly for control.

  • Data transmission instructions like: data arrangement, data importing and exporting.

  • Control instructions like: compare, conditional/unconditional branches, disable/enable dealing with imaginary part.

All vector instructions are applicable on complex words and have the ability to define the order in which data is read or written. MAC instructions are used for general implementations of FIR equations. MAC allows multiplication of data by data or data by stored or generated coefficients. MAC and Vector multiplication instructions allow multiplication by coefficients or their inverse for general transformations purpose.

4 Hardware accelerators

4.1 Processing element

The PE is the primary computational unit of the engine see Figure 10. The PE can be set to perform two radix-2 butterflies, one radix-3/4 butterfly For DFT implementations, multiply For DCT/IDCT multiplication stage, multiply accumulate for general FIR implementations in addition to other operations like accumulate, addition and subtraction. It is divided into four units: Constant multiplier unit, Addition unit, Multiplication unit, and finally Rounder unit. To increase utilization we time-share the complex multiplier to perform constant multiplication functions, that is to say constant multiplier CM and multiplier 1 M 1 in Figure 10 use the same multipliers. Data width naturally grows with processing, this is a major question in fixed-point ASIP applications. A rounder unit is placed at the final stage to re-fit data in a constant number of bits (word length). Stage scale factors can be set by the programmer and a proprietary software tool automatically generates the necessary scale factors for a given application. Complex multipliers are configured to multiply input 1 by input 2 or input 2 conjugate. Adder 3 is responsible for accumulate operations, so it is provided by a scalable truncator to prevent overflow. Multiplexers at the input and output data pins are used to swap their real and imaginary parts for inverse operations. The additional multiplexers configure the butterfly and bypass some stages like the multiplication stage.
Figure 10

Processing element.

4.2 Coefficient generator

The coefficient generator generates needed coefficients in two modes.

Mode one: Generates twiddle factors needed for Radix-r and DCT/IDCT Multiplication stage calculations. The first N/4 coefficients are stored in RAM and the remaining coefficients are generated by using the even and odd symmetry properties in the phase and amplitude of twiddle factor (Equations 10 and 11).
e j 2 π n N = e j 2 π n N e j 2 π x × N / 4 N = e j 2 π n N × E ( x ) , n = 0 , , N / 4
E ( x ) = e j 2 π x 4 , x = 0 , 1 , 2 , 3 = 1 , j , - 1 , - j

For e - j 2 π n N we invert the imaginary part's sign. For radix-4 computations we need to generate three twiddle factors at a time, so we use two memories, the first memory is a dual-port RAM and is used to generate e - j 2 π n N and e - j 2 π 3 n N . The second memory is a single-port RAM which is used to generate e - j 2 π 2 n N . For frame lengths with x > 2, the 2nd memory addresses are always even. So we remove all odd entries. This reduction adds a negligible noise in x = 2 case. For frame lengths with x = 1 we replace N by N'(N' = 4 × 3 y ). Consequently we save the first N'/ 4 = 3 y coefficients in RAM. To save power the 2nd memory is enabled only in radix-4 stage.

This method reduces coefficient memory size to 18% of a direct LUT implementation.

Mode two: Read stored coefficients from the first RAM starting from selected address and going in ascending or descending order depending on selected mode. This is more suitable for FIR transformation and direct implementations of general filters.

4.3 Read and write address generators

Generate continuous write and read addresses depending on their modes. The address bus is divided into four partitions: real part enable, imaginary part enable, bank number and bank index see Figure 11.
Figure 11

Address structure.

Each generator is connected to a single port RAM to get off-line generated addresses. Read and write address memories hold two addresses in each entry. To enable reading four sequential addresses in one clock cycle, write address memory is divided into two single port RAMs, one for odd entries and another for even entries.

The address generation modes are defined as:

Mode one: Generate addresses for different radix-r stages. Radix-4, 2 and 3 need to read 4, 4 (tworadix-2 handled in parallel), 3 data samples respectively for their computations.

This can be handled in several ways: Read data from memory 2 samples by 2 samples with 2 clock latency for each radix operation, double memory clock frequency and read 2 samples by 2 samples with 1 clock latency for each radix operation at the expense of double memory power, or use 4-port memories. Each of the above techniques have drawbacks to different degrees like lower throughput, power or both. In [10] we proposed an address scheme to solve the above problem with conflict-free memory access. The scheme is contingent on partitioning the memory to 4 dual-port memory banks as well as the specific way data is distributed between the banks. This guarantees that at any stage we have at most two accesses to the same memory bank.

Initially data is saved and distributed between memory banks to be ready for the first radix stage (radix-4 or 3). As N = 2 x × 3 y , (x ≠ 1), if x is even (integer stages from radix-4) the butterfly performs radix-4 computations till the end then switches to perform radix-3 stages. Else (if x odd) the butterfly performs ( x - 1 2 ) radix-4 stages followed by radix-2 then switches to perform radix-3 stages. Switching to radix-3 stages consumes a one-time additional stage to rearrange data in memory banks. At last radix-r stage, radix output is saved in the same locations of radix inputs.

Samples at any stage are saved in memory depending on the current radix stage (r), current DFT frame length (N), DFT frame number (f), and sample index inside the DFT frame (n) see Figure 12. The bank number results from accessing the bank Look Up Table (LUT) (Table 1) by signal bank t , and the data index in the bank (Equation 12).
Figure 12

Signal flow graph of 32-point DFT.

ban k t = floor n N / r
Bank index = n × mod N / r + f × N r

Mode two: Generate addresses for DCT/IDCT modes to arrange data in v n and V k order.

For DCT inputs are saved in the shuffled order shown in Figure 13. Data is distributed in the memory banks to allow direct starting for the next radix stage.
Figure 13

Re-ordering pattern for v n sequence.

In IDCT computation sequence data is ordered in V k order then multiplied by coefficients in the next stage. In order to save data arrangement time, data is saved in shuffle order shown in Figure 14 then at multiplication stage the multiplier is configured to multiply the coefficients by data conjugate to construct true V k sequence.
Figure 14

Re-ordering pattern for V k sequence.

Each input sample is saved in two locations in memory and the data is imported 2 samples by 2 samples in order to reduce data transmission time. So, data is distributed in memory banks to prevent memory conflict.

Mode three: Generate addresses for other vector instructions like MAC. It generates two addresses for two data vectors or one data vector (two samples at time) in-order sequence or get them form memory. With start addresses and data length as an input parameters.

5 Engine programming

The embedded processor programming passes through three phases: Simulation, testing, and verification. We will discuss 1024-point DFT, 8 × 8 DCT and 64 tap FIR as case studies.

5.1 Simulation

The goal of these simulations is to find best values of our design which are: the scale factor which we divide on each radix-r stage, word length and coefficient factors length.

5.1.1 Scale factor

Due to the nature of DFT operation the output data range is growth with the radix stages. So, the data must be scaled after each radix-r stage to refit in fixed number of bits. If this scale is large the data will be lost, on the other hand if it is small many overflows will occur, so stage scale must be well chosen. We considerate this point and designed an optimum scales generator tool to select the best scale on each stage with two modes:
  1. (1)

    Select the Highest SQNR.

  2. (2)

    Guaranteed output RMS, to keep signal peaks which are needed in same applications.


The tool are designed by Matlab software, it generate all possible scale factors with corresponding signal to quantization noise (SQNR) and the RMS of the output then select the best scale vector Depending on the input mode. Using the first mode for our example reveals scale factors of (4 4 2 2 2) for 1024 point five stages, giving the highest SQNR with a Gaussian input.

5.1.2 Word and coefficient lengths

Then, Fixed-point simulations of a 1024 point DFT in WiMAX see Figure 15 reveal that a 26 bit (13 real and 13 imaginary) complex word length and 20 bit complex twiddle factors are sufficient to keep quantization noise power under system noise by 15 dB at 10-3 Bit Error Rate (BER).
Figure 15

BER curve WIMAX system with 64QAM modulation, fading rate = 1 / 2, number of sub carriers = 240. Using quantized input to quantized DFT module.

5.2 Testing

Code for the application is written using custom mnemonics that combine HW-specific instructions with application-specific instructions. This then passes through a assembly compiler (designed by Matlab software) which generates the boot-loading and program object files.

When processing begins, the decoder accesses address zero in the boot loading ROM and reads initialization instructions. These instructions are mainly used for loading data and instructions from flash memory to the corresponding RAM memory in the system. Upon finishing, the decoder jumps to program memory and starts processing.

Figure 16 shows boot loading ROM initialization instructions. The initialization process may include preloading some or all of: program memory instructions (pm), coefficients memory (ws_mem1, ws_mem2), coefficients memory length (s_reg), read address memory (r_mem) and write addresses ram (w_mem). Boot-loading is necessary if the engine is to switch modes or standards on-the-fly. Otherwise program RAMs can be replaced by ROMs carrying the required instructions.
Figure 16

Boot loading initialization instructions.

5.2.1 1024-point DFT

Figure 17 shows code example for 1024-point DFT. In/Out operations read two words at a time, therefore for N words it takes only N/2 clock cycles. To save on processing overhead special control signals like r2 = N/radix (used by address generator) are inserted directly to reduce computational load (by adding this instruction we save the power and area of a full divider). After each stage these parameters are modified, and loop for the next radix stage. The twiddle factors in the last stage in DFT calculations are ones so we add choice (Scape multi, multi) to disable the twiddle factors generator and bypass multiplication stage. Thus the last radix instruction is separated from the loop. Then apply io instruction to export the processed symbol and import a new one. finally, jump to the first radix stage and so on.
Figure 17

Implementation code example for 1024-point DFT.

8 × 8 DCT

Figure 18 shows code example for 8 × 8 DCT. Data is read, row by row, saving each row in v n order discussed in Section 2. Then radix stages are applied until DFT calculations on all rows are completed. The data is multiplied by the twiddle factors, by getting addresses from read address memory (to arrange data after DFT operation and exchange row by column). Writing the result is in v n order (construct v n for new DCT-1D operation). The imaginary parts of result are set to zero by disabling writing of imaginary results. Then, the radix and multiplication stages are applied once more. Finally, the result is output in order and the new data is simultaneously loaded.
Figure 18

Implementation code example for 8 × 8 DCT.

5.2.2 FIR filter

Figure 19 shows code example for a 64 tap FIR. Data is read in order. Multiply accumulate operation are applied on the data to generate first output y(0). increment output index and apply MAC for next output and so on. Till the last output (N - 1) is generated. Finally, the result is output in order and the new data is simultaneously loaded.
Figure 19

Implementation code example for 64 tap FIR.

5.3 Verification

Verification of these and other examples is through bit-matching the results of random input patterns with fixed-point results from fixed point golden files. The golden files are verified and tested against a floating point model to make sure they perform the needed tasks. The golden files are used to verify the RTL design by generating test cases, both directed and random.

6 Implementation results and performance evaluation

6.1 Implementation

The engine is fully designed by the authors, using Verilog Hardware Description language and tested by applying various programming codes. Synthesis has been carried out using Cadence first encounter using IBM 130 nm CMOS technology. The post layout synthesis results report of the entire design with 26 bit complex word length, 20 bit complex twiddle factors and support for up to 8K-point DFT include system memories has been summarized in Table 2. The table also maintain all synthesis constraints. The engine parameters like the number of bits, memories size and types are parametrized to meet different requirements.
Table 2

Synthesis results (with memories)

Up to 8K point-DFT 1D symbol 26 complex word length


IBM 130 nm CMOS technology (6 layers)


1.08 V


Gates libraries: Typical (55°c)


Fast library(125°c) used for worst case conditions


Memories library: (125°c)

Number of Cells

57,906 cell


0.612 × 0.6 (0.36) mm2


56 mw at 100 MHz

Max frequency

700 MHz

6.2 Performance evaluation

Tables 3, 4, and 5 show a summary of features of our proposed embedded processor.
Table 3

Number of clock cycles and SQNR for 1D-DFT including data transfer times between the embedded engine and the host

N-point DFT

Cycles per

Latency @

100 MHz


Scale factor

Scale factor




(μ s)

s 1

s 2

s 3


s 5

s 6

s 7
















































































Table 4

Number of clock cycles for 1D-DCT including data transfer times between the embedded engine and the host

N-point DCT

Cyles per DCT

Latency @ 100 MHz (μ s)













Table 5

Number of clock cycles for 2D-DCT including data transfer times between the embedded engine and the host

N × N-point DCT

Cycles per DCT

Latency @ 100 MHz (μ s)

8 × 8



16 × 16



32 × 32



64 × 64



Table 6 has a list of power consumption values for previously published articles. To eliminate the process factor to make the comparisons as fair as possible, the power consumption of each design has been normalized to 130 nm technology, 1.08 V and engine throughput by Equation (13) [17]. We define the parameter power efficiency which introduces how much power is taken to have certain throughput to make fair comparisons between the engines power in the case of they have same throughput. This shows, at the very least, that the proposed engine has a significant advantage in power consumption.
Table 6

Number of clock cycles and SQNR for 1D-DFT including data transfer times between the embedded engine and the host














Time to end

(μ s)










Pipeline HW











Configurable HW











Configurable HW











Configurable HW











Configurable HW











Configurable HW




































































[21]: Present the power consumption of functional unit and data address generator only

Normalized power = Power × 130 Technology × 1 . 08 Volt 2 Power efficiency = Normalized power Throughput = Normalized power 1 / Time to end

6.3 Discussion

Weidong and Wanhammar [14] proposed an pipeline ASIC for pipeline FFT processor. Here we prove our discussion in Section 3, the pipeline architecture have a higher throughput but loss on power efficiency.

The authors of [5, 18, 19] proposed memory based Application-Specific Integrated Circuit (ASIC) for scalable DFT engine. The proposed engine in [5] enables runtime configuration of the DFT length, where the supported lengths vary only from 16-points to 4096. while the proposed engine in [18] enables reconfigurable FFT Processor, the FFT lengths vary only from 128-points to 8192. and [19] can perform 64 2048-point FFT. This engines have high throughput rates. But, they only support certain kinds of algorithms for which they are designed.

In contrast, [2] used digital signal processors owing to their high reconfigurability and adaptive capabilities. Although DSP performance is improving, it is still unsuitable due to its high power consumption and low throughput. Hsu and Lin [2] proposed an approach for DFT implementation on DSP with low-memory reference and high flexibility, however it is optimized for 2 x -point DFT, It needs 40,338 cycles to complete one 1024-point DFT.

The third solution, [20, 21] is the ASIP which compromises between the above solutions. Zhong et al. [20] proposed an DFT/IDFT processor based on multi-processor rings. This engine presents four processor rings (8, 16-Point FFT) and supports DFT lengths from 16-points to 4096. Guan et al. [21] proposed an ASIP scalable architecture of any-point DFT at the expense of a large PE (contains an 8-point butterfly). the authors present only the power consumption of functional unit and data address generator so we did not include it in the table.

From our investigation, Figure 20 shows comparison between implementation techniques throughput.
Figure 20

Number of clock cycles per one 1024-point DFT vs. implementation techniques.

Shah et al. [22] presents a pipelined scalable any-point DFT 1D/2D engine which requires 256 clock cycles for (16 × 16)-DFT 2D, while [23] and this design require 512 cycles. Nevertheless, Sohil Shah's proposal has higher area.

For DCT-1D, we use the mathematical algorithm in [12] which implements ASIC DCT-1D bulting blocks common with DFT. The engine has a throughput of one 512-point DCT per 1,771 cycles, and one 1024-point DFT per 3435 cycles.

For DCT-2D existence designs, the engine in [24] has been tailored to a particular application needing 80 cycle for (8 × 8)-DCT 2D, and programmable DSP [1] supports scalable (N × N)-DCT 2D as N = 4-64. needs 2,538 cycles for (16 × 16)-DCT 2D,

The proposed engines are more power efficient than most of other proposed architectures in the literature. Engine features:
  • More power efficient than most of other proposed architectures in the literature.

  • Could be support many OFDM Systems with relatively low power.

  • High reconfigurability which allows users to program a very wide range of applications with softwarelike ease.

  • Support peripheral operations beside the main processes like CP remover which was need in the proposed WiMAX demo.

  • Simple interfaces (FIFO interface) which handle data transfer between the engine and asynchronous blocks with different clock domains.

  • The engine parameters like the number of bits, memories size and types are parameterized to meet different requirements and higher symbol lengths

The features that helped to get a high throughput which helped to get good power efficiency are:
  • A new address generation scheme allows reading and writing the butterfly data in one clock cycle which allow performing 1 butterfly operation each clock. This reduce processing time by 50% without doubling the clock frequency no loss on power.

  • The selection of radix-4 algorithm which have best power efficiency.

  • Using HW accelerators accelerate the processing and reduce the complicity of the decoder.

  • Using pipeline processing of the vector instructions is also accelerate the processing.

  • Using simultaneously input and output data transformations with four data buses which reduce data transformations time by 75%.

  • Reduce time to market by supporting a compiler tool for the engine with a simple instruction set

  • The use of classified engines allows high degree of optimization.

7 Conclusion

In this article, we propose an ASIP design for low-power configurable embedded processor capable supporting DFT, DCT, FIR among other things. The defining feature of our processor is its reconfigurability supporting multiple transformations for many communication and signal processing standards with simple SW instructions, high SQNR, and relatively high throughput. The engine overall performance allows users to program a very wide range of applications with software-like ease, while delivering performance very close to HW. This puts the engine in an excellent spot in the current wireless communications environment with its profusion of multi-mode and emerging standards. The proposed embedded processor is synthesized in IBM 130 nm CMOS technology. The 8k-point DFT can 56 mW with a 1.08 V supply voltage to end in 13 μ s with SQNR of 95.25 dB. Table 7 shows some applications which can be supported.
Table 7

Applications that can be supported and the corresponding estimated clock frequency

DFT-1D applications

90 MHz


DCT-2D applications

60 MHz

Low bit rate video conferencing, basic video telephony, interactive multimedia and digital TV-NTSC



This study was part of a project supported by a grant from STDF, Egypt (Science and Technology Development Fund).

Authors’ Affiliations

Center for Wireless Studies, Faculty of Engineering, Cairo University


  1. Liu X, Wang Y: Memory Access Reduction Method for efficient implementation of Vector-Radix 2D fast cosine transform pruning on DSP. Proceedings of the IEEE SoutheastCon 2010, 68-72.Google Scholar
  2. Hsu YP, Lin SY: Implementation of Low-Memory Reference FFT on Digital Signal Processor. Journal of Computer Science 2008, 7: 545-549.Google Scholar
  3. Frigo M, Johnson SG: The Design and Implementation of FFTW3. Proceedings of the IEEE 2005, 93: 216-231.View ArticleGoogle Scholar
  4. Jo BG, Sunwoo MH: New continuous-flow mixed-radix (CFMR) FFT processor using novel in-place strategy. IEEE Transactions on Circuits and Systems 2005, 52(5):911-919.MathSciNetView ArticleGoogle Scholar
  5. Jacobson AT, Truong DN, Baas BM: The Design of a Reconfigurable Continuous-Flow Mixed-Radix FFT Processor. IEEE International Symposium on Circuits and Systems ISCAS 2009, 1133-1136.Google Scholar
  6. Hangpei T, Deyuan G, Yian Z: Gaining Flexibility and Performance of Computing Using Application-Specific Instructions and Reconfigurable Architecture. International Journal of Hybrid Information Technology 2009, 2: 324-329.Google Scholar
  7. Poon ASY: An Energy-Efficient Reconfigurable Baseband Processor for Wireless Communications. (IEEE) Trans VLSI 2007, 15(3):319-327.View ArticleGoogle Scholar
  8. Iacono DL, Zory J, Messina E, Piazzese N, Saia G, Bettinelli A: ASIP Architecture for Multi-Standard Wireless Terminals. Design, Automation and Test in Europe (DATE '06) 2006, 2: 1-6.View ArticleGoogle Scholar
  9. Hassan HM, Shalash AF, Hamed HM: Design architecture of generic DFT/DCT 1D and 2D engine controlled by SW instructions. Asia Pacific Conference on Circuits and Systems APCCAS 2010 2010, 84-87.View ArticleGoogle Scholar
  10. Hassan HM, Shalash AF, Mohamed K: FPGA Implementation of an ASIP for high throughput DFT/DCT 1D/2D engine. IEEE International Symposium on Circuits and Systems (ISCAS) 2011 2011, 1255-1258.View ArticleGoogle Scholar
  11. Cooley JW, Tukey JW: An Algorithm for Machine Computation of Complex Fourier Series. Mathematics of Computation 1965, 19: 297-301.MATHMathSciNetView ArticleGoogle Scholar
  12. Nguyen T, Koilpillai RD: The theory and Design of Aribitrary-length cosine-modulated filter Banks and wavelets, satisfying perfect reconstruction. IEEE Transaction on signal processing 1996, 44(3):473-483.View ArticleGoogle Scholar
  13. Braganza S, Leeser M: The 1D Discrete Cosine Transform for Large Point Sizes Implemented on Reconfigurable Hardware. IEEE International Conference on Application-specific Systems, Architectures and Processors ASAP 2007, 101-106.Google Scholar
  14. Weidong Li, Wanhammar L: A PIPELINE FFT PROCESSOR. IEEE Workshop on Signal Processing Systems, 1999. SiPS 99 1999, 19: 654-662.Google Scholar
  15. Chidambaram R, Leuken RV, Quax M, Held I, Huisken J: A multistandard FFT processor for wireless system-on-chip implementations. Proc International Symposium on Circuits and Systems 2006, 47.Google Scholar
  16. He S, Torkelson M: Design and Implementation of a 1024-point Pipeline FFT Processor. Proceedings of the IEEE 1998 Custom Integrated Circuits Conference 1998, 131-134.Google Scholar
  17. Lin JM, Yu HY, Wu YJ, Ma HP: A Power Efficient Baseband Engine for Multiuser Mobile MIMOOFDMA Communications. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI 2010, 57: 1779-1792.MathSciNetView ArticleGoogle Scholar
  18. Sung TY, Hsin HC, Ko LT: Reconfigurable VLSI Architecture for FFT Processor. WSEAS TRANSACTIONS on CIRCUITS and SYSTEMS 2009., 8:Google Scholar
  19. Lee YH, Yu TH, Huang KK, Wu AY: Rapid IP Design of Variable-length Cached-FFT Processor for OFDM-based Communication Systems. IEEE Workshop on Signal Processing Systems Design and Implementation, 2006. SIPS '06 2006, 62-65.View ArticleGoogle Scholar
  20. Zhong G, Xu F, Willson AN Jr: A power-scalable reconfigurable FFT/IFFT IC based on a multi-processorring. IEEE Journal of Solid-State Circuits (JSSC) 2006, 41: 483-495.View ArticleGoogle Scholar
  21. Guan X, Lin H, Fei Y: Design of an Application-specific Instruction Set Processor for High-throughput and Scalable FFT. IEEE International Symposium on Circuits and Systems ISCAS 2009, 2513-2516.Google Scholar
  22. Shah S, Venkatesan P, Sundar D, Kannan M: Low Latency, High Throughput, and Less Complex VLSI Architecture for 2D-DFT. International Conference on Signal Processing, Communications and Networking ICSCN 2008, 349-353.Google Scholar
  23. Shah S, Venkatesan P, Sundar D, Kannan M: A Fingerprint Recognition Algorithm Using Phase-BasedImage Matching for Low-Quality Fingerprints. IEEE International Conference on the Image Processing 2005, 33-36.Google Scholar
  24. Tumeo A, Monchiero M, Palermo G, Ferrandi F, Sciuto D: A Pipelined Fast 2D-DCT Accelerator for FPGA-based SoCs. IEEE Computer Society Annual Symposium on VLSI 2007, 331-336.View ArticleGoogle Scholar
  25. Cho YJ, Yu CL, Yu TH, Zhan CZ, Wu AYA: Efficient Fast Fourier Transform Processor Design for DVB-H System. proc VLSI/CAD symposium 2007.Google Scholar
  26. sung TY: Memory-efficient and high-speed split-radix FFT/IFFT processor based on pipeline CORDIC rotations. IEEE proceedings, Image Signal Process 2006, 153: 405-410.View ArticleGoogle Scholar


© Hassan et al; licensee Springer. 2012

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.