Open Access

Techniques and Architectures for Hazard-Free Semi-Parallel Decoding of LDPC Codes

  • Massimo Rovini1Email author,
  • Giuseppe Gentile1,
  • Francesco Rossi1 and
  • Luca Fanucci1
EURASIP Journal on Embedded Systems20092009:723465

DOI: 10.1155/2009/723465

Received: 4 March 2009

Accepted: 27 July 2009

Published: 15 September 2009

Abstract

The layered decoding algorithm has recently been proposed as an efficient means for the decoding of low-density parity-check (LDPC) codes, thanks to the remarkable improvement in the convergence speed (2x) of the decoding process. However, pipelined semi-parallel decoders suffer from violations or "hazards" between consecutive updates, which not only violate the layered principle but also enforce the loops in the code, thus spoiling the error correction performance. This paper describes three different techniques to properly reschedule the decoding updates, based on the careful insertion of "idle" cycles, to prevent the hazards of the pipeline mechanism. Also, different semi-parallel architectures of a layered LDPC decoder suitable for use with such techniques are analyzed. Then, taking the LDPC codes for the wireless local area network (IEEE 802.11n) as a case study, a detailed analysis of the performance attained with the proposed techniques and architectures is reported, and results of the logic synthesis on a 65 nm low-power CMOS technology are shown.

1. Introduction

Improving the reliability of data transmission over noisy channels is the key issue of modern communication systems and particularly of wireless systems, whose spatial coverage and data rate are increasing steadily.

In this context, low-density parity-check (LDPC) codes have gained the momentum of the scientific community and they have recently been adopted as forward error correction (FEC) codes by several communication standards, such as the second generation digital video broadcasting (DVB-S2, [1]), the wireless metropolitan area networks (WMANs, IEEE 802.16e, [2]), the wireless local area networks (WLANs, IEEE 802.11n, [3]), and the 10 Gbit Ethernet (10Gbase-T, IEEE 802.2ae).

LDPC codes were first discovered by Gallager in the far 1960s [4] but have long been put aside until MacKay and Neal, sustained by the advances in the very high large-scale of integration (VLSI) technology, rediscovered them in the early 1990s [5]. The renewed interest and the success of LDPC codes is due to (i) the remarkable error-correction performance, even at low signal-to-noise ratios (SNRs) and for small block-lengths, (ii) the flexibility in the design of the code parameters, (iii) the decoding algorithm, very suitable for hardware parallelization, and last but not least (iv) the advent of structured or architecture-aware (AA) codes [6]. AA-LDPC codes reduce the decoder area and power consumption and improve the scalability of its architecture and so allow the full exploitation of the complexity/throughput design trade-offs. Furthermore, AA-codes perform so close to random codes [6], that they are the common choice of all latest LDPC-based standards.

Nowadays, data services and user applications impose severe low-complexity and low-power constraints and demand very high throughput to the design of practical decoders. The adoption of a fully parallel decoder architecture leads to impressive throughput but unfortunately is also so complex in terms of both area and routing [7] that a semi-parallel implementation is usually preferred (see [6, 8]).

So, to counteract the reduced throughput, designers can act at two levels: at the algorithmic level, by efficiently rescheduling the message-passing algorithm to improve its convergence rate, and at the architectural level, with the pipeline of the decoding process, to shorten the iteration time. The first matter can be solved with the turbo-decoding message-passing (TDMP) [6] or the layered decoding algorithm [9], while pipelined architectures are mandatory especially when the decoder employs serial processing units.

However, the pipeline mechanism may dramatically corrupt the error-correction performance of a layered decoder by letting the processing units not always work on the most updated messages. This issue, known as pipeline "hazard'', arises when the dependence between the elaborations is violated. The idea is then to reschedule the sequence of updates and to delay with "idle'' cycles the decoding process until newer data are available.

As an improvement to similar state-of-the-art works [1013], this paper proposes three systematic techniques to optimally reschedule the decoding process in a way to minimize the number of idle cycles and achieve the maximum throughput. Also, this paper discusses different semi-parallel architectures, based on serial processing units and all supporting the reordering strategies, so as to attain the best trade-off between complexity and throughput for every LDPC code.

Semi-parallel architectures of LDPC decoder have recently been addressed in several papers, although none of them formally solves the issue of pipeline hazards and decoding idling. Gunnam et al. describe in [10] a pipelined semi-parallel decoder for WLAN LDPC codes, but the authors do not mention the issue of the pipeline hazards; only, the need of properly scrambling the sequence of data in order to clear some memory conflicts is described.

Boutillon et al. consider in [13] methods and architectures for layered decoding; the authors mention the problem of pipeline hazards (cut-edge conflict) and of using an output order different from the natural one in the processing units; nonetheless, the issue is not investigated further, and they simply modify the decoding algorithm to compute partial updates as in [14]. Although this approach allows the decoder to operate in full pipeline with no idle cycles, it is actually suboptimal in terms of both performance and complexity.

Similarly, Bhatt et al. propose in [11] a pipelined block-serial decoder architecture based on partial updates, but again, they do not investigate the dependence between elaborations.

In [12], Fewer et al. implement a semi-parallel TDMP decoder, but the authors boost the throughput by decoding two codewords in parallel and not by means of pipeline.

This paper is organised as follows. Section 2 recalls the basics of LDPC and of AA-LDPC codes and Section 3 summarizes the layered decoding algorithm. Section 4 introduces three different techniques to reduce the dependence between consecutive updates and analytically derives the related number of idle cycles. After this, Section 5 describes the VLSI architectures of a pipelined block-serial LDPC-layered decoder. Section 6 briefly reviews the WLAN codes used as a case study, while the performances of the related decoder are analysed in Section 7. Then, the results of the logic synthesis on a 65 nm low-power CMOS technology are discussed in Section 8, along with the comparison with similar state-of-the-art implementations. Finally, conclusions are drawn in Section 9.

2. Architecture-Aware Block-LDPC Codes

LDPC codes are linear block-codes described by a parity-check matrix establishing a certain number of (even) parity constraints on the bits of a codeword. Figure 1 shows the parity-check matrix of a very simple LDPC code with length bits and with parity constraints. LDPC codes are also effectively described in a graphical way through a Tanner graph [15], where each bit in the codeword is represented with a circle, known as variable-node (VN), and each parity-check constraint with a square, known as check-node (CN).
Figure 1

Tanner graph of a simple base-matrix and principle of vectorization.

Recently, the joint design of code and decoder has blossomed in many works (see [8, 16]), and several principles have been established for the design of implementation-oriented AA-codes [6]. These can be summarized into (i) the arrangement of the parity-check matrix in squared subblocks, and (ii) the use of deterministic patterns within the subblocks. Accordingly, AA-LDPC codes are also referred to as block-LDPC codes [8].

The pattern used within blocks is the vital facet for a low-cost implementation of the interconnection network of the decoder and can be based either on permutations, as in [6] and for the class of -rotation codes [17], or on circulants or cyclic shifts of the identity matrix, as in [8] and in every recent standards [13].

AA-LDPC codes are defined by the number of block-columns , the number of block-rows , and the block-size , which is the size of the component submatrices. Their parity-check matrix can be conveniently viewed as , that is, as the expansion of a base-matrix with size . The expansion is accomplished by replacing the 1's in with permutations or circulants, and the 0's with null subblocks. Thus, the block-size is also referred to as expansion-factor, for a codeword length of the resulting LDPC code equal to and code rate .

A simple example of expansion or vectorization of a base-matrix is shown in Figure 1. The size, number, and location of the nonnull blocks in the code are the key parameters to get good error-correction performance and low-complexity of the related decoder.

3. Decoding of LDPC Codes

LDPC codes are decoded with the belief propagation (BP) or message-passing (MP) algorithm, that belong to the broader class of maximum a posteriori (MAP) algorithms. The BP algorithm has been proved to be optimal if the graph of the code does not contain cycles, but it can still be used and considered as a reference for practical codes with cycles. In the latter case, the sequence of the elaborations, also referred to as schedule, considerably affects the achievable performance.

The most common schedule for BP is the so-called two-phase or flooding schedule (FS) [18], where all parity-check nodes first, followed by all variable nodes then, are updated in sequence.

A different approach, taking the distribution of closed paths and girths in the code into account, has been described by Xiao and Banihashemi in [19]. Although probabilistic schedules are shown to outperform deterministic schedules, the random activation strategy of the processing nodes is not very suitable to HW implementation and adds significant complexity overheads.

The most attractive schedule is the shuffled or layered decoding [6, 9, 18, 20]. Compared to the FS, the layered schedule almost doubles the decoding convergence speed, both for codes with cycles and cycle-free [20]. This is achieved by looking at the code as a connection of smaller supercodes [6] or layers [9], exchanging intermediate reliability messages. Specifically, a posteriori messages are made available to the next layers immediately after computation and not at next iteration as in a conventional flooding schedule.

Layers can be any set of either CNs or VNs, and, accordingly, CN-centric (or horizontal) or VN-centric (or vertical) algorithms have been analyzed in [18, 20]. However, CN-centric solutions are preferable since they can exploit serial, flexible, and low-complexity CN processors.

The horizontal layered decoding (HLD) is summarized in Algorithm 1 and consists in the exchange of probabilistic reliability messages around the edges of the Tanner graph (see Figure 1) in the form of logarithms of likelihood ratios (LLRs); given the random variable , its LLR is defined as
(1)

Algorithm 1: Horizontal layered decoding.

input: a-priori LLR ,

output: a-Posteriori hard-decisions

( ) // Messages initialization

( ) , , , ,

   ;

( ) while ( & !Convergence) do

( ) // Loop on all layers

( ) for     to     do

( ) // Check-node update

( ) forall     do

( ) // Sign update

( ) ;

( ) // Magnitude update

( ) ;

( ) // Soft-output update

( )

( ) end

( ) end

( ) ;

( ) end

In Algorithm 1, is the th a priori LLR of the received bits, with and the length of the codeword, is the overall number of parity-check constraints, and the number of decoding iterations. Also, is the set of VNs connected to the th CN, represents the check-to-variable (c2v) reliability message sent from CN to VN at iteration , and is the total information or soft-output (SO) of the th bit in the codeword (see Figure 1).

For the sake of an easier notation, it is assumed here that a layer corresponds to a single row of the parity-check matrix. Before being used by the next CN or layer, SOs are refined with the involved c2v message, as shown in line 13, and thanks to this mechanism, faster convergence is achieved.

Magnitudes are updated with the binary operator [21] defined as for . Following an approach similar to Jones et al. [22], the updating rule of magnitudes is further simplified with the method described in [23], which proved to yield very good performance. Here, only two values are computed and propagated for the magnitude of c2v messages; specifically, if we define
(2)
the index of the smallest variable-to-check (v2c) message entering CN , then a dedicated c2v message is computed in response to VN :
(3)
while all the remaining VNs receive one common, nonmarginalized value for magnitude given by
(4)

4. Decoding Pipelining and Idling

The data-flow of a pipelined decoder with serial processing units is sketched in Figure 2. A centralized memory unit keeps the updated soft-outputs, computed by the node processors (NPs) according to Algorithm 1. If we denote with the number of nonnull blocks in layer , that is, the degree of layer , then the processor takes clock cycles to serially load its inputs. Then, refined values are written back in memory (after scrambling or permutation) with the latency of clock cycles, and this operation takes again clock cycles. Overall, the processing time of layer is then clock cycles, as shown in Figure 3(a).
Figure 2

Outline of the flow of soft-outputs in an LDPC-layered decoder with serial processing units.

Figure 3

Pipelined and not pipelined data-flow. Not pipelinedPipelined

If the decoder works in pipeline, time is saved by overlapping the phases of elaboration, writing-out and reading, so that data are continuously read from and written into memory, and a new layer is processed every clock cycles (see Figure 3(b)).

Although highly desirable, the pipeline mechanism is particularly challenging in a layered LDPC decoder, since the soft-outputs retrieved from memory and used for the current elaboration could not be always up-to-date, but newer values could be still in the pipeline. This issue, known as pipeline hazard, prevents the use and so the propagation of always up-to-date messages and spoils the error-correction performance of the decoding algorithm.

The solution investigated in this paper is to insert null or idle cycles between consecutive updates, so that a node processor is suspended to wait for newer data. The number of idle cycles must be kept as small as possible since it affects the iteration time and so the decoding throughput. Its value depends on the actual sequence of layers updated by the decoder as well as on the order followed to update messages within a layer.

Three different strategies are described in this section, to reduce the dependence between consecutive updates in the HLD algorithm and, accordingly, the number of idle cycles. These differ in the order followed for acquisition and writing-out of the decoding messages and constitute a powerful tool for the design of "layered'', hazard-free, LDPC codes.

4.1. System Notation

Without any lack of generality, let us identify a layer with one single parity-check node and focusing on the set of soft-outputs participating to layer , let us define the following subsets:

(i) , the set of SOs in common with layer ;

(ii) , the set of SOs in common with layer and not in ;

(iii) , the set of SOs in common with both layers and ;

(iv) , the set of SOs in common with layer and not in or ;

(v) , the set of SOs in common with layer but not in , , ;

(vi) , the set of SOs in common with both layers and , but not in or ;

(vii) , the set of remaining SOs.

In the definitions above the notation means the relative complement of in or the set-theoretic difference of and . Let us also define the following cardinalities: (degree of layer ), , , , , , , .

4.2. Equal Output Processing

First, let us consider a very straightforward and implementation friendly architecture of the node processor that updates (and so delivers) the soft-output messages with the same order used to take them in.

In such a case it would be desirable to (i) postpone the acquisition of messages updated by the previous layer, that is, messages in , and (ii) output the messages in as soon as possible to let the next layer start earlier. Actually, the last constraint only holds when does not include any message common to layer , that is, when ; otherwise, the set could be acquired at any time before .

Figure 4 shows the I/O data streams of an equal output processing (EOP) unit. Here, is the latency of the SO data-path, including the elaboration in the NP, the scrambling, and the two memory accesses (reading and writing). Focusing on layer , the set cannot be assigned to any specific position within , since the whole is acquired according to the same order used by layer to output (and so also acquire) the sets and . For this reason, the situation plotted in Figure 4 is only for the sake of a clearer drawing.
Figure 4

Input and output data streams in an NP with EOP.

With reference to Figure 4, pipeline hazards are cleared if idle cycles are spent between layer and so that
(5)
with for and otherwise. This means that if is empty, then the messages in do not need to be waited for. The solution to (5) with minimum latency is
(6)

Note that (5) and (6) only hold under the hypothesis of leading within . If this is not the case, up to extra idle cycles could be added if is output last within .

So far, we have only focused on the interaction between two consecutive layers; however, violations could also arise between layer and . Despite this possibility, this issue is not treated here, as it is typically mitigated by the same idle cycles already inserted between layers and and between layers and .

4.3. Reversed Output Processing

Depending on the particular structure of the parity-check matrix , it may occur that the most of the messages of layer in common with layer are also shared with layer , that is, and . If this condition holds, as for the WLAN LDPC codes (see Figure 11), it can be worth reversing the output order of SOs so that the messages in can be both acquired last and output first.

Figure 5(a) shows the I/O streams of a reversed output processing (ROP) unit. Exploiting the reversal mechanism, the set is acquired second-last, just before , so that it is available earlier for layer .
Figure 5

Organization of the input and output data stream in an NP with ROP. Pipeline hazards in the update of two consecutive layers. Pipeline hazards in the update of three consecutive layers. Messages of and not in are shown in dark grey

Following a reasoning similar to EOP, the situation sketched in Figure 5(a) where is delivered first within is just for an easier representation, and the condition for hazard-free layered decoding is now
(7)
Indeed, when , one could output first in , and so get rid of the term . However, since is actually left floating within , (7) represents again a best-case scenario, and up to extra idle cycles could be required. From (7), the minimum latency solution is
(8)

Similarly to EOP, the ROP strategy also suffers from pipeline hazards between three consecutive layers, and because of the reversed output order, the issue is more relevant now. This situation is sketched in Figure 5(b), where the sets , , and are managed similarly to , and . The ROP strategy is then instructed to acquire the set later and to output earlier. However, the situation is complicated by the fact that the set may not entirely coincide with ; rather it is , since some of the messages in can be found in . This is highlighted in Figure 5(b), where those messages of and not delivered to are shown in dark grey.

To clear the hazards between three layers, additional idle cycles are added in the number of
(9)
where is the acquisition margin on layer , and is the writing-out margin on layer . These can be computed under the assumption of no hazard between layer and (i.e., is aligned with thanks to as shown in Figure 5(b)) and are given by
(10)

The margin is actually nonnull only if ; otherwise, under the hypothesis that (i) the set is output first within , and (ii) within , the messages not in are output last.

Overall, the number of idle cycles of ROP is given by
(11)

4.4. Unconstrained Output Processing

Fewer idle cycles are expected if the orders used for input and output are not constrained to each other. This implies that layer can still delay the acquisition of the messages updated by layer (i.e., messages in ) as usual, but at the same time the messages common to layer (i.e., in ) can also be delivered earlier.

The input and output data streams of an unconstrained output processing (UOP) unit are shown in Figure 6. Now, hazard-free layered decoding is achieved when
(12)
which yields
(13)
Figure 6

Input and output data streams in an NP with UOP.

Regarding the interaction between three consecutive layers, if the messages common to layer (i.e., in ) are output just after , and if on layer , the set is taken just before , then there is no risk of pipeline hazard between layer and .

4.5. Decoding of Irregular Codes

A serial processor cannot process consecutive layers with decreasing degrees, , as the pipeline of the internal elaborations would be corrupted and the output messages of the two layers would overlap in time. This is not but another kind of pipeline hazard, and again, it can be solved by delaying the update of the second layer with idle cycles.

Since this type of hazard is independent of that seen above, the same idle cycles may help to solve both issues. For this reason, the overall number of idle cycles becomes
(14)

with being computed according to (6), (11), or (13).

4.6. Optimal Sequence of Layers

For a given reordering strategy, the overall number of idle cycles per decoding iteration is a function of the actual sequence of layers used for the decoding. For a code with layers, the optimal sequence of layer minimizing the time spent in idle is given by
(15)

where is the number of idle cycles between layer and for the generic permutation and is given by (14), and is the set of the possible permutations of layers.

The minimization problem in (15) can be solved by means of a brute-force computer search and results in the definition of a permuted parity-check matrix , whose layers are scrambled according to the optimal permutation . Then, within each layer of , the order to update the nonnull subblocks is given by the strategy in use among EOP, ROP, and UOP.

4.7. Summary and Results

The three methods proposed in this section are differently effective to minimize the overall time spent in idle. Although UOP is expected to yield the smallest latency, the results strongly depend on the considered LDPC code, and ROP and EOP can be very close to UOP. As a case-example, results will be shown in Section 7 for the WLAN LDPC codes.

However, the effectiveness of the individual methods must be weighed up in view of the requirements of the underlying decoder architecture and the costs of its hardware implementation, which is the objective of Section 5. Thus, UOP generally requires bigger complexity in hardware, and EOP or ROP can be preferred for particular codes.

5. Decoder Architectures

Low complexity and high throughput are key features demanded to every competitive LDPC decoder, and to this extent, semi-parallel architectures are widely recognised as the best design choice.

As shown in [6, 8, 12] to mention just a few, a semi-parallel architecture includes an array of processing elements with size usually equal to the expansion factor of the base-matrix . Therefore, the HLD algorithm described in Section 3 must be intended in a vectorized form as well, and in order to exploit the code structure, a layer counts consecutive parity-check nodes. Layers (in the number of ) are updated in sequence by the check-node units (CNUs), and an array of SOs ( ) and of c2v messages ( ) are concurrently updated at every clock cycle. Since the parity-check equations in a layer are independent by construction, that is, they do not share SOs, the analysis of Section 4 still holds in a vectorized form.

The CNUs are designed to serially update the c2v magnitudes according to (3) and (4), and any arbitrary order of the c2v messages (and so of SOs, see line 13 of Algorithm 1) can be easily achieved by properly multiplexing between the two values as also shown in [23]. It must be pointed out that the 2-output approximation described in Section 3 is pivotal to a low-complexity implementation of EOP, ROP, or UOP in the CNU. However, the same strategies could also be used with a different (or even no) approximation in the CNU, although the cost of the related implementation would probably be higher.

Three VLSI architectures of a layered decoder will be described, that differ in the management of the memory units of both SO and c2v, and so result in different implementation costs in terms of memory (RAM and ROM) and logic.

5.1. Local Variable-to-Check Buffer

The most straightforward architecture of a vectorized layered decoder is shown in Figure 7. Here, the arrays of v2c messages entering the CNUs during the update of layer , are computed on-the-fly as with , and both the arrays of c2v and SO messages are retrieved from memory.
Figure 7

Layered decoder architecture with variable-to-check buffer.

Then, the updated c2v messages are used to refine every array of SOs belonging to layer : according to line 13 of Algorithm 1, this is done by adding the new c2v array to the input v2c array . Since the CNUs work in pipeline, while the update of layer is still progress, the array of the v2c messages belonging to layer is already being computed as , with . For this reason, needs to be temporarily stored in a local buffer as shown in Figure 7. The buffer is vectorized as well and stores messages, with the maximum CN degree in the code.

Before being stored back in memory, the array is circularly shifted and made ready for its next use, by applying compound or incremental rotations [12]; this operation is carried out by the circular shifting network of Figure 7, and more details about its architecture are available in [24].

The v2c buffer is the key element that allows the architecture to work in pipeline. This has to sustain one reading and one writing access concurrently and can be efficiently implemented with shift-register based architectures for EOP (first-in, first-out, FIFO buffer) and ROP (last-in, first-out, LIFO buffer). On the contrary, UOP needs to map the buffer onto a dual-port memory bank, whose (reading) address is provided by and extra configuration memory (ROM).

5.2. Double Memory Access

The buffer of Arch. V-A can be removed if the v2c messages are computed twice on-the-fly, as shown in Figure 8: the first time to feed the array of CNUs, and then to update the SOs. To this aim, a further reading is required to get the arrays and from memory, and so recompute the array on the CNUs output.
Figure 8

Layered decoder with three -port SO and c2v memories.

It follows that three-port memories are needed for both SO and c2v messages since three concurrent accesses have to be supported: two readings (see ports and in Figure 8) and one writing. This memory can be implemented by distributing data on several banks of customary dual-port memory, in such a way that two readings always involve different banks. Actually, in a layered decoder a same memory location needs to be accessed several times per iteration and concurrently to several other data, so that resorting to only two memory banks would be unfeasible. On the other hand, the management of a higher number of banks would add a significant overhead to the complexity of the whole design.

The proposed solution is sketched in Figure 9 and is based on only two banks (A and B) but, to clear access conflicts, some data are redundantly stored in both the banks (see elements C1 and C2 in the example of Figure 9).
Figure 9

Three -port memory: data partitioning and architecture.

The most trivial and expensive solution is achieved when both banks are a full copy or a mirror of the original memory as in [11], which corresponds to redundancy. Conversely to this route, data can be selectively assigned to the two banks through computer search aiming at a minimum redundancy.

Roughly speaking, if we denote by the cardinality of the set of data (SO or c2v messages) read concurrently to the th data for , then the higher is (for a given ), the higher is the expected redundancy. So, a small redundancy is experienced by the c2v memory, since each c2v message can collide with at most two other data (i.e., ), while a higher redundancy is associated to the SO memory, since every SO can face up to conflicts, with being the degree of the th variable node, typically greater than (especially for low-rate codes).

Indeed, the issue of memory partitioning and the reordering techniques described in Section 4 are linked to each other: whenever the CNUs are in idle, only one reading is performed. Therefore, an overall system optimization aiming at minimizing the iteration latency and the amount of memory redundancy at the same time could be pursued; however, due to the huge optimization space, this task is almost unfeasible and is not considered in this work.

5.3. Storage of Variable-to-Check Messages

During the elaboration of a generic layer, a certain v2c message is needed twice, and a local buffer or multiple memory reading operations were implemented in Arch. V-A and Arch. V-B, respectively.

A third way of solving the problem is computing the array of v2c messages only once per iteration, like in Arch. V-A, but instead of using a local buffer, the v2c messages are precomputed and stored in the SO memory ready for the next use, as sketched in Figure 10. A similar architecture is used in [10, 16] but the issue of decoding pipeline is not clearly stated there.
Figure 10

Layered decoder with v2c three -port memory.

Figure 11

Parity-check base-matrix of the block -LDPC code for IEEE 802. 11n with codeword size and rate . Black squares correspond to cyclic shifts s of the identity matrix ( ), also indicated in the square, while empty squares correspond to all-zero submatrices.

In this way, the SO memory turns into a v2c memory with the following meaning: the array updated by layer is stored in memory after marginalization with the c2v message , with being the index of the next layer reusing the same array of SOs, . In other words, the array of v2c messages involved in the next update of the same block-column is precomputed. Therefore, the data stored in the v2c memory are used twice, first to feed the array of CNUs, and then for the SOs update.

Similarly to Arch. V-B, a three-port memory would be required because of the decoding pipeline; the same considerations of Section 5.2 still hold, and an optimum partitioning of the v2c memory onto two banks with some redundancy can be found. Note that, as opposed to Arch. V-B, a customary dual-port memory is enough for c2v messages.

As far as the complexity is concerned, at first glance this solution seems to be preferable to Arch. V-B since it needs only two stages of parallel adders while the c2v memory is not split. However, the management of the reading ports of the v2c memory introduces significant overheads, since after the update of the soft outputs by layer , the memory controller must be aware of what is the next layer using the same soft outputs . This information needs to be stored in a dedicated configuration memory, whose size and area can be significant, especially in a multilength, multirate decoder.

6. A Case Study: The IEEE 802.11n LDPC Codes

6.1. LDPC Code Construction

The WLAN standard [3] defines AA-LDPC codes based on circulants of the identity matrix. Three different codeword lengths are supported, , and , each coming with four code rates, , , and , for a total of different codes. As a distinguishing feature, a different block-size is used for each codeword length, that is, , and , respectively; accordingly, every code counts block-columns, while the block-rows (layers) are in the number of for code rates , , and , respectively.

An example of the base-matrix for the code with length and rate is shown in Figure 11.

6.2. Multiframe Decoder Architecture

In order to attain an adequate throughput for every WLAN codes, the decoder must include a number of CNUs at least equal to . This means that two thirds of the processors would remain unused with the shortest codes.

In the latter case, the throughput can be increased thanks to a multiframe approach, where frames of the code with block-size are decoded in parallel. A similar solution is described in [12], but in that case two different frames are decoded in time-division multiplexing by exploiting the 2 nonoverlapped phases of the flooding algorithm. Here, frames are decoded concurrently, and more specifically, three different frames of the shortest code can be assigned to a cluster of 27 CNUs each.

Note that to work properly, the circular shifting network must support concurrent subrotations as described in [24].

7. Decoder Performance

As to give a practical example of the reordering strategies described in Section 4, Figure 12 shows the data flow related to the update of layer 0 for the WLAN code of Figure 11. While 6 idle cycles are required following the original, natural order of updates (see Figure 12(a)), EOP needs 5 cycles (see Figure 12(b)), ROP reduces them to 1 (see Figure 12(c)), while no idle cycle is used by UOP (see Figure 12(d)). The subsets defined in Section 4.1 are also shown in Figure 5, along with the optimal sequence of layers followed for decoding.
Figure 12

An example of optimization of the base-matrix of the LDPC code IEEE 802. 11n with and with EOP, ROP and UOP. Critical propagations are highlighted in dark gray.Original base-matrix (sequence of layers: 0,1,2,3,4,5,6,7)EOP (optimised sequence of layers: 0,5,6,7,4,2,3,1)ROP (optimised sequence of layers: 0,2,7,5,6,3,4,1)UOP (optimised sequence of layers: 0,4,7,6,1,2,6,3)

7.1. Latency and Throughput

The latency of a pipelined LDPC decoder can be expressed as
(16)

with being the clock period, being the number of iterations, being the number of nonnull blocks in the code, being the number of idle cycles per iteration, being the cycles to empty the decoder pipelin and finally, being the cycles for the input/output interface. Among the parameters above, is set for good error-correction performance, is a code-dependent parameter, and is fixed by the I/O management; thus, for a minimum latency, the designer can only act on , whose value can be optimised with the techniques of Section 4.

Focusing on the IEEE 802.11n codes, Table 1 shows the overall number of cycles for 12 iterations ( ), the number of idle cycles per iteration ( ), the percentage of idle cycles with respect to the total (idling %), and the throughput at the clock frequency of  MHz.
Table 1

Performance of an LDPC decoder for IEEE 802.11n with 12 iterations: and  MHz.

Code lenght

Code rate

Original

2299

1763

1779

1486

2106

1715

1886

1653

2107

1775

1752

1603

 

91

46

47

22

81

46

60

43

77

47

48

41

 

idling %

47%

31%

31%

17%

46%

32%

38%

31%

44%

31%

32%

30%

 

(Mbps)

101

176

197

262

74

121

124

157

111

175

200

243

EOP

1927

1691

1575

1462

1819

1643

1527

1377

1855

1691

1538

1352

 

60

40

30

20

57

40

30

20

56

40

30

20

 

idling %

37%

28%

23%

16%

37%

29%

23%

17%

36%

28%

23%

17%

 

(Mbps)

121

184

222

266

85

126

153

188

126

184

228

288

ROP

1308

1216

1290

1403

1223

1168

1239

1330

1283

1228

1243

1305

 

8

0

6

15

7

0

6

16

8

1

5

16

 

idling %

7.3%

0%

5.5%

13%

6.8%

0%

5.5%

14%

7.4%

1%

4.8%

14%

 

(Mbps)

178

256

271

277

127

178

188

195

182

253

282

298

UOP

1308

1216

1243

1380

1187

1168

1195

1260

1259

1216

1195

1164

 

8

0

2

13

4

0

2

10

6

0

1

4

 

idling %

7.3%

0%

1.9%

11%

4%

0%

2%

9.3%

5.6%

0%

0.9%

4%

 

(Mbps)

178

256

282

282

131

178

195

206

185

256

293

334

The latter is expressed in information bits decoded per time unit and is also referred to as net throughput:
(17)

where is the number of frames decoded in parallel. For this reason, the figures of Table 1 for the short codes are very similar to those for the long codes ( ); on the contrary, the middle codes do not benefit from the same mechanism (i.e., ) and their throughput is scaled down by a factor 2/3.

The results of Table 1 are for every technique of Section 4 as well as for the original codes before optimization. Although EOP clearly outperforms the original codes, better results are achieved with ROP and UOP for the WLAN case example, where at most 14% and 11% of the decoding time are spent in idle, respectively. On average, the decoding time decreases from 7.6 to 6.7 ns with EOP and even to 5.3 ns with ROP and 5.1 ns with UOP. This behaviour can be explained by considering that for the WLAN codes the term found in (6) for EOP is significantly nonnull, while comparing (8) to (13), ROP and UOP basically differ for the term , which is negligible for the WLAN codes.

7.2. Error-Correction Performance

Figure 13 compares the floating point frame error rate (FER) after 12 decoding iterations of a pipelined decoder using EOP, ROP, and UOP with a reference curve obtained by simulating the original parity-check matrix before optimization, in a nonpipelined decoder. Two simulations were run for each strategy, one with the proper number of idle cycles (curves with full markers), and the other without idle cycles and referred to as full pipeline mode (curves with empty markers).
Figure 13

Error-correction performance of the IEEE 802. 11n, , rate- LDPC code after 12 decoding iterations.

As expected, the three strategies reach the reference curve of the HLD algorithm when properly idled. Then, in case of full pipeline ( ), the performance of EOP are spoiled, while ROP and UOP only pay about 0.6 and 0.3 dB, respectively. This means that the reordering has significantly reduced the dependence between layers and only few hazards arise without idle cycles.

Similarly to EOP, no received codeword is successfully decoded even at high SNRs (i.e., ) if the original code descriptors are simulated in full pipeline. This confirms once more the importance of idle cycles in a pipelined HLD decoding decoder and motivates the need of an optimization technique.

Considering the same scenario of Figure 13, Figure 14 shows the convergence speed, measured in average number of iterations, of the layered decoding algorithm. The curves confirm that HLD needs one half of the number of iterations of the flooding schedule, on average, and show that the full pipeline mode is also penalized in terms of speed.
Figure 14

IEEE 802. 11n, , rate- LDPC code: average decoding speed for a maximum of 100 iterations.

8. Implementation Results

The complexity of an LDPC decoder for IEEE 802.11n codes was derived through logical synthesis on a low-power 65 nm CMOS technology targeting  MHz. Every architecture of Section 5 was considered for implementation, each one supporting the three reordering strategies, for a total of 9 combinations. For good error correction performance, input LLRs and c2v messages were represented on 5 bits, while internal SO and v2c messages on 7 bits.

Table 2 summarizes the complexity of the different designs in terms of logic, measured in equivalent Kgates and number of RAM and ROM bits. Equivalent gates are counted by referring to the low-drive, 2-input NAND cell, whose area is 2.08  for the target technology library. Arch. V-A needs the highest number of memory bits due to the local variable-to-check buffer, but its logic is smaller since it requires no additional hardware resources (adders) and less configuration bits.
Table 2

IEEE 802.11n LDPC decoder complexity analysis.

  

EOP

ROP

UOP

Arch. V-A

logic (Kgates)

71.29

71.62

74.65

 

RAM bits

61,722

61,722

61,722

 

ROM bits

23,159

23,159

40,788

Arch. V-B

logic (Kgates)

75.45

75.75

77.99

 

RAM bits

53,622

54,837

57,024

 

29.2%

29.2%

33.3%

 

1.1%

4.6%

9.1%

 

ROM bits

36,582

36,582

51,849

Arch. V-C

logic (Kgates)

71.83

72.14

74.60

 

RAM bits

53,217

53,217

53,784

 

29.2%

29.2%

33.3%

 

ROM bits

34,508

34,508

43,553

Because of the partitioning of both the SO and the c2v memories, Arch. V-B needs more logic resources and more memory bits than Arch. V-C (both for data and configuration). The redundancy ratios and of the SO and c2v memory in Arch. V-B, respectively, and of the v2c memory in Arch. V-C, are also reported in Table 2.

As a matter of fact, the three architectures are very similar in complexity and performance, and, for a given set of LDPC codes, the designer can select the most suitable solution by trading-off decoding latency and throughput at the system level, with the requirements of logic and memory in terms of area, speed, and power consumption at the technology level.

Table 3 compares the design of a decoder for IEEE 802.11n based on Arch. V-C with UOP with similar state-of-the-art implementations: a parallel decoder by Blanskby and Howland [7], a 2048-bit rate 1/2 TDMP decoder by Mansour and Shanbhag [25], a similar design for WLAN by Gunnam et al. [10], and a decoder for WiMAX by Brack et al. [26]. Here, for a fair comparison, the throughput is expressed in channel bits decoded per time unit; that is, it is the channel throughput .
Table 3

State-of-the-art LDPC decoder implementations.

  

[this]

[ 7 ]

[ 10 ]

[ 25 ]

[ 26 ]

Technology

65 nm CMOS

0.16  CMOS 5-LM

0.13  TSMC CMOS

0.18  1.8 V TSMC CMOS

0.13  CMOS

Algorithm

layered

flooding

layered

TDMP

flooding/layered

CPU arch.

serial

parallel

serial

parallel

serial

Nb. of CPUs

81

1536

81

64

96

Msg. width (c2v + SO)

5 + 7

4 + 4

5 + 6

4 + 5

6

Clock fr (MHz)

240

64

500

125

333

Rates

, , ,

1/2, 2/3, 3/4, 5/6

Codeword length, N

648, 1296, 1944

1024

648, 1296, 1944

2048

Codeword size, B

27, 54, 81

1

27, 54, 81

64

Nb. of blocks,

79–88

4,33

79–88

96

76–88

Speed

Iterations

12

64

5

10

16

 

(Mbps)

262–401

1,024

541–1,618

640

177–999

Area

Kgates ( )

100.7 (0.207)

1750 (52.5)

99.9 (1.85)

220 (14.3)

489.9 (2.964)

 

RAM bits

56,376

55,344

51,680

NA

Power consumption (W)

0.162

0.69

0.238

0.787

NA

(cycle/bit/iter)

1.103–1.306

0.231

1.361–1.521

0.417

1.01–1.31

(pjoule/bit/iter)

33.7–51.5

10.5

123

For the comparison, we focused on the architectural efficiency defined as
(18)

which represents the average number of clock cycles to update one block of . In decoders based on serial functional units it is and the higher is, the less efficient is the architecture. Actually, can reach 1 only when the dependence between consecutive layers is solved at the code design level. This is the case of two WiMAX codes (specifically, class 1/2 and class 2/3B codes) which are hazard-free (or layered) "by construction'', thus explaining the very low value of achieved by [26]. However, [26] is as efficient as our design ( ) on the remaining nonlayered WiMAX codes, but the authors do not perform layered decoding on such codes.

For decoders with parallel processing units (see [7, 25]) the architectural efficiency becomes a measure of the parallelization used in the processing units and it can be expressed as with being the average check node degree. Indeed, in a two-phase decoder, the number of blocks can be equivalently defined as the overall number of exchanged messages, divided by the number of functional units. If E is the number of edges in the code, then , which is an index of the parallelization used in the processors.

The different designs were also compared in terms of energy efficiency, defined as the energy spent per coded bit and per decoding iteration. This is computed as
(19)

with being the decoding energy and being the power consumption. The latter was estimated with Synopsys Power Compiler and was averaged out over three different SNRs (corresponding to different convergence speeds) and includes the power dissipated in the memory units (about of the total). In terms of energy, our design is more efficient than [25] and gets close to the parallel decoder in [7].

Since the design in [10] is for the same WLAN LDPC codes and implements a similar layered decoding algorithm with the same number of processing units, a closer inspection is compulsory. Thanks to the idle optimization, our solution is more efficient in terms of throughput, the saving in efficiency ranging from to . Then, although our design saves about 70 mW in power consumption with respect to [10], the related energy efficiency has not been included in Table 2 since the reference scenario used to estimate the power consumption (238 mW) was not clearly defined. Finally, although curves for error correction performance are not available in [10], penalties are expected in view of the smaller accuracy used to represent (5 bits) and SOs (6 bits) messages.

9. Conclusions

An effective method to counteract the pipeline hazards typical of block-serial layered decoders of LDPC codes has been presented in this paper. This method is based on the rearrangement of the decoding elaborations in order to minimize the number of idle cycles inserted between updates and resulted in three different strategies named equal, reversed, and unconstrained output (EOP, ROP, and UOP) processing.

Then, different semi-parallel VLSI architectures of a layered decoder for architecture-aware LDPC codes supporting the methods above have been described and applied to the design of a decoder for IEEE 802.11n LDPC codes.

The synthesis of the proposed decoder on a 65 nm low-power CMOS technology reached the clock frequency of 240 MHz, which corresponds to a net throughput ranging from 131 to 334 Mbps with UOP and 12 decoding iterations, outperforming similar designs.

This work has proved that the layered decoding algorithm can be extended with no modifications nor approximations to every LDPC code, despite the interconnections on its parity-check matrix, provided that idle cycles are used to maintain the dependencies between the updates in the algorithm.

Also, the paradigm of code-decoder codesign has been reinforced in this work, since not only the described techniques have shown to be very effective to counteract the pipeline hazards but also they provide at the same time useful guidelines for the design of good, hazard-free, LDPC codes. To this extent, it is then overcome the assumption that consecutive layers do not have to share soft-outputs, like the WiMAX class 1/2 and 2/3B codes do, thus leaving more room to the optimization of the code performance at the level of the code design.

Authors’ Affiliations

(1)
Department of Information Engineering, University of Pisa

References

  1. Satellite digital video broadcasting of second generation (DVB-S2) ETSI Standard EN302307, February 2005
  2. IEEE Computer Society : Air Interface for Fixed and Mobile Broadband Wirelss Access Systems. IEEE Std 802.16eTM-2005, February 2006
  3. IEEE P802.11n TM /D1.06 Draft amendment to Standard for high throughput, 802.11 Working Group, November 2006
  4. Gallager R: Low-density parity-check codes, Ph.D. dissertation. Massachusetts Institutes of Technology; 1960.Google Scholar
  5. MacKay D, Neal R: Good codes based on very sparse matrices. Proceedings of the 5th IMA Conference on Cryptography and Coding, 1995 Google Scholar
  6. Mansour MM, Shanbhag NR: High-throughput LDPC decoders. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 2003,11(6):976-996.View ArticleGoogle Scholar
  7. Blanksby A, Howland C: A 690-mW 1-Gb/s 1024-b, rate-1/2 lowdensity parity-check code decoder. IEEE Journal of Solid-State Circuits 2002,37(3):404-412. 10.1109/4.987093View ArticleGoogle Scholar
  8. Zhong H, Zhang T: Block-LDPC: a practical LDPC coding system design approach. IEEE Transactions on Circuits and Systems I 2005,52(4):766-775.MathSciNetView ArticleGoogle Scholar
  9. Hocevar DE: A reduced complexity decoder architecture via layered decoding of LDPC codes. Proceedings of the IEEE Workshop on Signal Processing Systems (SISP '04), 2004 107-112.Google Scholar
  10. Gunnam K, Choi G, Wang W, Yeary M: Multi-rate layered decoder architecture for block LDPC codes of the IEEE 802.11n wireless standard. Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS '07), May 2007 1645-1648.Google Scholar
  11. Bhatt T, Sundaramurthy V, Stolpman V, McCain D: Pipelined block-serial decoder architecture for structured LDPC codes. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '06), April 2006 4: 225-228.Google Scholar
  12. Fewer CP, Flanagan MF, Fagan AD: A versatile variable rate LDPC codec architecture. IEEE Transactions on Circuits and Systems I 2007,54(10):2240-2251.View ArticleGoogle Scholar
  13. Boutillon E, Tousch J, Guilloud F: LDPC decoder, corresponding method, system and computer program. US patent no. 7,174,495 B2, February 2007Google Scholar
  14. Rovini M, Rossi F, Ciao P, L'Insalata N, Fanucci L: Layered decoding of non-layered LDPC codes. Proceedings of the 9th Euromicro Conference on Digital System Design (DSD '06), August-September 2006 Google Scholar
  15. Tanner R: A recursive approach to low complexity codes. IEEE Transactions on Information Theory 1981,27(5):533-547. 10.1109/TIT.1981.1056404MATHMathSciNetView ArticleGoogle Scholar
  16. Zhang H, Zhu J, Shi H, Wang D: Layered approx-regular LDPC: code construction and encoder/decoder design. IEEE Transactions on Circuits and Systems I 2008,55(2):572-585.MathSciNetView ArticleGoogle Scholar
  17. Echard R, Chang S-C:The -rotation low-density parity check codes. Proceedings of the IEEE Global Telecommunications Conference (GLOBECOM '01), November 2001 980-984.View ArticleGoogle Scholar
  18. Guilloud F, Boutillon E, Tousch J, Danger J-L: Generic description and synthesis of LDPC decoders. IEEE Transactions on Communications 2006,55(11):2084-2091.View ArticleGoogle Scholar
  19. Xiao H, Banihashemi AH: Graph-based message-passing schedules for decoding LDPC codes. IEEE Transactions on Communications 2004,52(12):2098-2105. 10.1109/TCOMM.2004.838730View ArticleGoogle Scholar
  20. Sharon E, Litsyn S, Goldberger J: Efficient serial message-passing schedules for LDPC decoding. IEEE Transactions on Information Theory 2007,53(11):4076-4091.MathSciNetView ArticleGoogle Scholar
  21. Zarkeshvari F, Banihashemi A: On implementation of min-sum algorithm for decoding low-density parity-check (LDPC) codes. Proceedings of the IEEE Global Telecommunications Conference (GLOBECOM '02), November 2002 2: 1349-1353.Google Scholar
  22. Jones C, Valles E, Smith M, Villasenor J: Approximate-MIN constraint node updating for LDPC code decoding. Proceedings of the IEEE Military Communications Conference (MILCOM '03), October 2003 1: 157-162.Google Scholar
  23. Rovini M, Rossi F, L'Insalata N, Fanucci L: High-precision LDPC codes decoding at the lowest complexity. Proceedings of the 14th European Signal Processing Conference (EUSIPCO '06), September 2006 Google Scholar
  24. Rovini M, Gentile G, Fanucci L: Multi-size circular shifting networks for decoders of structured LDPC codes. Electronics Letters 2007,43(17):938-940. 10.1049/el:20071157View ArticleGoogle Scholar
  25. Mansour MM, Shanbhag NR: A 640-Mb/s 2048-bit programmable LDPC decoder chip. IEEE Journal of Solid-State Circuits 2006,41(3):684-698. 10.1109/JSSC.2005.864133View ArticleGoogle Scholar
  26. Brack T, Alles M, Kienle F, Wehn N: A synthesizable IP core for WiMax 802.16E LDPC code decoding. Proceedings of the 17th IEEE International Symposium on Personal, Indoor and Mobile Radio Communications (PIMRC '06), September 2006 1-5.Google Scholar

Copyright

© Massimo Rovini et al. 2009

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.