Techniques and Architectures for HazardFree SemiParallel Decoding of LDPC Codes
 Massimo Rovini^{1}Email author,
 Giuseppe Gentile^{1},
 Francesco Rossi^{1} and
 Luca Fanucci^{1}
DOI: 10.1155/2009/723465
© Massimo Rovini et al. 2009
Received: 4 March 2009
Accepted: 27 July 2009
Published: 15 September 2009
Abstract
The layered decoding algorithm has recently been proposed as an efficient means for the decoding of lowdensity paritycheck (LDPC) codes, thanks to the remarkable improvement in the convergence speed (2x) of the decoding process. However, pipelined semiparallel decoders suffer from violations or "hazards" between consecutive updates, which not only violate the layered principle but also enforce the loops in the code, thus spoiling the error correction performance. This paper describes three different techniques to properly reschedule the decoding updates, based on the careful insertion of "idle" cycles, to prevent the hazards of the pipeline mechanism. Also, different semiparallel architectures of a layered LDPC decoder suitable for use with such techniques are analyzed. Then, taking the LDPC codes for the wireless local area network (IEEE 802.11n) as a case study, a detailed analysis of the performance attained with the proposed techniques and architectures is reported, and results of the logic synthesis on a 65 nm lowpower CMOS technology are shown.
1. Introduction
Improving the reliability of data transmission over noisy channels is the key issue of modern communication systems and particularly of wireless systems, whose spatial coverage and data rate are increasing steadily.
In this context, lowdensity paritycheck (LDPC) codes have gained the momentum of the scientific community and they have recently been adopted as forward error correction (FEC) codes by several communication standards, such as the second generation digital video broadcasting (DVBS2, [1]), the wireless metropolitan area networks (WMANs, IEEE 802.16e, [2]), the wireless local area networks (WLANs, IEEE 802.11n, [3]), and the 10 Gbit Ethernet (10GbaseT, IEEE 802.2ae).
LDPC codes were first discovered by Gallager in the far 1960s [4] but have long been put aside until MacKay and Neal, sustained by the advances in the very high largescale of integration (VLSI) technology, rediscovered them in the early 1990s [5]. The renewed interest and the success of LDPC codes is due to (i) the remarkable errorcorrection performance, even at low signaltonoise ratios (SNRs) and for small blocklengths, (ii) the flexibility in the design of the code parameters, (iii) the decoding algorithm, very suitable for hardware parallelization, and last but not least (iv) the advent of structured or architectureaware (AA) codes [6]. AALDPC codes reduce the decoder area and power consumption and improve the scalability of its architecture and so allow the full exploitation of the complexity/throughput design tradeoffs. Furthermore, AAcodes perform so close to random codes [6], that they are the common choice of all latest LDPCbased standards.
Nowadays, data services and user applications impose severe lowcomplexity and lowpower constraints and demand very high throughput to the design of practical decoders. The adoption of a fully parallel decoder architecture leads to impressive throughput but unfortunately is also so complex in terms of both area and routing [7] that a semiparallel implementation is usually preferred (see [6, 8]).
So, to counteract the reduced throughput, designers can act at two levels: at the algorithmic level, by efficiently rescheduling the messagepassing algorithm to improve its convergence rate, and at the architectural level, with the pipeline of the decoding process, to shorten the iteration time. The first matter can be solved with the turbodecoding messagepassing (TDMP) [6] or the layered decoding algorithm [9], while pipelined architectures are mandatory especially when the decoder employs serial processing units.
However, the pipeline mechanism may dramatically corrupt the errorcorrection performance of a layered decoder by letting the processing units not always work on the most updated messages. This issue, known as pipeline "hazard'', arises when the dependence between the elaborations is violated. The idea is then to reschedule the sequence of updates and to delay with "idle'' cycles the decoding process until newer data are available.
As an improvement to similar stateoftheart works [10–13], this paper proposes three systematic techniques to optimally reschedule the decoding process in a way to minimize the number of idle cycles and achieve the maximum throughput. Also, this paper discusses different semiparallel architectures, based on serial processing units and all supporting the reordering strategies, so as to attain the best tradeoff between complexity and throughput for every LDPC code.
Semiparallel architectures of LDPC decoder have recently been addressed in several papers, although none of them formally solves the issue of pipeline hazards and decoding idling. Gunnam et al. describe in [10] a pipelined semiparallel decoder for WLAN LDPC codes, but the authors do not mention the issue of the pipeline hazards; only, the need of properly scrambling the sequence of data in order to clear some memory conflicts is described.
Boutillon et al. consider in [13] methods and architectures for layered decoding; the authors mention the problem of pipeline hazards (cutedge conflict) and of using an output order different from the natural one in the processing units; nonetheless, the issue is not investigated further, and they simply modify the decoding algorithm to compute partial updates as in [14]. Although this approach allows the decoder to operate in full pipeline with no idle cycles, it is actually suboptimal in terms of both performance and complexity.
Similarly, Bhatt et al. propose in [11] a pipelined blockserial decoder architecture based on partial updates, but again, they do not investigate the dependence between elaborations.
In [12], Fewer et al. implement a semiparallel TDMP decoder, but the authors boost the throughput by decoding two codewords in parallel and not by means of pipeline.
This paper is organised as follows. Section 2 recalls the basics of LDPC and of AALDPC codes and Section 3 summarizes the layered decoding algorithm. Section 4 introduces three different techniques to reduce the dependence between consecutive updates and analytically derives the related number of idle cycles. After this, Section 5 describes the VLSI architectures of a pipelined blockserial LDPClayered decoder. Section 6 briefly reviews the WLAN codes used as a case study, while the performances of the related decoder are analysed in Section 7. Then, the results of the logic synthesis on a 65 nm lowpower CMOS technology are discussed in Section 8, along with the comparison with similar stateoftheart implementations. Finally, conclusions are drawn in Section 9.
2. ArchitectureAware BlockLDPC Codes
Recently, the joint design of code and decoder has blossomed in many works (see [8, 16]), and several principles have been established for the design of implementationoriented AAcodes [6]. These can be summarized into (i) the arrangement of the paritycheck matrix in squared subblocks, and (ii) the use of deterministic patterns within the subblocks. Accordingly, AALDPC codes are also referred to as blockLDPC codes [8].
The pattern used within blocks is the vital facet for a lowcost implementation of the interconnection network of the decoder and can be based either on permutations, as in [6] and for the class of rotation codes [17], or on circulants or cyclic shifts of the identity matrix, as in [8] and in every recent standards [1–3].
AALDPC codes are defined by the number of blockcolumns , the number of blockrows , and the blocksize , which is the size of the component submatrices. Their paritycheck matrix can be conveniently viewed as , that is, as the expansion of a basematrix with size . The expansion is accomplished by replacing the 1's in with permutations or circulants, and the 0's with null subblocks. Thus, the blocksize is also referred to as expansionfactor, for a codeword length of the resulting LDPC code equal to and code rate .
A simple example of expansion or vectorization of a basematrix is shown in Figure 1. The size, number, and location of the nonnull blocks in the code are the key parameters to get good errorcorrection performance and lowcomplexity of the related decoder.
3. Decoding of LDPC Codes
LDPC codes are decoded with the belief propagation (BP) or messagepassing (MP) algorithm, that belong to the broader class of maximum a posteriori (MAP) algorithms. The BP algorithm has been proved to be optimal if the graph of the code does not contain cycles, but it can still be used and considered as a reference for practical codes with cycles. In the latter case, the sequence of the elaborations, also referred to as schedule, considerably affects the achievable performance.
The most common schedule for BP is the socalled twophase or flooding schedule (FS) [18], where all paritycheck nodes first, followed by all variable nodes then, are updated in sequence.
A different approach, taking the distribution of closed paths and girths in the code into account, has been described by Xiao and Banihashemi in [19]. Although probabilistic schedules are shown to outperform deterministic schedules, the random activation strategy of the processing nodes is not very suitable to HW implementation and adds significant complexity overheads.
The most attractive schedule is the shuffled or layered decoding [6, 9, 18, 20]. Compared to the FS, the layered schedule almost doubles the decoding convergence speed, both for codes with cycles and cyclefree [20]. This is achieved by looking at the code as a connection of smaller supercodes [6] or layers [9], exchanging intermediate reliability messages. Specifically, a posteriori messages are made available to the next layers immediately after computation and not at next iteration as in a conventional flooding schedule.
Layers can be any set of either CNs or VNs, and, accordingly, CNcentric (or horizontal) or VNcentric (or vertical) algorithms have been analyzed in [18, 20]. However, CNcentric solutions are preferable since they can exploit serial, flexible, and lowcomplexity CN processors.
Algorithm 1: Horizontal layered decoding.
input: apriori LLR ,
output: aPosteriori harddecisions
( ) // Messages initialization
( ) , , , ,
;
( ) while ( & !Convergence) do
( ) // Loop on all layers
( ) for to do
( ) // Checknode update
( ) forall do
( ) // Sign update
( ) ;
( ) // Magnitude update
( ) ;
( ) // Softoutput update
( )
( ) end
( ) end
( ) ;
( ) end
In Algorithm 1, is the th a priori LLR of the received bits, with and the length of the codeword, is the overall number of paritycheck constraints, and the number of decoding iterations. Also, is the set of VNs connected to the th CN, represents the checktovariable (c2v) reliability message sent from CN to VN at iteration , and is the total information or softoutput (SO) of the th bit in the codeword (see Figure 1).
For the sake of an easier notation, it is assumed here that a layer corresponds to a single row of the paritycheck matrix. Before being used by the next CN or layer, SOs are refined with the involved c2v message, as shown in line 13, and thanks to this mechanism, faster convergence is achieved.
4. Decoding Pipelining and Idling
If the decoder works in pipeline, time is saved by overlapping the phases of elaboration, writingout and reading, so that data are continuously read from and written into memory, and a new layer is processed every clock cycles (see Figure 3(b)).
Although highly desirable, the pipeline mechanism is particularly challenging in a layered LDPC decoder, since the softoutputs retrieved from memory and used for the current elaboration could not be always uptodate, but newer values could be still in the pipeline. This issue, known as pipeline hazard, prevents the use and so the propagation of always uptodate messages and spoils the errorcorrection performance of the decoding algorithm.
The solution investigated in this paper is to insert null or idle cycles between consecutive updates, so that a node processor is suspended to wait for newer data. The number of idle cycles must be kept as small as possible since it affects the iteration time and so the decoding throughput. Its value depends on the actual sequence of layers updated by the decoder as well as on the order followed to update messages within a layer.
Three different strategies are described in this section, to reduce the dependence between consecutive updates in the HLD algorithm and, accordingly, the number of idle cycles. These differ in the order followed for acquisition and writingout of the decoding messages and constitute a powerful tool for the design of "layered'', hazardfree, LDPC codes.
4.1. System Notation
Without any lack of generality, let us identify a layer with one single paritycheck node and focusing on the set of softoutputs participating to layer , let us define the following subsets:
(i) , the set of SOs in common with layer ;
(ii) , the set of SOs in common with layer and not in ;
(iii) , the set of SOs in common with both layers and ;
(iv) , the set of SOs in common with layer and not in or ;
(v) , the set of SOs in common with layer but not in , , ;
(vi) , the set of SOs in common with both layers and , but not in or ;
(vii) , the set of remaining SOs.
In the definitions above the notation means the relative complement of in or the settheoretic difference of and . Let us also define the following cardinalities: (degree of layer ), , , , , , , .
4.2. Equal Output Processing
First, let us consider a very straightforward and implementation friendly architecture of the node processor that updates (and so delivers) the softoutput messages with the same order used to take them in.
In such a case it would be desirable to (i) postpone the acquisition of messages updated by the previous layer, that is, messages in , and (ii) output the messages in as soon as possible to let the next layer start earlier. Actually, the last constraint only holds when does not include any message common to layer , that is, when ; otherwise, the set could be acquired at any time before .
Note that (5) and (6) only hold under the hypothesis of leading within . If this is not the case, up to extra idle cycles could be added if is output last within .
So far, we have only focused on the interaction between two consecutive layers; however, violations could also arise between layer and . Despite this possibility, this issue is not treated here, as it is typically mitigated by the same idle cycles already inserted between layers and and between layers and .
4.3. Reversed Output Processing
Depending on the particular structure of the paritycheck matrix , it may occur that the most of the messages of layer in common with layer are also shared with layer , that is, and . If this condition holds, as for the WLAN LDPC codes (see Figure 11), it can be worth reversing the output order of SOs so that the messages in can be both acquired last and output first.
Similarly to EOP, the ROP strategy also suffers from pipeline hazards between three consecutive layers, and because of the reversed output order, the issue is more relevant now. This situation is sketched in Figure 5(b), where the sets , , and are managed similarly to , and . The ROP strategy is then instructed to acquire the set later and to output earlier. However, the situation is complicated by the fact that the set may not entirely coincide with ; rather it is , since some of the messages in can be found in . This is highlighted in Figure 5(b), where those messages of and not delivered to are shown in dark grey.
The margin is actually nonnull only if ; otherwise, under the hypothesis that (i) the set is output first within , and (ii) within , the messages not in are output last.
4.4. Unconstrained Output Processing
Fewer idle cycles are expected if the orders used for input and output are not constrained to each other. This implies that layer can still delay the acquisition of the messages updated by layer (i.e., messages in ) as usual, but at the same time the messages common to layer (i.e., in ) can also be delivered earlier.
Regarding the interaction between three consecutive layers, if the messages common to layer (i.e., in ) are output just after , and if on layer , the set is taken just before , then there is no risk of pipeline hazard between layer and .
4.5. Decoding of Irregular Codes
A serial processor cannot process consecutive layers with decreasing degrees, , as the pipeline of the internal elaborations would be corrupted and the output messages of the two layers would overlap in time. This is not but another kind of pipeline hazard, and again, it can be solved by delaying the update of the second layer with idle cycles.
with being computed according to (6), (11), or (13).
4.6. Optimal Sequence of Layers
where is the number of idle cycles between layer and for the generic permutation and is given by (14), and is the set of the possible permutations of layers.
The minimization problem in (15) can be solved by means of a bruteforce computer search and results in the definition of a permuted paritycheck matrix , whose layers are scrambled according to the optimal permutation . Then, within each layer of , the order to update the nonnull subblocks is given by the strategy in use among EOP, ROP, and UOP.
4.7. Summary and Results
The three methods proposed in this section are differently effective to minimize the overall time spent in idle. Although UOP is expected to yield the smallest latency, the results strongly depend on the considered LDPC code, and ROP and EOP can be very close to UOP. As a caseexample, results will be shown in Section 7 for the WLAN LDPC codes.
However, the effectiveness of the individual methods must be weighed up in view of the requirements of the underlying decoder architecture and the costs of its hardware implementation, which is the objective of Section 5. Thus, UOP generally requires bigger complexity in hardware, and EOP or ROP can be preferred for particular codes.
5. Decoder Architectures
Low complexity and high throughput are key features demanded to every competitive LDPC decoder, and to this extent, semiparallel architectures are widely recognised as the best design choice.
As shown in [6, 8, 12] to mention just a few, a semiparallel architecture includes an array of processing elements with size usually equal to the expansion factor of the basematrix . Therefore, the HLD algorithm described in Section 3 must be intended in a vectorized form as well, and in order to exploit the code structure, a layer counts consecutive paritycheck nodes. Layers (in the number of ) are updated in sequence by the checknode units (CNUs), and an array of SOs ( ) and of c2v messages ( ) are concurrently updated at every clock cycle. Since the paritycheck equations in a layer are independent by construction, that is, they do not share SOs, the analysis of Section 4 still holds in a vectorized form.
The CNUs are designed to serially update the c2v magnitudes according to (3) and (4), and any arbitrary order of the c2v messages (and so of SOs, see line 13 of Algorithm 1) can be easily achieved by properly multiplexing between the two values as also shown in [23]. It must be pointed out that the 2output approximation described in Section 3 is pivotal to a lowcomplexity implementation of EOP, ROP, or UOP in the CNU. However, the same strategies could also be used with a different (or even no) approximation in the CNU, although the cost of the related implementation would probably be higher.
Three VLSI architectures of a layered decoder will be described, that differ in the management of the memory units of both SO and c2v, and so result in different implementation costs in terms of memory (RAM and ROM) and logic.
5.1. Local VariabletoCheck Buffer
Then, the updated c2v messages are used to refine every array of SOs belonging to layer : according to line 13 of Algorithm 1, this is done by adding the new c2v array to the input v2c array . Since the CNUs work in pipeline, while the update of layer is still progress, the array of the v2c messages belonging to layer is already being computed as , with . For this reason, needs to be temporarily stored in a local buffer as shown in Figure 7. The buffer is vectorized as well and stores messages, with the maximum CN degree in the code.
Before being stored back in memory, the array is circularly shifted and made ready for its next use, by applying compound or incremental rotations [12]; this operation is carried out by the circular shifting network of Figure 7, and more details about its architecture are available in [24].
The v2c buffer is the key element that allows the architecture to work in pipeline. This has to sustain one reading and one writing access concurrently and can be efficiently implemented with shiftregister based architectures for EOP (firstin, firstout, FIFO buffer) and ROP (lastin, firstout, LIFO buffer). On the contrary, UOP needs to map the buffer onto a dualport memory bank, whose (reading) address is provided by and extra configuration memory (ROM).
5.2. Double Memory Access
It follows that threeport memories are needed for both SO and c2v messages since three concurrent accesses have to be supported: two readings (see ports and in Figure 8) and one writing. This memory can be implemented by distributing data on several banks of customary dualport memory, in such a way that two readings always involve different banks. Actually, in a layered decoder a same memory location needs to be accessed several times per iteration and concurrently to several other data, so that resorting to only two memory banks would be unfeasible. On the other hand, the management of a higher number of banks would add a significant overhead to the complexity of the whole design.
The most trivial and expensive solution is achieved when both banks are a full copy or a mirror of the original memory as in [11], which corresponds to redundancy. Conversely to this route, data can be selectively assigned to the two banks through computer search aiming at a minimum redundancy.
Roughly speaking, if we denote by the cardinality of the set of data (SO or c2v messages) read concurrently to the th data for , then the higher is (for a given ), the higher is the expected redundancy. So, a small redundancy is experienced by the c2v memory, since each c2v message can collide with at most two other data (i.e., ), while a higher redundancy is associated to the SO memory, since every SO can face up to conflicts, with being the degree of the th variable node, typically greater than (especially for lowrate codes).
Indeed, the issue of memory partitioning and the reordering techniques described in Section 4 are linked to each other: whenever the CNUs are in idle, only one reading is performed. Therefore, an overall system optimization aiming at minimizing the iteration latency and the amount of memory redundancy at the same time could be pursued; however, due to the huge optimization space, this task is almost unfeasible and is not considered in this work.
5.3. Storage of VariabletoCheck Messages
During the elaboration of a generic layer, a certain v2c message is needed twice, and a local buffer or multiple memory reading operations were implemented in Arch. VA and Arch. VB, respectively.
In this way, the SO memory turns into a v2c memory with the following meaning: the array updated by layer is stored in memory after marginalization with the c2v message , with being the index of the next layer reusing the same array of SOs, . In other words, the array of v2c messages involved in the next update of the same blockcolumn is precomputed. Therefore, the data stored in the v2c memory are used twice, first to feed the array of CNUs, and then for the SOs update.
Similarly to Arch. VB, a threeport memory would be required because of the decoding pipeline; the same considerations of Section 5.2 still hold, and an optimum partitioning of the v2c memory onto two banks with some redundancy can be found. Note that, as opposed to Arch. VB, a customary dualport memory is enough for c2v messages.
As far as the complexity is concerned, at first glance this solution seems to be preferable to Arch. VB since it needs only two stages of parallel adders while the c2v memory is not split. However, the management of the reading ports of the v2c memory introduces significant overheads, since after the update of the soft outputs by layer , the memory controller must be aware of what is the next layer using the same soft outputs . This information needs to be stored in a dedicated configuration memory, whose size and area can be significant, especially in a multilength, multirate decoder.
6. A Case Study: The IEEE 802.11n LDPC Codes
6.1. LDPC Code Construction
The WLAN standard [3] defines AALDPC codes based on circulants of the identity matrix. Three different codeword lengths are supported, , and , each coming with four code rates, , , and , for a total of different codes. As a distinguishing feature, a different blocksize is used for each codeword length, that is, , and , respectively; accordingly, every code counts blockcolumns, while the blockrows (layers) are in the number of for code rates , , and , respectively.
An example of the basematrix for the code with length and rate is shown in Figure 11.
6.2. Multiframe Decoder Architecture
In order to attain an adequate throughput for every WLAN codes, the decoder must include a number of CNUs at least equal to . This means that two thirds of the processors would remain unused with the shortest codes.
In the latter case, the throughput can be increased thanks to a multiframe approach, where frames of the code with blocksize are decoded in parallel. A similar solution is described in [12], but in that case two different frames are decoded in timedivision multiplexing by exploiting the 2 nonoverlapped phases of the flooding algorithm. Here, frames are decoded concurrently, and more specifically, three different frames of the shortest code can be assigned to a cluster of 27 CNUs each.
Note that to work properly, the circular shifting network must support concurrent subrotations as described in [24].
7. Decoder Performance
7.1. Latency and Throughput
with being the clock period, being the number of iterations, being the number of nonnull blocks in the code, being the number of idle cycles per iteration, being the cycles to empty the decoder pipelin and finally, being the cycles for the input/output interface. Among the parameters above, is set for good errorcorrection performance, is a codedependent parameter, and is fixed by the I/O management; thus, for a minimum latency, the designer can only act on , whose value can be optimised with the techniques of Section 4.
Performance of an LDPC decoder for IEEE 802.11n with 12 iterations: and MHz.
Code lenght 


 

Code rate 











 












 












 
Original 
 2299  1763  1779  1486  2106  1715  1886  1653  2107  1775  1752  1603 
 91  46  47  22  81  46  60  43  77  47  48  41  
idling %  47%  31%  31%  17%  46%  32%  38%  31%  44%  31%  32%  30%  
(Mbps)  101  176  197  262  74  121  124  157  111  175  200  243  
EOP 
 1927  1691  1575  1462  1819  1643  1527  1377  1855  1691  1538  1352 
 60  40  30  20  57  40  30  20  56  40  30  20  
idling %  37%  28%  23%  16%  37%  29%  23%  17%  36%  28%  23%  17%  
(Mbps)  121  184  222  266  85  126  153  188  126  184  228  288  
ROP 
 1308  1216  1290  1403  1223  1168  1239  1330  1283  1228  1243  1305 
 8  0  6  15  7  0  6  16  8  1  5  16  
idling %  7.3%  0%  5.5%  13%  6.8%  0%  5.5%  14%  7.4%  1%  4.8%  14%  
(Mbps)  178  256  271  277  127  178  188  195  182  253  282  298  
UOP 
 1308  1216  1243  1380  1187  1168  1195  1260  1259  1216  1195  1164 
 8  0  2  13  4  0  2  10  6  0  1  4  
idling %  7.3%  0%  1.9%  11%  4%  0%  2%  9.3%  5.6%  0%  0.9%  4%  
(Mbps)  178  256  282  282  131  178  195  206  185  256  293  334 
where is the number of frames decoded in parallel. For this reason, the figures of Table 1 for the short codes are very similar to those for the long codes ( ); on the contrary, the middle codes do not benefit from the same mechanism (i.e., ) and their throughput is scaled down by a factor 2/3.
The results of Table 1 are for every technique of Section 4 as well as for the original codes before optimization. Although EOP clearly outperforms the original codes, better results are achieved with ROP and UOP for the WLAN case example, where at most 14% and 11% of the decoding time are spent in idle, respectively. On average, the decoding time decreases from 7.6 to 6.7 ns with EOP and even to 5.3 ns with ROP and 5.1 ns with UOP. This behaviour can be explained by considering that for the WLAN codes the term found in (6) for EOP is significantly nonnull, while comparing (8) to (13), ROP and UOP basically differ for the term , which is negligible for the WLAN codes.
7.2. ErrorCorrection Performance
As expected, the three strategies reach the reference curve of the HLD algorithm when properly idled. Then, in case of full pipeline ( ), the performance of EOP are spoiled, while ROP and UOP only pay about 0.6 and 0.3 dB, respectively. This means that the reordering has significantly reduced the dependence between layers and only few hazards arise without idle cycles.
Similarly to EOP, no received codeword is successfully decoded even at high SNRs (i.e., ) if the original code descriptors are simulated in full pipeline. This confirms once more the importance of idle cycles in a pipelined HLD decoding decoder and motivates the need of an optimization technique.
8. Implementation Results
The complexity of an LDPC decoder for IEEE 802.11n codes was derived through logical synthesis on a lowpower 65 nm CMOS technology targeting MHz. Every architecture of Section 5 was considered for implementation, each one supporting the three reordering strategies, for a total of 9 combinations. For good error correction performance, input LLRs and c2v messages were represented on 5 bits, while internal SO and v2c messages on 7 bits.
IEEE 802.11n LDPC decoder complexity analysis.
EOP  ROP  UOP  

Arch. VA  logic (Kgates)  71.29  71.62  74.65 
RAM bits  61,722  61,722  61,722  
ROM bits  23,159  23,159  40,788  
Arch. VB  logic (Kgates)  75.45  75.75  77.99 
RAM bits  53,622  54,837  57,024  
 29.2%  29.2%  33.3%  
 1.1%  4.6%  9.1%  
ROM bits  36,582  36,582  51,849  
Arch. VC  logic (Kgates)  71.83  72.14  74.60 
RAM bits  53,217  53,217  53,784  
 29.2%  29.2%  33.3%  
ROM bits  34,508  34,508  43,553 
Because of the partitioning of both the SO and the c2v memories, Arch. VB needs more logic resources and more memory bits than Arch. VC (both for data and configuration). The redundancy ratios and of the SO and c2v memory in Arch. VB, respectively, and of the v2c memory in Arch. VC, are also reported in Table 2.
As a matter of fact, the three architectures are very similar in complexity and performance, and, for a given set of LDPC codes, the designer can select the most suitable solution by tradingoff decoding latency and throughput at the system level, with the requirements of logic and memory in terms of area, speed, and power consumption at the technology level.
Stateoftheart LDPC decoder implementations.
[this]  [ 7 ]  [ 10 ]  [ 25 ]  [ 26 ]  

Technology  65 nm CMOS  0.16 CMOS 5LM  0.13 TSMC CMOS  0.18 1.8 V TSMC CMOS  0.13 CMOS  
Algorithm  layered  flooding  layered  TDMP  flooding/layered  
CPU arch.  serial  parallel  serial  parallel  serial  
Nb. of CPUs  81  1536  81  64  96  
Msg. width (c2v + SO)  5 + 7  4 + 4  5 + 6  4 + 5  6  
Clock fr (MHz)  240  64  500  125  333  
Rates 

 , , , 
 1/2, 2/3, 3/4, 5/6  
Codeword length, N  648, 1296, 1944  1024  648, 1296, 1944  2048 
 
Codeword size, B  27, 54, 81  1  27, 54, 81  64 
 
Nb. of blocks,  79–88  4,33  79–88  96  76–88  
Speed  Iterations  12  64  5  10  16 
(Mbps)  262–401  1,024  541–1,618  640  177–999  
Area  Kgates ( )  100.7 (0.207)  1750 (52.5)  99.9 (1.85)  220 (14.3)  489.9 (2.964) 
RAM bits  56,376  —  55,344  51,680  NA  
Power consumption (W)  0.162  0.69  0.238  0.787  NA  
(cycle/bit/iter)  1.103–1.306  0.231  1.361–1.521  0.417  1.01–1.31  
(pjoule/bit/iter)  33.7–51.5  10.5  —  123  — 
which represents the average number of clock cycles to update one block of . In decoders based on serial functional units it is and the higher is, the less efficient is the architecture. Actually, can reach 1 only when the dependence between consecutive layers is solved at the code design level. This is the case of two WiMAX codes (specifically, class 1/2 and class 2/3B codes) which are hazardfree (or layered) "by construction'', thus explaining the very low value of achieved by [26]. However, [26] is as efficient as our design ( ) on the remaining nonlayered WiMAX codes, but the authors do not perform layered decoding on such codes.
For decoders with parallel processing units (see [7, 25]) the architectural efficiency becomes a measure of the parallelization used in the processing units and it can be expressed as with being the average check node degree. Indeed, in a twophase decoder, the number of blocks can be equivalently defined as the overall number of exchanged messages, divided by the number of functional units. If E is the number of edges in the code, then , which is an index of the parallelization used in the processors.
with being the decoding energy and being the power consumption. The latter was estimated with Synopsys Power Compiler and was averaged out over three different SNRs (corresponding to different convergence speeds) and includes the power dissipated in the memory units (about of the total). In terms of energy, our design is more efficient than [25] and gets close to the parallel decoder in [7].
Since the design in [10] is for the same WLAN LDPC codes and implements a similar layered decoding algorithm with the same number of processing units, a closer inspection is compulsory. Thanks to the idle optimization, our solution is more efficient in terms of throughput, the saving in efficiency ranging from to . Then, although our design saves about 70 mW in power consumption with respect to [10], the related energy efficiency has not been included in Table 2 since the reference scenario used to estimate the power consumption (238 mW) was not clearly defined. Finally, although curves for error correction performance are not available in [10], penalties are expected in view of the smaller accuracy used to represent (5 bits) and SOs (6 bits) messages.
9. Conclusions
An effective method to counteract the pipeline hazards typical of blockserial layered decoders of LDPC codes has been presented in this paper. This method is based on the rearrangement of the decoding elaborations in order to minimize the number of idle cycles inserted between updates and resulted in three different strategies named equal, reversed, and unconstrained output (EOP, ROP, and UOP) processing.
Then, different semiparallel VLSI architectures of a layered decoder for architectureaware LDPC codes supporting the methods above have been described and applied to the design of a decoder for IEEE 802.11n LDPC codes.
The synthesis of the proposed decoder on a 65 nm lowpower CMOS technology reached the clock frequency of 240 MHz, which corresponds to a net throughput ranging from 131 to 334 Mbps with UOP and 12 decoding iterations, outperforming similar designs.
This work has proved that the layered decoding algorithm can be extended with no modifications nor approximations to every LDPC code, despite the interconnections on its paritycheck matrix, provided that idle cycles are used to maintain the dependencies between the updates in the algorithm.
Also, the paradigm of codedecoder codesign has been reinforced in this work, since not only the described techniques have shown to be very effective to counteract the pipeline hazards but also they provide at the same time useful guidelines for the design of good, hazardfree, LDPC codes. To this extent, it is then overcome the assumption that consecutive layers do not have to share softoutputs, like the WiMAX class 1/2 and 2/3B codes do, thus leaving more room to the optimization of the code performance at the level of the code design.
Authors’ Affiliations
References
 Satellite digital video broadcasting of second generation (DVBS2) ETSI Standard EN302307, February 2005
 IEEE Computer Society : Air Interface for Fixed and Mobile Broadband Wirelss Access Systems. IEEE Std 802.16e^{TM}2005, February 2006
 IEEE P802.11n^{ TM } /D1.06 Draft amendment to Standard for high throughput, 802.11 Working Group, November 2006
 Gallager R: Lowdensity paritycheck codes, Ph.D. dissertation. Massachusetts Institutes of Technology; 1960.Google Scholar
 MacKay D, Neal R: Good codes based on very sparse matrices. Proceedings of the 5th IMA Conference on Cryptography and Coding, 1995 Google Scholar
 Mansour MM, Shanbhag NR: Highthroughput LDPC decoders. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 2003,11(6):976996.View ArticleGoogle Scholar
 Blanksby A, Howland C: A 690mW 1Gb/s 1024b, rate1/2 lowdensity paritycheck code decoder. IEEE Journal of SolidState Circuits 2002,37(3):404412. 10.1109/4.987093View ArticleGoogle Scholar
 Zhong H, Zhang T: BlockLDPC: a practical LDPC coding system design approach. IEEE Transactions on Circuits and Systems I 2005,52(4):766775.MathSciNetView ArticleGoogle Scholar
 Hocevar DE: A reduced complexity decoder architecture via layered decoding of LDPC codes. Proceedings of the IEEE Workshop on Signal Processing Systems (SISP '04), 2004 107112.Google Scholar
 Gunnam K, Choi G, Wang W, Yeary M: Multirate layered decoder architecture for block LDPC codes of the IEEE 802.11n wireless standard. Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS '07), May 2007 16451648.Google Scholar
 Bhatt T, Sundaramurthy V, Stolpman V, McCain D: Pipelined blockserial decoder architecture for structured LDPC codes. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '06), April 2006 4: 225228.Google Scholar
 Fewer CP, Flanagan MF, Fagan AD: A versatile variable rate LDPC codec architecture. IEEE Transactions on Circuits and Systems I 2007,54(10):22402251.View ArticleGoogle Scholar
 Boutillon E, Tousch J, Guilloud F: LDPC decoder, corresponding method, system and computer program. US patent no. 7,174,495 B2, February 2007Google Scholar
 Rovini M, Rossi F, Ciao P, L'Insalata N, Fanucci L: Layered decoding of nonlayered LDPC codes. Proceedings of the 9th Euromicro Conference on Digital System Design (DSD '06), AugustSeptember 2006 Google Scholar
 Tanner R: A recursive approach to low complexity codes. IEEE Transactions on Information Theory 1981,27(5):533547. 10.1109/TIT.1981.1056404MATHMathSciNetView ArticleGoogle Scholar
 Zhang H, Zhu J, Shi H, Wang D: Layered approxregular LDPC: code construction and encoder/decoder design. IEEE Transactions on Circuits and Systems I 2008,55(2):572585.MathSciNetView ArticleGoogle Scholar
 Echard R, Chang SC:The rotation lowdensity parity check codes. Proceedings of the IEEE Global Telecommunications Conference (GLOBECOM '01), November 2001 980984.View ArticleGoogle Scholar
 Guilloud F, Boutillon E, Tousch J, Danger JL: Generic description and synthesis of LDPC decoders. IEEE Transactions on Communications 2006,55(11):20842091.View ArticleGoogle Scholar
 Xiao H, Banihashemi AH: Graphbased messagepassing schedules for decoding LDPC codes. IEEE Transactions on Communications 2004,52(12):20982105. 10.1109/TCOMM.2004.838730View ArticleGoogle Scholar
 Sharon E, Litsyn S, Goldberger J: Efficient serial messagepassing schedules for LDPC decoding. IEEE Transactions on Information Theory 2007,53(11):40764091.MathSciNetView ArticleGoogle Scholar
 Zarkeshvari F, Banihashemi A: On implementation of minsum algorithm for decoding lowdensity paritycheck (LDPC) codes. Proceedings of the IEEE Global Telecommunications Conference (GLOBECOM '02), November 2002 2: 13491353.Google Scholar
 Jones C, Valles E, Smith M, Villasenor J: ApproximateMIN constraint node updating for LDPC code decoding. Proceedings of the IEEE Military Communications Conference (MILCOM '03), October 2003 1: 157162.Google Scholar
 Rovini M, Rossi F, L'Insalata N, Fanucci L: Highprecision LDPC codes decoding at the lowest complexity. Proceedings of the 14th European Signal Processing Conference (EUSIPCO '06), September 2006 Google Scholar
 Rovini M, Gentile G, Fanucci L: Multisize circular shifting networks for decoders of structured LDPC codes. Electronics Letters 2007,43(17):938940. 10.1049/el:20071157View ArticleGoogle Scholar
 Mansour MM, Shanbhag NR: A 640Mb/s 2048bit programmable LDPC decoder chip. IEEE Journal of SolidState Circuits 2006,41(3):684698. 10.1109/JSSC.2005.864133View ArticleGoogle Scholar
 Brack T, Alles M, Kienle F, Wehn N: A synthesizable IP core for WiMax 802.16E LDPC code decoding. Proceedings of the 17th IEEE International Symposium on Personal, Indoor and Mobile Radio Communications (PIMRC '06), September 2006 15.Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.