A faster way to run Reed Solomon decoders
Keywords:reed solomon decoder? vhdl? time multiplexing?
Over the last 20 years, RS codes have become commonplace in communications channels. In those that have a potential for bursts of errors, RS codes have proved particularly useful. RS codes are also commonly used concatenated with a convolution code.
Now, there is a new Reed-Solomon (RS) decoder architecture that can process multiple symbols per clock cycle. The number of bytes processed in one clock cycle, along with other variables defining the RS code, are implemented as generics in the VHDL design to provide a choice between the optimization of system area and system speed. These generics enable system designers to quickly make trade-offs without redesigning the decoder.
The choice of what code to use is wide open. Most of those used today are based on the Galois Field GF(256). However, there are examples of the use of GF(128), GF(512) and others. Likewise, there are multitudes of generator polynomials, as well as primitive elements, to choose from, though only a few are commonly used.
To add to the choices, there is more than one algorithm that can be used to decode RS codes. Most common are Euclid's algorithm and the Berlekamp-Massey algorithm. Since different systems require different RS codes, the engineer should choose the decoder that fits best in the system being designed. Some systems have set requirements as to what code should be used; in other cases it would be useful to be able to experiment with various codes, and in yet other systems more than one code is required. The configurable programmable decoder gives the system designers the flexibility to adapt the decoder to the specific system requirement without redesign. It also allows the users to change the codes on the fly in the cases where more than one RS code is needed in the system.
Decoding process
In the architecture described, Euclid's algorithm is used as the decoding method. In decoding an RS code there are six steps to go through. These are: 1) generate syndromes; 2) generate error magnitude polynomial; 3) generate error location polynomial; 4) evaluate error location polynomial; 5) evaluate error magnitude polynomial and 6) correct errors.
Several ways to accomplish these steps have been published and patented. One is to generate a giant pipeline in which a piece of hardware is dedicated to one of the steps and thus operates on one block of data at a time; then there would be six pieces of hardware working on six different data blocks. This is hardware intensive, but it does allow for data to be processed at the rate at which it comes in as long as the time it takes to perform Steps 2 and 3 is less than the time it takes to do Step 1.
Another approach is to use one piece of hardware to do all steps and "time-share." Obviously, this will reduce the data rate as well as the size of the hardware. Between these two extremes are a host of combinations that imply size and delay solutions between the two described.
In any of the above-mentioned approaches, a partial syndrome is generated as the data (input) is fed in one byte at a time. There will be R circuits to generate R partial syndromes (where R is the number of parity bytes used in the code). Traditionally, this has limited the data rate of the circuit to be 1 byte per clock, assuming that there is enough hardware to handle all the decode steps listed above.
In a new approach to generating syndromes, the generation of partial syndromes uses two input bytes at a time. This allows for a doubling-in data rate. Again, similar improvements can be made in evaluating the polynomials (Steps 4 and 5); thus, the entire decode process will require about half the number of clocks compared to previous implementations. The trade-offs of size and clock speed will be discussed later. It should be noted that extending this approach to 4 or even 8 bytes per clock is possible.
In the past, most decoders have been code-specific and each application required specialized hardware, but the decoder designed around the new architecture separates the code variables into two groups. First is the at-design-time group; second is the at-execution-time group.
"At design time" means that when the designers implement their system, they choose a set of parameters that will bound the operation of the decoder. "At execution time" means that during normal operation, the decoder can be reprogrammed to change the code as needed.
At design time, the user can choose the following variables: Number of bytes to process per clock; maximum number of bits per byte; maximum number of parity bytes; fixed or programmable primitive polynomial; maximum block length.
The first of these is what has been described earlier, the second picks the maximum size for the GF to use, the third gives a range for R. Just like the bits-per-byte choice, the decoder can handle any code where R is smaller than or equal to the maximum. The fourth choice makes the decoder able either to work for only one specific GF or handle all GF (smaller than or equal to the size set by the second choice).
A possible macro architecture for this implementation could follow a time-multiplexing architecture. There are other alternatives available that speed up the decoder process even further. As mentioned earlier, one can implement the decoder as a pipeline. If the syndrome techniques used earlier to reduce the number of clocks per byte is used with this macro architecture, the decoder would be able to decode two or four blocks in the time it takes for a block to be transmitted.
The first of these is what has been described earlier, the second picks the maximum size for the GF to use, the third gives a range for R. Just like the bits-per-byte choice, the decoder can handle any code where R is smaller than or equal to the maximum. The fourth choice makes the decoder able either to work for only one specific GF or handle all GF (smaller than or equal to the size set by the second choice).
A possible macro architecture for this implementation could follow a time-multiplexing architecture. There are other alternatives available that speed up the decoder process even further. As mentioned earlier, one can implement the decoder as a pipeline. If the syndrome techniques used earlier to reduce the number of clocks per byte is used with this macro architecture, the decoder would be able to decode two or four blocks in the time it takes for a block to be transmitted.
The first of these is what has been described earlier, the second picks the maximum size for the GF to use, the third gives a range for R. Just like the bits-per-byte choice, the decoder can handle any code where R is smaller than or equal to the maximum. The fourth choice makes the decoder able either to work for only one specific GF or handle all GF (smaller than or equal to the size set by the second choice).
A possible macro architecture for this implementation could follow a time-multiplexing architecture. There are other alternatives available that speed up the decoder process even further. As mentioned earlier, one can implement the decoder as a pipeline. If the syndrome techniques used earlier to reduce the number of clocks per byte is used with this macro architecture, the decoder would be able to decode two or four blocks in the time it takes for a block to be transmitted.
Torjell Berg and Aaron Brennan
American Microsystems Inc.
Visit Asia Webinars to learn about the latest in technology and get practical design tips.