Global Sources
EE Times-Asia
Stay in touch with EE Times Asia
EE Times-Asia > Processors/DSPs

Achieve better DSP code from compilers

Posted: 18 Jun 2007 ?? ?Print Version ?Bookmark and Share

Keywords:compilers? C language? C code? modulo addressing? DSP?

Compiling DSP application code is not a push-button process!at least, not unless you're willing to settle for inefficient code. Signal processing algorithms (and the processors commonly used to run them) have specialized characteristics, and compilers usually can't generate efficient code for them without some level of programmer intervention.

Learning how to coax efficient signal processing object code out of a compiler is an important skill and can reduce (or eliminate) the amount of time you'll spend optimizing at the assembly level. This article will explain how to get the best performance out of whatever compiler you're using and how to avoid getting blindsided by common compiler pitfalls.

Learning by disassembling
A useful tool for understanding compilers' strengths and weaknesses is the disassembler. This tool takes object code and generates the corresponding sequence of assembly language instructions, allowing you to see exactly how the compiler implemented your code. You'll be able to tell whether it did a good job of using specialized processor features and parallelism and whether the resulting code looks more or less as expected.

You'll often find some surprising results. It is not uncommon to find compilers generating incorrect code or overlooking seemingly obvious optimizations. Sometimes, assemblers even alter hand-coded assembly files, which you may not realize unless a disassembler was used to view the final code. This can happen if, for example, you unknowingly use a pseudo-instruction that expands into a sequence of multiple native instructions.

DSPs (and many general-purpose processors) have specialized hardware or instructions to speed up common signal processing algorithms (such as filters and FFTs). These include, for example, single-cycle multiply-accumulates (MACs), specialized addressing modes (such as modulo and bit-reversed addressing), zero-overhead loops and saturation.

When compiling signal processing application code, figure out which (if any) of these instructions and hardware the compiler is capable of using and under what circumstances. This will allow you to write your C code in a way that helps the compiler recognize opportunities to use specialized hardware features.

Experiment with the C code and use the disassembler to observe the effect on the compiler's ability to create efficient object code. Each compiler has its own quirks and it's worth the effort to spend some time learning how to help it do a good job.

Be careful with data types
When defining data types in C (rather than in assembly), you must understand how the compiler will implement them on your target processor because this can significantly affect the efficiency of the compiled code.

The C standard defines several data types!but the sizes of these types are not uniform and differ across processors. From a code performance perspective, the key thing to understand is that the size used by the compiler won't necessarily provide a good fit for the native data word width of the processor. For this reason, if you use the wrong data type in C, you may incur a huge penalty in the compiled code. If your processor only supports 16bit integers, for example, you don't want to define data in your inner loop as 64bit double.

The C data types are as follows:

?int is the primary data type for indexing and counting.
?long provides at least 32bits (that's mandated by the C standard). On most processors where the native word is not 32bits, "long" arithmetic requires library support.
?long long provides at least 64bits. This format is not supported on most 16bit processors.
?short is 16bits on many processors but not all.
?char is the smallest addressable unit. Many C programs assume that a char is 8bits, which can be problematic because on a DSP, it is usually not. Moreover, note that the SIZEOF() operator in C returns size in units of char, which, again, may not be 8bits.

Table 1: The int and char sizes for a selection of DSP processors.

To further complicate matters, signal processing code implemented on fixed-point processors relies heavily on fractional data types, such as Q.15 in which a 16bit word represents a fractional value that lies between -1 and 1. DSPs are designed for efficient operations on fractional data!but ANSI C doesn't recognize fractional types. If you stick with ANSI C, you're likely to use integer data types and shifts to implement fractional arithmetic. But when the compiler encounters this, the resulting code can be extremely inefficient. To address this, many DSP compilers support fractional data types via C language extensions.

Signal processing algorithms are often initially developed using floating-point data types and then ported to fixed-point processors. If you specify a floating-point data type and the target processor doesn't natively support floating-point operations (as is true with most DSPs), then the compiled code will emulate floating-point math in software!which is extremely slow.

C language DSP extensions
Some compilers support DSP-oriented C language extensions. Typical extensions include support for DSP-oriented data types, such as fractional and complex data and support for specifying multiple memory spaces, including separate instruction and data memories. They may also support common DSPs features such as modulo addressing.

But these extensions are not standardized across vendors so using them can sacrifice portability. (ISO has developed DSP-oriented extensions to C as part of "Embedded C," but Embedded C has not yet been widely adopted.) And in general, you'll need to supervise the compiler very closely to verify that the extensions behave as expected.

Using optimization switches
It's common in signal processing applications to find that optimizations which improve speed come at the cost of additional memory use. Hence, the programmer or compiler must decide how to trade off speed vs. memory use. Most compilers allow the programmer to set compiler switches that govern how aggressively they want the code to be optimized and whether to optimize for maximum speed or minimum code size.

Compiler switches are quite useful but do not use them blindly. Directing the compiler to speed-optimize the entire application may speed up small sections of code when viewed in isolation but slow down overall performance. How is that possible? If the code size is increased by the compiler's optimizations to the point where key portions no longer fit in L1 memory, the cost of repeated paging in the desired section (or thrashing the cache) may offset any localized gains. In practice, the programmer must profile the application and select appropriate compiler optimization levels on a file-by-file basis, balancing localized optimizations and overall image size relative to available memory.

Intrinsics are meta-instructions that are embedded within C code and are translated by the compiler into a predefined sequence of assembly instructions. (Most intrinsics translate into a single assembly instruction but occasionally you can come across one that requires multiple instructions.) Using intrinsics gives the programmer a way to access specialized processor features without having to write assembly code.

Many compilers support intrinsics. If you use one, it is important to verify that the compiler's output is as expected. For example, with one compiler, using a single intrinsic resulted in three assembly instructions: one add and two instructions that set mode bits. Though the latter two instructions only needed to be performed once, they were placed within an inner loop alongside the add!causing a serious performance penalty.

Inline assembly
Many processor compilers support the use of inline assembly code, using the asm() construct within a C program. This feature causes the compiler to insert the specified assembly code into the compiler's assembly code output. Inline assembly is a good way to get access to specialized processor features and it may execute faster than calling assembly code in a separate function. However, in some circumstances the use of inline assembly may adversely affect the performance of the surrounding C code.

The problem is that compiler optimizations often depend on the compiler "understanding" the intent of the code!and inline assembly can interfere with that process. For example, an inserted assembly instruction might store data to memory, so the compiler may have to assume that all variables could be modified by the in-line code. This can interfere with the compiler's ability to keep variables in registers.

Many compilers do not optimize code contained within an asm() statement. Moreover, on some compilers, use of inline assembly disables most optimization in the C code surrounding the asm() statement.

Programmers sometimes insert inline assembly code that uses specific processor registers and assume that this would not conflict with the compiler's register use. If a conflict arises, however, it will almost never be detected by the compiler, and this can introduce harmful bugs. Unfortunately, interactions between inline assembly and the compiler are not always well-documented.

Accessing parallelism
Because many DSP algorithms are highly parallel, most processors intended for DSP can execute multiple operations or instructions in parallel. Unfortunately, C is by nature a sequential language thus, compilers often have a difficult time recognizing opportunities for parallelizing operations.

Many DSPs and high-performance general-purpose processors also use single-instruction, multiple data (SIMD) operations to improve their parallelism. Although SIMD is effective for speeding up signal processing code, it is difficult for compilers to use SIMD features well. Most compilers do not even try to use SIMD, instead leaving it to the programmer to use assembly code, intrinsics or off-the-shelf software components for the inner loops where SIMD tends to be most useful.

Using a compiler to create signal processing software requires a different set of skills than using a compiler for other code types. Getting the compiler to produce good, efficient signal processing code requires a solid understanding of the compiler, its DSP-oriented extensions and the target processor architecture!and how these things interrelate. For the best results, spend some time with a disassembler to get a feel of the compiler's capabilities and to be aware of potential compiler pitfalls. And as a final word, don't take anything for granted!

This article was contributed by BDTI.

Article Comments - Achieve better DSP code from compile...
*? You can enter [0] more charecters.
*Verify code:


Visit Asia Webinars to learn about the latest in technology and get practical design tips.

Back to Top