Combine C and assembly code for maximum DSP performance
Keywords:C coding? assembly coding? DSP? Modulo Mechanism?
Here are the assembly coding pros:
? Assembly code can take advantage of a processor's unique instructions as well as various specialized hardware resources. On the other hand, C code is generic, and must support various hardware platforms. Thus, it is difficult for C to support platform-specific code.
? The assembly programmer is usually very familiar with the application and can make assumptions that are unavailable to the compiler.
? The assembly programmer can use human creativity; the compiler, advanced as it may be, is merely an automatic program.
On the other hand, here are the assembly coding cons:
? The assembly programmer has to handle time-consuming machine-level issues such as register allocation and instruction scheduling. With C code, these issues are taken care of by the compiler.
? Assembly coding requires specialized knowledge of the DSP architecture and its instruction set, whereas C coding only requires knowledge of the C languagewhich is rather common.
? With assembly code, it is extremely difficult and time consuming to port applications from one platform to another. Porting is relatively easy for C applications.
The listing demonstrates the use of dedicated hardware mechanisms for highly optimized assembly code. The C implementation on the left side creates a cyclic buffer p1 using modulo arithmetic. In the highly optimized assembly code on the right, an equivalent buffer is created using the Modulo Mechanism of the CEVA-TeakLite-III DSP Core. The Modulo Mechanism automatically performs the modulo arithmetic whenever there is an update to the buffer pointer (r0 in this case). This arithmetic occurs in the same cycle as the pointer update, so the assembly code is much more efficient than the C code, which would generate separate instructions for the modulo arithmetic.
The question is where to draw the line between C code and assembly code. The DSP engineer needs to define clear objectives for the application. Typically, the objectives are cycle count, code size and data size. Once these are defined, the application should be written and built entirely in C. And then a profiler must be used to analyze its performance.
In some rare cases, mostly in control applications, C level coding is sufficient.
In most cases, the initial C level version of the application does not comply with one or more of the objectives. There are measures that can be taken in the C level to improve performance before resorting to assembly coding. Assuming all C level measures have been exhausted and assembly coding has been initiated, it is highly recommended to save the original C code implementation. This eases debugging and enables the return to the C implementation once the conditions are right.
![]() |
The C implementation on the left side creates a cyclic buffer p1 using modulo arithmetic. In the highly optimized assembly code on the right, an equivalent buffer is created. |
Critical functions
The assembly portion of the code should be kept to a minimum. For this purpose, the performance results reported by the profiler should be analyzed and the critical functions of the application should be identified. Critical functions are those that consume the most execution time and ought to be rewritten in assembly to meet performance objectives. Once the two or three most critical functions have been rewritten, it is time to take another performance measurement. If the application still does not meet its objectives, additional critical functions should be defined and rewritten in assembly. This process iterates until the performance objectives are met.
When writing assembly code that will later be combined with C code, the assembly programmer has to be aware of compiler conventions and assumptions. One of the important compiler conventions is the function calling convention, also known as the function argument passing convention. This convention describes how the compiler passes arguments when one function calls another. For an assembly function to be successfully called from a C function and vice versa, the assembly function must retrieve arguments and send arguments on the hardware resources defined by the function calling convention, which are usually registers or stack memory.
Usage convention
The assembly programmer must also know the compiler's register usage convention. This convention divides the hardware registers into callee-saved (or caller-used) and callee-used (or caller-saved) registers. The compiler assumes that callee-saved registers maintain their value across function calls. If assembly programmers want to use such registers, they must back them up first, and then restore their contents before returning to C code. In contrast, callee-used registers are not assumed to maintain their values across function calls. This means that assembly programmers can use these registers without a backup. However, they need to bear in mind that when their assembly functions call C functions, these registers can be overwritten by the callee.
In addition to calling convention and register usage conventionwhich are defined for every compilersome compilers may have additional assumptions regarding hand written assembly code. These assumptions are often specific to the compiler and should be well documented by the compiler's provider.
Another example of compiler assumptions concerns the location of specific instructions in hand written assembly code.
Most compilers for embedded platforms, especially those intended for DSP programming, have a rich set of features for connectivity between C and assembly code. Most of these features are not part of the standard C language and therefore are referred to as C language extensions. These include inline assembly, binding a hardware register to a C variable, section attribute, user-defined calling convention, compiler intrinsics and assembly intrinsics.
![]() |
The performance improvement throughout the optimization process of a critical function of the H.264 encoder is shown. |
Debugging
The first step when debugging a mixed application is to isolate the problem. Assuming the C level implementation of the assembly code has been maintainedand assuming the C level implementation works correctlyit is relatively easy to swap out assembly functions for their C implementations and re-test the application. To pinpoint problems rapidly, the programmer can use an iterative process: At each step half of the suspected functions are switched to their C implementations, so that at each step the programmer is testing only half as many functions as in the previous step.
The problematic assembly function should be investigated for standalone assembly issues and for C and assembly connectivity issues. Debugging standalone assembly issues is quite straightforward for assembly programmers, but C and assembly connectivity issues are somewhat puzzling. Unlike standalone assembly issues, C and assembly connectivity issues are not viewable when looking at the assembly function itself. To find these problems, the programmer must inspect compiler conventions such as calling convention and register usage convention.
The programmer must also check compiler assumptions such as the whereabouts of assembly instructions. To reduce debugging time, the programmer should verify that all compiler conventions and assumptions are followed when the assembly function is first implemented.
Real improvement
The H.264 video encoder is very demanding in terms of processing power (usually measured in MHz) and other resources. CEVA uses the CEVA-X16xx DSP Core family and its MM2000 multimedia platform to provide the processing power required by this encoder.
The critical functions of this encoder were identified using advanced profiling techniques and then optimized. The optimization process of the encoder's critical functions was gradual. First, the functions were fully optimized in C using advanced features like assembly intrinsics. Then, the assembly code provided by the Compiler was further optimized in assembly level.
Figure 2 shows the performance improvement throughout the optimization process of a critical function of the encoder. Only the last optimization stage involved full scale assembly coding. All other stages were based on C code with assembly intrinsics. The assembly intrinsics were mostly used for Single Instruction Multiple Data (SIMD) operations like avg_acW_acX_acZ_4b. This instruction performs byte averaging on eight input bytes, producing 4byte results. Such SIMD operations are very useful for video codecs that perform many calculations at the byte level.
With a high quality software development tool chain offering various C and assembly features, DSP programmers can reach impressive performance results without implementing their entire application in assembly. Writing a combination of C and assembly code is not a trivial exercise. However, the various features discussed here make it easier for the DSP engineer to handle this task.
- Eran Balaish
Senior Compiler Project Manager
CEVA Inc.
Related Articles | Editor's Choice |
Visit Asia Webinars to learn about the latest in technology and get practical design tips.