Global Sources
EE Times-Asia
Stay in touch with EE Times Asia
EE Times-Asia > Embedded

Optimizing performance of DSPs

Posted: 02 May 2003 ?? ?Print Version ?Bookmark and Share

Keywords:dsp application? memory? cpu? dma? multiple pipeline?

Many of today's DSP applications are subject to real-time constraints. And it seems many applications eventually grow to a point where they stress the available CPU and memory resources.

The fundamental rule in computer design, as well as programming real-time systems, is: "Make the common case fast, and favor the frequent case." This is really just Amdahl's Law that says the performance improvement to be gained using some faster mode of execution is limited by how often you use that mode of execution. So do not spend time trying to optimize a piece of code that will hardly ever run. Instead, if you can eliminate just one cycle from a loop that executes thousands of times, you will see a bigger impact on the bottom line.

Memory can be a severe bottleneck in embedded system architectures. This problem can be reduced by storing the most-often referenced items in fast, on-chip memory and leaving the rest in slower off-chip memory. The problem is getting the data from external memory to on-chip memory takes a lot of time. If the CPU is busy moving data, it cannot be performing other, more important tasks.

The fastest and most expensive memory is generally the registers on-chip. There never seems to be enough of this valuable resource, and managing it is paramount to improving performance. The next-fastest memory is usually the cache that holds the instructions or data the processor hopes to execute in the near future.

The slowest memory is generally found off-chip and referred to as external memory. As a real-time programmer, you want to reduce the accesses to off-chip, external memory because the time to access this memory can be long, causing huge processing delays. The CPU pipeline must "stall" or wait for the CPU to load this memory. Use of on-chip memory, therefore, is one of the most effective ways to increase performance.

Multiple pipelines

Hardware architecture techniques have been used to enhance the performance of processors using pipelining concepts. To improve performance even more, multiple pipelines can be used. This approach, called "superscalar," exploits further the concept of parallelism.

One way to control multiple execution units and other resources on the processor is to issue multiple instructions simultaneously. Each instruction in a very long instruction word (VLIW) machine can control multiple execution units on the processor. For example, each VLIW instruction in the TI 6200 DSP is eight instructions long, one instruction for each of the eight potentially available execution units. Again, the key is parallelism. In practice, however, it is hard to keep all these execution units full all the time because of various data dependencies. The possible performance improvement using a VLIW processor is excellent, especially for some DSP applications.

A superscalar architecture offers more parallelism than a pipelined processor. But unless there is an algorithm or function that can exploit this parallelism, the extra pipe can go unused, reducing the amount of parallelism that can be achieved. An algorithm that is written to run fast on a pipelined processor may not run nearly as efficiently on a superscalar processor.

DMA is another option for speeding up DSP execution rates. A peripheral device is used to write data directly to and from memory, taking the burden off the CPU. The DMA is just another type of CPU whose only function is moving data around quickly. The advantage of this is that the CPU can issue a few instructions to the DMA to move data, and then go back to what it was doing. This is just another way of exploiting the parallelism built into the device. The DMA is most useful for copying larger blocks of data. Smaller blocks of data do not have the payoff because of the setup and overhead time required for the DMA; the CPU can be used for these. But when used smartly, the DMA can save huge amounts of time.

A common use for the DMA is to stage data on- and off-chip. The CPU can access on-chip memory much faster than off-chip or external memory. Having as much data as possible on-chip is the best way to improve performance. If the data being processed cannot all fit on-chip at the same time-as with large arrays-then the data can be staged on- and off-chip in blocks using the DMA. All of the data transfers can be happening in the background, while the CPU is actually crunching the data. Smart management and layout of on-chip memory can reduce the amount of times data has to be staged on- and off-chip.

It is worth the time and effort to develop a smart plan for how to use the on-chip memory. In general, the rule is to stage the data in and out of on-chip memory using the DMA, and generate the results on-chip. For cost and space reasons, most DSPs do not have a lot of on-chip memory. This requires the programmer to coordinate the algorithms in a way that efficiently uses the available on-chip memory.

- Rob Oshana

Engineering Manager

Texas Instruments Inc.

Article Comments - Optimizing performance of DSPs
*? You can enter [0] more charecters.
*Verify code:


Visit Asia Webinars to learn about the latest in technology and get practical design tips.

Back to Top