Global Sources
EE Times-Asia
Stay in touch with EE Times Asia
EE Times-Asia > Controls/MCUs

Design low-power multiprocessor chips

Posted: 26 Feb 2007 ?? ?Print Version ?Bookmark and Share

Keywords:SoC? multiprocessor? processor? multiprocessors? processors?

By Shinya Fujimoto
LSI Logic Corp.

Just 10 years ago, the challenge chip designers faced was to design logic blocks with as few gates as possible to fit all the functions in a target die size. Today, advancements in semiconductor process technologies let designers easily pack millions of gates and complex mixed-signal components into a single chip. Nonetheless, chip designers still face the challenge of reducing the number of gates and implementing efficient architectures-not only to reach a target size but also to reduce total system power consumption.

From cellphones to portable media players, more products in the market are battery-powered, and low power consumption obviously leads to longer battery life. In the consumer electronics market, even ac-powered devices can benefit from reduced power consumption. Lower power draw leads to lower system cost by enabling the use of less-expensive chip packaging and the reduction or elimination of heat-dissipation components (e.g., fans and heat sinks).

But the demand for higher performance and more features has only served to increase total system power consumption.

The challenge is to achieve higher integration and performance within the system power budget. To reach that goal, chip designers are moving away from single-processor architectures and toward architectures that distribute tasks among multiple processors or cores. The cores can be either symmetric or heterogeneous, depending on the application.

For systems that require the lowest power consumption and best cost/performance, chip designers prefer to use heterogeneous multicore architectures, with task-dedicated processors that operate concurrently.

Care and forethought are required when designing multiprocessor chips; otherwise, a design can easily get trapped into data bandwidth limitations that will limit system performance. Designers must architect chips to handle data transactions efficiently and to minimize the stall cycles as multiple masters access the external memory.

Diminishing returns
Most embedded systems today still use a generic single-processor architecture in which one central processing unit is charged with handling the multiple tasks required in the application. For example, the CPU may be required to decode multimedia audio/video content, generate graphics for the user interface and manage peripheral devices. All of those tasks need to be done simultaneously on top of the general computing functions, such as running the operating system.

To meet the goals of higher performance and reduced power consumption, designers have begun to integrate multiple identical cores into chips.

To meet the ever-increasing need for more processing muscle, SoC architects and designers have traditionally relied on Moore's law for faster transistors to increase system frequency and consequently obtain higher performance. The faster transistors allowed designers to come up with complex branch prediction circuitry and add pipeline stages as needed to keep pace with the performance requirements. But this approach has created a trend of designing bigger and more-power-hungry processors with every new generation of process technology.

In the past, when people were designing with 0.18?m technologies, it was acceptable to increase system performance by adding functions and running the processor at a higher frequency, since those techniques yielded a sufficient return in exchange for the slight increase in power consumption.

As designs migrated to 0.13?m process technology and beyond, however, that approach did not yield as much gain in performance, in light of the amount of power the chips now dissipated.

In addition, the market's preference in many cases has shifted from raw performance improvements to greater power efficiency. That change has caused designers to move away from using more gates and running the system at higher frequencies; instead, they're looking for alternatives that achieve optimal cost and performance under tight power budgets.

Parallel computing
To meet the goals of higher performance and reduced power consumption, designers have begun to integrate multiple identical cores into chips. The most notable example of this trend, advanced by Intel, is to integrate multiple CPU cores on a single die, as opposed to trying to increase the frequency of a single CPU by adding more pipeline stages.

A similar approach has been taken in custom chips used in the latest generation of videogame consoles, such as the Sony Playstation 3 and Microsoft Xbox 360. Both systems use architectures that have a single main processor that integrates multiple cores of the same type into a single chip.

This symmetrical multicore approach is effective for systems that require flexible computing capabilities, such as PCs. It is also an appropriate approach for the new videogame consoles, which are attempting to become "the PC in the living room."

This type of architecture, however, is not suitable for products and applications that need to perform dedicated tasks with the lowest power consumption and optimal cost/performance.

Heterogeneous multiprocessor systems integrate a number of unique processors or cores into a system. Each processor supports a specific task and does it more efficiently with fewer gates than a generic CPU core. One implementation of such a system might include a fast video processor tasked to perform decoding of compressed video, a graphics processor dedicated to handle rendering of images for the user interface, a sound processor to process audio and add sound effects, and a CPU to handle low-level control such as running a real-time operating system.

All of these tasks could be performed with multiple CPUs in a symmetrical multiprocessor system or even by using a single CPU running at a high frequency. But for a system that is required to operate with very low power consumption and at low cost, developing a chip with the smallest die size that can operate at the slowest possible frequency becomes paramount.

An example implementation of a heterogeneous multiprocessing architecture is the Zevio 1020 multimedia application processor from LSI Logic Corp. The Zevio 1020 integrates a single ARM9 processor and a DSP that runs at 150MHz, a 3-D graphics processor, a 3-D sound processor and a 2-D display processor for direct output to LCD panels and televisions.

Granted, the raw performance of this chip does not match that of the latest videogame consoles, but it enables performance similar to that of second-generation videogame consoles for products that retail for under $100 while keeping the power consumption to less than 300mW.

That price point and power efficiency would have been difficult to achieve if LSI had chosen to implement the design using generic CPUs for each of the targeted tasks.

There are inherent challenges in designing a well-balanced multiprocessor system, however. One challenge is to reduce the bottleneck when accessing the external memory. To achieve the highest performance for a given power budget, it is essential that a memory controller provide very efficient arbitration as well as high data throughput. At the same time, to bring down system costs, the memory controller must be optimized to work with 16-bit-wide memories.

Bus arbitration
Another challenge of a multiprocessor system is to improve the arbitration inefficiencies created by standard bus protocols such as the Amba Host Bus, which is commonly used in SoCs. Inefficient arbitration can cause stalled cycles and keep each masters idle, reducing performance and increasing power consumption.

To achieve the highest performance for a given power budget, it is essential that a memory controller provide very efficient arbitration as well as high data throughput.

In the Zevio architecture, the external memory bus is designed to work with 16bit-wide memories by running the memory clock at twice the system clock frequency. The patent-pending arbitration algorithm takes advantage of the faster memory clock by fixing the DRAM burst size to two (2 x 16 bits) and CAS latency to two cycles for issuing SDRAM commands efficiently in a pipeline. This allows the memory controller to achieve much higher bus efficiency than memory controllers found in typical designs.

For example, in comparing a DRAM transaction on a typical memory controller vs. the memory controller designed for Zevio architecture, two 32-bit word read requests are followed by three- and four-word reads successively. In this instance, the Zevio memory controller achieves a bus efficiency improvement of 60 percent over typical memory controllers.

It is important to note that the improvement is a direct result of how efficiently the SDRAM commands are issued to the memory. In a typical memory controller implementation, the system wastes many cycles while waiting for the next command to be sent, whereas in the Zevio memory controller, many of the commands are tightly pipelined and packed together, resulting in higher efficiency.

Another area of concern for memory controllers is the ability to interface multiple bus masters with minimal gate count. In a complex SoC, it is common to find more than 10 master blocks requesting memory access. If the memory controller does not have enough ports to support them, extra logic needs to be added outside the memory controller.

That causes an increase in gate count to support the two levels of arbitration. It also eliminates the ability to arbitrate fairly across all masters, since the memory controller only can arbitrate among the masters connected directly to it.

The trend in SoC design has shifted from designing for pure performance with a fast-running processor to a multiprocessor architecture that meet expected performance requirements with the lowest cost and power consumption.

The most efficient course is to use a heterogeneous multiprocessor system with specialized cores optimized to handle specific tasks efficiently, as opposed to a generic processor that accomplishes the same task by simply running at a faster frequency.

In integrating multiple cores in a single chip, improving the overall bus efficiency is a key factor in reducing an SoC's power consumption. By optimizing the external memory access and bus arbitration protocols, the overall bus efficiency can be improved by 70 percent in some cases. A system architecture that is highly efficient with good bus data throughput gives designers the option to run the system at a slower frequency or even to consider more power efficient and slower process technologies while still achieving the target performance goals.

Designing an efficient multiprocessor system requires careful forethought and planning. Blindly stitching cores together can lead to chips with many cores struggling to hit the required performance because of a lack of overall bus bandwidth.

Moreover, low power consumption is not something that can be achieved using a single technique. Rather, it calls for a combination of factors, including simplified system design, an efficient memory controller and the use of multiple processors that are optimized for specific tasks.

About the author
Shinya Fujimoto
is principal architect for the Consumer Products Division of LSI Logic Corp. He earned a bachelor's degree in electrical and computer engineering and computer science from Carnegie Mellon University in 1995.

Article Comments - Design low-power multiprocessor chip...
*? You can enter [0] more charecters.
*Verify code:


Visit Asia Webinars to learn about the latest in technology and get practical design tips.

Back to Top