CPUs take parallel turn at Hot Chips
Keywords:parallel architecture? hot chips?
Parallel architectures emerged as the locus of attention in chip design at last August's Hot Chips 17 conference in Palo Alto, California. Hot Chips used to be about bigger and faster processors and ever-higher clock frequencies. But this year, the predominant theme among conference papers was not raw speed, but parallelism.!finding it, exploiting it with hardware and avoiding the bottlenecks that threaten to undermine those efforts.
The reason for this shift.!a gradual process over recent years.!was clear. On the transistor and circuit level, chips aren't going to whiz along much faster than they do now. The conspiracy of increasing parasitics, deteriorating transistors and spiraling power dissipation will make further progress in uniprocessor systems slow and painful. So architects are turning to multiprocessing, in any form they can adapt to their applications. And in one possibly prophetic paper, from IBM Corp.'s Zurich Research Laboratory, they are turning away from instruction-driven processors altogether.
While many issues of legacy architectural thinking, hardware implementation and, especially, programming support tend to obscure it, chip architecture has become a simple battle to preserve and exploit whatever parallelism exists in an algorithm and its data. This new reality is opening a profound gulf between embedded systems, where the algorithms and structure of the data are known in advance, and fully programmable systems, in which they are not.
In the embedded world, the first question is whether the parallelism exists in the data, in the algorithms or both. So far, data parallelism has been the most rewarding situation for chip designers. If a transformation can be applied independently to many groups of data, the degree of parallelism is in theory limited only by the amount of data available at one time and the complexity of the transform. You can still run out of transistors.
At one extreme, where transforms can be expressed as vector arithmetic with a minimum of control complexity, high-end chips almost become arrays of media-access controllers. Telairity Semiconductor Inc., for example, last week presented the Telairity-1 chip, intended for H.264 encoding. It has a total of 220 sixteen-bit vector function units and 30 scalar function units on the die. Cradle Technologies Inc. described the CT3616, with 16 DSP cores and eight general-purpose cores on the die.
As control complexity increases and the relative importance of simple computational loops wanes, the chips begin to look more like conventional CPUs with DSP extensions. Tensilica Inc. (Santa Clara), for example, demonstrated that a wide range of audio-processing applications could be handled very economically by adding specialized signal-processing instructions.!in this case, about 300 different operations.!to the vanilla Xtensa RISC core.
If the parallelism exists in the algorithm rather than the data, often the resulting chip will take on a pipelined appearance. One such example was a novel processor from Intel Corp. intended for efficiently handling the huge streams of very small packets.!"Milliflows," in Intel's parlance.!associated with voice-over-Internet Protocol and similar applications. Intel's Magpie architecture aims to aggregate thousands of these small packets, route them into a DSP farm for processing and then deliver them in a useful manner.
To do this, the chip employs three extended MIPS cores.!one each for packet ingress, traffic management and packet egress.!along with two unenhanced MIPS cores for managing the network stack and scheduling, respectively. In this way, functions that are logically separable occur on different processors; incoming packets see, in effect, a pipeline of functional units.
The real problem comes when the end system is programmable, rather than embedded. In this case, only generalizations can be made about the degree of parallelism available in either the data or the algorithms. The ideal of simply routing whatever might come along through one sufficiently fast CPU is no longer an option, so architects resort to another approach.
This scheme.!as illustrated in several papers on the IBM, Sony and Toshiba Cell architecture, and one paper on the Microsoft Xbox.!might be called virtualization. The architect tries to extract accurate generalizations about the data flows and tasks the system will face, and then build a network of processing elements, memories and buses than can serve as a medium in which to implement the transforms and connections required by each flow of data as it passes through the system.
Thus, in the system, one data flow might be processed by one set of processors under one operating system, while another flow takes a quite different path, under control of an entirely different OS. The underlying hardware is unchanged, but it is employed differently for different tasks and flows.
This emphasis on data flows makes both the Cell and Xbox processors resemble networking chips to a remarkable degree. Both architectures are described in terms of their general-purpose CPUs and specialized vector units. But both depend for their operation on extensive behind-the-scenes hardware and organization to, in IBM's words, orchestrate the movement of data. The idea is to provide a multiprocessing platform that will appear to the game developer to be a single CPU augmented with powerful vector-processing instructions and hardware threading.
This thinking played out very differently in the IBM and Microsoft camps. The Cell uses a single Power CPU to deal with control flows and a bank of eight rather specialized vector floating-point processors to process data flows. The Xbox, in contrast, does not separate control from data flows so firmly. Instead, it provides three PowerPC cores with extensive SIMD instruction enhancements. Both architectures depend upon external graphics processors and I/O processors.
The problem they will face was discussed at length in a keynote by David Kirk, chief scientist at graphics chip maker Nvidia Corp. (Santa Clara). Kirk warned that PC game performance was increasingly limited by the throughput of the CPU, which was unable to supply vertices fast enough to saturate the graphics chip. Rather than delivering faster CPUs, he said, Intel and AMD are fielding dual-core processors. "But virtually no games today can benefit from multicore processors," Kirk said. "Efforts to multithread them have not been successful. In fact, we have seen cases where these efforts . . . actually made the system run slower on two cores than on one."
While multiprocessing characterized most of the papers at Hot Chips, one.!from IBM Zurich.!suggested a future in which stored-program processors did not play a role. The team described a pattern-matching processor based not on a CPU but on a finite state machine with alterable state transition rules. By compiling search strings and regular expressions directly into state transition rules, the design avoids the issue of software, instruction storage, fetches and decodes. It thus can achieve deterministic throughput of over 20Gbps, independent of the number of patterns for which it is searching.
In a way, the paper illustrated a contentious remark made in a panel discussion. Nick Tredennick, editor of the Guilder Technology Report, commented that the invention of the microprocessor essentially brought to a halt innovation in logic design. But now, with progress in conventional CPU throughput grinding to a halt, there may be an effort to revisit the fundamentals of logic design that thrived before the microprocessor was born.
- Ron Wilson
EE Times
Related Articles | Editor's Choice |
Visit Asia Webinars to learn about the latest in technology and get practical design tips.