Global Sources
EE Times-Asia
Stay in touch with EE Times Asia
EE Times-Asia > Processors/DSPs

Demystifying multithreading, multicore

Posted: 02 Oct 2007 ?? ?Print Version ?Bookmark and Share

Keywords:multithreaded processors? multicore chips? multitasking? embedded SoC designs?

By Kevin D. Kissell
RapidIO Trade Association

Is multithreading better than multicore? Is multicore better than multithreading? The fact is that the best vehicle for a given application might have one, the other or both. Or neither. They are independent (but complementary) design decisions. As multithreaded processors and multicore chips become the norm, architects and designers of digital systems need to understand their respective attributes, advantages and disadvantages.

Both multithreading and multicore approaches exploit the concurrency in a computational workload. The cost, in silicon, energy and complexity, of making a CPU run a single instruction stream ever faster goes up nonlinearly and eventually hits a wall imposed by the physical limitations of circuit technology. That wall keeps moving out a little farther every year, but cost- and power-sensitive designs are constrained to follow the bleeding edge from a safe distance. Fortunately, virtually all computer applications have some degree of concurrency: At least some of the time, two or more independent tasks need to be performed simultaneously. Taking advantage of concurrency to improve computing performance and efficiency isn't always trivial, but it's certainly easier than violating the laws of physics.

Multicore systems
Multi-processor or multicore systems exploit concurrency to spread work around a system. As many software tasks can run at the same time as there are processors in the system. This tractability can be used to improve absolute performance, cost or power/performance. Clearly, once one has built the fastest single processor possible in a given technology, the only way to get even more computer power is to use more than one of these processors. More subtly, if a load that would saturate a 1GHz processor could be spread evenly across 4 processors, those processors could be run at roughly 250MHz each. If each 250MHz processor is less than a quarter the size of the 1GHz processor, or consumes less than a quarter the power (either of which may be the case because of the nonlinear cost of higher operating frequencies), the multicore system might be more economical.

Many designers of embedded SoCs are already exploiting concurrency with multiple cores. As is not the case with general-purpose workstations and servers, whose workload is variable and unknowable to system designers, fixed sets of embedded device functions are often able to be analyzed and decomposed into specialized tasks. Consequently, it is also possible to assign tasks across multiple processors, each of which has a specific responsibility, and each of which can be specified and configured optimally for that specific job.

Multithreaded chips
Multithreaded processors also exploit the concurrency of multiple tasks, but in a different way and for a different reason. Instead of a system-level technique to spread CPU load, multithreading is processor-level optimization to improve area and energy efficiency. Multithreaded architecture is driven to a large degree by the realization that single-threaded, high-performance processors spend a surprising amount of time doing nothing. When the results of a memory access are required for a program to advance, and that access must reference RAM whose cycle time is tens of times slower than that of the processor, a single-threaded processor can do nothing but stall until the data is returned.

Multithreading can be described thus: If latencies prevent a single task from keeping a processor pipeline busy, a single pipeline should be able to complete more than one concurrent task in less time than it would take to run the tasks serially. This means running more than one task's instruction stream, or thread, at a time, which in turn means that the processor has to have more than one program counter and more than one set of programmable registers. Replicating those resources is far less costly than replicating an entire processor. In the MIPS32 34K processor, which implements the MIPS MT multithreading architecture, an increase in area of 14 percent can buy an increase of throughput of 60 percent relative to a comparable single-threaded core (as measured using the EEMBC PKFLOW and OSPF benchmarks, run sequentially on a MIPS32 24KE core vs. concurrently on a dual-threaded MIPS32 34K core).

In theory, multi-processor architectures are infinitely scalable. No matter how many processors are used, it is always easy to imagine adding another, although only a limited class of problems can make practical use of thousands of CPUs. Each additional processor core on an SoC adds to the area of the chip at least as much as it adds to the performance.

Multithreading a single processor can only improve performance up to the level where the execution units are saturated. However, up to that limit, it can provide a "superlinear" payback for the investment in die size.

Common requirement
Although the means and the motives are different, multicore systems and multithreaded cores have the common requirement that concurrency in the workload be expressed explicitly by software. If the system has already been coded in terms of multiple tasks running on a multitasking OS, there may be no more work to be done. Monolithic, single-threaded applications need to be reworked and decomposed either into sub-programs or explicit software threads.

This work must be done for both multithreaded and multicore systems, and once the work is completed, either system can exploit the exposed concurrency, another reason why the two techniques are often confused, but which makes them highly complementary.

For embedded SoC designs, a multicore design makes the most sense when the functions of the SoC decompose cleanly into subsystems with a limited need for communication and coordination between them. Instead of running all code on a single, large high-frequency core connected to single, large high-bandwidth memory, assigning tasks to several simpler, slower cores allows code and data to be stored in per-processor memories, each of which has lower requirements for both capacity and bandwidth. That design normally translates into power savings and, potentially, area savings as well, if the lower bandwidth requirement allows for physically smaller RAM cells to be used.

If the concurrent functions of an SoC cannot be statically decomposed at system design time, an alternative approach is to emulate general-purpose computers and build a coherent SMP cluster of processor cores. Within such a cluster, multiple processors are available as a pool to run the available tasks, which are assigned to processors "on the fly." The price to be paid for this flexibility is that it requires a sophisticated interconnect between the cores and shared main memory, and the shared main memory needs to be relatively large and of a high bandwidth. This negates the area and power advantages alluded to above for functionally partitioned multicore systems, but it can still be a good trade-off.

Every core represents additional die area, and, even in a "powered down" standby state, each core in a multicore configuration consumes some amount of leakage current, so the number of cores in an SoC design should in general be kept to the minimum necessary to run the target application. There is no point in building a multicore design if the problem can be handled by a single core within the system's design constraints.

Multithreading makes sense whenever an application with some degree of concurrency needs to be run on a processor that would otherwise find itself stalled a significant portion of the time while waiting for instructions and operands. This is a function of core frequency, memory technology, and program memory reference behavior. Well-behaved, real-world programs in a typical single-threaded SoC processor/memory environment might be stalled as little as 30 percent of the time at 500MHz, but less cache-friendly codes may be stalled a whopping 75 percent of the time in the same environment. Systems where the speeds of processor and memory are so well matched that there is no loss of efficiency due to latency will not get any significant bandwidth improvement from multithreading.

Going beyond multicore
The additional resources of a multithreaded processor can be used for other things than simply recovering lost bandwidth if the multithreading architecture provides for it. A multithreaded processor can thus have capabilities that have no equivalent in a multicore system based on conventional processors. For example, in a conventional processor, when an external interrupt event needs to be serviced, the processor takes an interrupt exception, where instruction fetch and execution suddenly restarts at an exception vector. Interrupt vector code must save the current program state before invoking the interrupt service code, and must restore the program context before returning from the exception.

A multithreaded processor, by definition, can switch between two program contexts in hardware, without the need for decoding an exception or saving/restoring state in software. A multithreaded architecture targeted for real-time applications can potentially exploit this and allow for threads of execution to be suspended and then unblocked directly by external signals to the core, providing for zero-latency handling of interrupt events.

Arguably, from the standpoint of area and energy efficiency, the optimal SoC processor solution would use multithreaded cores as basic processing elements and replicate them in a multicore configuration if the application demands more performance than a single core can provide.

About the author
Kevin D. Kissell
is Principal Architect at MIPS Technologies. His work at MIPS includes having been principal architect of the MIPS MT multithreading architecture and the SmartMIPS extensions for smart cards. He holds a degree in computer science from the University of California at Berkeley.

Article Comments - Demystifying multithreading, multico...
*? You can enter [0] more charecters.
*Verify code:


Visit Asia Webinars to learn about the latest in technology and get practical design tips.

Back to Top