Global Sources
EE Times-Asia
Stay in touch with EE Times Asia
EE Times-Asia > Processors/DSPs

Single-chip multiprocessing steps up

Posted: 16 Sep 2008 ?? ?Print Version ?Bookmark and Share

Keywords:symmetric multiprocessing? MIPS32 system? IC design? processor?

Driving individual processor performance to the limit in a given implementation technology is never easy or efficient. Faster clocks, deeper pipelines and bigger caches carry silicon area and power dissipation costs that result in diminishing returns for that last 10 percent of performance. Sometimes there is no alternative but to turn up the clock and upgrade the power and cooling subsystems; but splitting a workload across multiple processors increases maximum total performance limits, and processor design can be simpler and more efficient.

Many embedded SoC designs today leverage multiple processors in an application-specific or loosely coupled manner. Until recently, SoC design options for software-friendly multiprocessing were limited. Now, SoC design components, such as the MIPS32 1004K coherent processing system, mean on-chip symmetric multiprocessing (SMP) under a single OS is a real option.

While parallel programming can cause software engineers to be apprehensive!because not all existing code was written for a parallel processing platform!there are several paradigms for parallel software, some of which are already familiar to software designers.

Handling data
Data-parallel algorithms carve up data sets to leverage more than one processor, up to a large number of CPUs. The textbook large data set is a large input file or data array, but in embedded systems, it can mean high I/O and event service bandwidth. In some SoC architectures, multiple input data sources (e.g. network interface ports) can be statically assigned to multiple processors running the same driver/router code for natural data parallelism.

When multiple processor performance is leveraged on a single data array or input stream, data parallel algorithms that divide and conquer data are common. Such algorithms are generally suboptimal on a single processor, but compensate for inefficiency with scalability for more computational bandwidth. These algorithms are scalable for parallel computing, but converting a working sequential program to a data parallel algorithm may be trivial, difficult or impossible, depending on factors such as program dependency characteristics. For more performance, system designers would probably look to explicitly implement data parallel algorithms if most of the application's computations are done in a small number of long runs of regular computational loops.

The emergence of multicore x86 chips for PC, workstations and servers has generated new libraries and tool kits to enable and more easily exploit parallel algorithms on modest numbers of processors. Many are open-sourced and portable to embedded architectures such as MIPS. OpenMP extensions to the GNU Compiler Collection (GCC) for data parallel C/C++ and Fortran are becoming part of the standard GCC.

Dividing tasks
Control-parallel programming splits work by task rather than input. If an automobile factory where 100 workers are each building a car is a metaphor for a 100-way data parallel algorithm, the metaphor for a control parallel program is an assembly line with 100 worker stations, each performing 1/100th of the work. The assembly line is generally more efficient, but the work of assembling a car can only be divided so much. This limitation is significant for scientific codes scaling to thousands of processors but is not generally an issue with modestly parallel SoC architectures.

Software engineers often break programs into phases for easier coding, debugging and maintenance, and for reduced instruction memory and cache pressure. Often, the control parallel decomposition is already at the level of OS-visible tasks. The single command "cc" on a Unix-like system invokes, sequentially, a C language preprocessor, compiler, assembler and linker. Several of these can be run simultaneously, with each successive program using output of the previous phase as input, using files or the software pipes in Unix-like OS.

When decomposition into independently run tasks isn't done, some software engineering is needed to make application phases visible to the OS and underlying hardware, and to pass data explicitly between tasks. But there shouldn't be a need to rework constituent phase algorithms. Coarse-grain task decomposition can be done in terms of processes communicating via files, sockets or pipes. For finer-grained control, the Posix thread API, pthreads, is supported by many OS, including Linux, Microsoft Windows and several RTOS.

Complex, modular, multitasking embedded software systems often exhibit serendipitous concurrency. The overall system mission may involve multiple tasks with distinct responsibilities responding to distinct inputs. Without a time-sharing OS, each task must run on a separate processor. On a time-sharing uniprocessor, they run in alternating time slices. On a multiprocessor with an SMP OS, they run concurrently across available processors.

On a time-sharing uniprocessor, tasks run in alternating time slices on one CPU. On a multiprocessor with an SMP OS, they run concurrently across processors. (Click to view full image)

Client-server programming
Distributed computing!typically network client-server models!is common that it is sometimes not thought of as "parallel." Client-server programming is basically a form of control-flow decomposition. Rather than perform all computation itself, a program task sends work requests to specialized system tasks designated for specific jobs. Client-server programming is most commonly done across LANs and WANs, but an SMP SoC follows the same paradigm. Unmodified client-server binaries can communicate by TCP/IP via on-chip or null loopback network interfaces, or more efficiently with local communications protocols that pass data buffers in memory.

These techniques may be used alone or in combination to leverage SMP power. One could even construct a data parallel array of distributed SMP servers, each implementing a control-flow pipeline.

In SoC systems where parallelism by static physical decomposition of tasks onto processors is possible, assignment of parallel tasks to processors can be done in hardware. This reduces software overhead and footprint, but provides no flexibility.

If an embedded application can be statically decomposed into clients and servers communicating across an on-chip interconnect, the only system software needed to tie the system together is message-passing code implementing a common protocol. The message-passing protocol provides a level of abstraction that enables configurations with more or fewer processors to run a common base of application code, but for any configuration, load balancing between processors is as static as the hardware partitioning. More flexible parallel system programming is enabled by software distribution of tasks across a multiprocessor system with shared resources.

OS differences
In SMP OS, all processors see the same memory, I/O devices and global OS state, making program migration between processors simple and efficient, and load balancing easy. With no additional programming or system administration, a set of programs that multitasks on a single CPU using time slicing will run concurrently on the available CPUs of an SMP system. An SMP scheduler, like that of Linux, switches programs on and off of processors.

A Linux application running as multiple processes needs no modification to exploit SMP parallelism, and no recompilation is generally required. An SMP Linux environment provides numerous tools for tuning how tasks share available processors!raising/lowering task priorities, or restricting tasks to run on arbitrary processor subsets. Appropriate kernel support enables use of different real-time scheduling regimes.

Unix-like OS always offer applications some control over relative task scheduling priority, even in uniprocessor time-sharing systems. The traditional nice shell command and system call are augmented in Linux with more elaborate mechanisms to manipulate task priority, groups of tasks or specific system users. Also, in multiprocessor configurations, every Linux task has a parameter that specifies which set of processors may schedule the task. The default parameter is the full set of system processors, but this CPU affinity is controllable.

An SMP paradigm requires that all processors see all memory at the same addresses. For low-performance processors, this is accomplished by placing the instruction fetch and load/store traffic of all processors on a common memory and I/O bus. This model breaks down quickly with more processors, however, as the bus becomes a bottleneck. Even in uniprocessor systems, bandwidth requirements for instructions and data of high-performance embedded cores dictate that cache memories be used between main memory and processor.

A system with independent per-processor caches is no longer naturally SMP. When one processor's cache contains the only copy of the most recent location value in memory, there's asymmetry. Cache coherence protocols must be added to restore symmetry.

In simple systems where all processors are connected to a common bus, cache controllers can monitor the bus to see which cache owns the latest version of a given memory location. In more-advanced systems, processors are connected to memory using point-to-point connections to a switching fabric, so cache coherence requires more-sophisticated support. A coherence management unit should impose global order on memory transactions and generate intervention signals to maintain cache coherence among processor cores.

An SMP OS such as Linux can freely migrate tasks and dynamically balance processor loads. In an embedded SoC, a substantial portion of overall computation can be spent in interrupt service. Good load balancing and performance tuning require control of where interrupt service is performed. The Linux OS has an interrupt request (affinity control interface that lets users and programs specify which processors service a given interrupt.

Cache coherence infrastructure is useful, not only between processors for SMP, but between processors and I/O DMA channels. Using software, this requires that DMA buffers be processed by the CPU before or after each I/O DMA operation. This can have a significant performance impact on I/O-intensive applications. Using I/O coherence hardware to connect I/O DMA to memory allows DMA traffic to be ordered and integrated with the coherent load/store flow, eliminating software overhead.

Increasing efficiency
A cache coherence management unit should impose order on memory traffic among processors, I/O and memory. This adds cycles to a processor's memory access time and results in lost processor cycles via pipeline stalls. However, techniques such as hardware multithreading in each core allow a single core to execute concurrent instruction streams to increase pipeline efficiency.

Threads in each core look just like a full-blown CPU to OS software, including having independent interrupt inputs. Threads share the same cache and functional units and interleave their pipeline execution. If one thread stalls, another can execute, taking back cycles otherwise lost to coherent memory subsystem latency. The same SMP OS that manages multiple cores can manage their constituent hardware threads. Software written to exploit SMP naturally exploits multithreading, and vice versa.

Two threads competing for a single pipeline will achieve lower performance than two threads on independent cores. The SMP Linux kernel should be optimized for load balancing. For power consumption optimization, the scheduler can load work onto the virtual processors of one core at a time, leaving others in a low-power state. For performance optimization, it can spread work across the cores, and then load multiple threads per core, once all cores have an active task.

On-chip multiprocessing can be leveraged for high SoC performance. SMP platforms and software provide a flexible, high-performance computing platform that can deliver significant speedup relative to single processors, often with little or no application code modification.

- Mark Throndson
Director of Product Marketing, Processor Business Group
MIPS Technologies Inc.

Article Comments - Single-chip multiprocessing steps up
*? You can enter [0] more charecters.
*Verify code:


Visit Asia Webinars to learn about the latest in technology and get practical design tips.

Back to Top