Global Sources
EE Times-Asia
Stay in touch with EE Times Asia
EE Times-Asia > EDA/IP

GPUs tapped for general computing

Posted: 01 Jun 2005 ?? ?Print Version ?Bookmark and Share

Keywords:gpu? cpu? chip? simd? moore's law?

Over the past 10 years, graphics processing units (GPUs) have become ubiquitous in desktop computers. GPUs feature enormous computing power and high memory bandwidth, both essential to sustaining high performance on 3D graphics applications. Historically, GPU computation has been specialized for these applications, but modern GPUs now allow substantial user programmability that lets the GPU be retasked for more general-purpose computation. First, however, general-purpose computation must be efficiently mapped onto the GPU.

Three-dimensional graphics computations are organized into a graphics pipeline that outlines the series of computation stages between the scene input and the image output. The input to the graphics pipeline is a scene consisting of a list of geometry (defined as connected vertices) and graphics instructions to compute the scene from the geometry. The GPU then processes and transforms those vertices into screen-space geometry, which in turn is divided into pixel-sized fragments (in a process called rasterization) according to which pixels are covered by that geometry. Each fragment is then associated with a pixel position on the screen. Finally, the fragments are processed and assembled into an image made of pixels.

The pipeline contains on the order of a dozen stages, each of which is implemented in a GPU as a separate processor per stage on the same die. The typical input to the pipeline is tens to hundreds of thousands of vertices, each of which can be processed in parallel on the GPU. The complex graphics pipeline is thus divided in space, with separate processors on the GPU running each stage in parallel. This organization, which allows each processor to be specialized to its task, differs from that of a CPU, which only features a single processor. Implementing the graphics pipeline on a CPU divides this complex task in time: The CPU processes the first stage of the pipeline on its processor, then begins the next stage only when the first stage is complete.

From the perspective of a general-purpose programmer, the GPU features two stages of interest, both user-programmable. The vertex stage runs a user-specified program on each vertex input; the fragment stage runs a user-specified program on every fragment.

Both stages feature a similar programming model and are most efficient when run on long lists of inputs (hundreds to thousands of vertices or fragments). Each of those inputs, however, is processed independently, so the computation of one vertex or fragment never affects another vertex or fragment. In this way, many vertices or fragments can be processed in parallel. Each input is processed with the same fragment or vertex program. The programs are currently most efficient when run under single-instruction, multiple-data (SIMD) control, meaning that each vertex or fragment is computed in parallel with the same sequence of instructions controlling each computation. However, recent hardware has begun to allow more complex control flow operations in such programs.

So what do the fragment or vertex programs look like? First, they both support only IEEE single-precision floating-point operations. Second, the programs can read from any location in global memory, but cannot write to arbitrary global memory. Instead, the output from a vertex program is a single vertex, and the output from a fragment program is just one fragment at the fragment's pixel position. While these limitations are significant from the point of view of the general-purpose programmer, they allow the GPU to support many functional units that can evaluate the vertex or fragment programs on many data elements in parallel. The aggregate arithmetic rate of the fragment programs in Nvidia's GeForce 6800, for example, is more than 50 billion flops on 16 parallel fragment processors.

Given the programmability of the vertex and fragment programs, general-purpose applications can then be run on the GPU. Most efforts to date have concentrated on the fragment stage because it features more processors and higher performance than the vertex program. The key to making these applications efficient is to map the necessary computation onto the SIMD parallel programming model of the fragment processing hardware.

One typical method of structuring a general-purpose program on graphics hardware follows:

? A large instance of geometry, such as a screen-sized rectangle, is specified. This geometry covers a substantial portion of the screen and thus generates a large number of fragments.

? Each fragment generated by this geometry can be processed in parallel. Modern GPUs can process 16 fragments in parallel with the same instruction (with many times that number in flight at the same time). Each fragment computation can read from arbitrary locations in global memory (in vector terminology, a "gather") but cannot write to arbitrary global memory (a "scatter").

? The output of the fragment program is an image in global memory, with each fragment computation corresponding to a separate pixel in that image.

? Some simple applications can be expressed in a single pass of the graphics pipeline, but many complex ones cannot. These more-complex applications use multiple passes through the graphics pipeline by using the global image output of one pass in the computation of subsequent passes.

With its superior computation rate and memory bandwidth compared with a CPU, the GPU has the potential to excel at applications that map well to its programming model. The challenge to programmers is to efficiently map their applications of interest to the graphics metaphors and the restrictive programming model of the graphics hardware.

John Owens

Assistant Professor, Department of Electrical and Computer Engineering

University of California, Davis

Article Comments - GPUs tapped for general computing
*? You can enter [0] more charecters.
*Verify code:


Visit Asia Webinars to learn about the latest in technology and get practical design tips.

Back to Top