Simulating fluid flow in a graphics chip
Keywords:gpu? cpu? chip? simd? 2d?
Using the GPU as a general-purpose processor!a pursuit that goes under the acronym GPGPU!has become an increasingly popular research topic in computer graphics. As the processing power of the GPU evolves, several parallelizable techniques get a substantial speedup when the processing is moved from the CPU to the GPU. However, to show how certain classes of general-purpose algorithms can be implemented on the GPU, it is important to understand modern graphics hardware architectures. In this article, an example application that focuses on incompressible fluid simulation using the Navier-Stokes equations will demonstrate how GPUs are well-matched with the demands of fluid simulation.
The graphics processor has been highly optimized to render 3d geometry and has several parallel vertex pipes to process multiple vertices simultaneously. Likewise, a triangle scan converter generates pixels from the processed vertices, and a number of parallel pixel pipelines are used to produce the final image.
Overall, the graphics pipeline is designed so that there is little or no dependent processing among different vertices and pixels. Hence, individual vertices and pixels can be processed independently, with their results combined in later pipeline stages.
To benefit from the processing power of the GPU, an algorithm should have characteristics that coincide with those of 3D geometry rendering. These characteristics include:
? Uniform grid data output!The output of a rendering algorithm is a 2D image. Algorithms that operate on a uniform sample grid benefit most from GPU processing.
? Localized computation!For each processed value, the GPU performs an independent set of data gathering and processing operations. Global operations, such as computing the maximum value of a large 2D array, require multiple processing passes at the data and are not especially efficient on a GPU. Algorithms that require independent computation on one sample at a time are more suitable for the graphics hardware since they are inherently parallelizable.
? No data scattering!GPUs do not support scatter operations (i.e. they cannot specify their output location since it is predetermined by the rasterization).
? To best utilize the hardware functionality, GPGPU algorithms often store input data in one or more off-chip floating-point texture images and perform several rendering passes in which the computations are executed. One can think of these texture images as 2D floating-point arrays. Most GPGPU algorithms employ processing passes that render quadrilaterals spanning the entire rectangular range of a renderable texture.
During these rendering passes, the programmable pixel-shader programs read the input textures, process data and write results. Complex algorithms, such as fluid simulations, can be implemented using several such passes to process data.
Recently, GPUs have been used for various algorithms such as image and volume processing, which have the characteristics described above.
The well-known Navier-Stokes equations describe the behavior of fluids. These equations provide a means of computing changes in density and velocity over a small period of time.
The first two equations compute the change in velocity and density. More specifically, both of these fields are advected based on the velocity field (first term), slowly diffused over time (second term) and acted upon by external forces (third term). The final equation is applied in a final step called the projection step, which updates the velocity field so that it remains mass-conserving.
To model these continuous equations in the discrete world of the graphics processor, we follow the semi-La-grangian technique presented by Jos Stam in 1999. Since that technique performs all computation locally without requiring data scatter, it can be extended to the GPU. We initially divide the 2D space into a square grid of small cells (i.e. samples) and perform all computations locally, updating a sample's density and velocity based on the values of itself and its neighbors in the previous iteration. We perform all derivative computations via finite differencing between neighboring samples.
To store the density and velocity of the fluid in a 2D grid, we use two floating-point texture images. We also use additional scratch images for divergence and pressure and multiple rendering passes to compute a stable solution for each simulation time step. The steps of the algorithm are as follows:
? External forces!First, we perform two rendering passes that increment the density and velocity textures to account for external forces that inject fluid.
? Diffusion!Diffusion of the velocity field is performed via local relaxation. Essentially, each sample fetches the velocity values of its neighbors and updates its value accordingly.
? Advection!For a given sample, we advect flow by fetching the velocity value of that sample. Based on the velocity, we sample the data, which should now be moved to the current cell's location. The velocity field is used to advect the density and the velocity field itself.
? Mass conservation!In our final step, we must ensure that the velocity field, which has been modified by the previous steps, remains mass-conserving. Computing the mass-conserving component of the velocity field involves solving a Poisson equation, which can be approximated using local relaxation and can take several rendering passes.
Since the mass-conservation step can be fairly expensive, we would like to allocate GPU resources more efficiently by concentrating computation in regions of higher pressure. In the next section, we will do this by exploiting early-z-culling hardware. Early-z-culling is a GPU feature that was created for efficient graphics processing, but we can cleverly use the feature for optimizing general-purpose computation.
Prior to pixel processing, the graphics hardware performs a check of the interpolated depth against the z-value in the z-buffer. If this test fails, then the pixel is not processed. The early-z-culling used in real-time rendering can be exploited to optimize our fluid simulation algorithm.
The expensive mass-conservation step is performed as a series of rendering passes to solve a linear system using a relaxation method. When approximating the solution to this linear system, the higher the number of iterations (rendering passes), the more accurate the result. Instead of performing the same number of passes on all cells, we use early-z-culling to perform more passes on high-pressure regions and fewer passes on regions with little to no pressure. This is accomplished by performing an additional inexpensive rendering pass that sets the z-value of each cell in the simulation based on the maximum value of that cell and its nearby cells from the pressure buffer. Then, when performing the expensive pressure computation passes, we set the depth-compare state to "less than or equal to" and linearly increase the z-value on each of the projection passes.
As these passes are rendered, the number of cells that are processed gradually decreases, improving the speed of the simulation.
- Pedro Sander
Member, Application Research Group
ATI Research Inc.
Related Articles | Editor's Choice |
Visit Asia Webinars to learn about the latest in technology and get practical design tips.