Global Sources
EE Times-Asia
Stay in touch with EE Times Asia
EE Times-Asia > Memory/Storage

Improve software through memory layout optimisation

Posted: 25 Nov 2014 ?? ?Print Version ?Bookmark and Share

Keywords:algorithms? memory optimisation? compiler? SIMD? vectorisation?

When a processor reads or writes to memory, it will often do this at the resolution of the computer's word size, which might be four bytes on a 32bit system. Data alignment is the process of putting data elements at offsets that are some multiple of the computer's word size so that various fields may be accessed efficiently. As such, it may be necessary for users to put padding into their data structures or for the tools to automatically pad data structures according to the underlying ABI and data type conventions when aligning data for a given processor target.

Alignment can have an impact on compiler and loop optimisations such as vectorisation. For instance, if the compiler is attempting to vectorise computation occurring over multiple arrays within a given loop body, it will need to know whether the data elements are aligned so as to make efficient use of packed SIMD move instructions, and also to know whether certain iterations of the loop nest that execute over non-aligned data elements must be peeled off.

If the compiler cannot determine whether or not the data elements are aligned, it may opt to not vectorise the loop at all, thereby leaving the loop body sequential in schedule. Clearly this is not the desired result for the best-performing executable. Alternatively, the compiler may decide to generate multiple versions of the loop nest with a run-time test to determine at loop execution time whether or not the data elements are aligned. In this case the benefits of a vectorised loop version are obtained; however, the cost of a dynamic test at run-time is incurred and the size of the executable will increase due to multiple versions of the loop nest being inserted by the compiler.

Users can often do multiple things to ensure that their data is aligned, for instance padding elements within their data structures and ensuring that various data fields lie on the appropriate word boundaries. Many compilers also support sets of pragmas to denote that a given element is aligned. Alternatively, users can put various asserts within their code to compute at run-time whether or not the data fields are aligned on a given boundary before a particular version of a loop executes.

Selecting data types for big payoffs
It is important that application developers also select the appropriate data types for their performance-critical kernels in addition to the aforementioned strategies of optimisation. When the minimal acceptable data type is selected for computation, it may have a number of secondary effects that can be beneficial to the performance of the kernels. Consider, for example, a performance-critical kernel that can be implemented in either 32bit integral computation or 16bit integral computation due to the application programmer's knowledge of the data range. If the application developer selects 16bit computation using one of the built-in C/C11 language data types such as "short int", then the following benefits may be gained at system run-time.

By selecting 16bit over 32bit data elements, more data elements can fit into a single data cache line. This allows fewer cache line fetches per unit of computation, and should help alleviate the compute-to-memory bottleneck when fetching data elements. In addition, if the target architecture supports SIMD-style computation, it is highly likely that a given ALU within the processor can support multiple 16bit computations in parallel versus their 32bit counterparts.

For example, many commercially available DSP architectures support packed 16bit SIMD operations per ALU, effectively doubling the computational throughput when using 16bit data elements versus 32bit data elements. Given the packed nature of the data elements, whereby additional data elements are packed per cache line or can be placed in user-managed scratchpad memory, coupled with the increased computational efficiency, it may also be possible to improve the power efficiency of the system due to the reduced number of data memory fetches required to fill cache lines.

Used with permission from Morgan Kaufmann, a division of Elsevier, Copyright 2012, this article was excerpted from Software Engineering for Embedded Systems, written and edited by Robert Oshana and Mark Kraeling.

About the author
Dr. Michael C. Brogioli is principal and founder at Polymathic Consulting as well as an adjuct professor of computer engineering at Rice University, Houston, Texas. Prior to Polymathic, he was senior member of the technical staff and chief architect at Freescale Semiconductor. In addition to that he also served in several roles at TI's Advanced Architecture and Chip Technology Research and Intel's Advance Microprocessor Research Lab. He holds a PhD/MSc in electrical and computer engineering from Rice as well as a BSc in electrical engineering from Renssealer Polytechnic Institute.

?First Page?Previous Page 1???2???3???4???5

Article Comments - Improve software through memory layo...
*? You can enter [0] more charecters.
*Verify code:


Visit Asia Webinars to learn about the latest in technology and get practical design tips.

Back to Top