Improve software through memory layout optimisation
Keywords:algorithms? memory optimisation? compiler? SIMD? vectorisation?
In determining those portions of the code which dominate the dynamic application run- time, and may be the best candidates for either hand optimisation or hand adjustment for applicability to automated-tool optimisation, application developers typically use a software profiler in conjunction with either the silicon target or software-based system simulation. Intel's VTUNE is one such example of a profiling framework; alternatively the GNU GCC compiler and GPROF are open-source solutions that provide dynamic run-time information. Many silicon vendors such as Freescale Semiconductor and Texas Instruments also offer their own proprietary solutions for use with their respective silicon targets, allowing for either traces collected on software-based simulation platforms, or alternatively larger application-level traces that can be collected on the native silicon target.
Vectorisation and the dynamic code: compute ratio
Vectorisation of loops is an optimisation whereby computation performed across multiple loop iterations can be combined into single vector instructions, effectively increasing the instruction-to-compute ratio within the application's dynamic run-time behaviour. Consider the example in figure 5.
![]() |
Figure 5: Loop level vectorisation example. |
In the first loop nest, we can see that each iteration of the loop contains a single 16bit by 16bit multiply instruction whose result is written to the a[] array as output. One multiplication instruction is performed for each iteration of the loop, resulting in 16 16bit multiplications. The second loop, however, shows pseudocode for how the compiler or application developer might vectorise the loop when targeting an architecture that supports a four-way SIMD multiply instruction over 16bit integer elements.
In this case, the compiler has vectorised multiple iterations of the loop together into the multiply instruction, as denoted by the array[start_range:end_range] syntax denoted in the second loop nest. Note that the loop counter is incremented by the vectorised length for each iteration of the loop now. Clearly only four iterations over the loop are now needed to compute the resulting an output array, as each iteration of the loop now contains a single vector multiply instruction that computes four elements of the output vector in parallel.
There are many benefits to vectorising code in this manner, either by hand if the application developer uses intrinsics that are proprietary with respect to the target architecture, or if the compiler is able to vectorise the code. One such benefit is the increase in performance, as the code now exploits dedicated SIMD hardware, often providing a multiplication in improvement over the vectorised loop on the order of the underlying SIMD vector hardware.
Other benefits are the reduction in code size, as loops are no longer unrolled resulting in explosions in the code size, but rather more dense instructions of vector format are used rather than atomic scalar instructions. This may have secondary benefits in reducing the number of instruction fetch transactions that go out to memory as well. Lastly, the overall ratio of dynamically issued instructions to computation performed within the application is increased as well.
There are a number of challenges to both the development tools and the application developers when trying to vectorise code at the loop level. One such challenge is the code shape of loop nests that are candidate for vectorisation. Typically, build tools need to understand the loop iteration space of a loop, so using constant loop bounds rather than run- time computed values may be beneficial depending on the advancement of the underlying compiler's vectorisation technology.
Secondly, the types of computation performed within the loop nest must be amenable to vectorisation. For example, in the example above simple 16bit integer multiplication is performed for a target architecture supporting a supposed 16bit four-way SIMD multiply instruction. If the underlying target architecture only supports 8bit SIMD multiplication, it may be advantageous to avoid 16bit multiplication wherever possible if vectorisation is desired.
Loop dependence analysis is another concern when vectorising or parallelizing loop nests, as the compiler must be able to prove the safety of loop transformations. Loop dependence analysis is the means by which the compiler or dependence analyser determines whether statements within a loop body form a dependence with respect to array accesses and data modifications, various data reduction patterns, simplification of loop-independent portions of the code and management of various conditional execution statements within the loop body.
As an example, consider the fragment of C-language code in figure 6.
![]() |
Figure 6: Fragment of C language code. |
For the loop above, the compiler's data dependence analyser will attempt to find all dependences between the statements reading the array b[] and writing to the array a[]. The challenge for the data dependence analyser is to find all possible dependences between the statements that write to array a[] and read from array b[]. To ensure safety, the data dependence analyser must ensure that it can explicitly prove safety or, in other words, any dependence that cannot be proven false must be assumed to be true to ensure safety!
Related Articles | Editor's Choice |
Visit Asia Webinars to learn about the latest in technology and get practical design tips.