Global Sources
EE Times-Asia
Stay in touch with EE Times Asia
EE Times-Asia > Memory/Storage

Memory access ordering in complex embedded designs

Posted: 22 Dec 2014 ?? ?Print Version ?Bookmark and Share

Keywords:embedded systems? processor? Sequential Execution Model? SEM? Compilers?

There are many opportunities in such a system for the SEM to break down. There are multiple software threads executing on multiple processors, there are caches and buffers and autonomous memory access units such as DMA controllers.

Analyzing everything going on in such a system is beyond the scope of this article but we will look at the most common behaviors and what you need to do about them in software.

There are two distinct effects which we need to consider:
???Compiler behavior
???System behavior

We need to write our software so that the compiler produces what we expect. But we also need to know how the system behaves so we can ensure that output code actually has the desired effect.

Compiler behavior
Compilers are bound by a strict and specific Sequential Execution Model of their own that applies at the level of high-level language statements and is propagated through to output machine instructions, executing on a tightly-defined virtual model of an idealized target machine. But this model breaks down when the Sequential Execution Model doesn't apply to the real machine on which the program is executed. There are lots of reasons why this might be the case.

The use of the "volatile" keyword is nothing more than an indication to the compiler that a particular value, held in memory, may change when it's not looking. The classic example would be a memory-mapped peripheral register representing a FIFO buffer. Every time you read it, you get the next input data item.

volatile int *fifo;
int input;

while (1)
????input = *fifo

Without the volatile keyword in the declaration of the FIFO item, the compiler would be free to cache the value after reading it once and simply reuse the value without ever reading memory again.

The volatile declaration tells the compiler that it can't cache the value and so has to physically read memory every time. (We are assuming here that there isn't a cache in between the processor and this particular memory location.)

But we also need to apply "volatile" to any global variable which may be changed by an interrupt handler. Likewise, such variables may change when the compiler isn't looking.

????status = 1;
????while (status == 1)
???? ????//do stuff ...
&nbsp }

&nbsp {
????status = 0;
&nbsp }

If status is not declared as volatile, this program won't necessarily work, as the compiler is free to assume that it never has to read it in the main loop after setting it in the first line.

System behavior
We can divide this into the behavior of the processor itself (pipeline and execution units), the memory system (write buffers, caches, external memory systems), and, finally, multiprocessing systems.

Modern processors employ increasingly complex pipelines to maximize instruction throughput. The latest ARM cores, for instance, incorporate superscalar, out-of-order pipelines with multiple execution units. This means that, regardless of what the compiler produces, the process itself can execute instructions in a different order to that in which they appear in the program.

The processor has to be able to satisfy itself that there are no data, address, or resource dependencies between the instructions but, if it can do that, it's pretty much free to do what it wants. The key restriction is that the Sequential Execution Model has to apply but that the processor can only ensure it applies to what it can see, which is itself and some closely-coupled components. It knows nothing about the rest, so cannot ensure that the model applies when extended outwards into the system. Consider the following sequence of instructions:

add r0, r0, #4
mul r2, r2, r3
str r2, [r0]
ldr r4, [r1]
sub r1, r4, r2 bx lr

If we execute this on a simple in-order processor, we might see something like this:

Cycle Instruction

0 add r0, r0, #4
1 mul r2, r2, r3
2 *stall*
3 str r2, [r0]
4 ldr r4, [r1]
5 *stall*
6 sub r1, r4, r2
7 bx lr

On an out-of-order processor, we might see this:

Cycle Instruction

0 mul r2, r2, r3
1 ldr r4, [r1]
2 str r2, [r0]
3 sub r1, r4, r2
4 bx lr

The processor has re-ordered the execution sequence to allow the LDR to start execution prior to the STR. This gives it more time to complete and reduces the overall latency of the sequence. It can do this provided it is satisfied that there is no dependency between the two instructions. Clearly, in this case, that's true.

Or is it?

?First Page?Previous Page 1???2???3???4???5?Next Page?Last Page

Article Comments - Memory access ordering in complex em...
*? You can enter [0] more charecters.
*Verify code:


Visit Asia Webinars to learn about the latest in technology and get practical design tips.

Back to Top