Global Sources
EE Times-Asia
Stay in touch with EE Times Asia
EE Times-Asia > Processors/DSPs

ARM64 vs ARM32: A guide for Linux programmers

Posted: 06 Nov 2015 ?? ?Print Version ?Bookmark and Share

Keywords:ARM? Linux programmers? just-in-time? RISC? cryptographic?

We discovered that we were having a lot more problems due to load/store exclusive instructions in ARM64 than we had previously seen with ARM32. This increase was surprising, as the load/store exclusive instructions didn't seem to have changed in a way that was relevant to us. In fact, I still do not think the load/store exclusive instructions have changed massively, but they have massively increased in use.

Take the random() glibc function as an example. It is called by rand(), which we all know well, and which is used in one of our favorite demo programs.

On AArch64 we see, a few instructions into random(), the following assembly code:

0x0000007fb7df3dd4 : mov w1, #0x1 // #1
0x0000007fb7df3dd8 : ldaxr w2, [x0]
0x0000007fb7df3ddc : cmp w2, wzr
0x0000007fb7df3de0 : 0x7fb7df3dec
0x0000007fb7df3de4 : stxr w3, w1, [x0]
0x0000007fb7df3de8 : cbnz w3, 0x7fb7df3dd8
0x0000007fb7df3dec : 0x7fb7df3e34

The load-exclusive-acquire (ldaxr) instruction and store-exclusive (stxr) instructions indicate that this code is probably trying to acquire a lock. A quick look at the source code shows that acquiring a lock is exactly what is happening. But the same code compiled for a Cortex-A9 (AArch32) did not do any locking. My guess is that this locking occurs purely to better support multiprocessor systems.

A short explanation of the assembly code is warranted here. ARMv8 does not have a single-instruction atomic read-modify-write. Instead, as we see in the example code, ldxr (load exclusive) and stxr (store exclusive) pairs are used. A load exclusive acquires a sort of lock called an "exclusive access mark." This mark is checked for by store exclusive instructions!if a different load exclusive has been executed, the store exclusive will "fail"!it will not update memory. We thus saw that store exclusive instructions would never succeed, and such failure was a common problem because of the increased use of these paired instructions in AArch64.

The answer to "why" is in the ARM Architecture Reference Manual for ARMv8!there is a list of things that can cause a store exclusive to fail. This includes, but is not limited to, normal (non-exclusive) loads and stores between the LDXR and STXR.

Our JIT was performing non-exclusive loads and stores between the LDXR and STXR, and thus was failing. Our attempts to debug the problem were making everything worse. The debug code was causing even more to happen in between the load exclusive and store exclusive.

The takeaway from this is to know that you can never predict whether a store exclusive will succeed or not, because pretty much anything can cause it to fail, including thread switches, context switches, etc. Also!do not trust GDB or other debuggers to tell you the truth. GDB in particular appears to do something sneaky to cause sequences to succeed if you single step through them, when really single stepping is so invasive it will cause the STXR to fail.

There's no single or easy solution: it depends what caused your problem. Code generators should avoid outputting some instructions between the load and store exclusive!the above real example can be awkward as the instructions are not in the same basic block. In our JIT, we try to execute load and store exclusive instructions together where they occur as a pair.

For more information on exclusive accesses on ARM, the ARM Architecture Reference Manual is the place to look. I recommend the list of conditions in "Load-Exclusive and Store-Exclusive instruction usage restrictions". There are also some hardcore sections on the types of monitors, and a very useful section explaining why there is an 'a' in the ldaxr (for "acquire") but no 'l' (for "reLease") in the stxr.

Aside: Older ARM cores do have an atomic read-and-write instruction: SWP. However, this instruction has been removed from ARMv8 32-bit, and does not exist on ARM 64-bit. I believe the intstruction is absent because SWP would not work well in multiprocessor systems, whereas load/store exclusive instructions assist multiprocessing by exporting information outside of the processor using the memory interface.

Additional pages in memory
The addition of [vvar] and [vdso] with AArch64 is in some ways purely the result of moving to a newer kernel, and it is the most tangential topic that I'll cover. When I cat /proc/self/maps the response shows that the x86 laptop I am writing this on also has [vvar] and [vdso]. This excellent article does a great job of explaining what these pages are all about, which I won't repeat here.

The problem for us is that [vvar] represents a source of non-determinism, so our record engine must save any data that is read from [vvar]. This need to respond to non-determinism is a well understood problem for us, and the solutions vary depending on the type of map we're using. The simplest (but worst performing) solution is to:

???mprotect() the map to PROT_NONE
???When a fault occurs:
Restore the original protection with mprotect
Re-execute the access
Save the data read
mprotect back to PROT_NONE

On AArch64, one of our tests was consistently failing with EACCESS when we tried to re-apply PROT_READ to vvar (the restore step above).

We isolated the cause of the problem to a nested local function in the test. I had no idea this nesting was allowed in C until this point, but nesting turns out to be a GNU extension. Importantly, you can pass the address of the nested function outside of the scope in which it is defined. Calls to the nested function then go via a "trampoline", which is explained here.

Here is a simple example program:

?First Page?Previous Page 1???2???3???4?Next Page?Last Page

Article Comments - ARM64 vs ARM32: A guide for Linux pr...
*? You can enter [0] more charecters.
*Verify code:


Visit Asia Webinars to learn about the latest in technology and get practical design tips.

Back to Top