Embracing ARM

Written by Chris Evans, May 16, 2022

Introduction

From AWS Graviton instances to M1 MacBooks, ARM processors are everywhere these days. This blog post illustrates the experiences, challenges, and solutions encountered in the Data Science Engineering team at NextRoll as we have come to embrace the ARM ecosystem.

Docker / Containerization

For many years, the Data Science Engineering team has used Docker to create a consistent running environment for our code: Once code is working on one developer’s laptop, it should work on their teammate’s laptops, in our CI test pipelines, and when deployed to our cloud infrastructure.

Everything has worked great until recently when a developer on our team upgraded their work laptop to an M1 MacBook (which uses an ARM CPU architecture). Suddenly many of our images would not build or run on their laptop, preventing them from developing locally.

While the developer could develop, e.g., remotely on an X86 machine, more team members would be upgrading their laptops over time. So as a team, we decided to make our Docker images compatible with ARM.

First Approach

The first approach we tried was to use simple emulation: Docker provides an emulation layer via QEMU, which should allow a developer on an ARM laptop to build and run a docker image targeted for X86.

There were some images where this worked (albeit at a performance hit) but other images failed to run entirely. Of particular note, in our Fledge Lab project, headful Chrome would immediately segfault when run in an X86 docker container on an ARM host machine.

Solution: Multi-Arch images

For cases where emulation failed, we went the route of multi-arch images: We would build both an X86 and ARM version for each image, leading to several challenges.

Differing Build Steps

Some packages or tools we installed in the image required different build steps in our Dockerfile. We found the most elegant way to address this was via RUN heredocs: We use one RUN heredoc per package/tool we install and switch on the $TARGETPLATFORM variable provided by Docker.

// Install a given tool/package
RUN <<EOF
  if [ $TARGETPLATFORM == 'linux/arm64' ]
  then
    // Build steps for ARM
  else
    // Build steps for X86
  fi
EOF

Code Compatibility Issues

We ran into several issues getting code optimized for X86 to compile and run on ARM.

We had some places in our code where we used X86-specific instructions, including X86 inline assembly and Intel intrinsics. We replaced these with a mix of LLVM IR code, which is architecture agnostic, and code sections where we switch on the target platform.

We also ran into a compatibility issue using epoll: On x86 machines, the epoll_event struct is expected to be packed, whereas, on ARM machines, it is not, leading to odd runtime behavior where threads would try to read data from the wrong client connection and took a while to debug. Once we diagnosed the issue, the fix was simple: We just needed to switch the struct alignment on the machine architecture.

struct EpollEvent
{
    uint events = EPOLLIN | EPOLLONESHOT | EPOLLRDHUP;
version(X86_64)
    align(4) ulong ID; // User data, packed
else
    ulong ID;  // User data, not packed
}

Memory Model Issues

One of the most subtle issues we ran into was the difference between the X86 and ARM memory models. There is an excellent introduction to memory models and code reordering here, but the rough idea is as follows:

Code that is written in a high-level language (Dlang in our case) can be reordered in many places:

At compile time: The compiler outputs assembly in a different order than the original code.
At runtime: The CPU executes the assembly/machine instructions out of order.

The CPU’s memory model determines what sort of reorderings are allowed at runtime. X86 has a strong memory model that prevents many reorderings of memory access instructions. ARM has a weak memory model, which gives the CPU much more liberty to rearrange memory access instructions.

In order to achieve optimal performance for our ad pricing engine, we relax the memory ordering constraints for atomic operations in our multi-threaded code in several places (e.g., in collecting statistics for the 100k ads we price per second). When targeting ARM, we needed to audit our code to ensure we were not implicitly relying on the stronger memory guarantees of X86, lest we introduce a subtle race condition. (As an interesting aside, Apple sidestepped this issue with Rosetta 2 by letting it use X86 memory-ordering)

A Concrete Example

The following example shows how the different memory models of X86 and ARM can lead to different behavior at runtime.

import core.atomic;
import core.thread;

align(128) __gshared bool mutex1;
align(128) __gshared bool mutex2;

void do_work()
{
    while(true)
    {
        while(atomicExchange(&mutex1, true)){} // block for mutex1
        while(atomicExchange(&mutex2, true)){} // block for mutex2
        mutex2 = false; // release mutex2
        mutex1 = false; // release mutex1
    }
}

void main(string[] args)
{
    auto thread = new Thread(&do_work);
    thread.start();

    while(true)
    {
        while(atomicExchange(&mutex1, true)){} // block for mutex1
        if(atomicLoad(mutex2) == true)
            assert(false);
        atomicStore(mutex1, false); // release mutex1
    }
}

Here a worker thread acquires two mutexes in a nested manner, while the main thread checks if mutex1 is free while mutex2 is held, violating the nesting. When I compile and run this on a pre-M1 MacBook (X86), it runs indefinitely, whereas when I compile and run it on an M1 MacBook (ARM) it hits the assertion almost immeadiately.

What’s going on? The issue lies in the lines

mutex2 = false; // release mutex2
mutex1 = false; // release mutex1

which compile to

strb    wzr, [x10]
strb    wzr, [x8]

and

mov     byte ptr [rcx], 0
mov     byte ptr [rax], 0

on ARMv8 and X86 respectively.

Even though these instructions are still in order at compile time, the ARM memory model allows these stores to be run out of order whereas the X86 memory model prevents the reordering of these stores. To fix this, we can instead use

atomicStore(mutex2, false); // release mutex2
atomicStore(mutex1, false); // release mutex1

which compiles to

stlrb   wzr, [x10]
stlrb   wzr, [x8]

and

xor     edx, edx
xchg    byte ptr [rcx], dl
xor     edx, edx
xchg    byte ptr [rax], dl

on ARMv8 and X86 respectively. In ARM, the strb instructions are now stlrb instructions. These cannot be reordered (they each provide a one-way fence from above), so the intended nesting is guaranteed.

Even for X86 I believe the latter code is more correct. Even though the mov instructions are replaced with heavier xchg instructions, we are expressing our intention that these lines not be rearranged in the dlang code itself and leaving it to the compiler to ensure our intentions are met. In particular, in the former code, I think the compiler could have rearranged the order of these two lines, even before the runtime ordering guarantees of X86.

Conclusion

While it took some work (and a fair bit of debugging), we successfully made all of our Data Science Engineering images compatible with both X86 and ARM. And along the way we got to learn some of the subtle differences between the two CPU architectures! If optimizing machine learning systems at a low level catches your interest, check out our careers page.