From AWS Graviton instances to M1 MacBooks, ARM processors are everywhere these days. This blog post illustrates the experiences, challenges, and solutions encountered in the Data Science Engineering team at NextRoll as we have come to embrace the ARM ecosystem.
Docker / Containerization
For many years, the Data Science Engineering team has used Docker to create a consistent running environment for our code: Once code is working on one developer’s laptop, it should work on their teammate’s laptops, in our CI test pipelines, and when deployed to our cloud infrastructure.
Everything has worked great until recently when a developer on our team upgraded their work laptop to an M1 MacBook (which uses an ARM CPU architecture). Suddenly many of our images would not build or run on their laptop, preventing them from developing locally.
While the developer could develop, e.g., remotely on an X86 machine, more team members would be upgrading their laptops over time. So as a team, we decided to make our Docker images compatible with ARM.
The first approach we tried was to use simple emulation: Docker provides an emulation layer via QEMU, which should allow a developer on an ARM laptop to build and run a docker image targeted for X86.
There were some images where this worked (albeit at a performance hit) but other images failed to run entirely. Of particular note, in our Fledge Lab project, headful Chrome would immediately segfault when run in an X86 docker container on an ARM host machine.
Solution: Multi-Arch images
For cases where emulation failed, we went the route of multi-arch images: We would build both an X86 and ARM version for each image, leading to several challenges.
Differing Build Steps
Some packages or tools we installed in the image required different build steps in our Dockerfile. We found the most elegant way to address this was via RUN heredocs: We use one RUN heredoc per package/tool we install and switch on the
$TARGETPLATFORM variable provided by Docker.
Code Compatibility Issues
We ran into several issues getting code optimized for X86 to compile and run on ARM.
We had some places in our code where we used X86-specific instructions, including X86 inline assembly and Intel intrinsics. We replaced these with a mix of LLVM IR code, which is architecture agnostic, and code sections where we switch on the target platform.
We also ran into a compatibility issue using epoll: On x86 machines, the epoll_event struct is expected to be packed, whereas, on ARM machines, it is not, leading to odd runtime behavior where threads would try to read data from the wrong client connection and took a while to debug. Once we diagnosed the issue, the fix was simple: We just needed to switch the struct alignment on the machine architecture.
Memory Model Issues
One of the most subtle issues we ran into was the difference between the X86 and ARM memory models. There is an excellent introduction to memory models and code reordering here, but the rough idea is as follows:
Code that is written in a high-level language (Dlang in our case) can be reordered in many places:
- At compile time: The compiler outputs assembly in a different order than the original code.
- At runtime: The CPU executes the assembly/machine instructions out of order.
The CPU’s memory model determines what sort of reorderings are allowed at runtime. X86 has a strong memory model that prevents many reorderings of memory access instructions. ARM has a weak memory model, which gives the CPU much more liberty to rearrange memory access instructions.
In order to achieve optimal performance for our ad pricing engine, we relax the memory ordering constraints for atomic operations in our multi-threaded code in several places (e.g., in collecting statistics for the 100k ads we price per second). When targeting ARM, we needed to audit our code to ensure we were not implicitly relying on the stronger memory guarantees of X86, lest we introduce a subtle race condition. (As an interesting aside, Apple sidestepped this issue with Rosetta 2 by letting it use X86 memory-ordering)
A Concrete Example
The following example shows how the different memory models of X86 and ARM can lead to different behavior at runtime.
Here a worker thread acquires two mutexes in a nested manner, while the main thread checks if
mutex1 is free while
mutex2 is held, violating the nesting. When I compile and run this on a pre-M1 MacBook (X86), it runs indefinitely, whereas when I compile and run it on an M1 MacBook (ARM) it hits the assertion almost immeadiately.
What’s going on? The issue lies in the lines
which compile to
Even though these instructions are still in order at compile time, the ARM memory model allows these stores to be run out of order whereas the X86 memory model prevents the reordering of these stores. To fix this, we can instead use
which compiles to
on ARMv8 and X86 respectively. In ARM, the
strb instructions are now
stlrb instructions. These cannot be reordered (they each provide a one-way fence from above), so the intended nesting is guaranteed.
Even for X86 I believe the latter code is more correct. Even though the
mov instructions are replaced with heavier
xchg instructions, we are expressing our intention that these lines not be rearranged in the dlang code itself and leaving it to the compiler to ensure our intentions are met. In particular, in the former code, I think the compiler could have rearranged the order of these two lines, even before the runtime ordering guarantees of X86.
While it took some work (and a fair bit of debugging), we successfully made all of our Data Science Engineering images compatible with both X86 and ARM. And along the way we got to learn some of the subtle differences between the two CPU architectures! If optimizing machine learning systems at a low level catches your interest, check out our careers page.