Synchronization Scalability Still Depends On Hardware First, Lock Algorithm Second

A review of synchronization on modern hardware

Pick a lock algorithm, benchmark it on your laptop that has about 8 cores. Now, deploy it to a 96-core EPYC, and the results may not even make sense. Locking algorithms don’t scale across hardware since it depends more on hardware topology and design than on algorithm choice.

That was the central finding of David, Guerraoui, and Trigonakis at SOSP 2013, when they ran nine lock algorithms across four architectures and found no consistent winner. [ACM] [PDF] This finding is still very relevant even though the hardware they used in the study are either dated.

Synchronization in 2025: The Hardware Got Weirder

The die is no longer a socket

In 2013, “NUMA” meant crossing a socket boundary — an expensive but well-understood hop. Today, AMD’s EPYC Turin packages up to 192 cores across twelve Core Complex Dies (CCDs) connected via Infinity Fabric to a central I/O die. Cross-CCD latency within a single socket is now a real design consideration, sitting between intra-core and cross-socket cost. You have NUMA inside what the OS thinks is one processor. The original paper never modeled this. Intel’s shift from ring bus to mesh NoC and UPI further changed the coherence adjudication model — on modern Intel systems, remote misses are adjudicated by the home node of the cache line, not broadcast to all caches.

Memory now has tiers

The 2013 paper assumed two kinds of memory: local DRAM and remote DRAM. CXL (Compute Express Link) adds a third tier — coherent-but-slower memory, attached over PCIe 5.0/6.0. CXL 2.0 supports memory pooling; CXL 3.0 adds peer-to-peer DMA and multi-level switching. This means lock implementations that assume uniform DRAM latency are now making a wrong assumption on a growing fraction of production hardware.

ARM broke the x86 monoculture

The paper studied only x86. AWS Graviton, Ampere Altra, and the Neoverse family now run significant datacenter workloads. ARM’s weak memory model (compared to x86’s TSO) means more explicit dmb/dsb barrier instructions are required, and the cost profile of atomics differs in ways that make the 2013 benchmark numbers non-transferable.

TSX is dead; NUMA-aware locks are alive

Hardware Transactional Memory via Intel TSX — which the paper flagged as promising — was effectively killed by the TAA vulnerability in 2019. In its place, a generation of NUMA-aware software locks has matured: Lock Cohorting, Compact NUMA-Aware Locks (CNA) (EuroSys 2019), and Scalable locks with shuffling (SOSP 2019) all exploit locality-first admission to reduce cross-socket coherence traffic — the exact bottleneck the paper identified.

The language layer finally caught up

The 2013 paper operated below the language level. Since then, C++11 standardized std::atomic with explicit acquire/release/seq_cst semantics. Rust went further, making data races a compile-time error through its ownership and borrow checker. The memory model the paper had to reason about informally is now formally specified and enforceable.

What hasn’t changed

The central insight holds: synchronization scalability is primarily a hardware property. Locality still wins. Contention still kills. Crossing domain boundaries — whether socket, CCD, or CXL tier — still costs orders of magnitude more than local access. The paper remains the right place to build intuition. It’s just that the hardware you’re building intuition about looks nothing like the hardware it measured.

Further reading:

Leave a Reply