These examples are from this paper which is inspiration behind this excellent Rust tool loom. This talk Beyond Sequential Consistency is an excellent introduction to this topic.
Following text is written with help of a LLM.
Example 1: The Result That Shouldn’t Be Possible (But Is)
Consider this program. Two threads, two shared variables.
Quick question: what are the possible outcomes for (r1, r2)?
Most would say: (0,0), (0,1), or (1,0). ThreadA might run before ThreadB, after it, or interleaved — that covers every case. It doesn’t.
The C++11 standard also allows r1 = 1, r2 = 1! Which seems impossible since for this to happen, both threads must be reading the value that the other thread hasn’t stored yet! This cannot happen under any sequential interleaving of the two threads. There is no ordering of the four statements that produces this result if you run them in sequence. But the C++ memory model isn’t sequential.
memory_order_relaxed is the weakest ordering available, and it explicitly permits memory operations to different locations to be reordered. A compiler is allowed — in fact, encouraged — to reorder the load and store within each thread to optimize performance. On ARM and PowerPC CPUs, this reordering can happen in hardware even without compiler involvement.
So when you write memory_order_relaxed, you’re not just saying “I don’t need a lock.” You’re saying “I’m fine with any possible reordering of my memory operations.”
Example 2: Acquire/Release Doesn’t Mean What You Think
The standard fix is to use memory_order_acquire and memory_order_release to create synchronization pairs. Let’s modify the previous example:
Now, if r1 = 1 — meaning threadA‘s acquire-load read the value stored by threadB‘s release-store — then a synchronization relationship is established. The standard guarantees that everything threadB did before the release is visible to threadA after the acquire.
This means: when r1 = 1, threadB must have already executed x.load()(which is 0) before doing y.store(1, release). ThreadA‘s load/acquire on y “saw” that store/release, so threadA must observe the effects of everything before it. Therefore r2 = 0. The possible outcomes are now only: (0,0), (0,1), and (1,0). The (1,1) result is gone.
By the way, the load/acquire does not “force” store/release to “happen” before it i.e. threadA load/acquire doesn’t force threadB‘s store/release is complete first. It only says, if I read value from other thread’s store/release then it is safe to assume that I can see everything that happened up to that point in other thread.
Why this is still dangerous
The logic above looks simple but it’s not. It requires you to:
- Identify every pair of atomic operations that need to synchronize.
- Correctly apply acquire/release to every single one of them.
- Reason through the transitive closure of happens-before across the entire execution.
Miss a single pair, or apply the wrong memory order to one operation, and you’re back in the land of (1,1). There’s no compiler warning. The code compiles cleanly. Your tests will almost certainly pass, because the (1,1) outcome might require a specific micro-architectural timing to trigger.
Example 3: The Modification Order Is a Hidden Variable
Here’s a subtler problem. Consider a single atomic variable with two stores and two loads:
The C++11 standard defines a modification order for every atomic object — a total order of all stores to that object that every thread must agree on. This isn’t a lock, and it isn’t sequential consistency — it’s a weaker guarantee that the relative order of stores to a single variable is consistent.
This constraint is invisible in the source code. You can’t annotate it or inspect it at runtime. But it has real consequences. In this example:
ThreadA stores 1, then 2, to x. Therefore in the modification order, store A comes before store B.
- If load
C reads the value 2 (from store B), load D cannot read the value 1 (from store A).
- Why? Because that would require
A to appear after B in the modification order — a contradiction.
So an execution where r1 = 2, r2 = 1 is illegal even though both loads are relaxed, even though there’s no acquire/release pair, and even though naive reasoning might suggest any combination is possible.
The problem is that most developers don’t reason about modification order at all. They reason about thread inter-leavings. The memory model is not defined in terms of inter-leavings — it’s defined in terms of relations (reads-from, happens-before, modification order, synchronizes-with). These relations can produce constraints that are deeply non-obvious, and violating them silently yields undefined behavior.
Example 4: The “Satisfaction Cycle” — Values Appearing Out of Thin Air
This one doesn’t just produce surprising results. It breaks the foundations of program reasoning.
Question: can r1 = r2 = 42?
Neither x nor y is initialized to 42. Thread 1 stores 42 to y only if it reads 42 from x. Thread 2 stores 42 to x only if it reads 42 from y. The value 42 can only enter the system if it’s already there.
This is a satisfaction cycle — a circular dependency where values manufacture themselves out of nothing. The C++11 standard informally discourages this result but (as of C++11) doesn’t formally prohibit it. The relevant clause (§29.3p9) was discovered to be both over-constraining in some cases and under-constraining in others, and was subsequently removed in C++14 with a note to “just not do this,” without a formal definition of what “this” means.
The paper I linked at the beginning of the paper found many bugs in other papers and real-world implementations. These bugs were not a data race in the classical sense but a violation of the memory model’s modification order constraints: a specific combination of relaxed loads and stores that, under a legal but non-obvious execution, produced an incorrect result. The authors of the original implementation had not found it through testing, because:
- The triggering execution requires a specific ordering of memory operations that real hardware only produces under particular timing conditions.
- The x86 architecture — where most concurrent C++ is tested — is stronger than the C++ memory model requires, so many illegal executions simply never occur on x86 even when the code is technically wrong.
- The bug would surface on ARM or PowerPC, which honor the weaker semantics the standard actually specifies.
This is the core danger of writing to the C++ memory model: you test on x86, which is well-behaved. Your code ships to ARM, which isn’t. The standard permits ARM’s behavior. Your tests don’t catch it.
What You Should Actually Do
1. Default to memory_order_seq_cst
The default when you write std::atomic<T> and don’t specify a memory order is memory_order_seq_cst — the strongest ordering, equivalent to the intuitive sequential model. It’s slower, but it’s correct. Optimize down from there only when you have a concrete performance need and are willing to formally verify the result.
2. Never use memory_order_relaxed without a proof
memory_order_relaxed gives you no synchronization guarantees whatsoever. It is appropriate only for things like counters where you only care about atomicity of individual operations, not their ordering relative to anything else. If you’re using it to communicate between threads, you almost certainly need something stronger.
3. Use a model checker
Tools like CDSChecker (from linked paper, and its successors) can exhaustively explore every legal execution of a unit test under the C/C++ memory model. If your concurrent data structure can be expressed as a small unit test, running it through a model checker is the only way to be confident it’s correct — not just confident it passes your tests.
4. Understand that x86 is “different” — it has stronger memory model
x86 has a stronger memory model than C++ requires. Code that appears correct on x86 may silently break on ARM, PowerPC, or RISC-V. If you write to the C++ standard rather than to x86-specific guarantees, test on architectures that actually exercise the weaker guarantees.