Tag Archives: c++

Synchronization Scalability Still Depends On Hardware First, Lock Algorithm Second

A review of synchronization on modern hardware

Pick a lock algorithm, benchmark it on your laptop that has about 8 cores. Now, deploy it to a 96-core EPYC, and the results may not even make sense. Locking algorithms don’t scale across hardware since it depends more on hardware topology and design than on algorithm choice.

That was the central finding of David, Guerraoui, and Trigonakis at SOSP 2013, when they ran nine lock algorithms across four architectures and found no consistent winner. [ACM] [PDF] This finding is still very relevant even though the hardware they used in the study are either dated.

Synchronization in 2025: The Hardware Got Weirder

The die is no longer a socket

In 2013, “NUMA” meant crossing a socket boundary — an expensive but well-understood hop. Today, AMD’s EPYC Turin packages up to 192 cores across twelve Core Complex Dies (CCDs) connected via Infinity Fabric to a central I/O die. Cross-CCD latency within a single socket is now a real design consideration, sitting between intra-core and cross-socket cost. You have NUMA inside what the OS thinks is one processor. The original paper never modeled this. Intel’s shift from ring bus to mesh NoC and UPI further changed the coherence adjudication model — on modern Intel systems, remote misses are adjudicated by the home node of the cache line, not broadcast to all caches.

Memory now has tiers

The 2013 paper assumed two kinds of memory: local DRAM and remote DRAM. CXL (Compute Express Link) adds a third tier — coherent-but-slower memory, attached over PCIe 5.0/6.0. CXL 2.0 supports memory pooling; CXL 3.0 adds peer-to-peer DMA and multi-level switching. This means lock implementations that assume uniform DRAM latency are now making a wrong assumption on a growing fraction of production hardware.

ARM broke the x86 monoculture

The paper studied only x86. AWS Graviton, Ampere Altra, and the Neoverse family now run significant datacenter workloads. ARM’s weak memory model (compared to x86’s TSO) means more explicit dmb/dsb barrier instructions are required, and the cost profile of atomics differs in ways that make the 2013 benchmark numbers non-transferable.

TSX is dead; NUMA-aware locks are alive

Hardware Transactional Memory via Intel TSX — which the paper flagged as promising — was effectively killed by the TAA vulnerability in 2019. In its place, a generation of NUMA-aware software locks has matured: Lock Cohorting, Compact NUMA-Aware Locks (CNA) (EuroSys 2019), and Scalable locks with shuffling (SOSP 2019) all exploit locality-first admission to reduce cross-socket coherence traffic — the exact bottleneck the paper identified.

The language layer finally caught up

The 2013 paper operated below the language level. Since then, C++11 standardized std::atomic with explicit acquire/release/seq_cst semantics. Rust went further, making data races a compile-time error through its ownership and borrow checker. The memory model the paper had to reason about informally is now formally specified and enforceable.

What hasn’t changed

The central insight holds: synchronization scalability is primarily a hardware property. Locality still wins. Contention still kills. Crossing domain boundaries — whether socket, CCD, or CXL tier — still costs orders of magnitude more than local access. The paper remains the right place to build intuition. It’s just that the hardware you’re building intuition about looks nothing like the hardware it measured.

Further reading:

Notes on the C++ memory model: why lock-free DSA is hard

Examples in post are from this paper which was the inspiration behind this excellent Rust tool loom. This talk Beyond Sequential Consistency is an excellent introduction to this topic.

Following text is written with help of a LLM.

Example 1: The Result That Shouldn’t Be Possible (But Is)

Consider this program. Two threads, two shared variables.

atomic<int> x(0), y(0);
void threadA() {
int r1 = y.load(memory_order_relaxed);
x.store(1, memory_order_relaxed);
}
void threadB() {
int r2 = x.load(memory_order_relaxed);
y.store(1, memory_order_relaxed);
}

Quick question: what are the possible outcomes for (r1, r2)?

Most would say: (0,0), (0,1), or (1,0). ThreadA might run before ThreadB, after it, or interleaved — that covers every case. It doesn’t.

The C++11 standard also allows r1 = 1, r2 = 1! Which seems impossible since for this to happen, both threads must be reading the value that the other thread hasn’t stored yet! This cannot happen under any sequential interleaving of the two threads. There is no ordering of the four statements that produces this result if you run them in sequence. But the C++ memory model isn’t sequential.

memory_order_relaxed is the weakest ordering available, and it explicitly permits memory operations to different locations to be reordered. A compiler is allowed — in fact, encouraged — to reorder the load and store within each thread to optimize performance. On ARM and PowerPC CPUs, this reordering can happen in hardware even without compiler involvement.

So when you write memory_order_relaxed, you’re not just saying “I don’t need a lock.” You’re saying “I’m fine with any possible reordering of my memory operations.”

Example 2: Acquire/Release Doesn’t Mean What You Think

The standard fix is to use memory_order_acquire and memory_order_release to create synchronization pairs. Let’s modify the previous example:

atomic<int> x(0), y(0);
void threadA() {
int r1 = y.load(memory_order_acquire); // changed
x.store(1, memory_order_relaxed);
}
void threadB() {
int r2 = x.load(memory_order_relaxed);
y.store(1, memory_order_release); // changed
}

Now, if r1 = 1 — meaning threadA‘s acquire-load read the value stored by threadB‘s release-store — then a synchronization relationship is established. The standard guarantees that everything threadB did before the release is visible to threadA after the acquire.

This means: when r1 = 1, threadB must have already executed x.load()(which is 0) before doing y.store(1, release). ThreadA‘s load/acquire on y “saw” that store/release, so threadA must observe the effects of everything before it. Therefore r2 = 0. The possible outcomes are now only: (0,0), (0,1), and (1,0). The (1,1) result is gone.

By the way, the load/acquire does not “force” store/release to “happen” before it i.e. threadA load/acquire doesn’t force threadB‘s store/release is complete first. It only says, if I read value from other thread’s store/release then it is safe to assume that I can see everything that happened up to that point in other thread.

Why this is still dangerous

The logic above looks simple but it’s not. It requires you to:

  1. Identify every pair of atomic operations that need to synchronize.
  2. Correctly apply acquire/release to every single one of them.
  3. Reason through the transitive closure of happens-before across the entire execution.

Miss a single pair, or apply the wrong memory order to one operation, and you’re back in the land of (1,1). There’s no compiler warning. The code compiles cleanly. Your tests will almost certainly pass, because the (1,1) outcome might require a specific micro-architectural timing to trigger.

Example 3: The Modification Order Is a Hidden Variable

Here’s a subtler problem. Consider a single atomic variable with two stores and two loads:

atomic<int> x(0);
void threadA() {
x.store(1, memory_order_relaxed); // A
x.store(2, memory_order_relaxed); // B
}
void threadB() {
int r1 = x.load(memory_order_relaxed); // C
int r2 = x.load(memory_order_relaxed); // D
}

The C++11 standard defines a modification order for every atomic object — a total order of all stores to that object that every thread must agree on. This isn’t a lock, and it isn’t sequential consistency — it’s a weaker guarantee that the relative order of stores to a single variable is consistent.

This constraint is invisible in the source code. You can’t annotate it or inspect it at runtime. But it has real consequences. In this example:

  • ThreadA stores 1, then 2, to x. Therefore in the modification order, store A comes before store B.
  • If load C reads the value 2 (from store B), load D cannot read the value 1 (from store A).
  • Why? Because that would require A to appear after B in the modification order — a contradiction.

So an execution where r1 = 2, r2 = 1 is illegal even though both loads are relaxed, even though there’s no acquire/release pair, and even though naive reasoning might suggest any combination is possible.

The problem is that most developers don’t reason about modification order at all. They reason about thread inter-leavings. The memory model is not defined in terms of inter-leavings — it’s defined in terms of relations (reads-from, happens-before, modification order, synchronizes-with). These relations can produce constraints that are deeply non-obvious, and violating them silently yields undefined behavior.

Example 4: The “Satisfaction Cycle” — Values Appearing Out of Thin Air

This one doesn’t just produce surprising results. It breaks the foundations of program reasoning.

// Thread 1
r1 = x.load(memory_order_relaxed);
if (r1 == 42)
y.store(r1, memory_order_relaxed);
// Thread 2
r2 = y.load(memory_order_relaxed);
if (r2 == 42)
x.store(42, memory_order_relaxed);

Question: can r1 = r2 = 42?

Neither x nor y is initialized to 42. Thread 1 stores 42 to y only if it reads 42 from x. Thread 2 stores 42 to x only if it reads 42 from y. The value 42 can only enter the system if it’s already there.

This is a satisfaction cycle — a circular dependency where values manufacture themselves out of nothing. The C++11 standard informally discourages this result but (as of C++11) doesn’t formally prohibit it. The relevant clause (§29.3p9) was discovered to be both over-constraining in some cases and under-constraining in others, and was subsequently removed in C++14 with a note to “just not do this,” without a formal definition of what “this” means.


The paper I linked at the beginning of the paper found many bugs in other papers and real-world implementations. These bugs were not a data race in the classical sense but a violation of the memory model’s modification order constraints: a specific combination of relaxed loads and stores that, under a legal but non-obvious execution, produced an incorrect result. The authors of the original implementation had not found it through testing, because:

  1. The triggering execution requires a specific ordering of memory operations that real hardware only produces under particular timing conditions.
  2. The x86 architecture — where most concurrent C++ is tested — is stronger than the C++ memory model requires, so many illegal executions simply never occur on x86 even when the code is technically wrong.
  3. The bug would surface on ARM or PowerPC, which honor the weaker semantics the standard actually specifies.

This is the core danger of writing to the C++ memory model: you test on x86, which is well-behaved. Your code ships to ARM, which isn’t. The standard permits ARM’s behavior. Your tests don’t catch it.


What You Should Actually Do

1. Default to memory_order_seq_cst

The default when you write std::atomic<T> and don’t specify a memory order is memory_order_seq_cst — the strongest ordering, equivalent to the intuitive sequential model. It’s slower, but it’s correct. Optimize down from there only when you have a concrete performance need and are willing to formally verify the result.

2. Never use memory_order_relaxed without a proof

memory_order_relaxed gives you no synchronization guarantees whatsoever. It is appropriate only for things like counters where you only care about atomicity of individual operations, not their ordering relative to anything else. If you’re using it to communicate between threads, you almost certainly need something stronger.

3. Use a model checker

Tools like CDSChecker (from linked paper, and its successors) can exhaustively explore every legal execution of a unit test under the C/C++ memory model. If your concurrent data structure can be expressed as a small unit test, running it through a model checker is the only way to be confident it’s correct — not just confident it passes your tests.

4. Understand that x86 is “different” — it has stronger memory model

x86 has a stronger memory model than C++ requires. Code that appears correct on x86 may silently break on ARM, PowerPC, or RISC-V. If you write to the C++ standard rather than to x86-specific guarantees, test on architectures that actually exercise the weaker guarantees.

BackupOnDelete: A Windows Minifilter And User Space App

When a file is deleted by any means possible, you want to ensure a backup of the file before its get permanently deleted. This post describes a windows minifilter and a user-space app working in tandem with minifilter to implement a working solution.

It is not a tutorial. You can browse the code and check if the patterns I’ve used make sense for your project. Additionally, you may like to have a look at the build system, packaging (nsis based installer) and application code.

Later I rewrote the application in Rust but that is not part of this project’s GitHub Repository.

Architectural summary

  • The minifilter monitors certain events in filesystem that says ‘delete this file’ (on close!).
  • Instead of letting kernel delete the file, the minifilter ask the kernel to rename the file (move it to a temporary space) and hide it. For example, if file D:\MyDocuments\important.pdf is being deleted, it will be moved to C:\ProgramData\Minifilter\D:^^MyDocument^^__??important.pdf??__ and then marked hidden. Minifilter also ensures that new file is not renamed by anyone.
  • Original filepath is converted to a special valid filename (filename!) e.g. D:\MyDocuments\important.pdf to D__^^MyDocument^^??important.pdf?? . As long as you can recover the original filepath from new filename unambigously, we are good to go. You can also use base64 encoded paths as filename. I used a simpler scheme since decoding base64 encoded path inside a kernel minifilter was quite a lot of work.
  • The new path is sent to user-space application which backs it in a bucket and then send a message a kernel minifilter to delete the file.
  • When delete this file request come from the user-space application, the minifilter doesn’t modify the delete-file event and let is go down the kernel stack. The kernel delete the file for sure this time.

Minifilter

This kernel minifilter is a cmake based project that also create a nsis based installer and sign it using your personal key. Note that you need to get the minifilter signed by MS or MS approved vendors before you can publish it. To load the minifilter into your machine, you should follow the official guide https://learn.microsoft.com/en-us/windows-hardware/drivers/ifs/development-and-testing-tools

  • The minifilter captures required filesystem events:
    CONST FLT_OPERATION_REGISTRATION Callbacks[] = {
    { IRP_MJ_CREATE, 0, BackupOnDeleteShield, NULL },
    // IRP_MJ_SET_INFORMATION is sent whenever file is marked for deletion.
    { IRP_MJ_SET_INFORMATION,
    FLTFL_OPERATION_REGISTRATION_SKIP_PAGING_IO,
    BackupOnDeleteShield,
    NULL },
    { IRP_MJ_OPERATION_END }
    };
  • See https://github.com/dilawar/minifilter-backup-before-delete/blob/e61e9c5fbfd9df51749d0de98acfbad7b7a6290b/minifilter/src/main.cpp#L396 for the definition of BackupOnDeleteShield function.
  • If the delete request has come from kenel mode application, let is pass. And if the user-space app is not connected with minifilter (more on it later), don’t do anything.

    if (Data->RequestorMode == KernelMode) {

    // if the requestor is kernel mode, pass the request on uninterrupted

    ret = FLT_PREOP_SUCCESS_NO_CALLBACK;

    goto CleanUp;

    }
        //
    // If no client is connected to create backups then there is no point using
    // the shield.
    //
    if (!IsUserAppConnected()) {
    DFLOG(ERROR, "Shield>" __FUNCTION__ ": No backup client connected.\n");
    ret = FLT_PREOP_SUCCESS_NO_CALLBACK;
    goto CleanUp;
    }
  • For an user-space app to talk to minifilter, the minifilter must create a channel to listen to. The port is opened by minifilter ‣. The channel is created using https://learn.microsoft.com/en-us/windows-hardware/drivers/ddi/fltkernel/nf-fltkernel-fltcreatecommunicationport

    PORT_STATE
    InitializeServerPort(IN PUNICODE_STRING CommunicationPortName,
    IN PFLT_FILTER Filter)
    {
    NTSTATUS status = STATUS_SUCCESS;

    gServerPortState = PORT_STATE::UnInitialized;

    PSECURITY_DESCRIPTOR SecurityDescriptor;
    OBJECT_ATTRIBUTES ObjectAttributes;

    PAGED_CODE();

    DFLOG(ERROR,
    "Shield>" __FUNCTION__ ": Trying opening port: '%wZ'.\n",
    CommunicationPortName);

    //
    // Create communication descriptor.
    //
    status = FltBuildDefaultSecurityDescriptor(&SecurityDescriptor,
    FLT_PORT_ALL_ACCESS);

    if (!NT_SUCCESS(status)) {
    DFLOG(
    ERROR,
    "Shield>" __FUNCTION__ " : Port is not initialized. Error "
    "FltBuildDefaultSecurityDescriptor - %X.\n",
    status);
    gServerPortState = PORT_STATE::UnInitialized;
    goto CleanUp;
    }

    InitializeObjectAttributes(&ObjectAttributes,
    CommunicationPortName,
    OBJ_KERNEL_HANDLE | OBJ_CASE_INSENSITIVE,
    NULL,
    SecurityDescriptor);

    status = FltCreateCommunicationPort(
    Filter,
    &gServerPort,
    &ObjectAttributes,
    NULL,
    ShieldConnect, // Connect notify callback
    ShieldDisconnect, // Disconnect notify callback.
    ShieldMessage, // message notify callback.
    MAX_CLIENTS);

    if (!NT_SUCCESS(status)) {
    DFLOG(ERROR,
    "Shield>" __FUNCTION__ " : Port is not initialized. Error "
    "FltCreateCommunicationPort - %X.\n",
    status);
    gServerPortState = PORT_STATE::UnInitialized;
    goto CleanUp;
    }

    DFLOG(ERROR,
    "Shield>" __FUNCTION__ ": opened server port handle 0x%p.\n",
    gServerPort);

    gServerPortState = PORT_STATE::Initialized;

    CleanUp:
    if (SecurityDescriptor)
    FltFreeSecurityDescriptor(SecurityDescriptor);

    return gServerPortState;
    }

User-space app

How to connect a user-space app to communicate with minifilter? We opened a port with a name. Use the same name to connect to the port (minifilter). See the app directory in the repository.

C++
void
ShieldClient::connect()
{
port_ = INVALID_HANDLE_VALUE;
HRESULT hr = S_OK;
## ifdef COMMUNICATION_IN_SYNC_MODE
//
// port in sync mode. We don't have to use Overlapped structure here.
// TODO: Not sure about the performance.
//
PLOGI << "Connecting to " << portname_ << " in SYNC mode.";
hr = FilterConnectCommunicationPort(
portname_.c_str(), FLT_PORT_FLAG_SYNC_HANDLE, NULL, 0, NULL, &port_);
## else
// port in async mode. This is the preferred way. Use with completion port.
PLOGI << "Connecting to " << portname_ << " in ASYNC mode.";
hr = FilterConnectCommunicationPort(
portname_.c_str(), 0, NULL, 0, NULL, &port_);
## endif
if (FAILED(hr)) {
PLOGW << "Failed to connect to Shield: Error: 0x" << std::hex << hr;
connected_ = false;
goto MAIN_EXIT;
}
completion_port_ =
CreateIoCompletionPort(port_, NULL, 0, 2 /* 2 user threads */);
if (NULL == completion_port_) {
hr = HRESULT_FROM_WIN32(GetLastError());
PLOGW << "Error in creating completion port. Code: 0x" << std::hex
<< hr;
goto MAIN_EXIT;
}
PLOGI << " ... Successfully connected.";
connected_ = true;
MAIN_EXIT:
if (!connected_ && INVALID_HANDLE_VALUE != port_) {
CloseHandle(port_);
}
}

Messaging

To keep it simple,

  • Each messages has first two bytes of it designated as message code.
  • Next 4 bytes are the size of the message, and rest is the message.

June 29, 2025: Weekly Note 2025/02

  • For the last couple of weeks, Ookie has been going to a day-care for a few hours. She gets to play with other kids there. “Playing with other kids” has been the reason for sending her there. After a few days of fuss, she now seems to be enjoying her time there. My very friendly neighbors think that she is too young to go to day-care!
  • Somehow I managed to run 400 km so far! My pace has been slowest. I still need to complete 600 km in the second half of the year.
    • I’ve started paying for RUNALYZE. It is nice that more and more services are offering purchase-parity plan for subscription for Indian users. I’ve been using jonasoreland/runnerup: A open source run tracker for a long time to sync my runs with Runalyze/Strava. One dollar=Rs 100, but it is roughly Rs 30 in purchase parity, so you have to reduce the pricing by third for Indian users! If you are doing this, I’d be happy to work for you for a purchase parity adjusted salary😛.
  • I added another small tool that generates QR codes with a logo. You can find it here https://tools.maxflow.in/tool/qrcodes.
  • I had a minor meltdown at work 😭. Not proud of it. It’s hard to keep cool when many co-workers write almost empty email without a subject/body to report bugs. I’ve been trying to get them to use GitLab issues for a few months now. Either, I need to disengage from work a little at this workplace, or find a place where co-workers are adults and not only they come to work but also know how to structure and plan it.
  • I read Programming as theory building : Naur.pdf and found it illuminating. I am pretty sure I’d have yawned reading it some 10 years ago. This article — written a year before I was born — put “programmers” at the center of software development. I don’t think I can reduce this article to soundbites. Do read. I learnt about this article from HN.

naur-prog-as-theory-building.png -- Naur - Programming as theory building
– I wrote another small utility to remind me that a LWN article has finally become open. I am not able to pay for LWN subscription due to HDFC Bank related issues and I forget to revisit the link when it is open. Perhaps I can rent a VPS in this money and read the article two weeks later?! LWN is a great resource and I feel bad for not paying for it though!
– My Wallet from Budgetbakers is no longer syncing with HDFC Bank. Their support is working on the issue. HDFC seem to have changed their login flow again! My another bank, DBS Bank, doesn’t have saving account API🤣. I opened account here thinking that they are “tech-savvy”!
– I’ve been thinking about hiring a lot these days. At my current company which is an early stage startup, they have been struggling to hire a dev for the last 5 months! I was part of a few interviews — some went well and most were meh, but not able to hire for 5 months feels a bit extreme!
– Both mango trees in my street has mangoes this year! Here is my daughter Ookie playing with her friend. Fortunately, like many streets in Bengaluru, this street is a dead end and have no traffic.

signal-2025-06-28-192959_002.jpeg -- Ookie trying to convince her co-worker to use Rust while Kalu the fifth doing her stuff

An example of `boost::fiber`

Fiber is just a thread implemented in user space. Fibers are easier to reason about and have the performance advantage of much cheaper context switching. Fibers are very well-suited for handling concurrent I/O operations where a processor mostly waits for data to become available; threads usually have a pretty high context-switching cost. Therefore, multiple fibers running in a single thread is an effective solution.

Here is a simple program I wrote to explore fibers. You can find the full example below. This program has two functions: print_a prints “a,” and print_b prints “b” and then launches a thread that prints “B” (in detached mode).

void print_a()
{
    cout << "a";
    boost::this_fiber::yield();
}

void print_b()
{
    cout << "b";
    std::thread j([]() { printf("B"); });
    j.detach();
    boost::this_fiber::yield();
}

Following is the main function. We created a shared variable i initialized to 0. We create two detached fibers. The first one keeps calling print_a until i < 20. Similarly, the second one loops on print_b until i < 20. Both increment i by 1. When i = 20, both fibers should be able to join.

int main()
{
    int i = 0;
    boost::fibers::fiber([&]() {
        do {
            print_a();
            i++;
        }
        while (i < 20);
    }).detach();

    boost::fibers::fiber([&]() {
        do {
            i++;
            print_b();
        } while (i < 20);
    }).detach();

    printf("X");
    return 0;
}

Let’s guess the output of this program. It is most likely to be the same as if std::threads were used instead of fibers.

Is “X” printed first? Yes. Note that detach() is called on each fiber, so neither of their functions is called immediately. They are put in the background. Control passes to the fiber manager at return 0;, which is when it asks the fibers to join. In fact, you can put more computations after the printf("X"); statement, and they would be computed before any fiber is called.

As soon as we try to return from main, the fiber manager is asked to join the fibers. The first fiber wakes up, “a” is printed, and the fiber yields control to the manager. The fiber manager then wakes up the second fiber (which was waiting in the queue), which prints “b” and also launches a thread in the background that prints “B”. We cannot be sure if “B” will be printed immediately after “b” (since it is a std::thread). print_b yields control and goes to sleep. The fiber manager wakes up the first fiber again, print_a is called, “a” is printed, and so on. Note that i is incremented every time either of the fibers is called.

When i hits 20, both fibers terminate and join, and the main function executes return 0;. Consequently, print_a is called ten times, and print_b is also called ten times. In the output, we should have ten “a”s, ten “b”s, and ten “B”s. “B” may not strictly follow “b,” but “b” must come after “a.”

Here are a few runs of the program. Note that the location of “B” is not deterministic.

XababBabBabBababBBabBabBabBabBB 
XababBabBabBabBabBabBabaBbBabBB 
XababBabBabBabBabBabBabBabBabBB 
XababBabBabBabBabBabBabBabBabBB 
XababBabBabBBababBabBabBabBabBB 
XababBabBabBabBabBabBabBabBabBB 
XababBabBabBababBBabBabBababBBB 
XababBabBabBababBBabBabBabBabBB 
XababBabBabBababBBabBabBabBabBB 
XababBabBabBabBabBababBBabBabBB 
XababBabBabBabBabBabBabBabBabBB

References A great talk by Nat on Boost fibers: https://youtu.be/e-NUmyBou8Q