Atomic ordering made sense after I drew it on paper

I’ve been writing concurrent code for years but until maybe eight months ago I had a kind of unexamined fear of atomic ordering. I’d seen std::memory_order_acquire in C++ code and Ordering::AcqRel in Rust and atomic.LoadInt64 (which is sequentially consistent by default in Go) and I knew, vaguely, that these were “memory fences” and that SeqCst was slower and Relaxed was faster and also dangerous. I just always used SeqCst because it was the safest, and ignored the others.

Eventually I had to care — specifically, on an ARM64 machine where SeqCst was noticeably slower than on x86, and a hot path in a lock-free queue was taking 40% longer than it could. I sat down with a paper notebook and drew some memory ordering diagrams, and it clicked.

Here’s the mental model I ended up with.

Imagine two threads and a shared flag. Thread 1 does some work, then sets the flag to 1. Thread 2 polls the flag, and once it sees 1, reads the shared work.

Thread 1:                    Thread 2:
data = 42;                   while (flag.load() == 0) {}
flag.store(1);               x = data;

The question is: when Thread 2 reads data, does it see 42? If there are no memory ordering guarantees, the CPU can freely reorder Thread 1’s writes. The store to data might get buffered, and the store to flag might go out first. Thread 2 sees flag == 1, reads data, gets garbage.

The fix is memory ordering. Specifically:

Thread 1’s flag.store(1) should be a release store. This guarantees that all memory writes that happened-before this store in program order are visible to any thread that does an acquire load of this atomic.
Thread 2’s flag.load() should be an acquire load. This guarantees that any subsequent reads in program order cannot be reordered before this load.

Together, acquire-release pairing creates a happens-before relationship. Everything Thread 1 did before the release store is visible to Thread 2 after the acquire load.

In Rust:

use std::sync::atomic::{AtomicU32, Ordering};

static FLAG: AtomicU32 = AtomicU32::new(0);
static mut DATA: u32 = 0;

// thread 1
unsafe { DATA = 42; }
FLAG.store(1, Ordering::Release);

// thread 2
while FLAG.load(Ordering::Acquire) == 0 {}
let x = unsafe { DATA };

Why not use SeqCst everywhere? Because SeqCst gives you a stronger guarantee — it also guarantees a global total order of all SeqCst operations, across all threads. On x86, this costs essentially nothing (x86 is already mostly-sequentially-consistent). On ARM or RISC-V, SeqCst requires explicit memory barriers (dmb ish, etc.) that cost real cycles. For a flag that’s being hit in a hot loop, the difference is measurable.

Relaxed gives you atomicity (no torn reads or writes) but no ordering. Use it for counters where you don’t care about ordering — e.g., a metric counter that’s summed across threads. If you just want “the total is roughly correct,” relaxed is fine. If you want “after I do X, the counter is at least Y,” you need acquire/release or SeqCst.

// relaxed is fine for a metric
static REQUESTS: AtomicU64 = AtomicU64::new(0);

fn handle() {
    REQUESTS.fetch_add(1, Ordering::Relaxed);
    // do the work
}

AcqRel is for read-modify-write operations where you want both acquire and release semantics. A lock-free stack push, for example:

loop {
    let head = top.load(Ordering::Acquire);
    node.next = head;
    if top.compare_exchange_weak(head, node, Ordering::AcqRel, Ordering::Acquire).is_ok() {
        break;
    }
}

The CAS is a read and a write. AcqRel says: this operation acquires (sees prior releases) and also releases (any thread that later acquires this atomic sees our writes up to this point).

A useful drawing I made on my notebook: imagine memory operations as a partial order. Release stores are dams that prevent earlier writes from flowing past them. Acquire loads are dams that prevent later reads from flowing before them. Relaxed is a “no dam” — the CPU can reorder whatever it wants. SeqCst is “this is a synchronization point for EVERYONE” — all SeqCst operations across all threads agree on a single order.

One real example where this matters: initializing a shared object. Using Once/double-checked-locking style:

static INIT_DONE: AtomicBool = AtomicBool::new(false);
static mut DATA: Option<Thing> = None;

fn get_thing() -> &'static Thing {
    if !INIT_DONE.load(Ordering::Acquire) {
        init();  // takes a mutex, initializes DATA, stores true with Release
    }
    unsafe { DATA.as_ref().unwrap() }
}

The Acquire on the load is essential. Without it, you might see INIT_DONE == true but then read DATA and see None, because the stores could be reordered. With Release on the init-side store, any thread that observes true with an Acquire load is guaranteed to see the DATA = Some(...) write.

Go makes this easier by not giving you ordering primitives — sync/atomic is implicitly sequentially consistent for most purposes. That’s fine but can be slow on weak architectures. Rust and C++ give you the full toolkit.

My practical rules:

If you’re using atomics to communicate state between threads (flags, once-init, lock-free structures), use acquire-release pairs. Default to SeqCst until you’ve measured.
If you’re using atomics purely as counters (metrics, statistics, anything where ordering doesn’t matter), use Relaxed.
If you’re not sure, use SeqCst. The cost is usually invisible and the correctness is guaranteed.
Read one of the ordering papers — Boehm’s “Foundations of the C++ Concurrency Memory Model” was the one that finally clicked for me.

I don’t want to pretend I deeply understand the full ARM or RISC-V memory model. I probably don’t. But I’ve got enough of a mental model that I can read a lock-free data structure and make a reasonable call about which orderings are right, and I stopped defaulting to SeqCst out of fear.