In the world of multithreaded C++ programming, sharing data between threads introduces a fundamental challenge: data races. When multiple threads access the same memory location simultaneously, and at least one thread modifies that memory, chaos ensues.
Consider this seemingly innocent code:
int counter = 0;
// Thread 1
counter++;
// Thread 2
counter++;
What should be the final value of counter
? You might expect 2, but the reality is far more sinister. The increment operation counter++
isn’t atomic - it’s actually three separate operations:
// What counter++ really does:
1. Load counter value into register
2. Increment the register
3. Store register back to counter
When two threads execute this simultaneously, the interleaving can produce unexpected results:
Thread 1: Load counter (0)
Thread 2: Load counter (0)
Thread 1: Increment (0 + 1 = 1)
Thread 2: Increment (0 + 1 = 1)
Thread 1: Store (counter = 1)
Thread 2: Store (counter = 1)
Final result: 1 (not 2!)
This is a data race, and according to the C++ standard, it results in undefined behavior. Your program might crash, produce wrong results, or appear to work correctly until the worst possible moment.
C++ atomics solve this problem by providing operations that are guaranteed to be atomic - they complete as a single, indivisible unit. No thread can observe an atomic operation in a half-completed state.
#include <atomic>
std::atomic<int> counter{0};
// Thread 1
counter++; // This increment is now atomic!
// Thread 2
counter++; // This increment is now atomic!
With atomics, the final result is guaranteed to be 2, regardless of thread scheduling.
Atomics provide three crucial guarantees:
Let’s explore what atomicity means in practice:
std::atomic<int> value{10};
// This is atomic - no thread can see a partial write
value.store(42);
// This is atomic - you get either the old or new value, never garbage
int current = value.load();
// This is atomic - exchange returns old value, sets new value
int old_value = value.exchange(100);
// This is atomic - increment and return new value
int new_value = ++value;
The key insight: each individual atomic operation is indivisible, but combinations of operations are not:
std::atomic<int> x{0}, y{0};
// Thread 1
x.store(1); // Atomic
y.store(2); // Atomic
// Thread 2 might see:
// x=0, y=0 (before Thread 1 starts)
// x=1, y=0 (after first store in Thread 1)
// x=1, y=2 (after both stores in Thread 1)
// But never x=0, y=2 or garbage values
Compilers and processors can reorder memory operations for optimization, but these reorderings must preserve the observable behavior within a single thread. However, other threads may observe these reorderings, which can cause synchronization issues. When you modify a flag or variable from one thread, other threads might see surrounding memory operations happen out of order relative to that flag change. This occurs because each processor core works with its own cache, and memory updates may not immediately become visible to other cores.
To establish ordering guarantees between threads, we use std::memory_order_xxx flags with atomic operations. These flags create synchronization relationships - for example, when one thread uses memory_order_release on a store operation, another thread can use memory_order_acquire on the corresponding load to ensure proper ordering. The key point is that both threads participating in synchronization need compatible memory ordering specifications to establish the desired ordering relationship.
Memory ordering controls how reorderings appear across thread boundaries, ensuring that when synchronization is needed, threads can establish the necessary ordering relationships through compatible atomic operations.
Atomicity alone isn’t enough for proper thread synchronization. Consider this scenario:
std::atomic<bool> ready{false};
int data = 0; // Regular variable
// Producer thread
data = 42; // Write data
ready.store(true); // Signal it's ready
// Consumer thread
if (ready.load()) {
std::cout << data; // Will this print 42?
}
Fortunately, this code works correctly! The consumer will always print 42 when it sees ready = true
. This is because atomic operations use sequential consistency by default (memory_order_seq_cst
), which provides strong ordering guarantees.
However, this raises an important question: what if we need different levels of synchronization? Sometimes we want maximum performance and don’t need such strong guarantees. Other times we need precise control over what gets synchronized and what doesn’t.
This is where memory ordering becomes crucial - it allows us to fine-tune the synchronization behavior of our atomic operations.
To understand why memory ordering matters, imagine if atomic operations provided only atomicity but no ordering guarantees. In such a hypothetical world:
Producer Thread Consumer Thread
=============== ===============
data = 42
ready.store(true) ───► if (ready.load()) // true
│
└─► data might still be 0!
The consumer could observe the atomic store to ready
before the regular write to data
becomes visible, even though they were written in order by the producer thread.
The default memory_order_seq_cst
prevents this problem by ensuring:
Every atomic operation accepts a memory ordering parameter that controls synchronization:
enum memory_order {
memory_order_relaxed,
memory_order_consume, // Deprecated, don't use
memory_order_acquire,
memory_order_release,
memory_order_acq_rel,
memory_order_seq_cst // Default
};
Let’s examine each ordering level:
std::atomic<int> counter{0};
// Only the counter increment is atomic
// No synchronization with other memory operations
counter.fetch_add(1, std::memory_order_relaxed);
Relaxed ordering provides only atomicity & visibility - no ordering guarantees. Other threads will see the most recent value written to that specific atomic variable. Use relaxed ordering for stores when other threads only need to know the store happened, but don't need to see other memory write operations around store in any particular order.
The most commonly used ordering pair:
std::atomic<bool> ready{false};
int data = 0;
// Producer (Release)
data = 42; // 1
ready.store(true, std::memory_order_release); // 2
// Consumer (Acquire)
if (ready.load(std::memory_order_acquire)) { // 3
assert(data == 42); // Always true! // 4
}
The C++ standard ensures that any memory written before the release store is visible to the thread that performs an acquire load.
Here’s the guarantee:
release
store (line 1) are visible after the acquire
load (line 3)ready == true
, it’s guaranteed to see data == 42
Producer Thread Consumer Thread
=============== ===============
data = 42 ───┐
│
ready.store() │ ───► if (ready.load())
(release) │ (acquire)
│ │
└───────────┘
Synchronizes-with
Everything above ───► Everything below
the release the acquire
becomes visible can see it
Acquire-Release is the sweet spot of atomic programming - providing strong synchronization guarantees while being more efficient than sequential consistency.
Think of acquire-release as a one-way synchronization barrier:
std::atomic<bool> flag{false};
int shared_data[100];
// Producer Thread
for (int i = 0; i < 100; ++i) {
shared_data[i] = i * i; // Fill array
}
flag.store(true, std::memory_order_release); // "I'm done!"
// Consumer Thread
if (flag.load(std::memory_order_acquire)) { // "Are you done?"
// ALL writes above the release are now visible
for (int i = 0; i < 100; ++i) {
assert(shared_data[i] == i * i); // Always passes!
}
}
Producer Thread Consumer Thread
=============== ===============
shared_data[0] = 0
shared_data[1] = 1
shared_data[2] = 4
...
shared_data[99] = 9801
═══════════════════ ═══════════════════
RELEASE BARRIER ACQUIRE BARRIER
flag.store(true, release) ──→ if(flag.load(acquire))
═══════════════════ ═══════════════════
// All data above is
// guaranteed visible here
process(shared_data)
template<typename T>
class LockFreeQueue {
private:
struct Node {
std::atomic<T*> data{nullptr};
std::atomic<Node*> next{nullptr};
};
std::atomic<Node*> head{new Node};
std::atomic<Node*> tail{head.load()};
public:
void enqueue(T item) {
Node* new_node = new Node;
T* data_ptr = new T(std::move(item));
Node* prev_tail = tail.exchange(new_node, std::memory_order_acq_rel);
prev_tail->data.store(data_ptr, std::memory_order_release);
prev_tail->next.store(new_node, std::memory_order_release);
}
bool dequeue(T& result) {
Node* head_node = head.load(std::memory_order_acquire);
Node* next = head_node->next.load(std::memory_order_acquire);
if (next == nullptr) return false; // Queue empty
T* data = next->data.load(std::memory_order_acquire);
if (data == nullptr) return false; // Being modified
result = *data;
head.store(next, std::memory_order_release);
delete head_node;
delete data;
return true;
}
};
Let’s trace through how memory ordering prevents race conditions:
std::atomic<bool> ready{false};
std::string message;
// Thread 1 - WRONG!
message = "Hello World";
ready.store(true, std::memory_order_relaxed); // Too weak!
// Thread 2 - WRONG!
if (ready.load(std::memory_order_relaxed)) { // Too weak!
std::cout << message; // Might print garbage!
}
Problem: Relaxed ordering provides no synchronization. Thread 2 might see ready = true
but message
could still be empty or partially written.
std::atomic<bool> ready{false};
std::string message;
// Thread 1 - CORRECT
message = "Hello World"; // 1
ready.store(true, std::memory_order_release); // 2
// Thread 2 - CORRECT
if (ready.load(std::memory_order_acquire)) { // 3
std::cout << message; // Always prints "Hello World" // 4
}
Solution: The release-acquire pair creates a happens-before relationship. Operation 1 happens-before operation 4, guaranteeing Thread 2 sees the complete message.
Thread 1: message = "Hello" ──┐
ready = true (release) ──┼─── synchronizes-with ───┐
│ │
Thread 2: │ ┌─── happens-before ─┘
if (ready) (acquire) ────┘ │
cout << message ────────────────┘
Compare-and-Swap is the fundamental building block of lock-free programming. It's an atomic operation that says: "If the current value equals what I expect, replace it with a new value, otherwise update my expected value to the current value."
std::atomic<int> value{10};
int expected = 10;
int new_value = 20;
// Atomic operation: if (value == expected) { value = new_value; return true; }
bool success = value.compare_exchange_weak(expected, new_value);
if (success) {
// value is now 20, expected is still 10
std::cout << "Successfully changed value\n";
} else {
// value unchanged, expected now contains actual value
std::cout << "Failed to change, actual value was: " << expected << "\n";
}
C++ provides two variants:
// compare_exchange_strong: Never fails spuriously
bool success = value.compare_exchange_strong(expected, new_value);
// compare_exchange_weak: May fail spuriously (faster on some architectures)
bool success = value.compare_exchange_weak(expected, new_value);
Rule of thumb: Use weak
in loops, strong
for single attempts.
Instead of using atomic++
, let’s implement it manually with CAS:
void atomic_increment(std::atomic<int>& counter) {
int current = counter.load();
while (!counter.compare_exchange_weak(current, current + 1)) {
// CAS failed, current now contains the actual value
// Loop and try again
}
}
CAS Loop in Action
Initial: counter = 5
Thread 1: Thread 2:
========= =========
current = 5 current = 5
CAS(5 → 6) ──┐ CAS(5 → 6) ──┐
│ │
▼ ▼
SUCCESS! FAIL!
counter = 6 current = 6 (updated)
CAS(6 → 7) ──► SUCCESS!
counter = 7
Consider this dangerous sequence:
// Initial state: pointer = A
Node* current = pointer.load(); // current = A
// Another thread: pointer: A → B → A (reused address!)
// This CAS succeeds even though pointer changed!
bool success = pointer.compare_exchange_strong(current, new_node);
Solution: Use a version counter or hazard pointers:
struct VersionedPointer {
Node* ptr;
std::uint64_t version;
};
std::atomic<VersionedPointer> versioned_ptr;
void safe_update(Node* new_node) {
VersionedPointer current = versioned_ptr.load();
VersionedPointer new_value{new_node, current.version + 1};
while (!versioned_ptr.compare_exchange_weak(current, new_value)) {
new_value.version = current.version + 1; // Update version
}
}
Here’s a complete lock-free stack using CAS:
template<typename T>
class LockFreeStack {
private:
struct Node {
T data;
Node* next;
Node(T data) : data(std::move(data)), next(nullptr) {}
};
std::atomic<Node*> head{nullptr};
public:
void push(T item) {
Node* new_node = new Node(std::move(item));
// Classic CAS loop pattern
Node* current_head = head.load();
do {
new_node->next = current_head;
} while (!head.compare_exchange_weak(current_head, new_node));
}
bool pop(T& result) {
Node* current_head = head.load();
// Keep trying until we succeed or stack is empty
while (current_head != nullptr) {
if (head.compare_exchange_weak(current_head, current_head->next)) {
result = std::move(current_head->data);
delete current_head;
return true;
}
// current_head was updated by failed CAS, try again
}
return false; // Stack was empty
}
bool empty() const {
return head.load() == nullptr;
}
};
LockFreeStack<int> stack;
// Multiple threads can safely use this
std::thread producer([&]() {
for (int i = 0; i < 1000; ++i) {
stack.push(i);
}
});
std::thread consumer([&]() {
int value;
while (stack.pop(value)) {
std::cout << "Popped: " << value << "\n";
}
});
producer.join();
consumer.join();
Memory fences are explicit synchronization barriers that control memory ordering without performing atomic operations on data. They’re the “pure synchronization” tool in your atomic arsenal.
A fence is a standalone synchronization point that affects the ordering of memory operations around it:
#include <atomic>
std::atomic_thread_fence(std::memory_order_acquire); // Acquire fence
std::atomic_thread_fence(std::memory_order_release); // Release fence
std::atomic_thread_fence(std::memory_order_seq_cst); // Full fence
// Using atomic operations for synchronization
std::atomic<bool> flag{false};
int data = 0;
// Thread 1
data = 42;
flag.store(true, std::memory_order_release);
// Thread 2
if (flag.load(std::memory_order_acquire)) {
assert(data == 42);
}
Equivalent using fences:
// Using fences for synchronization
std::atomic<bool> flag{false};
int data = 0;
// Thread 1
data = 42;
std::atomic_thread_fence(std::memory_order_release); // Release fence
flag.store(true, std::memory_order_relaxed);
// Thread 2
if (flag.load(std::memory_order_relaxed)) {
std::atomic_thread_fence(std::memory_order_acquire); // Acquire fence
assert(data == 42);
}
Fences are particularly useful when:
Classic mutual exclusion algorithm made modern:
class DekkerMutex {
private:
std::atomic<bool> flag0{false};
std::atomic<bool> flag1{false};
std::atomic<int> turn{0};
public:
void lock_thread0() {
flag0.store(true, std::memory_order_relaxed);
std::atomic_thread_fence(std::memory_order_seq_cst);
while (flag1.load(std::memory_order_relaxed)) {
if (turn.load(std::memory_order_relaxed) != 0) {
flag0.store(false, std::memory_order_relaxed);
while (turn.load(std::memory_order_relaxed) != 0) {
std::this_thread::yield();
}
flag0.store(true, std::memory_order_relaxed);
std::atomic_thread_fence(std::memory_order_seq_cst);
}
}
}
void unlock_thread0() {
turn.store(1, std::memory_order_relaxed);
std::atomic_thread_fence(std::memory_order_seq_cst);
flag0.store(false, std::memory_order_relaxed);
}
// Similar methods for thread1...
};
Let’s build a production-ready lock-free queue that demonstrates all concepts:
template<typename T>
class LockFreeQueue {
private:
struct Node {
std::atomic<T*> data;
std::atomic<Node*> next;
Node() : data(nullptr), next(nullptr) {}
};
std::atomic<Node*> head;
std::atomic<Node*> tail;
// Hazard pointer for memory management
thread_local static Node* hazard_ptr;
public:
LockFreeQueue() {
Node* dummy = new Node;
head.store(dummy, std::memory_order_relaxed);
tail.store(dummy, std::memory_order_relaxed);
}
void enqueue(T item) {
Node* new_node = new Node;
T* data = new T(std::move(item));
new_node->data.store(data, std::memory_order_relaxed);
while (true) {
Node* last = tail.load(std::memory_order_acquire);
Node* next = last->next.load(std::memory_order_acquire);
// Check if tail is still pointing to the last node
if (last == tail.load(std::memory_order_acquire)) {
if (next == nullptr) {
// Try to append new node
if (last->next.compare_exchange_weak(
next, new_node,
std::memory_order_release,
std::memory_order_relaxed)) {
break; // Successfully appended
}
} else {
// Help advance tail pointer
tail.compare_exchange_weak(
last, next,
std::memory_order_release,
std::memory_order_relaxed);
}
}
}
// Advance tail pointer
tail.compare_exchange_weak(
tail.load(), new_node,
std::memory_order_release,
std::memory_order_relaxed);
}
bool dequeue(T& result) {
while (true) {
Node* first = head.load(std::memory_order_acquire);
Node* last = tail.load(std::memory_order_acquire);
Node* next = first->next.load(std::memory_order_acquire);
// Verify consistency
if (first == head.load(std::memory_order_acquire)) {
if (first == last) {
if (next == nullptr) {
return false; // Queue is empty
}
// Help advance tail
tail.compare_exchange_weak(
last, next,
std::memory_order_release,
std::memory_order_relaxed);
} else {
if (next == nullptr) {
continue; // Inconsistent state, retry
}
// Read data before potential dequeue
T* data = next->data.load(std::memory_order_acquire);
if (data == nullptr) {
continue; // Another thread is dequeuing
}
// Try to advance head
if (head.compare_exchange_weak(
first, next,
std::memory_order_release,
std::memory_order_relaxed)) {
result = *data;
delete data;
// Safe to delete after successful CAS
delete first;
return true;
}
}
}
}
}
bool empty() const {
Node* first = head.load(std::memory_order_acquire);
Node* last = tail.load(std::memory_order_acquire);
return (first == last) && (first->next.load(std::memory_order_acquire) == nullptr);
}
};
LockFreeQueue<std::string> queue;
// Producer threads
std::vector<std::thread> producers;
for (int i = 0; i < 4; ++i) {
producers.emplace_back([&queue, i]() {
for (int j = 0; j < 1000; ++j) {
queue.enqueue("Producer " + std::to_string(i) + " Item " + std::to_string(j));
}
});
}
// Consumer threads
std::vector<std::thread> consumers;
std::atomic<int> total_consumed{0};
for (int i = 0; i < 2; ++i) {
consumers.emplace_back([&queue, &total_consumed]() {
std::string item;
while (total_consumed.load() < 4000) { // 4 producers × 1000 items
if (queue.dequeue(item)) {
total_consumed.fetch_add(1, std::memory_order_relaxed);
std::cout << "Consumed: " << item << "\n";
} else {
std::this_thread::yield();
}
}
});
}
// Wait for completion
for (auto& producer : producers) producer.join();
for (auto& consumer : consumers) consumer.join();
Lock-free programming is powerful but treacherous. Here are the essential guidelines for writing correct and efficient atomic code.
Always begin with the default (sequential consistency) and optimize only when profiling shows it’s necessary:
// Start with this (safest)
std::atomic<bool> flag{false};
flag.store(true); // Uses seq_cst by default
// Optimize to this only if needed
flag.store(true, std::memory_order_release);
Why: Premature optimization with weak ordering often introduces subtle bugs that are nearly impossible to debug.
For most synchronization needs, acquire-release is the sweet spot:
// The workhorse pattern
std::atomic<bool> ready{false};
Data shared_data;
// Producer
prepare_data(shared_data);
ready.store(true, std::memory_order_release);
// Consumer
if (ready.load(std::memory_order_acquire)) {
process_data(shared_data); // Safe to access
}
Always consider whether your algorithm is vulnerable to ABA:
// Vulnerable to ABA
Node* old_head = head.load();
// ... time passes, head goes A→B→A ...
if (head.compare_exchange_strong(old_head, new_node)) {
// SUCCESS, but old_head might be recycled memory!
}
// Solution: Add version counters
struct VersionedPtr {
Node* ptr;
uint64_t version;
};
Lock-free data structures have complex memory management requirements:
// WRONG: Immediate deletion
Node* old_head = head.load();
head.store(old_head->next);
delete old_head; // Another thread might still be using this!
// BETTER: Use hazard pointers or reference counting
class HazardPointer {
thread_local static std::array<std::atomic<void*>, 8> hazards;
public:
static void protect(void* ptr, int slot) {
hazards[slot].store(ptr, std::memory_order_release);
}
static void clear(int slot) {
hazards[slot].store(nullptr, std::memory_order_release);
}
static bool is_hazardous(void* ptr) {
// Check if any thread is protecting this pointer
// Implementation details omitted for brevity
}
};
// Performance ranking (fastest to slowest)
1. Relaxed atomics // ~1-2x regular memory access
2. Acquire/Release // ~2-5x regular memory access
3. Sequential consistency // ~5-10x regular memory access
4. Full memory barrier // ~10-20x regular memory access
5. Mutex lock/unlock // ~50-100x regular memory access
Lock-free doesn’t mean wait-free. Heavy contention can cause performance collapse:
// High contention scenario
std::atomic<int> counter{0};
// 16 threads all doing this simultaneously
for (int i = 0; i < 1000000; ++i) {
counter.fetch_add(1); // Cache line bouncing!
}
ASCII Art: Cache Line Bouncing
Thread 1 (Core 1) Thread 2 (Core 2) Thread 3 (Core 3)
================= ================= =================
counter++ ───┐ │ │
│ │ │
Cache miss │ ◄─────┼────────────────────┤
Invalidate others│ │ │
└────────►│ │
counter++ │
Cache miss │
Invalidate others ─────┼──────┐
│ │
│ ◄────┘
│ counter++
│ Cache miss
│ And so on...
Strategy 1: Thread-Local Accumulation
// Instead of shared counter
std::atomic<int> global_counter{0};
// Use thread-local counters
thread_local int local_counter = 0;
std::vector<std::atomic<int>> per_thread_counters(std::thread::hardware_concurrency());
void increment() {
static thread_local int thread_id = get_thread_id();
per_thread_counters[thread_id].fetch_add(1, std::memory_order_relaxed);
}
int get_total() {
int sum = 0;
for (const auto& counter : per_thread_counters) {
sum += counter.load(std::memory_order_relaxed);
}
return sum;
}
Strategy 2: Backoff Strategies
class ExponentialBackoff {
int delay = 1;
static constexpr int MAX_DELAY = 1024;
public:
void wait() {
for (int i = 0; i < delay; ++i) {
std::this_thread::yield();
}
delay = std::min(delay * 2, MAX_DELAY);
}
void reset() { delay = 1; }
};
// Use in CAS loops
void optimized_increment(std::atomic<int>& counter) {
ExponentialBackoff backoff;
int current = counter.load(std::memory_order_relaxed);
while (!counter.compare_exchange_weak(
current, current + 1,
std::memory_order_relaxed)) {
backoff.wait();
}
}
Lock-free code is notoriously difficult to test because race conditions are timing-dependent.
template<typename TestFunc>
class StressTester {
TestFunc test_func;
std::atomic<bool> start_flag{false};
std::atomic<bool> stop_flag{false};
std::atomic<int> error_count{0};
public:
StressTester(TestFunc func) : test_func(func) {}
void run_test(int num_threads, std::chrono::seconds duration) {
std::vector<std::thread> threads;
// Create threads but don't start them yet
for (int i = 0; i < num_threads; ++i) {
threads.emplace_back([this, i]() {
// Wait for synchronized start
while (!start_flag.load(std::memory_order_acquire)) {
std::this_thread::yield();
}
// Run test until stopped
while (!stop_flag.load(std::memory_order_acquire)) {
try {
test_func();
} catch (...) {
error_count.fetch_add(1, std::memory_order_relaxed);
}
}
});
}
// Start all threads simultaneously
start_flag.store(true, std::memory_order_release);
// Let them run for specified duration
std::this_thread::sleep_for(duration);
// Stop all threads
stop_flag.store(true, std::memory_order_release);
// Wait for completion
for (auto& thread : threads) {
thread.join();
}
std::cout << "Errors detected: " << error_count.load() << std::endl;
}
};
// Usage
LockFreeStack<int> stack;
StressTester tester([&stack]() {
stack.push(42);
int value;
stack.pop(value);
});
tester.run_test(8, std::chrono::seconds(10));
class LockFreeCounter {
std::atomic<int> value{0};
std::atomic<int> increments{0};
std::atomic<int> decrements{0};
public:
void increment() {
value.fetch_add(1, std::memory_order_relaxed);
increments.fetch_add(1, std::memory_order_relaxed);
}
void decrement() {
value.fetch_sub(1, std::memory_order_relaxed);
decrements.fetch_add(1, std::memory_order_relaxed);
}
// Invariant: value should equal (increments - decrements)
bool check_invariant() const {
int v = value.load(std::memory_order_acquire);
int inc = increments.load(std::memory_order_acquire);
int dec = decrements.load(std::memory_order_acquire);
return v == (inc - dec);
}
};
Atomic programming requires a fundamental shift in thinking:
Remember: Correct first, fast second. A slightly slower program that works is infinitely better than a fast program that occasionally produces wrong results.
The power of atomics lies not in eliminating all synchronization overhead, but in providing fine-grained control over exactly what synchronization guarantees you need, when you need them.