• Categorica Team

Trust your compiler

Modern C++ has changed... does the received wisdom hold?

Trust your compiler

C++ programmers absorb a lot of performance wisdom by osmosis: fast inverse square root1, XOR swap2, hand-unrolled loops3, Duff’s device4, exceptions are slow, virtual calls are slow, division is slow, lookup tables often beat maths. Even if once true, hardware, software, and compiler advances have drastically changed the landscape to the point where many are not now.

John Carmack’s DEC Alpha and Pentium Pro bear little resemblance to a modern Zen 5 x86_64 core - a piece of silicon who’s branch predictor alone likely has more transistors than all the workstations in the iD offices of the time. On top of that, LLVM’s Clang and GNU’s GCC are advanced enough to write optimal code from the naive input. For instance, both compilers have optimisation passes that essentially check if a code segment is actually just popcount in disguise5. In fact, writing “clever” code may be “worse” code; it could obscure intent from the optimiser, potentially costing you vectorisation, inlining, and target-specific lowering.

In this article we walk through a few old tricks, compare them with the obvious version, and see how they behave. The numbers below were produced on:

  • Ubuntu 24.04 LTS (With updated Linux kernel 6.18.1)
  • AMD Ryzen 9 9950X, 16 cores / 32 threads
  • 128 GB of DDR5/3600 memory
  • Clang 21.1.1, -O3 -ffast-math -mtune=native

The post comes with benchmark code. All benchmarks use Google Benchmark. 6

Content

  1. Part 1: New dogs, old tricks
  1. Part 2: The library
  1. How to use the benchmarks
  2. Closing remarks

Part 1: New dogs, old tricks

Fast inverse square root

The Q_rsqrt from Quake III7, reproduced here:

float Q_rsqrt( float number )
{
	long i;
	float x2, y;
	const float threehalfs = 1.5F;

	x2 = number * 0.5F;
	y  = number;
	i  = * ( long * ) &y;
	i  = 0x5f3759df - ( i >> 1 );
	y  = * ( float * ) &i;
	y  = y * ( threehalfs - ( x2 * y * y ) );
  return y;
}

There are numerous articles explaining how this works1, which we will skip here. In short, late 90s CPUs floating point units were nothing like today’s. During the development of Quake 3 Arena, iD’s developers discovered that a lot of time was being spent determining vertex normals used for their new lighting model. Much of this time was being spent in one operation: the inverse square root. Their approach was to forgo accuracy, and take a “estimate and improve” approach for this one operation. It could beat the naive 1.0f / sqrtf(x) by a wide margin: FSQRT and FDIV could take over 500 cycles in the 8087 FPU, while iD’s method used cheap integer operations plus one Newton iteration 8.

Modern CPUs

Intel introduced rsqrtss and rsqrtps to the x86 family of architectures with SSE 9; AVX later added the vrsqrtss and vrsqrtps forms for wider SIMD widths. These instructions compute an approximate reciprocal square root directly, with a documented error bound and substantially lower latency than x87 fsqrt10. ARMv8/AArch64 has comparable reciprocal-square-root estimate instructions11:

ISAScalarPackedWidth
x86-64 SSErsqrtssrsqrtps4 x f32
x86-64 AVXvrsqrtssvrsqrtps8 x f32
ARMv8 NEONfrsqrte (scalar)frsqrte4 x f32

Why this matters?

Writing this code in modern C++ might look something like:

constexpr float Q_rsqrt(float number) noexcept {
  static_assert(sizeof(float) == sizeof(std::uint32_t));
  auto i = std::bit_cast<std::uint32_t>(number);
  auto magic = 0x5f3759dfu - (i >> 1);
  auto y = std::bit_cast<float>(magic);
  return y * (1.5f - (number * 0.5f * y * y));
}

Compare that with the naive version:

constexpr float naive_rsqrt(float x) noexcept {
    return 1.0f / std::sqrt(x);
}

Now compare the relevant assembly produced with -std=c++23 -O3 -ffast-math -march=znver4 on Compiler Explorer using Clang 21.1.0. The benchmark run used -march=native; the assembly excerpt uses znver4 so the target is explicit. The -ffast-math part matters: it lets the compiler make aggressive, potentially lossy floating-point assumptions1213, so this example assumes positive, finite inputs and does not preserve every strict IEEE floating-point edge case.

Q_rsqrt(float):
        movd    eax, xmm0
        sar     eax
        mov     ecx, 1597463007
        sub     ecx, eax
        mulss   xmm0, dword ptr [rip + .LCPI0_0]
        movd    xmm1, ecx
        movdqa  xmm2, xmm1
        mulss   xmm2, xmm1
        mulss   xmm0, xmm2
        addss   xmm0, dword ptr [rip + .LCPI0_1]
        mulss   xmm0, xmm1
        ret

naive_rsqrt(float):
        vrsqrtss        xmm1, xmm0, xmm0
        vmulss  xmm0, xmm0, xmm1
        vfmadd213ss     xmm0, xmm1, dword ptr [rip + .LCPI1_0]
        vmulss  xmm1, xmm1, dword ptr [rip + .LCPI1_1]
        vmulss  xmm0, xmm1, xmm0
        ret

Benchmarking

This makes sense in theory, but is it actually true now?

BenchmarkTime (scalar)Time (n=1024)Time (n=65536)
Q_sqrt380 ns24.5 ns1865 ns
naive_rsqrt373 ns25.0 ns2161 ns

The scalar case is effectively a draw, with the obvious version slightly ahead. In the array kernels, the Quake version is somewhat faster in this particular run, but not in a way that justifies the trick as a default: the source is less clear, has a narrower useful domain, and is still only competitive because the compiler and CPU are already doing the hard work. Additionally, the naive way provides you with explicit bounds on error.

Take-away

Let the compiler do the work. You get comparable performance, better-defined code, and an implementation that says what it means.

Popcount and bit-twiddling

C++20 gave us <bit>: std::popcount, std::countl_zero, std::countr_zero, std::bit_width, std::has_single_bit, and friends14. On x86, when the relevant target features are enabled, many of these map to a single instruction:

Functionx86-64 (BMI/POPCNT)ARMv8
std::popcountpopcntcnt + addv
std::countl_zerolzcntclz
std::countr_zerotzcntrbit + clz

Compare:

constexpr int modern(std::uint64_t x) noexcept {
  return std::popcount(x);
}

constexpr int kernighan(std::uint64_t x) noexcept {
  int c = 0;
  while (x != 0U) {
    x &= x - 1U;
    ++c;
  }
  return c;
}

constexpr int swar(std::uint64_t x) noexcept {
  x = x - ((x >> 1) & 0x5555'5555'5555'5555ULL);
  x = (x & 0x3333'3333'3333'3333ULL) + ((x >> 2) & 0x3333'3333'3333'3333ULL);
  x = (x + (x >> 4)) & 0x0f0f'0f0f'0f0f'0f0fULL;
  return static_cast<int>((x * 0x0101'0101'0101'0101ULL) >> 56);
}

With -march=native and popcnt available, pop_modern can be one instruction. More interestingly, modern compilers may recognise the old tricks as popcount too: in this benchmark configuration, the Kernighan loop and SWAR version also collapse to popcnt. Without the target instruction, the distinction comes back: Kernighan is a data-dependent loop, while SWAR is a short sequence of shifts and masks. The standard spelling makes the intent explicit either way.

Before C++20 you could use __builtin_popcountll with GCC/Clang or __popcnt64 with MSVC. std::popcount is the portable spelling15, and on targets without a hardware popcnt, the implementation can still provide an efficient software fallback.

Benchmarks

BenchmarkTime
BM_popcount_modern94.5 ns
BM_popcount_kernighan94.5 ns
BM_popcount_swar94.5 ns

Take-away

Use <bit>, unless you’re targeting a freestanding embedded environment that doesn’t ship it.

Numerical recipes and row pointers

Part of the inherited wisdom is that row-pointer access to a matrix is faster than index calculation. That was easier to believe when multiplication was more expensive, address-generation hardware was less capable, and compilers had less room to simplify indexing.

Numerical Recipes in C — the well-known red book by Press et al. — also helped popularise this style. Its dmatrix/nrutil helpers use row indirection so algorithms can be written with [i][j] addressing, preserving mathematical notation and making transcriptions from Fortran more direct. That can be a reasonable interface choice; it is not, by itself, a performance win.

For the interested reader, the benchmark code contains three implementations of a matrix class:

  • flat_matrix (access is done via multiplication, contiguous data)
  • nr_matrix (access is done via row-pointer dereferencing, contiguous data)
  • scattered_matrix (the same as nr_matrix, except the data is not stored contiguously: some junk is allocated between the rows)

nr_matrix and scattered_matrix define operator[] returning a float *. Both of them store an array of pointers - pointing to the row of a matrix, the operator[] then returns the indexed row pointer.

Benchmarking

We run two kernels over these matrices: a row-major sum and a column-major sum. The main result is not “multiplication is free” or “row pointers are always bad”; it is that layout and traversal order dominate. The contiguous flat and Numerical Recipes-style layouts are close to one another. The deliberately scattered layout falls apart, especially once the working set is large.

BenchmarkTime (400 x 400)Time (4000 x 4000)
flat_matrix row-major kernel6062 ns987614 ns
nr_matrix row-major kernel6065 ns987632 ns
scattered_matrix row-major kernel6171 ns2323394 ns
flat_matrix column-major kernel33148 ns11258036 ns
nr_matrix column-major kernel36418 ns11450827 ns
scattered_matrix column-major kernel39054 ns14800609 ns

Take-away

Avoid indirection just because it looks clever or historically familiar. Prefer contiguous storage and traversal patterns that match it. If possible, avoid writing your own linear algebra kernels at all: Eigen, BLAS/LAPACK, GLM, or a vendor-tuned library will usually be a better starting point.

const& everywhere vs forwarding intent

Another habit many of us inherited is: “always pass by const&; copies are expensive”.

That is still a good rule when the function merely observes its argument:

double norm(std::vector<double> const& xs);

But it is the wrong abstraction for wrapper code. If a function’s job is to pass arguments on to something else, const& destroys information. An rvalue becomes a const lvalue, move construction is disabled. overload resolution changes, and the caller’s intent is lost.

Consider the old-school version of an emplace-style wrapper:

template <class T>
class bag {
public:
  template <class... Args>
  T& emplace(Args const&... args) {
    return xs_.emplace_back(args...);
  }

private:
  std::vector<T> xs_;
};

This looks harmless, but it is subtly pessimising. Even if the caller writes:

b.emplace(std::string(1024, 'x'));

inside emplace_bad, the argument is now a std::string const&. The vector cannot move from it. It has to copy.

The modern version preserves the caller’s value category:

template <class T>
class bag {
public:
  template <class... Args>
  T& emplace(Args&&... args) {
    return xs_.emplace_back(std::forward<Args>(args)...);
  }

private:
  std::vector<T> xs_;
};

Now lvalues stay lvalues, rvalues stay rvalues, and overload resolution sees what the caller actually wrote.

The same distinction shows up throughout modern C++ APIs:

// observes: const& is fine
void draw(widget const& w);

// consumes/stores: pass by value can be excellent
void set_name(std::string name) {
  name_ = std::move(name);
}

// forwards construction: forwarding references are the right tool
template <class... Args>
auto make_widget(Args&&... args) {
  return widget(std::forward<Args>(args)...);
}

Benchmarks

BenchmarkTime (n=64)Time (n=1024)Time (n=4096)
Using const &12.2 ns18.6 ns72.1 ns
Using perfect forwarding6.55 ns8.74 ns31.0 ns

We can see that perfect forwarding beats const ref arguments in this example. This is because we can elide a copy, when emplacing the string on the internal vector.

Take-away

This is not to say “never use const&”. It is more precise:

  • use T const& when you observe an existing object;
  • use T by value when you need your own copy and move from it internally;
  • use T&& for explicit sinks;
  • use forwarding references for generic forwarding wrappers.

Old C++ had fewer ways to say what you meant, so const& became a universal reflex. Modern C++ gives us better vocabulary. Use the type system to preserve intent.

Part 2: The library

The modern C++ standard library has improved substantially. Most of us have met engineers who work somewhere the STL is discouraged, wrapped, or forbidden outright. It may be time to reconsider: standard algorithms and containers are more readable, better tested, and increasingly better optimised.

Ranges and algorithms

Consider a simple task:

Given raw voltage samples, calibrate to millivolts, compute residual from a target, square it, apply a weight, and sum the cost

inline double raw_loop(std::span<double const> xs) noexcept {
  double sum = 0.0;

  for (double volts : xs) {
    auto mv  = calibrated_mv(volts);
    auto err = residual(mv);
    sum += weighted_square(err);
  }

  return sum;
}

inline double algorithm_call(std::span<double const> xs) noexcept {
  return std::accumulate(
    xs.begin(),
    xs.end(),
    0.0,
    [](double acc, double volts) {
      auto mv  = calibrated_mv(volts);
      auto err = residual(mv);
      return weighted_square(err) + acc;
    });
}

inline double ranges_pipeline(std::span<double const> xs) noexcept {
  auto costs = xs
    | std::views::transform(calibrated_mv)
    | std::views::transform(residual)
    | std::views::transform(weighted_square);

  return std::ranges::fold_left(costs, 0.0, std::plus<double>{});
}

This is deliberately small. It is also deliberately a good case for the optimiser: the transforms are simple, visible, and inlinable; std::ranges::fold_left is a left fold over the view pipeline16 - equivalent to f(f(f(f(init, x1), x2), ...), xn). In that setting, the abstraction largely disappears.

Benchmarks

BenchmarkTime (n=1024)Time (n=65536)
Raw loop36.0 ns2400 ns
Using <algorithm>36.0 ns2395 ns
Using <ranges>37.9 ns2417 ns

The benchmarks tell the same story: all three versions are essentially equivalent. The ranges pipeline is a little behind in the small case and within noise at the larger size. That is the point: the more expressive version is not paying a dramatic abstraction tax here.

Take-away

Use modern library algorithms. They may not always outperform hand-written loops, but they often match them while making the program easier for the compiler to reason about (and to my eyes, making them easier to read).

Note: I have observed poor performance when using std::views::filter with Clang 21.1.0 and GCC 13.2-era libstdc++ (GLIBCXX_3.4.32). There is nothing conceptually wrong with filter; this is the kind of implementation detail that standard libraries continue to improve.

Exceptions vs std::expected vs error codes

C++23’s std::expected<T, E> gives us a sum type for fallible operations17. The received wisdom says: “exceptions are slow, use error codes”. Reality is more nuanced.

Let’s look at three implementations:


// Case 1: Use exceptions as the error handling mechanism
int parse_throws(std::string_view s) {
  int v{};
  auto const* end = s.data() + s.size();
  auto [ptr, ec] = std::from_chars(s.data(), end, v);
  if (ec != std::errc{} || ptr != end) {
      throw std::runtime_error("bad parse");
  }
  return v;
}

// Case 2: Return std::expected encoding the success or failure
std::expected<int, std::errc> parse_expected(std::string_view s) noexcept {
  int v{};
  auto const* end = s.data() + s.size();
  auto [ptr, ec] = std::from_chars(s.data(), end, v);
  if (ec != std::errc{}) return std::unexpected(ec);
  if (ptr != end) return std::unexpected(std::errc::invalid_argument);
  return v;
}

// Case 3: Return error code in an out parameter
int parse_errc(std::string_view s, std::errc& err) noexcept {
  int v{};
  auto const* end = s.data() + s.size();
  auto [ptr, ec] = std::from_chars(s.data(), end, v);
  if (ec != std::errc{}) {
    err = ec;
  }
  else if (ptr != end) {
    err = std::errc::invalid_argument;
  } else {
    err = std::errc{};
  }
  return (err != std::errc{}) ? 0 : v;
}

Analysis

  • Exceptions: Modern Itanium ABI-style exception handling is often described as zero-cost when nothing is thrown18. The unwind tables live out of line, so the common path does not pay for stack unwinding. A successful parse_throws call inlined into a hot loop can look very similar to parse_errc, depending on the caller, optimiser, and exception-handling regions. However, once a call is inside an exception-handling region, or crosses an optimisation boundary the compiler cannot see through, the emitted code can grow landing pads and unwind metadata. That can inhibit inlining and other optimisations.

  • std::expected:

    • The return value can be thought of as a {value, has_value_flag} pair. Crucially, std::expected is noexcept-friendly, and the inliner has an easier path across the boundary. Branching on has_value() also plays nicely with branch prediction and speculative execution.
  • Error codes:

    • Similar to std::expected, but uglier. It also asks more of the optimiser because the out-parameter participates in aliasing analysis. More importantly, the code has no semantic constraint forcing the caller to inspect the result.

Cold path matters

When errors are common — say, parsing HTTP requests — exceptions become problematic: each throw unwinds the stack. std::expected remains ordinary control flow: a tagged-union check and a branch. In all three parsing functions, the source checks both ec and ptr == end; std::from_chars is allowed to stop after a valid prefix, so checking only ec would accept strings such as "123abc"19.

Benchmarks

The benchmarks run the above parse_ functions. For each function, 3 sets of inputs are given: Inputs with 0%, 5%, and 30% out of range/invalid content.

BenchmarkTime (0%)Time (5%)Time (30%)
Exceptions3770 ns26971 ns154762 ns
std::expected3553 ns3403 ns2338 ns
std::errc out-parameter3608 ns3430 ns2347 ns

In this benchmark, the non-throwing error paths do little useful work on failures, which explains why the expected and errc cases become faster as the failure rate rises. Treat that as a benchmark artefact. The exception case shows the important point: throwing is a cold-path mechanism, and it is expensive when it stops being cold. Indeed, both std::expected and std::errc have so little overhead that the effect of having less input to parse (due to to early-exiting) is actually apparent.

Take-away

Even at a 5% exception rate, runtime balloons by roughly an order of magnitude:

  • use std::expected for expected failures: parsing, lookup, conversion, validation
  • use exceptions for exceptional failures: out-of-memory, unrecoverable configuration errors at start-up, and similar

Modern C++ now has a first-class type for ordinary fallibility, which is clearly capable of outperforming exceptions for even low failure-rate scenarios.

Virtual vs static polymorphism

There are many reasons why virtual dispatch may incur a performance penalty. Whereas a plain function call may be a couple of instructions (or none, if it has been inlined), we may need to do some or all of the following in the virtual dispatch case:

  1. Load the vptr from the object (1 load, usually L1-resident)
  2. Load the function pointer from the vtable (1 load, usually L1-resident)
  3. Indirect call (depending on branch prediction 1-20+ cycles)

Each of these steps rolls the dice on a cache miss (or worse yet, a TLB miss). Additionally, unless there has been an opportunity for de-virtualisation, no inlining can occur. This can also preclude a host of other optimisations.

An alternative approach would be the curiously recurring template pattern (CRTP), which gives you a form of polymorphism - often called static polymorphism - but it does not work in every context, e.g., a std::vector of base-types. We can however use std::variant - C++17’s sum type - to handle dispatch in these cases. std::visit over std::variant<A, B, C> is lowered to a switch over the active alternative. Each branch is visible to the optimiser, indeed, dispatch becomes a regular function call, and hence all normal optimisations can apply.

Example

We cut down the implementation of the individual structs (cf. benchmark source code):

struct shape {
  virtual ~shape() = default;
  virtual double area() const = 0;
};

struct circle final : shape { ... };

struct square final : shape { ... };

struct circle_v { ... };
struct square_v { ... };

struct shape_v : std::variant<circle_v, square_v> {
  using variant::variant;

  double area() const {
    return std::visit([](auto &&x) {return x.area();}, *this);
  }
};

Benchmarks

The benchmarks instantiate a std::vector of std::unique_ptr<shape> and a std::vector<shape_v> respectively. Each benchmark then calls ->area() or .area() on every element.

BenchmarkTime (n=1024)Time (n=1000000)
vector of pointers3824 ns4330980 ns
vector of variants142 ns149928 ns

This benchmark does not isolate the dispatch cost. Reading it as “virtual call versus std::visit” would be misleading. What it measures is closer to a very common production pattern: std::vector<std::unique_ptr<base>> in a hot loop, compared with a value-based representation of the same closed set of types.

The virtual version stores owning pointers and chases them through the heap. The variant version stores values contiguously, giving the cache and prefetcher a much easier job. The benchmark therefore measures the whole shape of the design: allocation strategy, layout, locality, inlining opportunities, and dispatch. In this case, layout is probably doing more work than the dispatch mechanism itself.

That is not a flaw in the example; it is the point. “Virtual calls are slow” is rarely encountered as a clean, isolated dispatch problem. In everyday C++, it often appears as pointer-heavy ownership, heap allocation, and an opaque call boundary. If the set is closed, replacing std::vector<std::unique_ptr<base>> with std::vector<std::variant<...>> can improve locality immediately, without reaching for custom allocators or more elaborate object-pool schemes.

Take-away

The dispatch mechanism is almost noise compared with (a) data layout and (b) whether the optimiser can see through the call. Virtual dispatch is fine for coarse-grained polymorphism across translation-unit boundaries — for example, a plugin loaded via dlopen. For fine-grained hot-loop polymorphism over a closed set of similarly sized types, std::variant is often the better modern C++ choice. For open extension points, ABI boundaries, and plugin systems, virtual dispatch remains the right tool.

How to use the benchmarks

If you want to run these tests yourself, you can reproduce them (albeit with different results) by running:

git clone https://github.com/Categorica/blog_artefacts.git
cd 20260626_trust_your_compiler
cmake -B build -DCMAKE_BUILD_TYPE=Release 
  -DCMAKE_CXX_FLAGS="-Wno-nan-infinity-disabled -O3 -ffast-math -funsafe-math-optimizations -march=native -mtune=native"
cmake --build build -j

Note: the nan-infinity-disabled warning comes from Google Benchmark’s JSON reporter when building with fast-math flags.

When running tests, ideally CPU frequency scaling should be disabled for consistent results (although this will incur a significant performance penalty). Additionally, using something like taskset to affine the benchmark process to a specific core - ideally one isolated from regular kernel scheduling - will give the most repeatable results.

Closing remarks

There is a pattern across all of these examples: the optimiser rewards intent and punishes obfuscation. With permissive floating-point flags such as -ffast-math, 1.0f / std::sqrt(x) can become a reciprocal-square-root estimate plus refinement. std::popcount can become popcnt. A chain of simple range transforms can become a tight vector loop. The more directly the code says what it means, the more room the compiler and library have to help.

The corollary is that the first optimisation is often to delete clever code and write the obvious thing. Profile, then optimise the portions that matter. Consider tools like std::expected, std::variant, <bit>, std::bit_cast, ranges, and the standard algorithms. Compilers have improved significantly in the last 20 years, and with modern C++ and the standard library it is easier to express desired intent directly.

If you currently code in C++23 (pre-build of C++26) maybe none of this is a surprise to you. If you are returning to C++ after years away, or are used to older versions, it is surprisingly easy to obtain good performance in modern C++ with straight-forward code.

In short: trust your compiler!

Further reading

Footnotes


  1. https://betterexplained.com/articles/understanding-quakes-fast-inverse-square-root/
  2. https://en.wikipedia.org/wiki/XOR_swap_algorithm
  3. https://en.wikipedia.org/wiki/Loop_unrolling
  4. https://en.wikipedia.org/wiki/Duff%27s_device
  5. https://lemire.me/blog/2016/05/23/the-surprising-cleverness-of-modern-compilers/
  6. Run them on your own hardware: the sample numbers here are indicative, not definitive. Unless otherwise stated, the tables below report Google Benchmark CPU time from one run on this machine. Frequency scaling was enabled on the machine, which introduces some variance to these numbers.
  7. Quake III Arena q_math.c: https://github.com/id-Software/Quake-III-Arena/blob/master/code/game/q_math.c
  8. Historical x86 timing tables: https://kluge.in-chemnitz.de/docs/notes/asm_timing.php
  9. RSQRTSS/VRSQRTSS instruction reference: https://www.felixcloutier.com/x86/rsqrtss
  10. Agner Fog’s instruction tables: https://www.agner.org/optimize/instruction_tables.pdf
  11. Arm A64 FRSQRTE: https://developer.arm.com/documentation/ddi0487/latest
  12. Clang user manual, fast-math mode: https://clang.llvm.org/docs/UsersManual.html
  13. GCC optimisation options, -ffast-math: https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
  14. C++ <bit> header: https://en.cppreference.com/w/cpp/header/bit
  15. std::popcount: https://en.cppreference.com/w/cpp/numeric/popcount
  16. std::ranges::fold_left: https://en.cppreference.com/w/cpp/algorithm/ranges/fold_left
  17. std::expected: https://en.cppreference.com/w/cpp/utility/expected
  18. LLVM exception handling documentation: https://llvm.org/docs/ExceptionHandling.html
  19. std::from_chars: https://en.cppreference.com/w/cpp/utility/from_chars