simplify the IP benchmark #351

lemire · 2025-12-23T02:10:41Z

The IP benchmark was a bit complicated. I simplified the code a bit. I have added a control where you can see the speed without number parsing (so just_seek_ip_end).

just_seek_ip_end (no parse)              :   2.64 GB/s  164.8 Ma/s   6.07 ns/d 
parse_ip_std_fromchars                   :   0.80 GB/s   49.8 Ma/s  20.06 ns/d 
parse_ip_fastfloat                       :   0.71 GB/s   44.6 Ma/s  22.43 ns/d

shikharish · 2025-12-23T02:34:40Z

Ran it on my machine (Apple M1, clang 21)

❯ sudo ./build/benchmarks/bench_ip
just_seek_ip_end (no parse)              :   0.58 GB/s   36.4 Ma/s  27.50 ns/d   1.96 GHz  53.80 c/d  111.31 i/d   3.36 c/b   6.96 i/b   2.07 i/c 
parse_ip_std_fromchars                   :   0.28 GB/s   17.2 Ma/s  58.10 ns/d   1.96 GHz  113.59 c/d  489.60 i/d   7.10 c/b  30.60 i/b   4.31 i/c 
parse_ip_fastfloat                       :   0.33 GB/s   20.4 Ma/s  49.06 ns/d   1.96 GHz  95.99 c/d  362.24 i/d   6.00 c/b  22.64 i/b   3.77 i/c

Also a simple_bench which just measures throughput:

./simple_bench
just_seek_ip_end (no parse)    :  0.61 GB/s   26.4 ns/ip
std::from_chars                :  0.33 GB/s   49.1 ns/ip
fast_float::from_chars         :  0.52 GB/s   30.8 ns/ip

shikharish · 2025-12-23T02:39:01Z

Actualy I inspected the assembly of the benchmark. Turns out it was not able to inline function calls. Compiling with -flto increases throughput:

❯ sudo ./build/benchmarks/bench_ip
just_seek_ip_end (no parse)              :   0.60 GB/s   37.8 Ma/s  26.46 ns/d   1.96 GHz  51.77 c/d  109.30 i/d   3.24 c/b   6.83 i/b   2.11 i/c 
parse_ip_std_fromchars                   :   0.34 GB/s   21.1 Ma/s  47.29 ns/d   1.96 GHz  92.52 c/d  402.25 i/d   5.78 c/b  25.14 i/b   4.35 i/c 
parse_ip_fastfloat                       :   0.46 GB/s   28.5 Ma/s  35.07 ns/d   1.93 GHz  67.78 c/d  281.17 i/d   4.24 c/b  17.57 i/b   4.15 i/c

lemire · 2025-12-23T03:01:37Z

@shikharish The libraries are header libraries (both counters and fast_float) so -flto should have no effect. There is just one source file the benchmark itself.

lemire · 2025-12-23T03:19:38Z

I have pushed a memcpy measurement. So if you run bench_ip, it will give you an estimate of the best memcpy speed on your system.

You can independently measure it with an entirely different program:

try this C++ file, just save it, compile it with `-O3` and run it.

#include <iostream>
#include <chrono>
#include <cstring>
#include <cstdlib>
#include <memory>

int main(int argc, char* argv[]) {
    const size_t element_count = 15000;
    const size_t element_size = 16;
    const size_t buffer_size = element_count * element_size;  // 240000 bytes

    unsigned int iterations = 1000;
    if (argc > 1) {
        iterations = std::atoi(argv[1]);
    }

    std::unique_ptr<char[]> src = std::make_unique<char[]>(buffer_size);
    std::unique_ptr<char[]> dst = std::make_unique<char[]>(buffer_size);

    // Initialize source buffer (arbitrary data)
    for (size_t i = 0; i < buffer_size; ++i) {
        src[i] = static_cast<char>(i);
    }

    // Warm-up: perform a few copies to fill caches
    for (unsigned int i = 0; i < 10; ++i) {
        std::memcpy(dst.get(), src.get(), buffer_size);
    }

    volatile char sink;

    // Timed measurement
    auto start = std::chrono::high_resolution_clock::now();
    for (unsigned int i = 0; i < iterations; ++i) {
        std::memcpy(dst.get(), src.get(), buffer_size);
    }
    sink += dst[0];  // Prevent optimization

    auto end = std::chrono::high_resolution_clock::now();

    auto duration_ns = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count();
    std::cout << "duration " << duration_ns << " ns\n";
    double duration_sec = duration_ns / 1e9;

    double bytes_copied = static_cast<double>(buffer_size) * iterations;
    double speed_gbps = bytes_copied / duration_ns;

    std::cout << "Buffer size: " << buffer_size << " bytes (" << buffer_size / 1024.0 << " KiB)\n";
    std::cout << "Iterations: " << iterations << "\n";
    std::cout << "Time: " << duration_sec << " seconds\n";
    std::cout << "Memcpy speed: " << speed_gbps << " GB/s\n";

    return EXIT_SUCCESS;
}

Now, if you are getting that the memcpy speed measured the naive way is faster than the approach using the new bench_ip, then it will indicate that, indeed, my measurements carry some unacceptable overhead (they are pessimistic). If so, report your results, we shall verify.

Inlining might be a concern but I have ensured in the latest pushed that pretty much everything can get inlined.

shikharish · 2025-12-23T04:33:13Z

The memcpy test runs with similar speeds. But still there is a big difference in the actual benchmarks when I compile with Apple LLVM(clang 14)
(before I was actually compiling with clang 21 installed from MacPorts. And using -flto was making a difference. It doesn't make a difference when I use c++ that is Apple clang 14)

❯ sudo ./build/benchmarks/bench_ip
memcpy baseline                          :  33.55 GB/s  2096.8 Mip/s   0.48 ns/ip   1.75 GHz   0.83 c/ip   3.03 i/ip   0.05 c/b   0.19 i/b   3.63 i/c 
just_seek_ip_end (no parse)              :   0.59 GB/s   36.7 Mip/s  27.28 ns/ip   1.96 GHz  53.38 c/ip  123.01 i/ip   3.34 c/b   7.69 i/b   2.30 i/c 
parse_ip_std_fromchars                   :   0.31 GB/s   19.7 Mip/s  50.87 ns/ip   1.96 GHz  99.51 c/ip  411.94 i/ip   6.22 c/b  25.75 i/b   4.14 i/c 
parse_ip_fastfloat                       :   0.46 GB/s   29.0 Mip/s  34.51 ns/ip   1.96 GHz  67.54 c/ip  343.92 i/ip   4.22 c/b  21.49 i/b   5.09 i/c

❯ c++ -std=c++17 -O3 -o simple_bench simple_bench.cpp -I../include && ./simple_bench
just_seek_ip_end (no parse)    :  0.60 GB/s   26.6 ns/ip
std::from_chars                :  0.35 GB/s   46.2 ns/ip
fast_float::from_chars         :  0.82 GB/s   19.5 ns/ip
sink=2739889660

shikharish · 2025-12-23T04:38:32Z

The issue is simply that counters::bench code is complex, has templating and compiler is not able to optimize properly.
I wrote a simpler version of the bench nothing fancy:

template <class Function>
COUNTERS_FORCE_INLINE event_aggregate bench_simple(Function &&function,
                                                   size_t repeats = 100) {
  static thread_local event_collector collector;
  event_aggregate warm_aggregate{};

  // warmup
  for (size_t i = 0; i < 10; i++) {
    collector.start();
    function();
    warm_aggregate << collector.end();
  }

  // measurement
  event_aggregate aggregate{};
  for (size_t i = 0; i < repeats; i++) {
    collector.start();
    function();
    aggregate << collector.end();
  }
  return aggregate;
}

and got much faster results, matching with my simple_bench:

❯ sudo ./build/benchmarks/bench_ip
memcpy baseline                          :  37.20 GB/s  2325.1 Mip/s   0.43 ns/ip   1.98 GHz   0.85 c/ip   3.03 i/ip   0.05 c/b   0.19 i/b   3.56 i/c 
just_seek_ip_end (no parse)              :   0.60 GB/s   37.6 Mip/s  26.60 ns/ip   1.96 GHz  52.03 c/ip  123.02 i/ip   3.25 c/b   7.69 i/b   2.36 i/c 
parse_ip_std_fromchars                   :   0.35 GB/s   21.8 Mip/s  45.80 ns/ip   1.96 GHz  89.58 c/ip  278.99 i/ip   5.60 c/b  17.44 i/b   3.11 i/c 
parse_ip_fastfloat                       :   0.82 GB/s   51.2 Mip/s  19.51 ns/ip   1.96 GHz  38.21 c/ip  176.01 i/ip   2.39 c/b  11.00 i/b   4.61 i/c

shikharish · 2025-12-23T04:49:50Z

the macro I defined here:

#if defined(__clang__) || defined(__GNUC__)
#define COUNTERS_FORCE_INLINE __attribute__((flatten)) inline
#elif defined(_MSC_VER)
#define COUNTERS_FORCE_INLINE __forceinline
#else
#define COUNTERS_FORCE_INLINE inline
#endif

shikharish · 2025-12-23T05:04:37Z

Even better. I just removed all the compile-time call_ntimes<M> in calls bench.cpp and replaced them with call_ntimes_runtime and changed it to just a simple loop:

template <typename Func>
COUNTERS_FORCE_INLINE void call_ntimes_runtime(Func &&func, size_t M) {
  for (size_t i = 0; i < M; i++) {
    func();
  }
}

Same results as bench_simple

It works better for this case I don't know if it will for other cases. Can open a PR in the counters repo if you want.

lemire · 2025-12-23T19:03:10Z

@shikharish Why are you using LLVM 14? The current Apple LLVM is 17.

The memcpy test runs with similar speeds. But still there is a big difference (...)

If the issue is that the benchmark framework introduces overhead is therefore likely to produce pessimistic measures, how do you account for the fact that the fastest function possible over this data size (a memcpy) does not run slower ?

The issue is simply that counters::bench code is complex, has templating and compiler is not able to optimize properly.

I do not understand this statement. What do you mean by optimize properly?

Even better. I just removed all the compile-time call_ntimes in calls bench.cpp and replaced them with call_ntimes_runtime and changed it to just a simple loop:

For short functions, this will include non-trivial loop overhead.

and got much faster results

This benchmark code that you are proposing is not robust to short functions.

In the case of the functions that we are benchmarking in bench_ip, the call_ntimes_runtime template will be called with the parameter 1.

Thus we call

// Compile-time specialized bench implementation for a fixed inner repeat M.
template <size_t M, class Function>
event_aggregate bench_impl(Function &&function, size_t min_repeat,
                           size_t min_time_ns, size_t max_repeat) {
  static thread_local event_collector collector;
  auto fn = std::forward<Function>(function);
  size_t N = min_repeat;
  if (N == 0)
    N = 1;

  // Warm-up
  event_aggregate warm_aggregate{};
  for (size_t i = 0; i < N; i++) {
    collector.start();
    call_ntimes<M>(fn);
    event_count allocate_count = collector.end();
    warm_aggregate << allocate_count;
    if ((i + 1 == N) && (warm_aggregate.total_elapsed_ns() < min_time_ns) &&
        (N < max_repeat)) {
      N *= 10;
    }
  }
  // Measurement
  event_aggregate aggregate{};
  for (size_t i = 0; i < N; i++) {
    collector.start();
    call_ntimes<M>(fn);
    event_count allocate_count = collector.end();
    aggregate << allocate_count;
  }

  aggregate /= M;
  aggregate.inner_count = M;
  return aggregate;
}

But with M==1. In that case, we have...

template <std::size_t M, typename Func> void call_ntimes(Func &&func) {
  if constexpr (M == 1) {
    func();
    return;
  }
...

If you compile with optimizations, this will surely get inlined.

It works better for this case I don't know if it will for other cases. Can open a PR in the counters repo if you want.

Statements like 'it works better' or 'compiler is not able to optimize properly' are not helpful to me. For example, if you say that the compiler does a poor job (which is entirely possible) then I expect an analysis of the produced ASM for example.

The problem is that we end up with statements such that -flto helps when we have a single source file (and header-only libraries). I am not saying it is impossible, but it requires evidence because it is unexpected. And evidence is relatively easy to get, we can look at the assembly.

Now, these things can get tricky, in the 2 minutes I spent in lldb, I get the impression that in the benchmark here, neither std::from_chars nor fastfloat::from_chars are inlined. So we have loops with function calls. This is fair. Good.

Now, you are giving me quite different answers but you are not explaining the difference. Why is fastfloat::from_chars so much faster in your small benchmarks than in this benchmark with std::from_chars? A possibility, of course, is that in your benchmark, fastfloat::from_chars is not inlined, whereas std::from_chars is not. Then, of course, the would not be a direct comparison. In such a case, the performance difference might then be entirely due to the inlining and so we would be measuring the benefits of inlining. That's fine, but we have to be clear about what we are measuring.

lemire · 2025-12-23T19:10:11Z

Let me be clearer, we have not optimized in any way the from_chars functions for uint8_t, I am getting the following results...

parse_ip_std_fromchars                   :   0.81 GB/s   50.7 Mip/s  19.74 ns/ip
parse_ip_fastfloat                       :   0.72 GB/s   44.8 Mip/s  22.34 ns/ip

So I am slightly slower than the standard library.

You are getting that fastfloat is already nearly 3x faster than the standard library.

⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️
Your claim is extraordinary. I'd be very excited if we are 2x faster than the standard library with our relatively routine functions... But that's a claim that would require deep and immediate investigation.
⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️

Thus I believe that your results are wrong.

Now, I'd be happy to be proven wrong, but my stance is the reasonable one. It is very difficult to be 2x to 3x faster than the standard library. Possible, certainly, but rarely by accident and without tradeoff.

shikharish · 2025-12-23T22:15:42Z

Let me be clearer, we have not optimized in any way the from_chars functions for uint8_t, I am getting the following results...
parse_ip_std_fromchars                   :   0.81 GB/s   50.7 Mip/s  19.74 ns/ip
parse_ip_fastfloat                       :   0.72 GB/s   44.8 Mip/s  22.34 ns/ip
So I am slightly slower than the standard library.

You are getting that fastfloat is already nearly 3x faster than the standard library.

⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ Your claim is extraordinary. I'd be very excited if we are 2x faster than the standard library with our relatively routine functions... But that's a claim that would require deep and immediate investigation. ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️

Thus I believe that your results are wrong.

Now, I'd be happy to be proven wrong, but my stance is the reasonable one. It is very difficult to be 2x to 3x faster than the standard library. Possible, certainly, but rarely by accident and without tradeoff.

The benchmarks I posted above were after cherry-picking my commits from the "parsing uint8_t" PR. I apologize I didn't mention that before.

@shikharish Why are you using LLVM 14? The current Apple LLVM is 17.

I am using MacOS 13.x. I belive I would have to do a system update(which I am avoiding) to get LLVM 17.

Now, these things can get tricky, in the 2 minutes I spent in lldb, I get the impression that in the benchmark here, neither std::from_chars nor fastfloat::from_chars are inlined. So we have loops with function calls. This is fair. Good.

Now, you are giving me quite different answers but you are not explaining the difference. Why is fastfloat::from_chars so much faster in your small benchmarks than in this benchmark with std::from_chars? A possibility, of course, is that in your benchmark, fastfloat::from_chars is not inlined, whereas std::from_chars is not. Then, of course, the would not be a direct comparison. In such a case, the performance difference might then be entirely due to the inlining and so we would be measuring the benefits of inlining. That's fine, but we have to be clear about what we are measuring.

Yes what I meant was: counters::bench is preventing inlining (for both std::from_chars and fast_float::from_chars).

The following benchmarks were ran without cherry-picking any commits, that is, the current fast_float::from_chars implementation without any uint8_t optimization.

So we get this:

❯ sudo ./build/benchmarks/bench_ip
memcpy baseline                          :  35.31 GB/s  2207.1 Mip/s   0.45 ns/ip   1.97 GHz   0.89 c/ip   3.03 i/ip   0.06 c/b   0.19 i/b   3.38 i/c 
just_seek_ip_end (no parse)              :   0.60 GB/s   37.6 Mip/s  26.62 ns/ip   1.96 GHz  52.08 c/ip  123.02 i/ip   3.26 c/b   7.69 i/b   2.36 i/c 
parse_ip_std_fromchars                   :   0.31 GB/s   19.6 Mip/s  50.94 ns/ip   1.96 GHz  99.66 c/ip  411.96 i/ip   6.23 c/b  25.75 i/b   4.13 i/c 
parse_ip_fastfloat                       :   0.29 GB/s   17.9 Mip/s  55.95 ns/ip   1.96 GHz  109.48 c/ip  478.18 i/ip   6.84 c/b  29.89 i/b   4.37 i/c

And the simple_bench is inlining both.

❯ ./simple_bench
just_seek_ip_end (no parse)    :  0.59 GB/s   26.9 ns/ip
std::from_chars                :  0.35 GB/s   45.9 ns/ip
fast_float::from_chars         :  0.46 GB/s   34.8 ns/ip
sink=2739889660

So inlined fast_float is faster than inlined std but non-inlined is slower.

And after I add my commits from the other PR, fast_float(inlined) throughput shoots to 0.82.

shikharish · 2025-12-23T22:48:04Z

The problem is that we end up with statements such that -flto helps when we have a single source file (and header-only libraries). I am not saying it is impossible, but it requires evidence because it is unexpected. And evidence is relatively easy to get, we can look at the assembly.

It is strange but it is what I am seeing on my machine:

fast_float/benchmarks on  simpler_benchmark [$!?⇡] via △ v4.1.2 took 2s 
❯ clang++ -std=c++17 -O3 -o simple_bench simple_bench.cpp -I../include && ./simple_bench
just_seek_ip_end (no parse)    :  0.58 GB/s   27.4 ns/ip
std::from_chars                :  0.29 GB/s   55.9 ns/ip
fast_float::from_chars         :  0.33 GB/s   48.2 ns/ip
sink=2739889660

fast_float/benchmarks on  simpler_benchmark [$!?⇡] via △ v4.1.2 took 2s 
❯ clang++ -std=c++17 -O3 -flto=thin -o simple_bench simple_bench.cpp -I../include && ./simple_bench
just_seek_ip_end (no parse)    :  0.58 GB/s   27.4 ns/ip
std::from_chars                :  0.32 GB/s   50.8 ns/ip
fast_float::from_chars         :  0.40 GB/s   40.4 ns/ip
sink=2739889660

Inspecting the assembly and turns out flto inlines some function calls. For example std::from_chars calls drops from 14 to 3

❯ grep -c "bl.*subject_seq" simple_bench_nolto.asm simple_bench_lto.asm                 

simple_bench_nolto.asm:14
simple_bench_lto.asm:3

I am not really sure why this happens :/

shikharish · 2025-12-23T23:06:14Z

@lemire Please have a look : #1

This adds a stronger hint to inline all functions being called inside that function(as well as the function itself).
Results match my simple_bench:

❯ sudo ./build/benchmarks/bench_ip
memcpy baseline                          :  35.32 GB/s  2207.5 Mip/s   0.45 ns/ip   1.97 GHz   0.89 c/ip   3.03 i/ip   0.06 c/b   0.19 i/b   3.38 i/c 
just_seek_ip_end (no parse)              :   0.60 GB/s   37.6 Mip/s  26.63 ns/ip   1.96 GHz  52.10 c/ip  123.01 i/ip   3.26 c/b   7.69 i/b   2.36 i/c 
parse_ip_std_fromchars                   :   0.35 GB/s   22.1 Mip/s  45.33 ns/ip   1.96 GHz  88.67 c/ip  276.11 i/ip   5.54 c/b  17.26 i/b   3.11 i/c 
parse_ip_fastfloat                       :   0.46 GB/s   28.4 Mip/s  35.16 ns/ip   1.96 GHz  68.78 c/ip  256.54 i/ip   4.30 c/b  16.03 i/b   3.73 i/c

lemire · 2025-12-24T01:26:24Z

@shikharish

Please see #352

simplify the benchmark

62ed60e

lemire mentioned this pull request Dec 23, 2025

uint8_t parsing #349

Open

lemire added 2 commits December 22, 2025 22:01

add a memcpy baseline

55723db

adding a memcpy benchmark and ensure inlining.

b5ae54c

lemire added 2 commits December 23, 2025 11:46

lint

bfa7bcc

display the inner count (check)

75d01f0

lemire merged commit 5830594 into main Dec 23, 2025
71 checks passed

simplify the IP benchmark #351

simplify the IP benchmark #351

Conversation

lemire commented Dec 23, 2025

Uh oh!

shikharish commented Dec 23, 2025

Uh oh!

shikharish commented Dec 23, 2025

Uh oh!

lemire commented Dec 23, 2025

Uh oh!

lemire commented Dec 23, 2025

Uh oh!

shikharish commented Dec 23, 2025

Uh oh!

shikharish commented Dec 23, 2025

Uh oh!

shikharish commented Dec 23, 2025

Uh oh!

shikharish commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lemire commented Dec 23, 2025

Uh oh!

lemire commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

shikharish commented Dec 23, 2025

Uh oh!

shikharish commented Dec 23, 2025

Uh oh!

shikharish commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lemire commented Dec 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

shikharish commented Dec 23, 2025 •

edited

Loading

lemire commented Dec 23, 2025 •

edited

Loading

shikharish commented Dec 23, 2025 •

edited

Loading