Skip to content

Conversation

@lemire
Copy link
Member

@lemire lemire commented Dec 23, 2025

The IP benchmark was a bit complicated. I simplified the code a bit. I have added a control where you can see the speed without number parsing (so just_seek_ip_end).

just_seek_ip_end (no parse)              :   2.64 GB/s  164.8 Ma/s   6.07 ns/d 
parse_ip_std_fromchars                   :   0.80 GB/s   49.8 Ma/s  20.06 ns/d 
parse_ip_fastfloat                       :   0.71 GB/s   44.6 Ma/s  22.43 ns/d 

@lemire lemire mentioned this pull request Dec 23, 2025
@shikharish
Copy link
Contributor

Ran it on my machine (Apple M1, clang 21)

❯ sudo ./build/benchmarks/bench_ip
just_seek_ip_end (no parse)              :   0.58 GB/s   36.4 Ma/s  27.50 ns/d   1.96 GHz  53.80 c/d  111.31 i/d   3.36 c/b   6.96 i/b   2.07 i/c 
parse_ip_std_fromchars                   :   0.28 GB/s   17.2 Ma/s  58.10 ns/d   1.96 GHz  113.59 c/d  489.60 i/d   7.10 c/b  30.60 i/b   4.31 i/c 
parse_ip_fastfloat                       :   0.33 GB/s   20.4 Ma/s  49.06 ns/d   1.96 GHz  95.99 c/d  362.24 i/d   6.00 c/b  22.64 i/b   3.77 i/c 

Also a simple_bench which just measures throughput:

./simple_bench
just_seek_ip_end (no parse)    :  0.61 GB/s   26.4 ns/ip
std::from_chars                :  0.33 GB/s   49.1 ns/ip
fast_float::from_chars         :  0.52 GB/s   30.8 ns/ip

@shikharish
Copy link
Contributor

Actualy I inspected the assembly of the benchmark. Turns out it was not able to inline function calls. Compiling with -flto increases throughput:

❯ sudo ./build/benchmarks/bench_ip
just_seek_ip_end (no parse)              :   0.60 GB/s   37.8 Ma/s  26.46 ns/d   1.96 GHz  51.77 c/d  109.30 i/d   3.24 c/b   6.83 i/b   2.11 i/c 
parse_ip_std_fromchars                   :   0.34 GB/s   21.1 Ma/s  47.29 ns/d   1.96 GHz  92.52 c/d  402.25 i/d   5.78 c/b  25.14 i/b   4.35 i/c 
parse_ip_fastfloat                       :   0.46 GB/s   28.5 Ma/s  35.07 ns/d   1.93 GHz  67.78 c/d  281.17 i/d   4.24 c/b  17.57 i/b   4.15 i/c 

@lemire
Copy link
Member Author

lemire commented Dec 23, 2025

@shikharish The libraries are header libraries (both counters and fast_float) so -flto should have no effect. There is just one source file the benchmark itself.

@lemire
Copy link
Member Author

lemire commented Dec 23, 2025

I have pushed a memcpy measurement. So if you run bench_ip, it will give you an estimate of the best memcpy speed on your system.

You can independently measure it with an entirely different program:

try this C++ file, just save it, compile it with `-O3` and run it.
#include <iostream>
#include <chrono>
#include <cstring>
#include <cstdlib>
#include <memory>

int main(int argc, char* argv[]) {
    const size_t element_count = 15000;
    const size_t element_size = 16;
    const size_t buffer_size = element_count * element_size;  // 240000 bytes

    unsigned int iterations = 1000;
    if (argc > 1) {
        iterations = std::atoi(argv[1]);
    }

    std::unique_ptr<char[]> src = std::make_unique<char[]>(buffer_size);
    std::unique_ptr<char[]> dst = std::make_unique<char[]>(buffer_size);

    // Initialize source buffer (arbitrary data)
    for (size_t i = 0; i < buffer_size; ++i) {
        src[i] = static_cast<char>(i);
    }

    // Warm-up: perform a few copies to fill caches
    for (unsigned int i = 0; i < 10; ++i) {
        std::memcpy(dst.get(), src.get(), buffer_size);
    }

    volatile char sink;

    // Timed measurement
    auto start = std::chrono::high_resolution_clock::now();
    for (unsigned int i = 0; i < iterations; ++i) {
        std::memcpy(dst.get(), src.get(), buffer_size);
    }
    sink += dst[0];  // Prevent optimization

    auto end = std::chrono::high_resolution_clock::now();

    auto duration_ns = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count();
    std::cout << "duration " << duration_ns << " ns\n";
    double duration_sec = duration_ns / 1e9;

    double bytes_copied = static_cast<double>(buffer_size) * iterations;
    double speed_gbps = bytes_copied / duration_ns;

    std::cout << "Buffer size: " << buffer_size << " bytes (" << buffer_size / 1024.0 << " KiB)\n";
    std::cout << "Iterations: " << iterations << "\n";
    std::cout << "Time: " << duration_sec << " seconds\n";
    std::cout << "Memcpy speed: " << speed_gbps << " GB/s\n";

    return EXIT_SUCCESS;
}

Now, if you are getting that the memcpy speed measured the naive way is faster than the approach using the new bench_ip, then it will indicate that, indeed, my measurements carry some unacceptable overhead (they are pessimistic). If so, report your results, we shall verify.

Inlining might be a concern but I have ensured in the latest pushed that pretty much everything can get inlined.

@shikharish
Copy link
Contributor

The memcpy test runs with similar speeds. But still there is a big difference in the actual benchmarks when I compile with Apple LLVM(clang 14)
(before I was actually compiling with clang 21 installed from MacPorts. And using -flto was making a difference. It doesn't make a difference when I use c++ that is Apple clang 14)

❯ sudo ./build/benchmarks/bench_ip
memcpy baseline                          :  33.55 GB/s  2096.8 Mip/s   0.48 ns/ip   1.75 GHz   0.83 c/ip   3.03 i/ip   0.05 c/b   0.19 i/b   3.63 i/c 
just_seek_ip_end (no parse)              :   0.59 GB/s   36.7 Mip/s  27.28 ns/ip   1.96 GHz  53.38 c/ip  123.01 i/ip   3.34 c/b   7.69 i/b   2.30 i/c 
parse_ip_std_fromchars                   :   0.31 GB/s   19.7 Mip/s  50.87 ns/ip   1.96 GHz  99.51 c/ip  411.94 i/ip   6.22 c/b  25.75 i/b   4.14 i/c 
parse_ip_fastfloat                       :   0.46 GB/s   29.0 Mip/s  34.51 ns/ip   1.96 GHz  67.54 c/ip  343.92 i/ip   4.22 c/b  21.49 i/b   5.09 i/c 
❯ c++ -std=c++17 -O3 -o simple_bench simple_bench.cpp -I../include && ./simple_bench
just_seek_ip_end (no parse)    :  0.60 GB/s   26.6 ns/ip
std::from_chars                :  0.35 GB/s   46.2 ns/ip
fast_float::from_chars         :  0.82 GB/s   19.5 ns/ip
sink=2739889660

@shikharish
Copy link
Contributor

The issue is simply that counters::bench code is complex, has templating and compiler is not able to optimize properly.
I wrote a simpler version of the bench nothing fancy:

template <class Function>
COUNTERS_FORCE_INLINE event_aggregate bench_simple(Function &&function,
                                                   size_t repeats = 100) {
  static thread_local event_collector collector;
  event_aggregate warm_aggregate{};

  // warmup
  for (size_t i = 0; i < 10; i++) {
    collector.start();
    function();
    warm_aggregate << collector.end();
  }

  // measurement
  event_aggregate aggregate{};
  for (size_t i = 0; i < repeats; i++) {
    collector.start();
    function();
    aggregate << collector.end();
  }
  return aggregate;
}

and got much faster results, matching with my simple_bench:

❯ sudo ./build/benchmarks/bench_ip
memcpy baseline                          :  37.20 GB/s  2325.1 Mip/s   0.43 ns/ip   1.98 GHz   0.85 c/ip   3.03 i/ip   0.05 c/b   0.19 i/b   3.56 i/c 
just_seek_ip_end (no parse)              :   0.60 GB/s   37.6 Mip/s  26.60 ns/ip   1.96 GHz  52.03 c/ip  123.02 i/ip   3.25 c/b   7.69 i/b   2.36 i/c 
parse_ip_std_fromchars                   :   0.35 GB/s   21.8 Mip/s  45.80 ns/ip   1.96 GHz  89.58 c/ip  278.99 i/ip   5.60 c/b  17.44 i/b   3.11 i/c 
parse_ip_fastfloat                       :   0.82 GB/s   51.2 Mip/s  19.51 ns/ip   1.96 GHz  38.21 c/ip  176.01 i/ip   2.39 c/b  11.00 i/b   4.61 i/c 

@shikharish
Copy link
Contributor

the macro I defined here:

#if defined(__clang__) || defined(__GNUC__)
#define COUNTERS_FORCE_INLINE __attribute__((flatten)) inline
#elif defined(_MSC_VER)
#define COUNTERS_FORCE_INLINE __forceinline
#else
#define COUNTERS_FORCE_INLINE inline
#endif

@shikharish
Copy link
Contributor

shikharish commented Dec 23, 2025

Even better. I just removed all the compile-time call_ntimes<M> in calls bench.cpp and replaced them with call_ntimes_runtime and changed it to just a simple loop:

template <typename Func>
COUNTERS_FORCE_INLINE void call_ntimes_runtime(Func &&func, size_t M) {
  for (size_t i = 0; i < M; i++) {
    func();
  }
}

Same results as bench_simple

It works better for this case I don't know if it will for other cases. Can open a PR in the counters repo if you want.

@lemire
Copy link
Member Author

lemire commented Dec 23, 2025

@shikharish Why are you using LLVM 14? The current Apple LLVM is 17.

The memcpy test runs with similar speeds. But still there is a big difference (...)

If the issue is that the benchmark framework introduces overhead is therefore likely to produce pessimistic measures, how do you account for the fact that the fastest function possible over this data size (a memcpy) does not run slower ?

The issue is simply that counters::bench code is complex, has templating and compiler is not able to optimize properly.

I do not understand this statement. What do you mean by optimize properly?

Even better. I just removed all the compile-time call_ntimes in calls bench.cpp and replaced them with call_ntimes_runtime and changed it to just a simple loop:

For short functions, this will include non-trivial loop overhead.

and got much faster results

This benchmark code that you are proposing is not robust to short functions.

In the case of the functions that we are benchmarking in bench_ip, the call_ntimes_runtime template will be called with the parameter 1.

Thus we call

// Compile-time specialized bench implementation for a fixed inner repeat M.
template <size_t M, class Function>
event_aggregate bench_impl(Function &&function, size_t min_repeat,
                           size_t min_time_ns, size_t max_repeat) {
  static thread_local event_collector collector;
  auto fn = std::forward<Function>(function);
  size_t N = min_repeat;
  if (N == 0)
    N = 1;

  // Warm-up
  event_aggregate warm_aggregate{};
  for (size_t i = 0; i < N; i++) {
    collector.start();
    call_ntimes<M>(fn);
    event_count allocate_count = collector.end();
    warm_aggregate << allocate_count;
    if ((i + 1 == N) && (warm_aggregate.total_elapsed_ns() < min_time_ns) &&
        (N < max_repeat)) {
      N *= 10;
    }
  }
  // Measurement
  event_aggregate aggregate{};
  for (size_t i = 0; i < N; i++) {
    collector.start();
    call_ntimes<M>(fn);
    event_count allocate_count = collector.end();
    aggregate << allocate_count;
  }

  aggregate /= M;
  aggregate.inner_count = M;
  return aggregate;
}

But with M==1. In that case, we have...

template <std::size_t M, typename Func> void call_ntimes(Func &&func) {
  if constexpr (M == 1) {
    func();
    return;
  }
...

If you compile with optimizations, this will surely get inlined.

It works better for this case I don't know if it will for other cases. Can open a PR in the counters repo if you want.

Statements like 'it works better' or 'compiler is not able to optimize properly' are not helpful to me. For example, if you say that the compiler does a poor job (which is entirely possible) then I expect an analysis of the produced ASM for example.

The problem is that we end up with statements such that -flto helps when we have a single source file (and header-only libraries). I am not saying it is impossible, but it requires evidence because it is unexpected. And evidence is relatively easy to get, we can look at the assembly.

Now, these things can get tricky, in the 2 minutes I spent in lldb, I get the impression that in the benchmark here, neither std::from_chars nor fastfloat::from_chars are inlined. So we have loops with function calls. This is fair. Good.

Now, you are giving me quite different answers but you are not explaining the difference. Why is fastfloat::from_chars so much faster in your small benchmarks than in this benchmark with std::from_chars? A possibility, of course, is that in your benchmark, fastfloat::from_chars is not inlined, whereas std::from_chars is not. Then, of course, the would not be a direct comparison. In such a case, the performance difference might then be entirely due to the inlining and so we would be measuring the benefits of inlining. That's fine, but we have to be clear about what we are measuring.

@lemire
Copy link
Member Author

lemire commented Dec 23, 2025

Let me be clearer, we have not optimized in any way the from_chars functions for uint8_t, I am getting the following results...

parse_ip_std_fromchars                   :   0.81 GB/s   50.7 Mip/s  19.74 ns/ip
parse_ip_fastfloat                       :   0.72 GB/s   44.8 Mip/s  22.34 ns/ip

So I am slightly slower than the standard library.

You are getting that fastfloat is already nearly 3x faster than the standard library.

⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️
Your claim is extraordinary. I'd be very excited if we are 2x faster than the standard library with our relatively routine functions... But that's a claim that would require deep and immediate investigation.
⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️

Thus I believe that your results are wrong.

Now, I'd be happy to be proven wrong, but my stance is the reasonable one. It is very difficult to be 2x to 3x faster than the standard library. Possible, certainly, but rarely by accident and without tradeoff.

@lemire lemire merged commit 5830594 into main Dec 23, 2025
71 checks passed
@shikharish
Copy link
Contributor

Let me be clearer, we have not optimized in any way the from_chars functions for uint8_t, I am getting the following results...

parse_ip_std_fromchars                   :   0.81 GB/s   50.7 Mip/s  19.74 ns/ip
parse_ip_fastfloat                       :   0.72 GB/s   44.8 Mip/s  22.34 ns/ip

So I am slightly slower than the standard library.

You are getting that fastfloat is already nearly 3x faster than the standard library.

⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ Your claim is extraordinary. I'd be very excited if we are 2x faster than the standard library with our relatively routine functions... But that's a claim that would require deep and immediate investigation. ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️

Thus I believe that your results are wrong.

Now, I'd be happy to be proven wrong, but my stance is the reasonable one. It is very difficult to be 2x to 3x faster than the standard library. Possible, certainly, but rarely by accident and without tradeoff.

The benchmarks I posted above were after cherry-picking my commits from the "parsing uint8_t" PR. I apologize I didn't mention that before.

@shikharish Why are you using LLVM 14? The current Apple LLVM is 17.

I am using MacOS 13.x. I belive I would have to do a system update(which I am avoiding) to get LLVM 17.


Now, these things can get tricky, in the 2 minutes I spent in lldb, I get the impression that in the benchmark here, neither std::from_chars nor fastfloat::from_chars are inlined. So we have loops with function calls. This is fair. Good.

Now, you are giving me quite different answers but you are not explaining the difference. Why is fastfloat::from_chars so much faster in your small benchmarks than in this benchmark with std::from_chars? A possibility, of course, is that in your benchmark, fastfloat::from_chars is not inlined, whereas std::from_chars is not. Then, of course, the would not be a direct comparison. In such a case, the performance difference might then be entirely due to the inlining and so we would be measuring the benefits of inlining. That's fine, but we have to be clear about what we are measuring.

Yes what I meant was: counters::bench is preventing inlining (for both std::from_chars and fast_float::from_chars).

The following benchmarks were ran without cherry-picking any commits, that is, the current fast_float::from_chars implementation without any uint8_t optimization.

So we get this:

❯ sudo ./build/benchmarks/bench_ip
memcpy baseline                          :  35.31 GB/s  2207.1 Mip/s   0.45 ns/ip   1.97 GHz   0.89 c/ip   3.03 i/ip   0.06 c/b   0.19 i/b   3.38 i/c 
just_seek_ip_end (no parse)              :   0.60 GB/s   37.6 Mip/s  26.62 ns/ip   1.96 GHz  52.08 c/ip  123.02 i/ip   3.26 c/b   7.69 i/b   2.36 i/c 
parse_ip_std_fromchars                   :   0.31 GB/s   19.6 Mip/s  50.94 ns/ip   1.96 GHz  99.66 c/ip  411.96 i/ip   6.23 c/b  25.75 i/b   4.13 i/c 
parse_ip_fastfloat                       :   0.29 GB/s   17.9 Mip/s  55.95 ns/ip   1.96 GHz  109.48 c/ip  478.18 i/ip   6.84 c/b  29.89 i/b   4.37 i/c 

And the simple_bench is inlining both.

❯ ./simple_bench
just_seek_ip_end (no parse)    :  0.59 GB/s   26.9 ns/ip
std::from_chars                :  0.35 GB/s   45.9 ns/ip
fast_float::from_chars         :  0.46 GB/s   34.8 ns/ip
sink=2739889660

So inlined fast_float is faster than inlined std but non-inlined is slower.

And after I add my commits from the other PR, fast_float(inlined) throughput shoots to 0.82.

@shikharish
Copy link
Contributor

The problem is that we end up with statements such that -flto helps when we have a single source file (and header-only libraries). I am not saying it is impossible, but it requires evidence because it is unexpected. And evidence is relatively easy to get, we can look at the assembly.

It is strange but it is what I am seeing on my machine:

fast_float/benchmarks on  simpler_benchmark [$!?⇡] via △ v4.1.2 took 2s 
❯ clang++ -std=c++17 -O3 -o simple_bench simple_bench.cpp -I../include && ./simple_bench
just_seek_ip_end (no parse)    :  0.58 GB/s   27.4 ns/ip
std::from_chars                :  0.29 GB/s   55.9 ns/ip
fast_float::from_chars         :  0.33 GB/s   48.2 ns/ip
sink=2739889660

fast_float/benchmarks on  simpler_benchmark [$!?⇡] via △ v4.1.2 took 2s 
❯ clang++ -std=c++17 -O3 -flto=thin -o simple_bench simple_bench.cpp -I../include && ./simple_bench
just_seek_ip_end (no parse)    :  0.58 GB/s   27.4 ns/ip
std::from_chars                :  0.32 GB/s   50.8 ns/ip
fast_float::from_chars         :  0.40 GB/s   40.4 ns/ip
sink=2739889660

Inspecting the assembly and turns out flto inlines some function calls. For example std::from_chars calls drops from 14 to 3

❯ grep -c "bl.*subject_seq" simple_bench_nolto.asm simple_bench_lto.asm                 

simple_bench_nolto.asm:14
simple_bench_lto.asm:3

I am not really sure why this happens :/

@shikharish
Copy link
Contributor

shikharish commented Dec 23, 2025

@lemire Please have a look : #1

This adds a stronger hint to inline all functions being called inside that function(as well as the function itself).
Results match my simple_bench:

❯ sudo ./build/benchmarks/bench_ip
memcpy baseline                          :  35.32 GB/s  2207.5 Mip/s   0.45 ns/ip   1.97 GHz   0.89 c/ip   3.03 i/ip   0.06 c/b   0.19 i/b   3.38 i/c 
just_seek_ip_end (no parse)              :   0.60 GB/s   37.6 Mip/s  26.63 ns/ip   1.96 GHz  52.10 c/ip  123.01 i/ip   3.26 c/b   7.69 i/b   2.36 i/c 
parse_ip_std_fromchars                   :   0.35 GB/s   22.1 Mip/s  45.33 ns/ip   1.96 GHz  88.67 c/ip  276.11 i/ip   5.54 c/b  17.26 i/b   3.11 i/c 
parse_ip_fastfloat                       :   0.46 GB/s   28.4 Mip/s  35.16 ns/ip   1.96 GHz  68.78 c/ip  256.54 i/ip   4.30 c/b  16.03 i/b   3.73 i/c 

@lemire
Copy link
Member Author

lemire commented Dec 24, 2025

@shikharish

Please see #352

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants