Profile-Guided Optimization (PGO) benchmark results

Hi!

Yesterday I read a [post](https://www.reddit.com/r/rust/comments/1al8cuc/modular_community_spotlight_outperforming_rust/) about needletail performance. I came up with an idea to try to optimize the library performance with PGO (as I already did for many other applications - all the results are available [here](https://github.com/zamazan4ik/awesome-pgo)). I performed some tests and want to share the results.

## Test environment

* Fedora 39
* Linux kernel 6.7.3
* AMD Ryzen 9 5900x
* 48 Gib RAM
* SSD Samsung 980 Pro 2 Tib
* Compiler - Rustc 1.76
* needletail version: the latest for now from the `master` branch on commit `25e9b931af87d5aed79ecf7a3ff32245b91ce9dc`
* Disabled Turbo boost (for more stable results across benchmark runs)

## Benchmark

Built-in benchmarks are invoked with `cargo bench`. PGO instrumentation phase on benchmarks is done with `cargo pgo bench`. PGO optimization phase is done with `cargo pgo optimize bench`.

All PGO optimization steps are done with [cargo-pgo](https://github.com/Kobzol/cargo-pgo) tool.

The only caveat is found that Rustc hits some internal bug when LTO and PGO are combined at the same time (more details see [here](https://github.com/rust-lang/rust/issues/115344#issuecomment-1935985808)). However, it should not affect the benchmark usefulness - PGO still can bring performance improvements even with LTO in practice. I hope one day the bug will be fixed, and it will be possible to use LTO and PGO for needletail simultaneously.

## Results

I got the following results:

* Release: https://gist.github.com/zamazan4ik/be0e175bfb0b8513f5c9cc7d45044e42
* PGO optimized compared to Release: https://gist.github.com/zamazan4ik/2cd369f47356b4c4f4b68c9fb1203b41
* (just for reference) PGO instrumented compared to Release: https://gist.github.com/zamazan4ik/1749e1c5e38cf11ce8cb1437f7d28821

At least in the provided by the project benchmarks, I see measurable performance improvements in many cases. The only interesting case here - regression in "FASTA parsing/SeqIO" case. It should be investigated further but my guess here that it's due to PGO nature: sometimes optimizing for one hot path pessimizes other cases. In real life, in such cases, users usually are able to build multiple PGO-optimized binaries - one for each workload (with different PGO profiles).

## Possible further steps

I can suggest the following things to consider:

* Perform more PGO benchmarks in other scenarios. If it shows improvements - add a note to the documentation about possible improvements in the tracing library performance with PGO (I guess somewhere in the README file will be enough).

I will be happy to answer all your questions about PGO.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Profile-Guided Optimization (PGO) benchmark results #72

Test environment

Benchmark

Results

Possible further steps

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Profile-Guided Optimization (PGO) benchmark results #72

Description

Test environment

Benchmark

Results

Possible further steps

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions