Skip to content

Accelerating 'get_detcost' function #34

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jun 19, 2025
Merged

Accelerating 'get_detcost' function #34

merged 5 commits into from
Jun 19, 2025

Conversation

draganaurosgrbic
Copy link
Contributor

@draganaurosgrbic draganaurosgrbic commented Jun 16, 2025

Performance Optimization: Accelerating the A* Heuristic Function with Data Locality and Early-Exit Logic

This Pull Request introduces two major optimizations to the get_detcost function, a critical component of Tesseract's A* heuristic. These changes resolve a severe performance bottleneck, leading to dramatic speedups of up to 5X faster decoding in some configurations.


The Problem: The get_detcost Bottleneck

Tesseract's decoding process relies on an admissible A* heuristic, which requires the precise calculation of a lower-bound cost for each search state. This calculation is performed by the get_detcost function, which aggregates the minimum cost of unblocked errors affecting a given detector.

My profiling consistently identified get_detcost as the primary performance bottleneck across all code families, consuming:

  • Over 60% of total decoding time in Color Codes.
  • Over 70% (and sometimes up to 90%) in Bivariate-Bicycle Codes.
  • Around 40% in Surface Codes and Transversal CNOT Protocols.

The core inefficiency stemmed from high-frequency accesses to elements at arbitrary, non-contiguous indices within two large pre-computed vectors. This scattered memory access pattern led to numerous CPU cache misses, severely degrading performance. However, a key observation was that accesses to these vectors exhibited a consistent co-access pattern: elements at the same arbitrary index in both vectors were frequently accessed together. The original implementation was missing this crucial insight.


The Solution: A Two-Pronged Optimization Strategy

To address this bottleneck, I implemented two key optimizations:

  1. Improved Data Locality:
    Leveraging the co-access pattern, I redesigned the two conceptual vectors into a single std::vector of a custom data structure. Each element of this new vector is a custom struct with two uint32_t fields: one for the blocked error flag and the other for the fired detector count. This reorganization dramatically improved data locality, ensuring that co-accessed data resided contiguously in memory. With a total size of 8 bytes per struct, this design aligns well with typical 64-byte CPU cache lines, maximizing the benefits of hardware pre-fetching. This optimization builds on the std::vector<bool> to std::vector<char> optimization from PR Replace std::vector<bool> with std::vector<char> for faster computations #25.

  2. Early-Exit Strategy:
    The cost calculation loop within get_detcost was made more efficient by implementing an early-exit strategy. I now pre-compute a lower bound on the cost of each error before decoding begins. Errors are pre-sorted based on this lower bound. When calculating a detector's cost, the function can now terminate its loop early if the current minimum cost is less than or equal to the lower bound of the next error to be considered. This clever strategy significantly prunes unnecessary iterations.


Performance Impact: Up to 5X Faster Decoding

These optimizations yielded remarkable results, validated through two distinct sets of experiments.

  • Initial Benchmarks (Smaller Number of Shots):
    For smaller benchmarks, I specifically analyzed the impact of the optimizations in this PR. This demonstrated a reduction in cache misses in the get_detcost function by over 70% in Color Codes and over 50% in Bivariate-Bicycle Codes. Speedups from these two optimizations alone reached almost 40% in Color Codes and over 50% in Bivariate-Bicycle Codes.

  • Extensive Benchmarks (1000 Shots):
    I then performed extensive benchmarks on 1000 shots to measure the cumulative speedup from both the optimization in PR Replace std::vector<bool> with std::vector<char> for faster computations #25 and the two new optimizations in this PR. The combined effect resulted in massive decoding speedups:

    • Bivariate-Bicycle Codes: 41.2% to 79.6% (up to 5X faster)
    • Surface Codes: 45.4% to 52.3% (up to 2X faster)
    • Transversal CNOT Protocols: 45% to 51.8% (up to 2X faster)
    • Color Codes: 37.2% to 52.1% (up to 2X faster)

The attached graphs provide a detailed breakdown of these performance improvements across various configurations.


Key Contributions

  • Pinpointed a Critical Bottleneck: Performed extensive profiling to identify get_detcost as the main performance bottleneck, consuming up to 90% of decoding time.
  • Engineered Two Optimizations: Implemented a dual strategy of improving data locality by reorganizing co-accessed data into a single custom struct and introducing an early-exit strategy to prune redundant computations.
  • Validated Performance Improvements: Conducted thorough benchmarks on both small-scale experiments (measuring cache misses and speedup from this PR alone) and large-scale (1000-shot) experiments to demonstrate the cumulative impact.
  • Achieved Massive Speedups: Delivered a cumulative performance boost of up to 5X faster decoding, making a major improvement to the overall efficiency of Tesseract.

Plots for Smaller Benchmarks

Decoding Speedup in Color Codes

Screenshot 2025-07-26 1 04 22 AM

Cache Misses Improvement in Color Codes

Screenshot 2025-07-26 1 05 18 AM

Decoding Speedup in Bivariate-Bicycle Codes

Screenshot 2025-07-26 1 05 43 AM

Cache Misses Improvement in Bivariate-Bicycle Codes

Screenshot 2025-07-26 1 06 10 AM

Decoding Speedup in NLR5 Bivariate-Bicycle Codes

Screenshot 2025-07-26 1 06 45 AM

Cache Misses Improvement in NLR5 Bivariate-Bicycle Codes

Screenshot 2025-07-26 1 07 14 AM

Decoding Speedup in NLR10 Bivariate-Bicycle Codes

Screenshot 2025-07-26 1 07 36 AM

Cache Misses Improvement in NLR10 Bivariate-Bicycle Codes

Screenshot 2025-07-26 1 08 02 AM

Plots for Broad Benchmarks (1000 shots)

Decoding Speedup in Color Codes

color1 color2

Decoding Speedup in Bivariate-Bicycle Codes

bicycle1 bicycle2

Decoding Speedup in Surface Codes

surface1

Decoding Speedup in Transversal CNOT Protocols

trans1

Signed-off-by: Dragana Grbic <[email protected]>
Signed-off-by: Dragana Grbic <[email protected]>
@draganaurosgrbic draganaurosgrbic requested a review from LalehB June 16, 2025 23:06
@LalehB
Copy link
Collaborator

LalehB commented Jun 18, 2025

@draganaurosgrbic amazing plots and speedup! I was wondering if these cache misses are cache misses of the total program or only the get_detcost program because it says "cache misses in the get_detcost function" in the title of plots

@draganaurosgrbic
Copy link
Contributor Author

draganaurosgrbic commented Jun 18, 2025

@LalehB Cache misses are for the get_detcost function only. On a plot, when I say cache misses for get_detcost are 50%, this means that 50% of all cache misses spent in the application were spent in get_detcost. On plots for cache misses, for each optimization strategy, I also plotted a number that says the improvement in percentage (if cache misses were initially 35.74% and after optimization they were 11.25%, this means we achieved a 68.5% improvement).

Copy link
Collaborator

@LalehB LalehB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!
Thanks Dragana. very nice speedups 👏🏻

@LalehB LalehB merged commit c7a0cf5 into main Jun 19, 2025
3 checks passed
@LalehB LalehB deleted the optimization-cpu branch June 19, 2025 02:32
draganaurosgrbic added a commit that referenced this pull request Jul 25, 2025
…--at-most-two-errors-per-detector flag (#45)

### Fixing the performance issue (and also implicitly a bug that
existed)

This PR fixes the performance issue of costly `std::vector` copy
operations performed when the `--at-most-two-errors-per-detector` flag
is enabled. As discussed in #27, the initial `next_next_blocked_errs =
next_blocked_errs` line that existed used to consume a significant
decoding time, as each time a new search state was being
explored/constructed, this line would make a local copy of blocked
errors for the current state being processed, so that changes made on
that current state do not affect following states being explored. In
#27, I realized these operations of creating local copies of
`std::vector` data structures only had to be performed when the
`--at-most-two-errors-per-detector` flag was enabled, as in that case,
_Tesseract_ was performing additional changes on the vector of blocked
errors that had to be reverted in the following iteration of exploring a
search state. In this PR, I address these issues also when the
`--at-most-two-errors-per-detector` flag is enabled, as it is highly
inefficient to copy entire large vectors each time a new search state is
explored, only to revert a few changes. I achieved this by storing a
special value `2` (instead of `true/1`) for the errors that are blocked
due to the `--at-most-two-errors-per-detector` flag and that should be
unblocked/reverted in the next search state. Note that this is possible
to implement now, as we are now storing boolean elements as integers,
rather than simple bits/values that are always either `1` or `0`.

The evaluation of this change/optimization could be explored in
different dimensions:
1. Before I started working on _Tesseract_,
`--at-most-two-errors-per-detector` flag would frequently perform copy
operations of `std::vector<bool>`. As discussed in #25, these data
structures improve the memory efficiency by packing boolean elements
into individual bits, but I replaced them with `std::vector<char>` as
this drastically improved the performance/speed, since _Tesseract_ was
frequently accessing elements in `std::vector<bool>`, which induced
significant overhead, due to the bit-wise operations being performed.
2. After I applied the optimization in #25, we were performing copies of
`std::vector<char>`, which required more time, as they stored larger
elements.
3. Finally, after I implemented optimization in #34, we would perform
copies of `std::vector<DetectorCostTuple>`, where each element is 8
bytes, requiring even more time.

Since we were not using the `--at-most-two-errors-per-detector` flag, as
it affects the accuracy of the decoder, I focused my optimizations when
not enabling this flag. However, the _Tesseract_ algorithm was still
left with this code that was frequently performing copies of large
vectors when the `--at-most-two-errors-per-detector` flag was enabled,
only to revert a few changes. In this PR, I fix this this performance
issue when the `--at-most-two-errors-per-detector` flag is enabled, by
imposing a smarter strategy: storing a special `2` value for the errors
that need to be unblocked/reverted in the following iteration of
exploring a search state.

Below are graphs where I evaluate the performance improvement I achieved
by removing these unnecessary copy operations when the
`--at-most-two-errors-per-detector` flag is enabled. Note that I am
evaluating this when removing copy operations of
`std::vector<DetectorCostTuple>`, as this was the last data
representation I implemented before this PR. I also noticed that this PR
affects the accuracy of the decoder. The reason why this PR
affects/improves the accuracy of the decoder when using the
`--at-most-two-errors-per-detector` flag is because the code had a bug.
The code below:

```
for (int d : edets[ei]) {
  next_detectors[d] = !next_detectors[d];
  int fired = next_detectors[d] ? 1 : -1;
  next_num_detectors += fired;
  for (int oei : d2e[d]) {
    next_detector_cost_tuples[oei].detectors_count += fired;
  }

  if (!next_detectors[d] && config.at_most_two_errors_per_detector) {
    for (size_t oei : d2e[d]) {
      next_next_detector_cost_tuples[oei].error_blocked = true;
    }
  }
}
```

contains the critical loop with the bug. This loop updates the number of
fired detectors for each error in the `next_detector_cost_tuples` and
blocks errors in the `next_next_detector_cost_tuples`. However, when
calling the `get_detcost` function, the `next_next_detector_cost_tuples`
is provided as the argument. This inconsistency occurred only when the
flag `--at-most-two-errors-per-detector` is enabled, as in that case,
`next_next_detector_cost_tuples` was being constructed and passed to
`get_detcost`, but the modifications on the fired detectors were
performed on the `next_detector_cost_tuples`. This explains having more
low confidence results and also 3 errors in a surface code benchmark
(r=11, p=0.002, 500 shots).

In this PR, this loop with the bug is replaced with:
```
for (int d : edets[ei]) {
  next_detectors[d] = !next_detectors[d];
  int fired = next_detectors[d] ? 1 : -1;
  next_num_detectors += fired;
  for (int oei : d2e[d]) {
    next_detector_cost_tuples[oei].detectors_count += fired;
  }

  if (!next_detectors[d] && config.at_most_two_errors_per_detector) {
    for (size_t oei : d2e[d]) {
      next_detector_cost_tuples[oei].error_blocked = next_detector_cost_tuples[oei].error_blocked == 1 ? 1 : 2;
    }
  }
}
```

As explained earlier, this PR entirely removes the
`next_next_detector_cost_tuples` and replaces it with the smarter
strategy of reverting changes made on blocked errors, explained earlier.
Since I completely removed this array, I also fixed the bug, as now
changes on the number of fired detectors and blocked errors are always
performed on the same `next_detector_cost_tuples` array.

**Note: I evaluated the change/improvement in accuracy only by comparing
the number of low confidence results. I also measured the number of
errors when executing shots/simulations, but for the benchmarks below I
tested with, there were no errors before and after I implemented this
PR. The only exception was the surface code benchmark (r=11, p=0.002,
500 shots). For this benchmark, I encountered 3 errors (from all 500
shots) before I implemented this PR and 0 after I implemented this PR.**

<img width="1778" height="870" alt="Screenshot 2025-07-18 7 40 19 PM"
src="https://github.com/user-attachments/assets/a1f54d20-ee00-43bf-8c28-f29aaf487d80"
/>

<img width="1778" height="876" alt="Screenshot 2025-07-18 7 40 39 PM"
src="https://github.com/user-attachments/assets/87279c31-abae-4860-9bab-749eb02631ee"
/>

<img width="1778" height="877" alt="Screenshot 2025-07-18 7 41 01 PM"
src="https://github.com/user-attachments/assets/07f820fa-c2de-4029-b5c4-52c0e035dc71"
/>

<img width="1778" height="875" alt="Screenshot 2025-07-18 7 41 21 PM"
src="https://github.com/user-attachments/assets/8a4275ed-a787-4733-9e2c-8c609406edbc"
/>

### Analyzing the impact of this flag on the performance and accuracy
**Now that we have this flag fixed and optimized (as the version without
using the flag), we can analyze its impact on the performance and
accuracy of the decoder.**

I first analyzed the performance and accuracy impact of this flag using
the same benchmarks I used to test the performance/bug fix I implemented
in this PR. I noticed that for these benchmarks, the flag provides
somewhat better accuracy, but lower performance. Below are graphs that
compare the accuracy and performance with and without using the
`--at-most-two-errors-per-detector` flag.

<img width="1778" height="985" alt="Screenshot 2025-07-18 7 41 52 PM"
src="https://github.com/user-attachments/assets/d4e4f5f3-75e8-46e1-9fdf-51537316d652"
/>

<img width="1778" height="874" alt="Screenshot 2025-07-18 7 42 10 PM"
src="https://github.com/user-attachments/assets/b31b28cd-d9a9-4a2a-bae2-88f78a16a5e0"
/>

<img width="1778" height="986" alt="Screenshot 2025-07-18 7 42 27 PM"
src="https://github.com/user-attachments/assets/c6ea54d6-0738-44b3-924f-4632291e41e1"
/>

<img width="1778" height="883" alt="Screenshot 2025-07-18 7 42 44 PM"
src="https://github.com/user-attachments/assets/e5f64154-da58-406d-8e47-2be2a0004a37"
/>

### More data on the performance and accuracy impact of the flag
I performed additional experiments/benchmarks to collect more
comprehensive data on the impact of this flag on the performance and
accuracy of the _Tesseract_ decoder. Below are plots for various groups
of codes. It confirms that for (most of) benchmarks it provides somewhat
better accuracy, but lower performance.

<img width="1763" height="980" alt="Screenshot 2025-07-18 1 50 52 PM"
src="https://github.com/user-attachments/assets/3ac277dd-4e9d-418c-956d-dc331ef12019"
/>

<img width="1763" height="981" alt="Screenshot 2025-07-18 1 52 49 PM"
src="https://github.com/user-attachments/assets/9c7e50ef-7bb2-4805-8e8c-d1df4152cc10"
/>

<img width="1762" height="981" alt="Screenshot 2025-07-18 1 55 53 PM"
src="https://github.com/user-attachments/assets/1803cccf-4f25-4b9a-bb2a-3818412f60de"
/>

<img width="1762" height="980" alt="Screenshot 2025-07-18 1 57 34 PM"
src="https://github.com/user-attachments/assets/b5645353-8168-4b39-9473-4c3ed425083c"
/>

<img width="1748" height="981" alt="Screenshot 2025-07-18 2 02 48 PM"
src="https://github.com/user-attachments/assets/1084b196-365a-4e3b-a65d-bacd19929760"
/>

<img width="1748" height="981" alt="Screenshot 2025-07-18 2 04 45 PM"
src="https://github.com/user-attachments/assets/c6cbcf78-26ab-48e2-a9a3-2ff1faf3c5dc"
/>

<img width="1756" height="989" alt="Screenshot 2025-07-18 2 08 19 PM"
src="https://github.com/user-attachments/assets/e5f98227-b5e0-4eba-885f-571908d183a0"
/>

<img width="1746" height="988" alt="Screenshot 2025-07-18 3 15 07 PM"
src="https://github.com/user-attachments/assets/81c6133e-e34d-403d-85fc-320042311120"
/>

**The results show that the increase in decoding speed can range from
around 0% to over 40%. In very rare cases, this flag provides (very low)
performance improvement. The accuracy improvement ranges from 0% to over
30%, indicating that this flag can have a significant impact on the
higher accuracy.**

### Major contributions of the PR:
- Removes the performance degradation caused by optimizations I
implemented when targeting configurations that do not use this flag, but
significantly improved the decoding time without using the flag
- Completely removed inefficient/redundant `std::vector` copy operations
that were propagated due to the `next_next_blocked_errs =
next_blocked_errs` line that existed before (mentioned in PR #27)
- Fixed the performance issue/bug that existed when using the
`--at-most-two-errors-per-detector` flag, where large vectors were
frequently copied in each decoding iteration only to revert a few
changes (it is important to note that this performance issue escalated
because of the changes made in the data representation, which were
necessary to implement previous optimization strategies)
- Extensive experiments/benchmarks performed to evaluate the impact of
the performance issue/bug fix
- Extensive experiments/benchmarks performed to evaluate the impact of
the flag itself on the performance and accuracy of the decoder

### Does it provide better performance on any benchmark now?
I also tested running a benchmark our team looked at the last meeting
where we saw that using the `--at-most-two-errors-per-detector` flag did
provide better performance. I specifically tested running this
benchmark:

`bazel build src:all && time ./bazel-bin/src/tesseract --pqlimit 200000
--beam 5 --num-det-orders 20 --sample-num-shots 20 --det-order-seed
13267562 --circuit
testdata/colorcodes/r\=9\,d\=9\,p\=0.002\,noise\=si1000\,c\=superdense_color_code_X\,q\=121\,gates\=cz.stim
--sample-seed 717347 --threads 1 --print-stats`

with and without the `--at-most-two-errors-per-detector` flag. However,
the execution time I had without using the flag was 69.01 seconds, and
with using the flag 74.23 seconds. There were no errors or low
confidence results in each run. I think the benchmark we looked at
during our last meeting used the installation of _Tesseract_ before my
optimization from #34. If so, this shows that my optimizations had
higher impact when not using this flag, and also shows that the
performance improvement I achieved outweighs this flag's initial
speedup.

**Conclusion: I am very confident that the current version of the
_Tesseract_ algorithm is faster without using this flag due to the
optimizations I implemented in the `get_detcost` function. When
`--at-most-two-errors-per-detector` flag is enabled, more errors are
blocked, preventing them to influcence detectors' costs, and therefore
the `get_detcost` function itself. I invested a lot of time accelerating
the `get_detcost` function, so other speedups this flag initially
achieved did not outweigh the impact I achieved in #34.**

PR #47 contains the code/scripts I used to benchmark and compare color,
surface, and bicycle codes with and without using the
`--at-most-two-errors-per-detector` flag.

---------

Signed-off-by: Dragana Grbic <[email protected]>
Co-authored-by: noajshu <[email protected]>
Co-authored-by: LaLeh <[email protected]>
draganaurosgrbic added a commit that referenced this pull request Jul 30, 2025
### Hashing Syndrome Patterns with `boost::dynamic_bitset`
In this PR, I address a key performance bottleneck: the hashing of fired
detector patterns (syndrome patterns). I introduce the use of
`boost::dynamic_bitset` from the Boost library, a data structure that
combines the memory-saving bit-packing feature of `std::vector<bool>`
with highly optimized bit-wise operations and built-in hashing, enabling
fast access and modification operations like in `std::vector<char>`.
Crucially, `boost::dynamic_bitset` also provides highly optimized,
built-in functions for efficiently hashing sequences of boolean
elements.

---

### Initial Optimization: `std::vector<bool>` to `std::vector<char>`
The initial _Tesseract_ implementation, as documented in #25, utilized
`std::vector<bool>` to store patterns of fired detectors and predicates
that block specific errors from being added to the current error
hypothesis. While `std::vector<bool>` optimizes memory usage by packing
elements into individual bits, accessing and modifying its elements is
highly inefficient due to its reliance on proxy objects that perform
costly bit-wise operations (shifting, masking). Given _Tesseract_'s
frequent access and modification of these elements, this caused
significant performance overheads.

In #25, I transitioned from `std::vector<bool>` to `std::vector<char>`.
This change made boolean elements addressable bytes, enabling efficient
and direct byte-level access. Although this increased memory footprint
(as each boolean was stored as a full byte), it delivered substantial
performance gains by eliminating `std::vector<bool>`'s proxy objects and
their associated overheads for element access and modification. Speedups
achieved with this initial optimization were significant:
* For Color Codes, speedups reached 17.2%-32.3%
* For Bivariate-Bicycle Codes, speedups reached 13.0%-22.3%
* For Surface Codes, speedups reached 33.4%-42.5%
* For Transversal CNOT Protocols, speedups reached 12.2%-32.4%

These significant performance gains highlight the importance of choosing
appropriate data structures for boolean sequences, especially in
performance-sensitive applications like _Tesseract_. The remarkable
42.5% speedup achieved in Surface Codes with this initial switch
underscores the substantial overhead caused by unsuitable data
structures. The performance gain from removing `std::vector<bool>`'s
proxy objects and their inefficient operations far outweighed any
overhead from increased memory consumption.

---

### Current Bottleneck: `std::vector<char>` and Hashing
Following the optimizations in #25, _Tesseract_ continued to use
`std::vector<char>` for storing and managing patterns of fired detectors
and predicates that block errors. Subsequently, PR #34 replaced and
merged vectors of blocked errors into the `DetectorCostTuple` structure,
which efficiently stores `error_blocked` and `detectors_count` as
`uint32_t` fields (reasons explained in #34). These changes left vectors
of fired detectors as the sole remaining `std::vector<char>` data
structure in this context.

After implementing and evaluating optimizations in #25, #27, #34, and
#45, profiling _Tesseract_ to analyze remaining bottlenecks revealed
that, aside from the `get_detcost` function, a notable bottleneck
emerged: `VectorCharHash` (originally `VectorBoolHash`). This function
is responsible for hashing patterns of fired detectors to prevent
re-exploring previously visited syndrome states. The implementation of
`VectorCharHash` involved iterating through each element, byte by byte,
and accumulating the hash. Even though this function saw significant
speedups with the initial switch from `std::vector<bool>` to
`std::vector<char>`, hashing patterns of fired detectors still consumed
considerable time. Post-optimization profiling (after #25, #27, #34, and
#45) revealed that this hashing function consumed approximately 25% of
decoding time in Surface Codes, 30% in Transversal CNOT Protocols, 10%
in Color Codes, and 2% in Bivariate-Bicycle Codes (`get_detcost`
remained the primary bottleneck for Bivariate-Bicycle Codes). Therefore,
I decided to explore opportunities to further optimize this function and
enhance the decoding speed.

---

### Solution: Introducing `boost::dynamic_bitset`
This PR addresses the performance bottleneck of hashing fired detector
patterns and mitigates the increased memory footprint from the initial
switch to `std::vector<char>` by introducing the `boost::dynamic_bitset`
data structure. The C++ standard library's `std::bitset` offers an ideal
conceptual solution: memory-efficient bit-packed storage (like
`std::vector<bool>`) combined with highly efficient access and
modification operations (like `std::vector<char>`). This data structure
achieves efficient access and modification by employing highly optimized
bit-wise operations, thereby reducing performance overhead stemming from
proxy objects in `std::vector<bool>`. However, `std::bitset` requires a
static size (determined at compile-time), rendering it unsuitable for
_Tesseract_'s dynamically sized syndrome patterns.

The Boost library's `boost::dynamic_bitset` provides the perfect
solution by offering dynamic-sized bit arrays whose dimensions can be
determined at runtime. This data structure brilliantly combines the
memory efficiency of `std::vector<bool>` (by packing elements into
individual bits) with the performance benefits of direct element access
and modification, similar to `std::vector<char>`. This is achieved by
internally storing bits within a contiguous array of fundamental integer
types (e.g., `unsigned long` or `uint64_t`) and accessing/modifying
elements using highly optimized bit-wise operations, thus avoiding the
overheads of `std::vector<bool>`'s proxy objects and costly bit-wise
operations. Furthermore, `boost::dynamic_bitset` offers highly
optimized, built-in hashing functions, replacing our custom, less
efficient byte-by-byte hashing and resulting in a cleaner, faster
implementation.

---

### Performance Evaluation: Individual Impact of Optimization
I performed two types of experiments to evaluate the achieved
performance gains. First, I conducted extensive benchmarks across
various code families and configurations to evaluate the individual
performance gains achieved by this specific optimization. Speedups
achieved include:
* For Surface Codes: 8.0%-24.7%
* For Transversal CNOT Protocols: 12.1%-26.8%
* For Color Codes: 3.6%-7.0%
* For Bivariate-Bicycle Codes: 0.5%-4.8%

These results highlight the highest impact in Surface Codes and
Transversal CNOT Protocols, which aligns with the initial profiling data
that showcased these code families were spending more time in the
original `VectorCharHash` function.

---

#### Speedups in Surface Codes

<img width="1990" height="989" alt="img1"
src="https://github.com/user-attachments/assets/04044da5-a980-4282-a6fe-4debfa815f41"
/>

---

#### Speedups in Transversal CNOT Protocols

<img width="1990" height="989" alt="img2"
src="https://github.com/user-attachments/assets/f79e4d7d-5cfc-4077-be1a-13ef92a2d65a"
/>

<img width="1990" height="989" alt="img3"
src="https://github.com/user-attachments/assets/35a9b672-07d3-45ea-9334-23dd85760925"
/>

---

#### Speedups in Color Codes

<img width="1990" height="989" alt="img4"
src="https://github.com/user-attachments/assets/2b52c4fd-5137-47f0-9bae-7c667c740ff0"
/>

<img width="1990" height="989" alt="img5"
src="https://github.com/user-attachments/assets/e7883dec-5a88-4b2b-914b-3d12a1843d6f"
/>

---

#### Speedups in Bivariate-Bicycle Codes

<img width="1990" height="989" alt="img6"
src="https://github.com/user-attachments/assets/bd530a3b-da17-4ac1-bf68-702aaafe6047"
/>

<img width="1990" height="989" alt="img7"
src="https://github.com/user-attachments/assets/2d2f2576-0b16-4f0a-b8a2-221723250945"
/>

---

### Performance Evaluation: Cumulative Speedup
Following the evaluation of individual performance gains, I analyzed the
cumulative effect of the optimizations implemented across PRs #25, #27,
#34, and #45. The cumulative speedups achieved are:
* For Color Codes: 40.7%-54.8%
* For Bivariate-Bicycle Codes: 41.5%-80.3%
* For Surface Codes: 50.0%-62.4%
* For Transversal CNOT Protocols: 57.8%-63.6%

These results demonstrate that my optimizations achieved over 2x speedup
in Color Codes, over 2.5x speedup in Surface Codes and Transversal CNOT
Protocols, and over 5x speedup in Bivariate-Bicycle Codes.

---

#### Speedups in Color Codes

<img width="1990" height="989" alt="img1"
src="https://github.com/user-attachments/assets/cd81dc98-8599-4740-b00c-4ff396488f69"
/>

<img width="1990" height="989" alt="img2"
src="https://github.com/user-attachments/assets/c337ddcf-44f0-4641-91df-2a6d3c586680"
/>

---

#### Speedups in Bivariate-Bicycle Codes

<img width="1990" height="989" alt="img3"
src="https://github.com/user-attachments/assets/a57cf9e2-4c2c-44e8-8a6e-1860b1544cbd"
/>

<img width="1990" height="989" alt="img4"
src="https://github.com/user-attachments/assets/fde60159-fd7f-4893-b30d-34da844ac452"
/>

---

#### Speedups in Surface Codes

<img width="1990" height="989" alt="img5"
src="https://github.com/user-attachments/assets/57234d33-201b-41a9-b867-15e9ff87e666"
/>

---

#### Speedups in Transversal CNOT Protocols

<img width="1990" height="989" alt="img6"
src="https://github.com/user-attachments/assets/5780843d-2055-4870-9454-50184a268ad1"
/>

---

### Conclusion
These results demonstrate that the `boost::dynamic_bitset` optimization
significantly impacts code families where the original hashing function
(`VectorCharHash`) was a primary bottleneck (Surface Codes and
Transversal CNOT Protocols). The substantial speedups achieved in these
code families validate that `boost::dynamic_bitset` provides
demonstrably more efficient hashing and bit-wise operations. For code
families where hashing was less of a bottleneck (Color Codes and
Bivariate-Bicycle Codes), the speedups were modest, reinforcing that
`std::vector<char>` can remain highly efficient even with increased
memory usage when bit packing is not the primary performance concern.
Crucially, this optimization delivers comparable or superior performance
to `std::vector<char>` while simultaneously reducing memory footprint,
providing additional speedups where hashing performance is critical.

---

### Key Contributions
* Identified the hashing of syndrome patterns as the primary remaining
bottleneck in Surface Codes and Transversal CNOT Protocols, post prior
optimizations (#25, #27, #34, #45).
* Adopted `boost::dynamic_bitset` as a superior data structure,
combining `std::vector<bool>`'s memory efficiency with high-performance
bit-wise operations and built-in hashing, enabling fast access and
modification operations like in `std::vector<char>`
* Replaced `std::vector<char>` with `boost::dynamic_bitset` for storing
syndrome patterns.
* Performed extensive benchmarking to evaluate both the individual
impact of this optimization and its cumulative effect with prior PRs.
* Achieved significant individual speedups (e.g., 8.0%-24.7% in Surface
Codes, 12.1%-26.8% in Transversal CNOT Protocols) and substantial
cumulative speedups (over 2x in Color Codes, over 2.5x in Surface Codes
and Transversal CNOT Protocols, and over 5x in Bivariate-Bicycle Codes).

PR #47 contains the scripts I used for benchmarking and plotting the
results.

---------

Signed-off-by: Dragana Grbic <[email protected]>
Co-authored-by: noajshu <[email protected]>
Co-authored-by: LaLeh <[email protected]>
noajshu added a commit that referenced this pull request Aug 7, 2025
…reate-pr

Revert verbose logging refactor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants