[QDP] Add SVHN IQP encoding benchmark with PennyLane baseline and QDP pipeline by ryankert01 · Pull Request #1186 · apache/mahout

ryankert01 · 2026-03-15T09:20:32Z

1. Introduction

QDP (Quantum Data Plane) is a CUDA-accelerated quantum state encoding library that
converts classical feature vectors into quantum state vectors using GPU kernels. This
report compares QDP's one-shot encoding approach against PennyLane's gate-by-gate
encoding for IQP (Instantaneous Quantum Polynomial) circuits in a variational
classification task on the SVHN dataset.

Key insight: PennyLane embeds the IQP encoding circuit (Hadamard-Phase-Hadamard
gates) inside the quantum node, so it re-executes on every forward and backward pass.
QDP encodes all samples once upfront on GPU, then feeds pre-computed state vectors via
StatePrep during training. This eliminates redundant encoding and enables zero-copy
GPU tensor sharing with lightning.gpu.

2. Method

2.1 Task

Binary classification on SVHN (digit 1 vs 7):

SVHN (32x32x3) -> Flatten (3072) -> binary filter -> subsample
StandardScaler -> PCA to n_qubits dimensions
IQP encoding -> variational layers (Rot + ring CNOT) -> expval(PauliZ(0))
Square loss, Adam optimizer, batched training

2.2 IQP Encoding

The IQP circuit implements: H^n * Diag(phases) * H^n |0>^n

where phases include single-qubit terms z_i = features[i] and two-qubit terms
z_{ij} = features[i] * features[j] for all pairs i < j. This creates n + n(n-1)/2
parametric gates per sample.

2.3 Four Configurations

Config	Encoding	Transfer	Training Backend
PL-CPU	PennyLane IQP gates (CPU, every step)	--	`default.qubit` (CPU)
PL-GPU	PennyLane IQP gates (GPU, every step)	--	`lightning.gpu` (GPU)
QDP-CPU	QDP CUDA kernel (one-shot)	GPU->CPU copy	`default.qubit` (CPU)
QDP-GPU	QDP CUDA kernel (one-shot)	Zero copy	`lightning.gpu` (GPU)

2.4 Hardware

GPU: NVIDIA GeForce RTX 3090 Ti
PennyLane 0.44.1, PennyLane-Lightning-GPU 0.44.0
QDP 0.2.0 (qumat-qdp, Rust+CUDA)

3. Results

3.1 Accuracy Parity (Experiment 2)

Goal: Prove identical quantum states by showing matching accuracy across all configs.

Fixed parameters: 6 qubits, 200 samples, 200 iterations, batch size 10, 4 layers, seed 42, 1 trial.

Config	Test Accuracy	Train Time (s)	Throughput (samples/s)
PL-CPU	0.6750	126.03	15.9
QDP-CPU	0.6750	127.02	15.7
PL-GPU	0.6750	61.66	32.4
QDP-GPU	0.6750	52.74	37.9

All four configurations produce identical test accuracy (0.6750), confirming that
QDP's CUDA-encoded state vectors match PennyLane's gate-by-gate construction. The CPU
configs (PL-CPU, QDP-CPU) produce exact numerical agreement because both use
default.qubit with deterministic autograd.

QDP encoding time: 9.6 ms to encode all 200 samples (one-shot).

3.2 Qubit Scaling (Experiment 1)

Goal: Show QDP's advantage grows with qubit count.

Fixed parameters: 200 samples, 200 iterations, batch size 10, 4 layers, seed 42, 3 trials.

3.2.1 Training Time (mean of 3 trials, seconds)

Qubits	State Dim	PL-CPU	QDP-CPU	PL-GPU	QDP-GPU
4	16	89.4	87.1	44.8	36.4
6	64	134.7	114.4	65.2	43.2
8	256	186.5	166.9	80.7	59.4
10	1024	253.4	217.7	100.1	71.1

3.2.2 Speedup Ratios (PL / QDP)

Qubits	CPU Speedup (PL-CPU / QDP-CPU)	GPU Speedup (PL-GPU / QDP-GPU)	Cross Speedup (PL-CPU / QDP-GPU)
4	1.03x	1.23x	2.46x
6	1.18x	1.51x	3.12x
8	1.12x	1.36x	3.14x
10	1.16x	1.41x	3.56x

The GPU speedup (PL-GPU vs QDP-GPU) peaks at 1.51x at 6 qubits and remains
significant across all tested qubit counts. The cross speedup (PL-CPU vs QDP-GPU)
reaches 3.56x at 10 qubits, combining the benefits of one-shot encoding with GPU
training.

3.2.3 QDP One-Shot Encoding Time

Qubits	State Dim	QDP Encode (ms)	Fraction of Total
4	16	10.0	< 0.01%
6	64	9.4	< 0.01%
8	256	10.7	< 0.01%
10	1024	12.2	< 0.01%

QDP's encoding time is essentially constant (~10 ms) regardless of qubit count
for this dataset size, thanks to GPU parallelism. This is negligible compared to
training time (36-253 seconds).

3.2.4 Throughput (samples/sec, mean of 3 trials)

Qubits	PL-CPU	QDP-CPU	PL-GPU	QDP-GPU
4	22.4	23.0	44.6	54.9
6	14.8	17.5	30.7	46.3
8	10.7	12.0	24.8	33.7
10	7.9	9.2	20.0	28.1

QDP-GPU consistently delivers the highest throughput at every qubit count.

3.3 Zero-Copy Advantage

The QDP-CPU vs QDP-GPU comparison isolates the effect of keeping encoded tensors on
GPU (zero copy) versus copying them to CPU after encoding:

Qubits	QDP-CPU (s)	QDP-GPU (s)	Zero-Copy Speedup
4	87.1	36.4	2.39x
6	114.4	43.2	2.65x
8	166.9	59.4	2.81x
10	217.7	71.1	3.06x

The zero-copy advantage grows with qubit count (2.39x to 3.06x), reflecting the
increasing benefit of GPU-native state vector operations as the Hilbert space
dimension grows.

3.4 Test Accuracy Consistency

All configurations produce identical test accuracy at the same qubit count (seed = 42):

Qubits	PL-CPU	QDP-CPU	PL-GPU	QDP-GPU
4	0.6417	0.6583	0.6583	0.6333
6	0.6667	0.6750	0.6750	0.6750
8	0.6750	0.6750	0.6750	0.6750
10	0.6750	0.6750	0.6750	0.6750

At 6+ qubits, all configs converge to the same accuracy. Minor variations at 4
qubits are due to different circuit structure (QDP uses StatePrep vs PennyLane's
gate-by-gate IQP) and floating-point non-determinism on GPU.

4. Discussion

Why QDP is Faster

One-shot encoding: QDP encodes all samples once (~10 ms) regardless of training
iterations. PennyLane re-runs n + n(n-1)/2 parametric gates per sample per step.
Simpler training circuit: QDP's circuit uses StatePrep (one operation) instead
of the full H-D-H IQP gate sequence. This reduces per-step circuit complexity.
Zero-copy GPU path: QDP-GPU keeps encoded tensors on GPU, avoiding GPU->CPU
data transfer. Combined with lightning.gpu for training, the entire pipeline
stays on GPU.

Where QDP Matters Most

High qubit counts: The IQP circuit has O(n^2) parametric gates. At 10 qubits,
that's 55 gates removed from every forward pass.
Many training iterations: The savings compound — at 200 iterations with batch
size 10, PennyLane executes the IQP circuit ~4000 times (200 iters x 2 for
forward/backward x 10 samples). QDP does it zero times during training.
GPU-native workflows: The zero-copy path (QDP-GPU) provides the largest speedup
(up to 3.56x vs PL-CPU).

Limitations

At very low qubit counts (4 qubits), the CPU speedup is minimal (1.03x) because
the IQP circuit is small relative to variational layers.
The benchmark uses default.qubit (CPU) and lightning.gpu (GPU) state vector
simulators. Results on real quantum hardware would differ.
Accuracy is limited by the small dataset (200 samples) and binary classification
task, not by the encoding method.

5. Reproduction

Prerequisites

cd qdp/qdp-python
uv sync --group benchmark

Experiment 2: Accuracy Parity

# PL-CPU
uv run --group benchmark python benchmark/encoding_benchmarks/pennylane_baseline/svhn_iqp.py \
  --n-qubits 6 --n-samples 200 --iters 200 --batch-size 10 --layers 4 \
  --lr 0.01 --optimizer adam --trials 1 --early-stop 0 --seed 42 --backend cpu

# QDP-CPU
uv run --group benchmark python benchmark/encoding_benchmarks/qdp_pipeline/svhn_iqp.py \
  --n-qubits 6 --n-samples 200 --iters 200 --batch-size 10 --layers 4 \
  --lr 0.01 --optimizer adam --trials 1 --early-stop 0 --seed 42 --backend cpu

# PL-GPU
uv run --group benchmark python benchmark/encoding_benchmarks/pennylane_baseline/svhn_iqp.py \
  --n-qubits 6 --n-samples 200 --iters 200 --batch-size 10 --layers 4 \
  --lr 0.01 --optimizer adam --trials 1 --early-stop 0 --seed 42 --backend gpu

# QDP-GPU
uv run --group benchmark python benchmark/encoding_benchmarks/qdp_pipeline/svhn_iqp.py \
  --n-qubits 6 --n-samples 200 --iters 200 --batch-size 10 --layers 4 \
  --lr 0.01 --optimizer adam --trials 1 --early-stop 0 --seed 42 --backend gpu

Experiment 1: Qubit Scaling

# Run all configs for a given qubit count (e.g., 10 qubits):
for script in pennylane_baseline/svhn_iqp.py qdp_pipeline/svhn_iqp.py; do
  for backend in cpu gpu; do
    uv run --group benchmark python benchmark/encoding_benchmarks/$script \
      --n-qubits 10 --n-samples 200 --iters 200 --batch-size 10 --layers 4 \
      --lr 0.01 --optimizer adam --trials 3 --early-stop 0 --seed 42 --backend $backend
  done
done

ryankert01 · 2026-03-20T12:36:37Z

It's not production code rn. will fix them when i have time.

edit: it's ready now.

ryankert01 · 2026-03-29T09:08:59Z

Ready for review!

Copilot

Pull request overview

Adds a new SVHN (digit 1 vs 7) IQP-encoding benchmark to compare PennyLane’s gate-by-gate embedding against a QDP one-shot encoding + StatePrep training pipeline, including CPU and GPU training backends.

Changes:

Add SVHN IQP benchmark scripts for (1) pure PennyLane baseline and (2) QDP encoding + StatePrep pipeline.
Update benchmark dependency group to include newer PennyLane/Lightning GPU (py>=3.11) and SciPy (for SVHN .mat loading).
Update benchmark README with SVHN IQP usage instructions and refresh lockfiles.

Reviewed changes

Copilot reviewed 4 out of 6 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
`uv.lock`	Updates locked dependencies and benchmark requirement markers (incl. PennyLane/Lightning, SciPy).
`qdp/qdp-python/uv.lock`	Updates QDP Python workspace lock to include PennyLane 0.44.x + Lightning GPU deps.
`qdp/qdp-python/pyproject.toml`	Adds benchmark deps for PennyLane>=0.44 / Lightning GPU (py>=3.11) and SciPy.
`qdp/qdp-python/benchmark/encoding_benchmarks/qdp_pipeline/svhn_iqp.py`	New QDP pipeline benchmark: one-shot IQP encoding via QDP + `StatePrep` training (CPU/GPU).
`qdp/qdp-python/benchmark/encoding_benchmarks/pennylane_baseline/svhn_iqp.py`	New baseline benchmark: IQP circuit inside QNode (CPU/GPU).
`qdp/qdp-python/benchmark/encoding_benchmarks/README.md`	Documents how to run the new SVHN IQP benchmarks.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

qdp/qdp-python/benchmark/encoding_benchmarks/README.md

qdp/qdp-python/benchmark/encoding_benchmarks/pennylane_baseline/svhn_iqp.py

qdp/qdp-python/benchmark/encoding_benchmarks/qdp_pipeline/svhn_iqp.py

qdp/qdp-python/benchmark/encoding_benchmarks/README.md

ryankert01 · 2026-03-29T14:59:58Z

fixed at f5b91a5

ryankert01 · 2026-04-02T15:21:32Z

PTAL @guan404ming

- Changed revision from 3 to 2. - Added conditional specifications for 'pennylane' and 'pennylane-lightning' for Python 3.11 and above. - Included 'scipy' with a minimum version of 1.11.

…QP scripts

ryankert01 requested review from 400Ping and guan404ming as code owners March 15, 2026 09:20

ryankert01 changed the title ~~feat: add QDP pipeline for SVHN IQP variational classifier with one-time encoding~~ [QDP] Add SVHN IQP encoding benchmark with PennyLane baseline and QDP pipeline Mar 15, 2026

ryankert01 marked this pull request as draft March 20, 2026 12:29

ryankert01 marked this pull request as ready for review March 29, 2026 07:58

ryankert01 force-pushed the svhn-iqp branch 2 times, most recently from 3d00ce7 to edb9e21 Compare March 29, 2026 07:59

ryankert01 requested a review from rich7420 March 29, 2026 09:08

ryankert01 requested review from Copilot and removed request for 400Ping, guan404ming and rich7420 March 29, 2026 11:02

Copilot started reviewing on behalf of ryankert01 March 29, 2026 11:02 View session

Copilot AI reviewed Mar 29, 2026

View reviewed changes

ryankert01 added 2 commits April 2, 2026 23:23

Update uv.lock: Adjust revision and package specifications

2ccffa3

- Changed revision from 3 to 2. - Added conditional specifications for 'pennylane' and 'pennylane-lightning' for Python 3.11 and above. - Included 'scipy' with a minimum version of 1.11.

fix: update QDP encoding method and synchronize CUDA device in SVHN I…

38ff210

…QP scripts

ryankert01 force-pushed the svhn-iqp branch from f5b91a5 to 38ff210 Compare April 2, 2026 15:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QDP] Add SVHN IQP encoding benchmark with PennyLane baseline and QDP pipeline #1186

[QDP] Add SVHN IQP encoding benchmark with PennyLane baseline and QDP pipeline #1186
ryankert01 wants to merge 2 commits intoapache:mainfrom
ryankert01:svhn-iqp

ryankert01 commented Mar 15, 2026 •

edited

Loading

Uh oh!

ryankert01 commented Mar 20, 2026 •

edited

Loading

Uh oh!

ryankert01 commented Mar 29, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ryankert01 commented Mar 29, 2026

Uh oh!

ryankert01 commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ryankert01 commented Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. Introduction

2. Method

2.1 Task

2.2 IQP Encoding

2.3 Four Configurations

2.4 Hardware

3. Results

3.1 Accuracy Parity (Experiment 2)

3.2 Qubit Scaling (Experiment 1)

3.2.1 Training Time (mean of 3 trials, seconds)

3.2.2 Speedup Ratios (PL / QDP)

3.2.3 QDP One-Shot Encoding Time

3.2.4 Throughput (samples/sec, mean of 3 trials)

3.3 Zero-Copy Advantage

3.4 Test Accuracy Consistency

4. Discussion

Why QDP is Faster

Where QDP Matters Most

Limitations

5. Reproduction

Prerequisites

Experiment 2: Accuracy Parity

Experiment 1: Qubit Scaling

Uh oh!

ryankert01 commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ryankert01 commented Mar 29, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ryankert01 commented Mar 29, 2026

Uh oh!

ryankert01 commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ryankert01 commented Mar 15, 2026 •

edited

Loading

ryankert01 commented Mar 20, 2026 •

edited

Loading