-
Notifications
You must be signed in to change notification settings - Fork 87
Improve time measurement in benchmark #822
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Improve time measurement in benchmark #822
Conversation
📝 WalkthroughWalkthroughReplace per-iteration wall-clock timing with graph-capture-aware benchmark timing: add Example.benchmark_time, introduce a Warp kernel to apply random joint targets, capture/launch CUDA graphs where available, and use graph-captured timing for ASV benchmarks instead of direct time.time() measurements. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor ASV as ASV Benchmark
participant Track as track_simulate()
participant EX as Example
participant GPU as Device / CUDA Graph
ASV->>Track: run benchmark
Track->>EX: reset()
loop per-benchmark-step
Track->>GPU: synchronize()
Track->>EX: step()
Note right of EX: sync -> start_time\n(capture & launch graph) or simulate\nend_time -> accumulate benchmark_time
end
Track->>GPU: synchronize()
Track-->>ASV: ms_per_env_step = (EX.benchmark_time / steps) * 1000
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (3 warnings)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests
📜 Recent review detailsConfiguration used: Path: .coderabbit.yml Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
🧰 Additional context used🧬 Code graph analysis (1)asv/benchmarks/simulation/bench_mujoco.py (3)
⏰ Context from checks skipped due to timeout of 900000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
🔇 Additional comments (2)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (5)
asv/benchmarks/simulation/bench_mujoco.py (5)
76-96
: Bug: KPI returns only last sample’s time but divides by samplesYou must accumulate benchmark_time across samples before averaging.
Apply this diff:
@skip_benchmark_if(wp.get_cuda_device_count() == 0) def track_simulate(self, num_envs): - for _iter in range(self.samples): - example = Example( + total = 0.0 + for _iter in range(self.samples): + example = Example( stage_path=None, robot=self.robot, randomize=True, headless=True, actuation="random", num_envs=num_envs, use_cuda_graph=True, builder=self.builder[num_envs], ) for _ in range(self.num_frames): example.step() wp.synchronize_device() - return example.benchmark_time * 1000 / (self.num_frames * example.sim_substeps * num_envs * self.samples) + total += example.benchmark_time + return total * 1000 / (self.num_frames * example.sim_substeps * num_envs * self.samples)
148-170
: Bug: KPI Cartpole averages without summing samplesAccumulate time across samples before dividing.
Apply this diff:
@skip_benchmark_if(wp.get_cuda_device_count() == 0) def track_simulate(self, num_envs): - for _iter in range(self.samples): + total = 0.0 + for _iter in range(self.samples): example = Example( stage_path=None, robot=self.robot, randomize=True, headless=True, actuation="random", num_envs=num_envs, use_cuda_graph=True, builder=self.builder[num_envs], ) wp.synchronize_device() for _ in range(self.num_frames): example.step() wp.synchronize_device() - return example.benchmark_time * 1000 / (self.num_frames * example.sim_substeps * num_envs * self.samples) + total += example.benchmark_time + return total * 1000 / (self.num_frames * example.sim_substeps * num_envs * self.samples)
222-244
: Bug: KPI G1 averages without summing samplesAccumulate benchmark_time across samples.
Apply this diff:
@skip_benchmark_if(wp.get_cuda_device_count() == 0) def track_simulate(self, num_envs): - for _iter in range(self.samples): + total = 0.0 + for _iter in range(self.samples): example = Example( stage_path=None, robot=self.robot, randomize=True, headless=True, actuation="random", num_envs=num_envs, use_cuda_graph=True, builder=self.builder[num_envs], ) wp.synchronize_device() for _ in range(self.num_frames): example.step() wp.synchronize_device() - return example.benchmark_time * 1000 / (self.num_frames * example.sim_substeps * num_envs * self.samples) + total += example.benchmark_time + return total * 1000 / (self.num_frames * example.sim_substeps * num_envs * self.samples)
296-318
: Bug: KPI H1 averages without summing samplesSum per‑sample times before averaging.
Apply this diff:
@skip_benchmark_if(wp.get_cuda_device_count() == 0) def track_simulate(self, num_envs): - for _iter in range(self.samples): + total = 0.0 + for _iter in range(self.samples): example = Example( stage_path=None, robot=self.robot, randomize=True, headless=True, actuation="random", num_envs=num_envs, use_cuda_graph=True, builder=self.builder[num_envs], ) wp.synchronize_device() for _ in range(self.num_frames): example.step() wp.synchronize_device() - return example.benchmark_time * 1000 / (self.num_frames * example.sim_substeps * num_envs * self.samples) + total += example.benchmark_time + return total * 1000 / (self.num_frames * example.sim_substeps * num_envs * self.samples)
369-391
: Bug: KPI Humanoid averages without summing samplesAccumulate over samples to avoid underreporting.
Apply this diff:
@skip_benchmark_if(wp.get_cuda_device_count() == 0) def track_simulate(self, num_envs): - for _iter in range(self.samples): + total = 0.0 + for _iter in range(self.samples): example = Example( stage_path=None, robot=self.robot, randomize=True, headless=True, actuation="random", num_envs=num_envs, use_cuda_graph=True, builder=self.builder[num_envs], ) wp.synchronize_device() for _ in range(self.num_frames): example.step() wp.synchronize_device() - return example.benchmark_time * 1000 / (self.num_frames * example.sim_substeps * num_envs * self.samples) + total += example.benchmark_time + return total * 1000 / (self.num_frames * example.sim_substeps * num_envs * self.samples)
🧹 Nitpick comments (1)
newton/examples/example_mujoco.py (1)
284-294
: Use monotonic high‑resolution timer for benchmarkingtime.perf_counter() (or perf_counter_ns) is preferred over time.time() to avoid clock adjustments and improve resolution.
Apply this diff:
- wp.synchronize_device() - start_time = time.time() + wp.synchronize_device() + start_time = time.perf_counter() if self.use_cuda_graph: wp.capture_launch(self.graph) else: self.simulate() - wp.synchronize_device() - end_time = time.time() + wp.synchronize_device() + end_time = time.perf_counter() self.benchmark_time += end_time - start_time
📜 Review details
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
asv/benchmarks/simulation/bench_mujoco.py
(10 hunks)newton/examples/example_mujoco.py
(3 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-08-19T12:27:30.629Z
Learnt from: preist-nvidia
PR: newton-physics/newton#579
File: newton/examples/example_mujoco.py:350-354
Timestamp: 2025-08-19T12:27:30.629Z
Learning: In Newton examples, there's a distinction between solver parameters and Example class attributes. The Example class can have its own use_mujoco attribute for controlling example-level behavior (like CUDA graphs, rendering logic), while the solver uses use_mujoco_cpu for backend selection. These serve different purposes and should not be conflated during API renames.
Applied to files:
newton/examples/example_mujoco.py
📚 Learning: 2025-07-14T03:57:29.670Z
Learnt from: Kenny-Vilella
PR: newton-physics/newton#398
File: newton/examples/example_mujoco.py:352-352
Timestamp: 2025-07-14T03:57:29.670Z
Learning: The use_mujoco option in newton/examples/example_mujoco.py is currently unsupported and causes crashes. The code automatically disables this option with a warning message when users attempt to enable it. This is intentionally kept as a placeholder for future implementation.
Applied to files:
newton/examples/example_mujoco.py
🧬 Code graph analysis (2)
asv/benchmarks/simulation/bench_mujoco.py (1)
newton/examples/example_mujoco.py (1)
step
(279-294)
newton/examples/example_mujoco.py (1)
newton/examples/robot/example_robot_anymal_c_walk.py (1)
simulate
(188-198)
⏰ Context from checks skipped due to timeout of 900000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Run GPU Unit Tests on AWS EC2 (Pull Request)
- GitHub Check: Run GPU Benchmarks (Pull Request)
🔇 Additional comments (1)
newton/examples/example_mujoco.py (1)
216-216
: LGTM: Dedicated benchmark_time counter addedPublic accumulator is appropriate for downstream benchmarks.
def track_simulate(self): | ||
for _ in range(self.num_frames): | ||
self.example.step() | ||
wp.synchronize_device() | ||
|
||
return self.example.benchmark_time * 1000 / (self.num_frames * self.example.sim_substeps * self.num_envs) | ||
|
||
track_simulate.unit = "ms/env-step" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what I started with, but I don't recommend it since track
benchmarks only run once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion is to leave the Fast*
benchmarks reporting as time_
benchmarks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeap but that means we measure the time taken by everything including all the python code, applying the control and so on....
A possible way will be to reimplement the step function here, we may get the best of both world like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds promising!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the FastAnt benchmark.
If you are OK with it will update the other Fast benchmarks.
It is not perfect, but at least the random actuation is graphed so we can get more robust time measurement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed all the Fast benchmarks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (5)
asv/benchmarks/simulation/bench_mujoco.py (5)
99-117
: Average across every sample, not just the final oneAfter looping over
self.samples
, we only read the lastExample
’sbenchmark_time
while still dividing byself.samples
, so earlier samples are dropped from the numerator. The same pattern appears in the other KPI benchmarks.- for _iter in range(self.samples): + total_time = 0.0 + sim_substeps = None + for _iter in range(self.samples): example = Example( stage_path=None, robot=self.robot, randomize=True, headless=True, actuation="random", num_envs=num_envs, use_cuda_graph=True, builder=self.builder[num_envs], ls_iteration=10, ) for _ in range(self.num_frames): example.step() - wp.synchronize_device() - - return example.benchmark_time * 1000 / (self.num_frames * example.sim_substeps * num_envs * self.samples) + total_time += example.benchmark_time + sim_substeps = example.sim_substeps + wp.synchronize_device() + + return total_time * 1000 / (self.num_frames * sim_substeps * num_envs * self.samples)
167-186
: Incorporate every sample when computing KPI Cartpole timingSame issue here: only the last sample’s runtime survives, yet the denominator still multiplies by
self.samples
. Please accumulate the per-samplebenchmark_time
before averaging.- for _iter in range(self.samples): + total_time = 0.0 + sim_substeps = None + for _iter in range(self.samples): example = Example( stage_path=None, robot=self.robot, randomize=True, headless=True, actuation="random", num_envs=num_envs, use_cuda_graph=True, builder=self.builder[num_envs], ls_iteration=3, ) wp.synchronize_device() for _ in range(self.num_frames): example.step() wp.synchronize_device() - - return example.benchmark_time * 1000 / (self.num_frames * example.sim_substeps * num_envs * self.samples) + total_time += example.benchmark_time + sim_substeps = example.sim_substeps + + return total_time * 1000 / (self.num_frames * sim_substeps * num_envs * self.samples)
238-257
: Sum benchmark time across KPI G1 samplesHere too, the numerator drops all but the last sample. Please accumulate before dividing by
self.samples
.- for _iter in range(self.samples): + total_time = 0.0 + sim_substeps = None + for _iter in range(self.samples): example = Example( stage_path=None, robot=self.robot, randomize=True, headless=True, actuation="random", num_envs=num_envs, use_cuda_graph=True, builder=self.builder[num_envs], ls_iteration=10, ) wp.synchronize_device() for _ in range(self.num_frames): example.step() wp.synchronize_device() - - return example.benchmark_time * 1000 / (self.num_frames * example.sim_substeps * num_envs * self.samples) + total_time += example.benchmark_time + sim_substeps = example.sim_substeps + + return total_time * 1000 / (self.num_frames * sim_substeps * num_envs * self.samples)
309-328
: Include every KPI H1 sample in the reported averageIdentical averaging bug: we divide by
self.samples
but only keep the final iteration’s time. Please accumulatebenchmark_time
across the loop.- for _iter in range(self.samples): + total_time = 0.0 + sim_substeps = None + for _iter in range(self.samples): example = Example( stage_path=None, robot=self.robot, randomize=True, headless=True, actuation="random", num_envs=num_envs, use_cuda_graph=True, builder=self.builder[num_envs], ls_iteration=10, ) wp.synchronize_device() for _ in range(self.num_frames): example.step() wp.synchronize_device() - - return example.benchmark_time * 1000 / (self.num_frames * example.sim_substeps * num_envs * self.samples) + total_time += example.benchmark_time + sim_substeps = example.sim_substeps + + return total_time * 1000 / (self.num_frames * sim_substeps * num_envs * self.samples)
379-398
: Aggregate KPI Humanoid runtimes across all samplesOnce more, only the last sample contributes to the numerator. Please add each sample’s
benchmark_time
before dividing.- for _iter in range(self.samples): + total_time = 0.0 + sim_substeps = None + for _iter in range(self.samples): example = Example( stage_path=None, robot=self.robot, randomize=True, headless=True, actuation="random", num_envs=num_envs, use_cuda_graph=True, builder=self.builder[num_envs], ls_iteration=15, ) wp.synchronize_device() for _ in range(self.num_frames): example.step() wp.synchronize_device() - - return example.benchmark_time * 1000 / (self.num_frames * example.sim_substeps * num_envs * self.samples) + total_time += example.benchmark_time + sim_substeps = example.sim_substeps + + return total_time * 1000 / (self.num_frames * sim_substeps * num_envs * self.samples)
🧹 Nitpick comments (1)
newton/examples/example_mujoco.py (1)
306-315
: Prefer a monotonic clock for benchmarking accuracy
time.time()
can jump when the system clock is adjusted, which skews the accumulatedbenchmark_time
. Switching totime.perf_counter()
keeps the measurement monotonic and higher resolution.- start_time = time.time() + start_time = time.perf_counter() @@ - end_time = time.time() + end_time = time.perf_counter()
📜 Review details
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
asv/benchmarks/simulation/bench_mujoco.py
(7 hunks)newton/examples/example_mujoco.py
(3 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-07-14T03:57:29.670Z
Learnt from: Kenny-Vilella
PR: newton-physics/newton#398
File: newton/examples/example_mujoco.py:352-352
Timestamp: 2025-07-14T03:57:29.670Z
Learning: The use_mujoco option in newton/examples/example_mujoco.py is currently unsupported and causes crashes. The code automatically disables this option with a warning message when users attempt to enable it. This is intentionally kept as a placeholder for future implementation.
Applied to files:
newton/examples/example_mujoco.py
🧬 Code graph analysis (2)
newton/examples/example_mujoco.py (2)
newton/examples/robot/example_robot_h1.py (1)
simulate
(97-108)newton/examples/selection/example_selection_materials.py (1)
simulate
(167-181)
asv/benchmarks/simulation/bench_mujoco.py (3)
newton/examples/robot/example_robot_anymal_c_walk.py (2)
capture
(178-186)simulate
(188-198)newton/_src/sim/state.py (1)
joint_dof_count
(108-112)newton/examples/example_mujoco.py (1)
simulate
(294-298)
⏰ Context from checks skipped due to timeout of 900000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Run GPU Benchmarks (Pull Request)
- GitHub Check: Run GPU Unit Tests on AWS EC2 (Pull Request)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
📜 Review details
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
asv/benchmarks/simulation/bench_mujoco.py
(7 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
asv/benchmarks/simulation/bench_mujoco.py (4)
newton/examples/robot/example_robot_anymal_c_walk.py (2)
capture
(178-186)simulate
(188-198)newton/examples/selection/example_selection_cartpole.py (2)
capture
(126-131)simulate
(133-141)newton/_src/sim/state.py (1)
joint_dof_count
(108-112)newton/examples/example_mujoco.py (1)
simulate
(294-298)
⏰ Context from checks skipped due to timeout of 900000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Run GPU Unit Tests on AWS EC2 (Pull Request)
- GitHub Check: Run GPU Benchmarks (Pull Request)
🔇 Additional comments (1)
asv/benchmarks/simulation/bench_mujoco.py (1)
51-75
: Initializeself.graph
when graph capture is unavailable.If mempool capture is off,
time_simulate()
still callswp.capture_launch(self.graph)
, butself.graph
is never assigned. That crashes the benchmark instead of falling back to the pre-recorded graph in theExample
. Please seedself.graph
(e.g., fromself.example.graph
) before the availability check and only overwrite it when the recapture succeeds.- cuda_graph_comp = wp.get_device().is_cuda and wp.is_mempool_enabled(wp.get_device()) - if not cuda_graph_comp: - print("Cannot use graph capture. Graph capture is disabled.") - else: + cuda_graph_comp = wp.get_device().is_cuda and wp.is_mempool_enabled(wp.get_device()) + self.graph = self.example.graph + if not cuda_graph_comp: + print("Cannot use graph capture. Graph capture is disabled.") + else: state = wp.rand_init(self.example.seed) with wp.ScopedCapture() as capture: wp.launch( apply_random_control, dim=(self.example.model.joint_dof_count,), inputs=[state], outputs=[self.example.control.joint_target], ) self.example.simulate() - self.graph = capture.graph + self.graph = capture.graph + self.example.graph = self.graph
# Recapture the graph with control application included | ||
cuda_graph_comp = wp.get_device().is_cuda and wp.is_mempool_enabled(wp.get_device()) | ||
if not cuda_graph_comp: | ||
print("Cannot use graph capture. Graph capture is disabled.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest changing this to raising a SkipNotImplemented
(from asv_runner.benchmarks.mark import SkipNotImplemented
)
cuda_graph_comp = wp.get_device().is_cuda and wp.is_mempool_enabled(wp.get_device()) | ||
if not cuda_graph_comp: | ||
print("Cannot use graph capture. Graph capture is disabled.") | ||
else: | ||
state = wp.rand_init(self.example.seed) | ||
with wp.ScopedCapture() as capture: | ||
wp.launch( | ||
apply_random_control, | ||
dim=(self.example.model.joint_dof_count,), | ||
inputs=[state], | ||
outputs=[self.example.control.joint_target], | ||
) | ||
self.example.simulate() | ||
self.graph = capture.graph |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be a good time to define a base class or two for these benchmarks to inherit from to reuse code like Gilles did for benchmarks like https://github.com/NVIDIA/warp/blob/main/asv/benchmarks/fem/integrate.py
Description
This will fix #745 and is also related to #652.
Newton Migration Guide
Please ensure the migration guide for warp.sim users is up-to-date with the changes made in this PR.
docs/migration.rst
is up-to dateBefore your PR is "Ready for review"
newton/tests/test_examples.py
)pre-commit run -a
Summary by CodeRabbit
New Features
Refactor
Chores