[Remyx Recommendation] SpaceDG: Benchmarking Spatial Intelligence under Visual Degradation by github-actions[bot] · Pull Request #73 · remyxai/VQASynth

github-actions · 2026-05-29T04:23:50Z

Drafted by an autonomous discovery loop — Remyx ranks recent arXiv papers against this team's research interest and shipping history; Claude Code selects the candidate most directly implementable against this repo from the lookback window and drafts it.

Recommended paper: SpaceDG: Benchmarking Spatial Intelligence under Visual Degradation
Confidence: 🟢 high (Remyx relevance 0.96)
Research interest: VQASynth
Implementation by: Claude Code as autonomous agent

Why this paper for this team

VQASynth's 'Add optional pipeline for chain-of-thought reasoning' generates data for spatial understanding. SpaceDG introduces a benchmark for evaluating spatial intelligence under visual degradations, revealing a significant robustness gap in current MLLMs. This directly extends and complements VQASynth's CoT generation by providing insights into how robust the generated reasoning data is when visual inputs are imperfect. Understanding these degradation-induced failure modes can help VQASynth evolve its CoT generation to be more resilient, perhaps by explicitly incorporating reasoning about visual uncertainty or by generating data reflecting challenging conditions, thereby addressing the 'Generalization to Diverse Scenes' open problem.

Why this candidate (selected from the lookback pool)

SpaceDG is a spatial-reasoning evaluation benchmark over (degraded) static images, which drops directly into the repo's existing evaluation stage: a new dataset loader in benchmarks.py, degradation-aware scoring in evaluation.py, and execution through the existing inference.py VLM wrapper. It requires no new trainer, model, or 3D infrastructure — only code paths the repo already calls for its multi-benchmark eval stage.

Suggested experiment

Apply a subset of SpaceDG's synthetic degradations (e.g., motion blur, low light) to a small batch of VQASynth's input images. Generate CoT-based QA pairs for these degraded images using the 'Add optional pipeline for chain-of-thought reasoning'. Compare the quality, correctness, and logical consistency of the generated reasoning against those from pristine images to identify sensitivity to degradations.

What this PR actually does

Call site: docker/eval_stage/process_eval.py run_eval() — the existing eval-stage CLI now constructs BenchmarkRunner(degradation=..., severity=...) and calls runner.degrade_items(items) before run_inference_on_benchmark; degradation also flows through BenchmarkRunner.get_benchmark_items()

Implemented from the paper:

Nine image-space degradation operators on a 1-5 severity scale (motion_blur, defocus_blur, low_light, gaussian_noise, jpeg_compression, fog, contrast_loss, pixelate, color_shift) in new vqasynth/image_degradation.py, built on PIL+numpy with deterministic (seeded) noise
apply_degradation()/degrade_images() helpers, including heterogeneous-input coercion (PIL, file path, raw bytes, HF {'bytes':...} dicts) with pass-through for un-coercible entries
BenchmarkRunner extended with degradation/severity params and a degrade_items() method; get_benchmark_items() now degrades loaded images so the existing multi-benchmark eval can run on clean vs. degraded inputs
Wired into the existing eval CLI: docker/eval_stage/process_eval.py gains --degradation and --severity flags, passes them to BenchmarkRunner, and calls runner.degrade_items(items) before inference, enabling the clean-vs-degraded robustness-gap measurement
README section documenting the operators, the CLI usage, and the scope limits

Stubbed / left out:

SpaceDG's core contribution — the physically grounded degradation synthesis engine embedding degradation into 3D Gaussian Splatting rendering — is not reproduced (substituted with ImageNet-C-style image-space approximations); requires neural 3DGS rendering infrastructure
The SpaceDG dataset (~1M QA pairs, ~1,000 indoor scenes) and the SpaceDG-Bench human-verified benchmark (1,102 questions, 11 reasoning categories, 9 degradation types) are not added as loaders (requires dataset release/hosting)
Degradation-aware finetuning shown by the paper to close the robustness gap is not implemented (requires training infra; out of scope for the eval stage)
No automated clean-vs-degraded gap reporting/aggregation: the gap must be derived by running the eval twice and comparing reports manually

I built a self-contained image-space corruption library (nine ImageNet-C-style operators) and wired it into the pre-existing eval stage: the process_eval.py CLI gains --degradation/--severity flags and degrades every benchmark image before inference, so the product genuinely exercises the new code (not an orphan). What is faithfully delivered is the robustness-evaluation methodology — re-running existing spatial benchmarks under degradation to expose the clean-vs-degraded gap. What is NOT present is the paper's actual primary contribution: the physically grounded 3DGS degradation-synthesis engine, the released SpaceDG dataset, and the human-verified SpaceDG-Bench. The degradations here are cheap perceptual approximations, not the paper's physically grounded renders, and there is no automated gap-reporting — the user must run and diff two reports themselves.

Test results

✅ All tests passed.

Opened by the Remyx Recommendation orchestrator.

…er Visual Degradation

The action's candidate-selection + role-based-guardrails work (v1.0.6) is validated end-to-end on this repo (draft PR #73). Drop the rate-limit-days: '0' override that bypassed the per-run rate limit during validation, so the weekly cron falls back to the action default (7 days). Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

[Remyx Recommendation] SpaceDG: Benchmarking Spatial Intelligence und…

cdbc540

…er Visual Degradation

This was referenced May 29, 2026

[Remyx Recommendation] SpaceDG: Benchmarking Spatial Intelligence under Visual Degradation #72

Closed

Restore Remyx Recommendation weekly cadence after validation #74

Merged

smellslikeml mentioned this pull request May 31, 2026

Set rate-limit-days: '0' on VQASynth for daily-candidate cadence #76

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Remyx Recommendation] SpaceDG: Benchmarking Spatial Intelligence under Visual Degradation#73

[Remyx Recommendation] SpaceDG: Benchmarking Spatial Intelligence under Visual Degradation#73
github-actions[bot] wants to merge 1 commit into
mainfrom
remyx-recommendation/2605.22536v1

github-actions Bot commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

0 participants

Conversation

github-actions Bot commented May 29, 2026

Why this paper for this team

Why this candidate (selected from the lookback pool)

Suggested experiment

What this PR actually does

Test results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

0 participants