Skip to content

[Remyx Recommendation] SpaceDG: Benchmarking Spatial Intelligence under Visual Degradation#73

Draft
github-actions[bot] wants to merge 1 commit into
mainfrom
remyx-recommendation/2605.22536v1
Draft

[Remyx Recommendation] SpaceDG: Benchmarking Spatial Intelligence under Visual Degradation#73
github-actions[bot] wants to merge 1 commit into
mainfrom
remyx-recommendation/2605.22536v1

Conversation

@github-actions

Copy link
Copy Markdown

Drafted by an autonomous discovery loop — Remyx ranks recent arXiv papers against this team's research interest and shipping history; Claude Code selects the candidate most directly implementable against this repo from the lookback window and drafts it.

Recommended paper: SpaceDG: Benchmarking Spatial Intelligence under Visual Degradation
Confidence: 🟢 high (Remyx relevance 0.96)
Research interest: VQASynth
Implementation by: Claude Code as autonomous agent


Why this paper for this team

VQASynth's 'Add optional pipeline for chain-of-thought reasoning' generates data for spatial understanding. SpaceDG introduces a benchmark for evaluating spatial intelligence under visual degradations, revealing a significant robustness gap in current MLLMs. This directly extends and complements VQASynth's CoT generation by providing insights into how robust the generated reasoning data is when visual inputs are imperfect. Understanding these degradation-induced failure modes can help VQASynth evolve its CoT generation to be more resilient, perhaps by explicitly incorporating reasoning about visual uncertainty or by generating data reflecting challenging conditions, thereby addressing the 'Generalization to Diverse Scenes' open problem.

Why this candidate (selected from the lookback pool)

SpaceDG is a spatial-reasoning evaluation benchmark over (degraded) static images, which drops directly into the repo's existing evaluation stage: a new dataset loader in benchmarks.py, degradation-aware scoring in evaluation.py, and execution through the existing inference.py VLM wrapper. It requires no new trainer, model, or 3D infrastructure — only code paths the repo already calls for its multi-benchmark eval stage.

Suggested experiment

Apply a subset of SpaceDG's synthetic degradations (e.g., motion blur, low light) to a small batch of VQASynth's input images. Generate CoT-based QA pairs for these degraded images using the 'Add optional pipeline for chain-of-thought reasoning'. Compare the quality, correctness, and logical consistency of the generated reasoning against those from pristine images to identify sensitivity to degradations.


What this PR actually does

Call site: docker/eval_stage/process_eval.py run_eval() — the existing eval-stage CLI now constructs BenchmarkRunner(degradation=..., severity=...) and calls runner.degrade_items(items) before run_inference_on_benchmark; degradation also flows through BenchmarkRunner.get_benchmark_items()

Implemented from the paper:

  • Nine image-space degradation operators on a 1-5 severity scale (motion_blur, defocus_blur, low_light, gaussian_noise, jpeg_compression, fog, contrast_loss, pixelate, color_shift) in new vqasynth/image_degradation.py, built on PIL+numpy with deterministic (seeded) noise
  • apply_degradation()/degrade_images() helpers, including heterogeneous-input coercion (PIL, file path, raw bytes, HF {'bytes':...} dicts) with pass-through for un-coercible entries
  • BenchmarkRunner extended with degradation/severity params and a degrade_items() method; get_benchmark_items() now degrades loaded images so the existing multi-benchmark eval can run on clean vs. degraded inputs
  • Wired into the existing eval CLI: docker/eval_stage/process_eval.py gains --degradation and --severity flags, passes them to BenchmarkRunner, and calls runner.degrade_items(items) before inference, enabling the clean-vs-degraded robustness-gap measurement
  • README section documenting the operators, the CLI usage, and the scope limits

Stubbed / left out:

  • SpaceDG's core contribution — the physically grounded degradation synthesis engine embedding degradation into 3D Gaussian Splatting rendering — is not reproduced (substituted with ImageNet-C-style image-space approximations); requires neural 3DGS rendering infrastructure
  • The SpaceDG dataset (~1M QA pairs, ~1,000 indoor scenes) and the SpaceDG-Bench human-verified benchmark (1,102 questions, 11 reasoning categories, 9 degradation types) are not added as loaders (requires dataset release/hosting)
  • Degradation-aware finetuning shown by the paper to close the robustness gap is not implemented (requires training infra; out of scope for the eval stage)
  • No automated clean-vs-degraded gap reporting/aggregation: the gap must be derived by running the eval twice and comparing reports manually

I built a self-contained image-space corruption library (nine ImageNet-C-style operators) and wired it into the pre-existing eval stage: the process_eval.py CLI gains --degradation/--severity flags and degrades every benchmark image before inference, so the product genuinely exercises the new code (not an orphan). What is faithfully delivered is the robustness-evaluation methodology — re-running existing spatial benchmarks under degradation to expose the clean-vs-degraded gap. What is NOT present is the paper's actual primary contribution: the physically grounded 3DGS degradation-synthesis engine, the released SpaceDG dataset, and the human-verified SpaceDG-Bench. The degradations here are cheap perceptual approximations, not the paper's physically grounded renders, and there is no automated gap-reporting — the user must run and diff two reports themselves.

Test results

✅ All tests passed.


Opened by the Remyx Recommendation orchestrator.

salma-remyx added a commit that referenced this pull request May 29, 2026
The action's candidate-selection + role-based-guardrails work (v1.0.6) is
validated end-to-end on this repo (draft PR #73). Drop the rate-limit-days:
'0' override that bypassed the per-run rate limit during validation, so the
weekly cron falls back to the action default (7 days).

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants