Skip to content

Improve ROM scan speed with two-phase pipeline and IGDB optimizations#3165

Open
Praeses0 wants to merge 8 commits intorommapp:masterfrom
Praeses0:improve-scan-speed
Open

Improve ROM scan speed with two-phase pipeline and IGDB optimizations#3165
Praeses0 wants to merge 8 commits intorommapp:masterfrom
Praeses0:improve-scan-speed

Conversation

@Praeses0
Copy link
Copy Markdown

Summary

  • Split scan into discovery → enrichment phases: ROMs appear in the library within seconds (~14 roms/s), metadata fills in progressively afterward. Previously, each ROM blocked on IGDB API calls before appearing.
  • Fix IGDB rate limiter lock-during-sleep bug: The token bucket was sleeping while holding the asyncio lock, serializing all concurrent IGDB requests to ~1 req/s instead of the intended 4 req/s.
  • Reduce IGDB API calls per ROM: Merge the two-phase filtered/unfiltered search into a single API call with local game_type preference. Eliminates redundant expanded searches. Worst case drops from 6 to 3 calls per ROM.
  • Add per-scan search term dedup cache: Avoids duplicate IGDB API calls when multiple ROMs normalize to the same search term (e.g. regional variants).
  • Show scan phase in UI: Frontend now displays "Discovering" vs "Fetching metadata" phase indicator in both the scan page and admin task progress panel.
  • Increase default SCAN_WORKERS from 1 to 10: Better I/O overlap between hash computation, DB writes, and metadata API calls.

Benchmark Results

Tested with 1,469 ROMs (1,100 Game Boy + 368 Game Gear + 1 Switch) using IGDB metadata:

Metric Before After Improvement
ROMs visible in library ~29 min 36 seconds ~50x
End-to-end scan time ~29 min ~9.5 min ~3x
IGDB metadata throughput 0.83 roms/s 2.72 roms/s 3.3x

The hard ceiling is IGDB's 4 req/s rate limit — these optimizations maximize throughput within that constraint.

Test plan

  • All 714 tests pass (713 existing + 1 new, only pre-existing Docker root-user CHD test fails)
  • New regression test for HASHES scan type persisting ROM hash fields
  • New unit tests for IGDB search cache semantics (positive hit cached, failures not cached, cache clear)
  • Benchmarked on 51-ROM and 1,469-ROM sets
  • trunk fmt and trunk check pass
  • Verified scan UI shows phase indicator during both discovery and enrichment

AI Disclosure

This PR was authored with the assistance of Claude Code (AI).

- Increase default SCAN_WORKERS from 1 to 5, allowing multiple ROMs
  to be scanned concurrently via the existing asyncio semaphore
- Add IGDB API rate limiter (4 req/s token bucket) to prevent 429
  responses that cause expensive 2-second retry penalties
- Parallelize cover, manual, and screenshot downloads using
  asyncio.gather instead of sequential awaits

Benchmark results (51 ROMs, quick scan + IGDB):
  Before: 0.83 roms/s, 61.57s total
  After:  2.80 roms/s, 18.23s total (3.4x faster)

AI-assisted: Claude Code
Fix rate limiter lock-during-sleep bug that serialized concurrent IGDB
requests. The acquire() method was sleeping while holding the asyncio
lock, blocking all other coroutines from proceeding. Now sleeps outside
the lock, enabling true 4 req/s throughput with concurrent workers.

Refactor _search_rom to eliminate redundant IGDB API calls. The old
pattern called _search_rom twice (with and without game_type filter),
each making up to 3 API calls including an identical expanded search
fallback. The new approach chains: search with filter -> search without
filter -> expanded search (once), reducing worst-case from 6 to 4 calls
per ROM.

Increase SCAN_WORKERS default from 5 to 10 for better I/O overlap
between hash computation, DB writes, and metadata API calls.

Benchmark results (51 ROMs, IGDB metadata):
- Before: 0.83 roms/s (61.57s total)
- After:  3.74 roms/s (13.63s total)
- Speedup: 4.5x

AI-assisted: Claude Code
Adds offset parameter to _request() and list_games() methods in
IGDBService, enabling paginated queries through the IGDB Apicalypse
API. This is standard IGDB API functionality that was missing from
the adapter.

AI-assisted: Claude Code
…ference

Previously _search_rom made two separate API calls: first with a
game_type filter (main games only), then without if no match was found.
This wastes an API call for every ROM where the filtered search fails.

Now a single unfiltered search is made, with results split into main
game types and other types locally. Main games are tried first, falling
back to DLC/bundle/mod types only if no main game matches. This
preserves the same matching priority while saving ~25% of API calls.

Also adds:
- Per-scan search term dedup cache to avoid duplicate API calls when
  multiple ROMs normalize to the same search term
- game_type field to GAMES_FIELDS for local type filtering

Benchmark (1469 ROMs, IGDB metadata):
- Before: 2.24 roms/s (prior commit)
- After:  2.72 roms/s
- Improvement: ~21%

Combined improvement over baseline (0.83 roms/s):
- 51 ROMs: ~4.5x faster
- 1469 ROMs: ~3.3x faster (539s vs ~1770s estimated baseline)

AI-assisted: Claude Code
Refactor _identify_rom into two phases:
- Phase 1 (discovery): Create DB entries, hash files, save to DB. No
  metadata API calls. Runs at ~14 roms/s — all ROMs appear in the
  library within seconds.
- Phase 2 (enrichment): Fetch metadata from IGDB/other sources,
  download covers/screenshots. Rate-limited by IGDB API at ~2.3 roms/s.

This improves perceived scan speed significantly: users see their
entire ROM library immediately instead of waiting for metadata to
load one ROM at a time. Metadata fills in progressively afterward.

AI-assisted: Claude Code
- Add scan_phase field ("discovering" / "enriching") to ScanStats so
  the frontend and benchmark tool can show which phase is active
- Improve phase transition log messages with platform name and ROM count
- Move IGDB search cache from class variable to instance variable
- Clear search cache once at the start of each scan (alongside gamelist
  cache), not per-platform — cache remains beneficial within a scan for
  deduplicating regional variant searches

AI-assisted: Claude Code
Add visual phase indicator to both the scan page and admin task
progress panel, showing "Discovering" (orange) during the fast
filesystem discovery phase and "Fetching metadata" (blue) during
the IGDB/metadata enrichment phase.

Changes:
- Extend ScanStats type in scanning store with scan_phase field
- Add phase chip with icon to Scan.vue sticky bottom stats bar
- Add phase label to ScanTaskProgress.vue admin panel
- Add i18n keys for phase labels (en_US, en_GB)

AI-assisted: Claude Code
High: Fix HASHES scan type not persisting ROM-level hash fields.
The two-phase split returned early before scan_rom() could write
crc_hash/md5_hash/sha1_hash/ra_hash/fs_size_bytes to the ROM record.
Now _discover_rom explicitly persists these fields via update_rom()
before the HASHES early return.

Medium: Fix search cache memoizing API failures as "no match".
Only positive matches are now cached. Transient IGDB errors (timeouts,
5xx responses) no longer suppress all subsequent ROMs with the same
search term for the rest of the scan.

Low: Wire i18n into ScanTaskProgress.vue admin panel. The phase labels
were hardcoded in English; now they use t("scan.phase-discovering") and
t("scan.phase-enriching") like the main scan page.

Tests: Add regression test for HASHES scan persisting ROM hash fields,
and unit tests for IGDB search cache semantics (positive cache hit,
negative result not cached, cache clear).

AI-assisted: Claude Code
@Praeses0 Praeses0 marked this pull request as ready for review March 23, 2026 20:02
@gantoine gantoine requested review from Copilot and gantoine March 23, 2026 20:30
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves perceived and end-to-end ROM scan performance by splitting scanning into a fast “discovery” phase (DB + hashes) followed by an async “enrichment” phase (metadata + assets), while also optimizing IGDB request throughput and de-duplicating IGDB searches. It also surfaces scan phase in the UI and increases concurrency defaults.

Changes:

  • Backend: introduce discovery→enrichment scan pipeline, add scan phase to stats, improve IGDB search efficiency + per-scan dedup cache, and fix IGDB rate limiting concurrency.
  • Frontend: display “Discovering” vs “Fetching metadata” phase indicator in scan UI and admin task progress.
  • Config/tests: bump default scan worker concurrency and add regression/unit tests for hashes persistence and IGDB search cache behavior.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
frontend/src/views/Scan.vue Shows scan phase chip in the scan stats footer.
frontend/src/stores/scanning.ts Extends scan stats type to include scan_phase from backend.
frontend/src/locales/en_US/scan.json Adds English strings for the new scan phase labels.
frontend/src/locales/en_GB/scan.json Adds British English strings for the new scan phase labels.
frontend/src/components/Settings/Administration/tasks/ScanTaskProgress.vue Adds scan phase chip to admin task progress UI with i18n labels.
backend/tests/handler/metadata/test_igdb_handler.py New unit tests for IGDB per-scan search cache semantics.
backend/tests/endpoints/sockets/test_scan.py New regression test ensuring HASHES scan persists ROM-level hash fields.
backend/handler/metadata/igdb_handler.py Adds per-scan search dedup cache and reduces IGDB calls by preferring game types locally.
backend/endpoints/sockets/scan.py Implements two-phase scanning, adds scan_phase to stats, and clears IGDB search cache per scan.
backend/config/init.py Increases default SCAN_WORKERS from 1 to 10.
backend/adapters/services/igdb.py Adds a token-bucket rate limiter and supports offset in IGDB requests.
Comments suppressed due to low confidence (1)

backend/endpoints/sockets/scan.py:299

  • In _discover_rom(), calculate_hashes is derived from SKIP_HASH_CALCULATION even for ScanType.HASHES (line 293). If a user sets SKIP_HASH_CALCULATION=true, a HASHES scan will still proceed but fs_rom_handler.get_rom_files(..., calculate_hashes=False) will not compute hashes, so the “recalculate hashes” scan type can’t fulfill its purpose. Consider forcing calculate_hashes=True when scan_type == ScanType.HASHES, or explicitly disabling HASHES scans when hashes are globally skipped.
    if should_update_files:
        calculate_hashes = not cm.get_config().SKIP_HASH_CALCULATION
        if calculate_hashes:
            log.debug(f"Calculating file hashes for {rom.fs_name}...")

        parsed_rom_files = await fs_rom_handler.get_rom_files(
            rom, calculate_hashes=calculate_hashes
        )

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +45 to +46
"phase-discovering": "Discovering",
"phase-enriching": "Fetching metadata",
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New i18n keys phase-discovering / phase-enriching were added only to en_US (and en_GB). The repo has frontend/src/locales/check_i18n_locales.py, which enforces that every locale’s scan.json includes the same keys as en_US; missing these keys will fail that check (and will show raw key strings at runtime for non-English locales). Please add these keys to all other locale scan.json files (even if the value is temporarily English) to keep locales in sync.

Copilot uses AI. Check for mistakes.
Comment on lines +51 to +59
class IGDBRateLimiter:
"""Token bucket rate limiter for IGDB API (4 requests/second)."""

def __init__(self, rate: float = 4.0) -> None:
self._rate = rate
self._lock = asyncio.Lock()
self._tokens = rate
self._last_refill = asyncio.get_event_loop().time()

Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The shared module-level _igdb_rate_limiter holds an asyncio.Lock() created on first use. If the process ever runs IGDB requests across different event loops (common in test suites with multiple loops, or in app reload scenarios), reusing an asyncio primitive across loops can raise “attached to a different loop” errors. Consider making the limiter loop-local (e.g., store one per running loop), or avoid asyncio.Lock in a global singleton by using an anyio limiter or a pure-time-based atomic approach. Also prefer asyncio.get_running_loop().time() over get_event_loop() in async code for 3.13+ compatibility.

Copilot uses AI. Check for mistakes.
msedek pushed a commit to msedek/romm that referenced this pull request Mar 24, 2026
Merged PR rommapp#3165 (two-phase scan pipeline + IGDB optimizations):
- Discovery phase: ROMs appear in library in seconds
- Enrichment phase: metadata fills in progressively
- Fixed IGDB rate limiter lock-during-sleep bug
- Search term dedup cache reduces API calls

Additional improvement:
- Discovery semaphore 3x higher than enrichment (I/O-bound, no API limit)
- SCAN_WORKERS=20 in docker-compose for more parallelism

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@gantoine gantoine added the on-hold Pending further research or blocked by another issue label Mar 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

on-hold Pending further research or blocked by another issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants