Improve ROM scan speed with two-phase pipeline and IGDB optimizations#3165
Improve ROM scan speed with two-phase pipeline and IGDB optimizations#3165Praeses0 wants to merge 8 commits intorommapp:masterfrom
Conversation
- Increase default SCAN_WORKERS from 1 to 5, allowing multiple ROMs to be scanned concurrently via the existing asyncio semaphore - Add IGDB API rate limiter (4 req/s token bucket) to prevent 429 responses that cause expensive 2-second retry penalties - Parallelize cover, manual, and screenshot downloads using asyncio.gather instead of sequential awaits Benchmark results (51 ROMs, quick scan + IGDB): Before: 0.83 roms/s, 61.57s total After: 2.80 roms/s, 18.23s total (3.4x faster) AI-assisted: Claude Code
Fix rate limiter lock-during-sleep bug that serialized concurrent IGDB requests. The acquire() method was sleeping while holding the asyncio lock, blocking all other coroutines from proceeding. Now sleeps outside the lock, enabling true 4 req/s throughput with concurrent workers. Refactor _search_rom to eliminate redundant IGDB API calls. The old pattern called _search_rom twice (with and without game_type filter), each making up to 3 API calls including an identical expanded search fallback. The new approach chains: search with filter -> search without filter -> expanded search (once), reducing worst-case from 6 to 4 calls per ROM. Increase SCAN_WORKERS default from 5 to 10 for better I/O overlap between hash computation, DB writes, and metadata API calls. Benchmark results (51 ROMs, IGDB metadata): - Before: 0.83 roms/s (61.57s total) - After: 3.74 roms/s (13.63s total) - Speedup: 4.5x AI-assisted: Claude Code
Adds offset parameter to _request() and list_games() methods in IGDBService, enabling paginated queries through the IGDB Apicalypse API. This is standard IGDB API functionality that was missing from the adapter. AI-assisted: Claude Code
…ference Previously _search_rom made two separate API calls: first with a game_type filter (main games only), then without if no match was found. This wastes an API call for every ROM where the filtered search fails. Now a single unfiltered search is made, with results split into main game types and other types locally. Main games are tried first, falling back to DLC/bundle/mod types only if no main game matches. This preserves the same matching priority while saving ~25% of API calls. Also adds: - Per-scan search term dedup cache to avoid duplicate API calls when multiple ROMs normalize to the same search term - game_type field to GAMES_FIELDS for local type filtering Benchmark (1469 ROMs, IGDB metadata): - Before: 2.24 roms/s (prior commit) - After: 2.72 roms/s - Improvement: ~21% Combined improvement over baseline (0.83 roms/s): - 51 ROMs: ~4.5x faster - 1469 ROMs: ~3.3x faster (539s vs ~1770s estimated baseline) AI-assisted: Claude Code
Refactor _identify_rom into two phases: - Phase 1 (discovery): Create DB entries, hash files, save to DB. No metadata API calls. Runs at ~14 roms/s — all ROMs appear in the library within seconds. - Phase 2 (enrichment): Fetch metadata from IGDB/other sources, download covers/screenshots. Rate-limited by IGDB API at ~2.3 roms/s. This improves perceived scan speed significantly: users see their entire ROM library immediately instead of waiting for metadata to load one ROM at a time. Metadata fills in progressively afterward. AI-assisted: Claude Code
- Add scan_phase field ("discovering" / "enriching") to ScanStats so
the frontend and benchmark tool can show which phase is active
- Improve phase transition log messages with platform name and ROM count
- Move IGDB search cache from class variable to instance variable
- Clear search cache once at the start of each scan (alongside gamelist
cache), not per-platform — cache remains beneficial within a scan for
deduplicating regional variant searches
AI-assisted: Claude Code
Add visual phase indicator to both the scan page and admin task progress panel, showing "Discovering" (orange) during the fast filesystem discovery phase and "Fetching metadata" (blue) during the IGDB/metadata enrichment phase. Changes: - Extend ScanStats type in scanning store with scan_phase field - Add phase chip with icon to Scan.vue sticky bottom stats bar - Add phase label to ScanTaskProgress.vue admin panel - Add i18n keys for phase labels (en_US, en_GB) AI-assisted: Claude Code
High: Fix HASHES scan type not persisting ROM-level hash fields.
The two-phase split returned early before scan_rom() could write
crc_hash/md5_hash/sha1_hash/ra_hash/fs_size_bytes to the ROM record.
Now _discover_rom explicitly persists these fields via update_rom()
before the HASHES early return.
Medium: Fix search cache memoizing API failures as "no match".
Only positive matches are now cached. Transient IGDB errors (timeouts,
5xx responses) no longer suppress all subsequent ROMs with the same
search term for the rest of the scan.
Low: Wire i18n into ScanTaskProgress.vue admin panel. The phase labels
were hardcoded in English; now they use t("scan.phase-discovering") and
t("scan.phase-enriching") like the main scan page.
Tests: Add regression test for HASHES scan persisting ROM hash fields,
and unit tests for IGDB search cache semantics (positive cache hit,
negative result not cached, cache clear).
AI-assisted: Claude Code
There was a problem hiding this comment.
Pull request overview
This PR improves perceived and end-to-end ROM scan performance by splitting scanning into a fast “discovery” phase (DB + hashes) followed by an async “enrichment” phase (metadata + assets), while also optimizing IGDB request throughput and de-duplicating IGDB searches. It also surfaces scan phase in the UI and increases concurrency defaults.
Changes:
- Backend: introduce discovery→enrichment scan pipeline, add scan phase to stats, improve IGDB search efficiency + per-scan dedup cache, and fix IGDB rate limiting concurrency.
- Frontend: display “Discovering” vs “Fetching metadata” phase indicator in scan UI and admin task progress.
- Config/tests: bump default scan worker concurrency and add regression/unit tests for hashes persistence and IGDB search cache behavior.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| frontend/src/views/Scan.vue | Shows scan phase chip in the scan stats footer. |
| frontend/src/stores/scanning.ts | Extends scan stats type to include scan_phase from backend. |
| frontend/src/locales/en_US/scan.json | Adds English strings for the new scan phase labels. |
| frontend/src/locales/en_GB/scan.json | Adds British English strings for the new scan phase labels. |
| frontend/src/components/Settings/Administration/tasks/ScanTaskProgress.vue | Adds scan phase chip to admin task progress UI with i18n labels. |
| backend/tests/handler/metadata/test_igdb_handler.py | New unit tests for IGDB per-scan search cache semantics. |
| backend/tests/endpoints/sockets/test_scan.py | New regression test ensuring HASHES scan persists ROM-level hash fields. |
| backend/handler/metadata/igdb_handler.py | Adds per-scan search dedup cache and reduces IGDB calls by preferring game types locally. |
| backend/endpoints/sockets/scan.py | Implements two-phase scanning, adds scan_phase to stats, and clears IGDB search cache per scan. |
| backend/config/init.py | Increases default SCAN_WORKERS from 1 to 10. |
| backend/adapters/services/igdb.py | Adds a token-bucket rate limiter and supports offset in IGDB requests. |
Comments suppressed due to low confidence (1)
backend/endpoints/sockets/scan.py:299
- In
_discover_rom(),calculate_hashesis derived fromSKIP_HASH_CALCULATIONeven forScanType.HASHES(line 293). If a user setsSKIP_HASH_CALCULATION=true, a HASHES scan will still proceed butfs_rom_handler.get_rom_files(..., calculate_hashes=False)will not compute hashes, so the “recalculate hashes” scan type can’t fulfill its purpose. Consider forcingcalculate_hashes=Truewhenscan_type == ScanType.HASHES, or explicitly disabling HASHES scans when hashes are globally skipped.
if should_update_files:
calculate_hashes = not cm.get_config().SKIP_HASH_CALCULATION
if calculate_hashes:
log.debug(f"Calculating file hashes for {rom.fs_name}...")
parsed_rom_files = await fs_rom_handler.get_rom_files(
rom, calculate_hashes=calculate_hashes
)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| "phase-discovering": "Discovering", | ||
| "phase-enriching": "Fetching metadata", |
There was a problem hiding this comment.
New i18n keys phase-discovering / phase-enriching were added only to en_US (and en_GB). The repo has frontend/src/locales/check_i18n_locales.py, which enforces that every locale’s scan.json includes the same keys as en_US; missing these keys will fail that check (and will show raw key strings at runtime for non-English locales). Please add these keys to all other locale scan.json files (even if the value is temporarily English) to keep locales in sync.
| class IGDBRateLimiter: | ||
| """Token bucket rate limiter for IGDB API (4 requests/second).""" | ||
|
|
||
| def __init__(self, rate: float = 4.0) -> None: | ||
| self._rate = rate | ||
| self._lock = asyncio.Lock() | ||
| self._tokens = rate | ||
| self._last_refill = asyncio.get_event_loop().time() | ||
|
|
There was a problem hiding this comment.
The shared module-level _igdb_rate_limiter holds an asyncio.Lock() created on first use. If the process ever runs IGDB requests across different event loops (common in test suites with multiple loops, or in app reload scenarios), reusing an asyncio primitive across loops can raise “attached to a different loop” errors. Consider making the limiter loop-local (e.g., store one per running loop), or avoid asyncio.Lock in a global singleton by using an anyio limiter or a pure-time-based atomic approach. Also prefer asyncio.get_running_loop().time() over get_event_loop() in async code for 3.13+ compatibility.
Merged PR rommapp#3165 (two-phase scan pipeline + IGDB optimizations): - Discovery phase: ROMs appear in library in seconds - Enrichment phase: metadata fills in progressively - Fixed IGDB rate limiter lock-during-sleep bug - Search term dedup cache reduces API calls Additional improvement: - Discovery semaphore 3x higher than enrichment (I/O-bound, no API limit) - SCAN_WORKERS=20 in docker-compose for more parallelism Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
Benchmark Results
Tested with 1,469 ROMs (1,100 Game Boy + 368 Game Gear + 1 Switch) using IGDB metadata:
The hard ceiling is IGDB's 4 req/s rate limit — these optimizations maximize throughput within that constraint.
Test plan
trunk fmtandtrunk checkpassAI Disclosure
This PR was authored with the assistance of Claude Code (AI).