Skip to content

Refact: Enhance DOCLING integration with lazy loading and macOS safeguards#2352

Merged
danielaskdd merged 2 commits intoHKUDS:mainfrom
danielaskdd:docling-gunicorn-multi-worker
Nov 13, 2025
Merged

Refact: Enhance DOCLING integration with lazy loading and macOS safeguards#2352
danielaskdd merged 2 commits intoHKUDS:mainfrom
danielaskdd:docling-gunicorn-multi-worker

Conversation

@danielaskdd
Copy link
Collaborator

Refact: Enhance DOCLING integration with lazy loading and macOS safeguards

🎯 Summary

Improve DOCLING document loading engine integration with enhanced macOS compatibility and multi-worker Gunicorn support. This PR addresses PyTorch fork-safety issues on macOS and implements better configuration management for the DOCLING engine.

🔧 Changes

1. CLI Configuration Enhancement (lightrag/api/config.py)

  • Added --docling CLI flag to enable DOCLING document loading engine
  • Simplified configuration logic: flag takes precedence over environment variable
  • Improved user experience with explicit command-line controls

2. Lazy Loading Implementation (lightrag/api/routers/document_routes.py)

  • Breaking Change Fix: Converted module-level DOCLING availability check to lazy-loaded @lru_cache function
  • Prevents import-time errors when DOCLING is not installed
  • Added warning logs when DOCLING is configured but unavailable, with graceful fallback to default parsers (pypdf, python-docx, python-pptx, openpyxl)
  • Improves compatibility with Gunicorn multi-worker mode

3. macOS Compatibility Guard (lightrag/api/run_with_gunicorn.py)

  • Added pre-flight validation check for incompatible configurations
  • Prevents server startup when DOCLING + multi-worker mode is used on macOS
  • Clear error message explaining PyTorch fork-safety issues
  • Provides actionable solutions:
    • Use single worker mode (--workers 1)
    • Switch to DEFAULT document engine
    • Deploy on Linux for full multi-worker support

4. Platform-Specific Dependencies (pyproject.toml)

  • Made DOCLING dependency macOS-exclusive: docling>=2.0.0,<3.0.0; sys_platform != 'darwin'
  • Pinned numpy version: >=1.24.0,<2.0.0 for better stability
  • Prevents installation of incompatible dependencies on macOS

5. Dependency Lock Updates (uv.lock)

  • Updated dependency resolution with platform markers
  • Removed macOS-specific wheels for tree-sitter packages when not needed
  • Added voyageai and zstandard with proper platform markers

🐛 Problems Solved

  1. PyTorch Fork Safety: PyTorch (required by DOCLING) has known issues with fork-based multiprocessing on macOS, causing crashes in Gunicorn multi-worker mode
  2. Import-time Failures: Previous implementation checked DOCLING availability at module import time, causing issues in multi-worker scenarios
  3. Silent Fallbacks: Users were not informed when DOCLING was configured but unavailable
  4. Dependency Conflicts: macOS users could install DOCLING dependencies that wouldn't work properly

✅ Benefits

  • Safer Deployments: Prevents runtime crashes on macOS with clear pre-flight checks
  • Better User Experience: CLI flag makes configuration more discoverable
  • Improved Logging: Users can diagnose DOCLING availability issues easily
  • Cross-Platform Compatibility: Platform-specific dependencies prevent installation issues

🧪 Testing Recommendations

  • Test --docling flag on Linux with multi-worker Gunicorn
  • Verify error message on macOS with --docling --workers 2
  • Confirm graceful fallback when DOCLING unavailable
  • Test single-worker mode on macOS with DOCLING
  • Validate DEFAULT engine works on all platforms

📋 Migration Notes

For macOS users:

  • If using DOCLING: Must use --workers 1 (single worker mode)
  • Recommended: Use DOCUMENT_LOADING_ENGINE=DEFAULT on macOS
  • For production: Deploy on Linux for full multi-worker support

For all users:

  • New --docling CLI flag available as alternative to DOCUMENT_LOADING_ENGINE=DOCLING
  • Warning logs added - check logs if documents aren't processing as expected

- Add --docling CLI flag for easier setup
- Add numpy version constraints
- Exclude docling on macOS (fork-safety)
@danielaskdd
Copy link
Collaborator Author

@codex review

@chatgpt-codex-connector
Copy link

Codex Review: Didn't find any major issues. You're on a roll.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@danielaskdd danielaskdd merged commit 28fba19 into HKUDS:main Nov 13, 2025
1 check passed
@danielaskdd danielaskdd deleted the docling-gunicorn-multi-worker branch November 14, 2025 03:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant