Skip to content

feat(core,diskann): robust embedding server (no-hang) + DiskANN fast mode (graph partition) #29

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 99 commits into from
Aug 14, 2025

Conversation

andylizf
Copy link
Collaborator

@andylizf andylizf commented Aug 11, 2025

This pull request consolidates the robust embedding server lifecycle and adds a fast mode for DiskANN graph partitioning.


✅ What's Included

Embedding Server Hardening (HNSW & DiskANN)

  • A per-instance manager is now used, eliminating cross-process reuse and scanning.
  • Default shutdown is handled by atexit and weakref.finalize.
  • ZMQ options SNDTIMEO=1s and LINGER=0 have been set to prevent send blocking during shutdown.
  • The server now provides shape-correct responses and tolerates nested [[ids]].

DiskANN Fast Mode (Graph Partition)

  • Automatic partitioning now occurs during the build process when is_recompute=True.
  • The system will auto-detect partition files on load and set the partition_prefix.
  • Implemented safe cleanup of large _disk.index artifacts after partitioning is complete.

⚠️ Breaking Changes

  • Cross-process server reuse has been removed. Servers are now always started fresh for each process.

➡️ Migration Guide

  • It is now preferred to call cleanup() explicitly in long-lived applications. The atexit hook will handle default shutdown cases.

yichuan-w and others added 30 commits August 5, 2025 23:11
- Add GraphPartitioner class for advanced graph partitioning
- Add partition_graph_simple function for easy-to-use partitioning
- Add pybind11 dependency for C++ executable building
- Update __init__.py to export partition functions
- Include test scripts for partition functionality

The partition functionality allows optimizing disk-based indices
for better search performance and memory efficiency.
- Update DiskANN submodule to commit b2dc4ea
- Includes graph partition tools and CMake integration
- Enables graph partitioning functionality in DiskANN backend
- Pin ruff==0.12.7 in pyproject.toml dev dependencies
- Update CI to use exact ruff version instead of latest
- Add comments explaining version pinning rationale
- Ensures consistent formatting across local, CI, and pre-commit
- uv tool install is the correct way to install CLI tools like ruff
- uv pip install --system is for Python packages, not tools
- Add logging in DiskANN embedding server to show metadata_file_path
- Add debug logging in PassageManager to trace path resolution
- This will help identify why CI fails to find passage files
- Change from --find-links to direct wheel installation with --force-reinstall
- This ensures CI uses locally built packages with latest source code
- Prevents uv from using PyPI packages with same version number but old code
- Fixes CI test failures where old code (without metadata_file_path) was used

Root cause: CI was installing leann-backend-diskann v0.2.1 from PyPI
instead of the locally built wheel with same version number.
- Check wheel contents before and after auditwheel repair
- Verify _diskannpy module installation after pip install
- List installed package directory structure
- Add explicit platform tag for auditwheel repair

This helps diagnose why ImportError: cannot import name '_diskannpy' occurs
- Remove '--plat linux_x86_64' which is not a valid platform tag
- Let auditwheel automatically determine the correct platform
- Based on CI output, it will use manylinux_2_35_x86_64

This was causing auditwheel repair to fail, preventing proper wheel repair
- Use --find-links with --no-index to let uv select correct wheel
- Prevents installing wrong Python version wheel (e.g., cp310 for Python 3.11)
- Fixes ImportError: _diskannpy.cpython-310-x86_64-linux-gnu.so in Python 3.11

The issue was that *.whl glob matched all Python versions, causing
uv to potentially install a cp310 wheel in a Python 3.11 environment.
- Explicitly specify Python version when creating venv with uv
- Prevents mismatch between build Python (e.g., 3.10) and test Python
- Fixes: _diskannpy.cpython-310-x86_64-linux-gnu.so in Python 3.11 error

The issue: uv venv was defaulting to Python 3.11 regardless of matrix version
- Ubuntu: Install all packages from local builds with --no-index
- macOS: Install core packages from PyPI, backends from local builds
- Remove --no-index for macOS backend installation to allow dependency resolution
- Pin versions when installing from PyPI to ensure consistency

Fixes error: 'leann-core was not found in the provided package locations'
- Replace 'int | None' with 'Optional[int]' everywhere
- Replace 'subprocess.Popen | None' with 'Optional[subprocess.Popen]'
- Add Optional import to all affected files
- Update ruff target-version from py310 to py39
- The '|' syntax for Union types was introduced in Python 3.10 (PEP 604)

Fixes TypeError: unsupported operand type(s) for |: 'type' and 'NoneType'
- Build leann-core and leann on macOS too
- Install all packages via --find-links and --no-index across platforms
- Lower macOS MACOSX_DEPLOYMENT_TARGET to 12.0 for wider compatibility

This ensures consistency and avoids PyPI drift while improving macOS compatibility.
…heels for our packages

- Remove --no-index so numpy/scipy/etc can be resolved on Python 3.13
- Keep --find-links to force our packages from local dist

Fixes: dependency resolution failure on Ubuntu Python 3.13 (numpy missing)
- Fix build failure: 'sgesdd_' only available on macOS 13.3+
- Keep other CI improvements (local builds, find-links installs)
- validate_model_and_suggest: str | None -> Optional[str]
- OpenAIChat.__init__: api_key: str | None -> Optional[str]
- get_llm: dict[str, Any] | None -> Optional[dict[str, Any]]

Ensures Python 3.9 compatibility for CI macOS 3.9.
- Fix import ordering in embedding servers and graph_partition_simple
- Remove duplicate Optional import
- Complete Optional[...] replacements
andylizf and others added 5 commits August 13, 2025 12:28
Analysis of recent CI failures shows:
- Model download takes ~12 seconds
- Embedding server startup + first search takes additional ~78 seconds
- Total time needed: ~90-100 seconds

Updated timeouts:
- test_readme_basic_example: 90s -> 180s
- test_backend_options: 60s -> 150s
- test_llm_config_simulated: 75s -> 150s

Root cause: Initial model download from huggingface.co in CI environment
is slower than local development, causing legitimate timeouts rather than
actual hanging processes.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Previous fix revealed the real issue: embedding server fails to start within 120s,
not timeout issues. The error was hidden because both stdout and stderr were
redirected to DEVNULL in CI.

Changes:
- Keep stderr output in CI environment for debugging
- Only redirect stdout to DEVNULL to avoid buffer deadlock
- This will help us see why embedding server startup is failing

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
…d their own REP sockets and poll with timeouts; fix undefined socket causing startup crash and CI hangs on Ubuntu 22.04
…embedding requests to match client expectations; prevents missing ID lookup when wrapper nests the list
@andylizf andylizf force-pushed the debug/clean-state-investigation branch from 8acdb1c to f496621 Compare August 13, 2025 22:58
…e shutdown-capable server implementation to reduce surface and avoid hangs
…al and rely on simplified CI + robust servers
…rManager to ensure server stops on interpreter exit/GC
…rver() can terminate adopted processes; clear server_port on stop
…void slow/hanging process scans; always pick a fresh available port
… core dump in constrained runners; HNSW still validated
…ts from low-level dup2; no-op contextmanager on CI
@andylizf andylizf changed the title Debug: Add tmate debugging tools to clean state for CI hang investigation feat(core,diskann): robust embedding server (no-hang) + DiskANN fast mode (graph partition) Aug 14, 2025
@yichuan-w
Copy link
Owner

@andylizf , lets make the change back-compatible, thanks!

@andylizf andylizf merged commit fafdf8f into main Aug 14, 2025
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants