Skip to content

Refactor: Centralize keyword_extraction parameter handling in OpenAI LLM implementations#2401

Merged
danielaskdd merged 3 commits intoHKUDS:mainfrom
danielaskdd:fix-openai-keyword-extraction
Nov 21, 2025
Merged

Refactor: Centralize keyword_extraction parameter handling in OpenAI LLM implementations#2401
danielaskdd merged 3 commits intoHKUDS:mainfrom
danielaskdd:fix-openai-keyword-extraction

Conversation

@danielaskdd
Copy link
Collaborator

Refactor: Centralize keyword_extraction parameter handling in OpenAI LLM implementations

Summary

Refactored the keyword_extraction parameter handling in OpenAI and Azure OpenAI LLM implementations to follow the DRY (Don't Repeat Yourself) principle. All keyword extraction logic is now centralized in the base *_complete_if_cache functions, eliminating code duplication across wrapper functions.

Enhance keyword extraction compatibility to handle cases where the LLM cannot reliably generate JSON output.

Changes

lightrag/llm/openai.py

  • Enhanced openai_complete_if_cache: Added keyword extraction handling that sets response_format to GPTKeywordExtractionFormat when keyword_extraction=True
  • Simplified wrapper functions: Removed redundant keyword extraction logic from:
    • openai_complete() - removed inconsistent "json" format handling
    • gpt_4o_complete() - removed duplicate format setting
    • gpt_4o_mini_complete() - removed duplicate format setting

lightrag/llm/azure_openai.py

  • Added import: GPTKeywordExtractionFormat from lightrag.types
  • Enhanced azure_openai_complete_if_cache:
    • Added keyword_extraction: bool = False parameter
    • Implemented keyword extraction logic with GPTKeywordExtractionFormat
    • Removed redundant kwargs.pop("keyword_extraction", None)
  • Updated azure_openai_complete: Now properly passes keyword_extraction parameter to the base function

Benefits

Single source of truth: All keyword extraction logic centralized in base functions
Consistency: Both implementations use the same GPTKeywordExtractionFormat
Maintainability: Future changes only need to be made in one location per file
Code quality: Eliminates code duplication and improves readability

Testing

  • Python syntax validation passed
  • Backward compatibility maintained
  • All wrapper functions continue to work as expected

Breaking Changes

None. This is a pure refactoring that maintains full backward compatibility.

• Move response format to core function
• Remove duplicate format assignments
• Standardize keyword extraction flow
• Clean up redundant parameter handling
• Improve Azure OpenAI compatibility
@danielaskdd
Copy link
Collaborator Author

@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +73 to +75
# Handle keyword extraction mode
if keyword_extraction:
kwargs["response_format"] = GPTKeywordExtractionFormat

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Guard Azure keyword extraction parse responses

When keyword_extraction is true this new block sets response_format, causing the request to go through beta.chat.completions.parse, but the non-streaming path still assumes response.choices[0].message.content is a string and immediately probes it with "\u" in content. Structured parse responses often set content=None and populate message.parsed instead, so keyword extraction calls will now raise a TypeError before returning any result. Consider handling message.parsed or null content before performing string operations.

Useful? React with 👍 / 👎.

Added support for structured output (JSON mode) from the OpenAI API in `openai.py` and `azure_openai.py`.

When `response_format` is used to request structured data, the new logic checks for the `message.parsed` attribute. If it exists, it's serialized into a JSON string as the final content. If not, the code falls back to the existing `message.content` handling, ensuring backward compatibility.
- Remove beta prefix from completions.parse
- Update OpenAI dependency to >=2.0.0
- Fix whitespace formatting
- Update all requirement files
- Clean up pyproject.toml dependencies
@danielaskdd
Copy link
Collaborator Author

@codex review

@chatgpt-codex-connector
Copy link

Codex Review: Didn't find any major issues. Keep it up!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@danielaskdd danielaskdd merged commit 8777895 into HKUDS:main Nov 21, 2025
4 checks passed
@danielaskdd danielaskdd deleted the fix-openai-keyword-extraction branch November 21, 2025 11:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant