Skip to content

Conversation

sumukshashidhar
Copy link
Collaborator

configuration generator for the cli

@sumukshashidhar sumukshashidhar requested a review from alozowski June 4, 2025 11:44
Copy link
Collaborator

@alozowski alozowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good CLI enhancement, but there are a few important points I commented on, especially around API key handling. We should also update tests to cover .env edge cases (.env file is missing or not readable or writable)

i = int(idx.strip()) - 1
if 0 <= i < len(models):
selected.append(models[i]["model_name"])
except ValueError:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a silent failure, and the debugging will be hard. I suggest to log it as always, something like this:

for idx in indices.split(","):
    try:
        i = int(idx.strip()) - 1
        if 0 <= i < len(models):
            selected.append(models[i]["model_name"])
        else:
            logger.warning(f"Model index {idx} is out of range (1-{len(models)})")
    except ValueError:
        logger.warning(f"Invalid model index '{idx}' - expected a number")

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed! Added proper logging for both out-of-range indices and invalid format errors. Now logs warnings with clear messages about what went wrong.

# Write to file
output.parent.mkdir(parents=True, exist_ok=True)
with open(output, "w") as f:
yaml.dump(config, f, default_flow_style=False, sort_keys=False, width=120)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Writing here can fail because of the various reasons. Let's wrap into a try/catch

try:
    # the file operations here
except (OSError, PermissionError) as e:
    # log the error using logger.error()
    # exit gracefully with typer.Exit(1)

Also useful if users try to write to read-only folders or forget permissions

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrapped file operations in try/catch block with OSError and PermissionError handling. Shows user-friendly error messages and exits gracefully.


elif choice == 2: # OpenAI Compatible
config["base_url"] = "https://api.openai.com/v1"
config["api_key"] = Prompt.ask("API key (use $VAR for env)", default="$OPENAI_API_KEY")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, if a user enters a real API key instead of a variable like $OPENAI_API_KEY (and there's no input validation) it gets written to config.yaml in plain text. This is a security risk, especially if the file is committed to git (we never know)

I suggest writing to an .env file instead and adding input validation, for example, blocking input that doesn't start with $

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added validation to prevent plain text API keys. Now enforces $VAR format and automatically creates/updates .env file with placeholders. Shows clear error if users try to enter real keys.

if Confirm.ask("Use custom tokenizer?", default=False):
config["encoding_name"] = Prompt.ask("Encoding name", default="cl100k_base")
else:
config["max_concurrent_requests"] = 16 if choice == 1 else 8
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Magic numbers make code hard to maintain.

I suggest to place such variables at the top of the file:

DEFAULT_CONCURRENT_REQUESTS_HF = 16
DEFAULT_CONCURRENT_REQUESTS_API = 8
DEFAULT_CHUNK_TOKENS = 256
DEFAULT_MAX_TOKENS = 16384

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved all magic numbers to named constants at the top of the file for better maintainability.

return {
"ingestion": [model_name],
"summarization": [model_name],
"chunking": ["intfloat/multilingual-e5-large-instruct"],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed we're assigning intfloat/multilingual-e5-large-instruct to the chunking stage by default, but the chunking stage is purely token-based since semantic_chunking was deprecated

Also, this model requires torch and transformers, which aren’t installed in the project, so this could break things or confuse users. Let’s remove it from model_roles["chunking"]?

TODO: update the example/configs/advanced_example.yaml as well and remove semantic_chunking

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the embedding model from chunking stage since it's purely token-based now. Also cleaned up model_roles to exclude chunking. Will update advanced_example.yaml in a follow-up commit.

Copy link
Collaborator

@alozowski alozowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool!

@sumukshashidhar sumukshashidhar merged commit ea584bb into main Jun 26, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants