Skip to content

Add google embedder support #192

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
100 changes: 97 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@
- **Ask Feature**: Chat with your repository using RAG-powered AI to get accurate answers
- **DeepResearch**: Multi-turn research process that thoroughly investigates complex topics
- **Multiple Model Providers**: Support for Google Gemini, OpenAI, OpenRouter, and local Ollama models
- **Flexible Embeddings**: Choose between OpenAI, Google AI, or local Ollama embeddings for optimal performance

## 🚀 Quick Start (Super Easy!)

Expand All @@ -39,6 +40,8 @@ cd deepwiki-open
# Create a .env file with your API keys
echo "GOOGLE_API_KEY=your_google_api_key" > .env
echo "OPENAI_API_KEY=your_openai_api_key" >> .env
# Optional: Use Google AI embeddings instead of OpenAI (recommended if using Google models)
echo "DEEPWIKI_EMBEDDER_TYPE=google" >> .env
# Optional: Add OpenRouter API key if you want to use OpenRouter models
echo "OPENROUTER_API_KEY=your_openrouter_api_key" >> .env
# Optional: Add Ollama host if not local. defaults to http://localhost:11434
Expand Down Expand Up @@ -67,6 +70,8 @@ Create a `.env` file in the project root with these keys:
```
GOOGLE_API_KEY=your_google_api_key
OPENAI_API_KEY=your_openai_api_key
# Optional: Use Google AI embeddings (recommended if using Google models)
DEEPWIKI_EMBEDDER_TYPE=google
# Optional: Add this if you want to use OpenRouter models
OPENROUTER_API_KEY=your_openrouter_api_key
# Optional: Add this if you want to use Azure OpenAI models
Expand Down Expand Up @@ -269,6 +274,89 @@ If you want to use embedding models compatible with the OpenAI API (such as Alib

This allows you to seamlessly switch to any OpenAI-compatible embedding service without code changes.

## 🧠 Using Google AI Embeddings

DeepWiki now supports Google AI's latest embedding models as an alternative to OpenAI embeddings. This provides better integration when you're already using Google Gemini models for text generation.

### Features

- **Latest Model**: Uses Google's `text-embedding-004` model
- **Same API Key**: Uses your existing `GOOGLE_API_KEY` (no additional setup required)
- **Better Integration**: Optimized for use with Google Gemini text generation models
- **Task-Specific**: Supports semantic similarity, retrieval, and classification tasks
- **Batch Processing**: Efficient processing of multiple texts

### How to Enable Google AI Embeddings

**Option 1: Environment Variable (Recommended)**

Set the embedder type in your `.env` file:

```bash
# Your existing Google API key
GOOGLE_API_KEY=your_google_api_key

# Enable Google AI embeddings
DEEPWIKI_EMBEDDER_TYPE=google
```

**Option 2: Docker Environment**

```bash
docker run -p 8001:8001 -p 3000:3000 \
-e GOOGLE_API_KEY=your_google_api_key \
-e DEEPWIKI_EMBEDDER_TYPE=google \
-v ~/.adalflow:/root/.adalflow \
ghcr.io/asyncfuncai/deepwiki-open:latest
```

**Option 3: Docker Compose**

Add to your `.env` file:

```bash
GOOGLE_API_KEY=your_google_api_key
DEEPWIKI_EMBEDDER_TYPE=google
```

Then run:

```bash
docker-compose up
```

### Available Embedder Types

| Type | Description | API Key Required | Notes |
|------|-------------|------------------|-------|
| `openai` | OpenAI embeddings (default) | `OPENAI_API_KEY` | Uses `text-embedding-3-small` model |
| `google` | Google AI embeddings | `GOOGLE_API_KEY` | Uses `text-embedding-004` model |
| `ollama` | Local Ollama embeddings | None | Requires local Ollama installation |

### Why Use Google AI Embeddings?

- **Consistency**: If you're using Google Gemini for text generation, using Google embeddings provides better semantic consistency
- **Performance**: Google's latest embedding model offers excellent performance for retrieval tasks
- **Cost**: Competitive pricing compared to OpenAI
- **No Additional Setup**: Uses the same API key as your text generation models

### Switching Between Embedders

You can easily switch between different embedding providers:

```bash
# Use OpenAI embeddings (default)
export DEEPWIKI_EMBEDDER_TYPE=openai

# Use Google AI embeddings
export DEEPWIKI_EMBEDDER_TYPE=google

# Use local Ollama embeddings
export DEEPWIKI_EMBEDDER_TYPE=ollama
```

**Note**: When switching embedders, you may need to regenerate your repository embeddings as different models produce different vector spaces.

### Logging

DeepWiki uses Python's built-in `logging` module for diagnostic output. You can configure the verbosity and log file destination via environment variables:
Expand Down Expand Up @@ -311,19 +399,25 @@ docker-compose up

| Variable | Description | Required | Note |
|----------------------|--------------------------------------------------------------|----------|----------------------------------------------------------------------------------------------------------|
| `GOOGLE_API_KEY` | Google Gemini API key for AI generation | No | Required only if you want to use Google Gemini models
| `OPENAI_API_KEY` | OpenAI API key for embeddings | Yes | Note: This is required even if you're not using OpenAI models, as it's used for embeddings. |
| `GOOGLE_API_KEY` | Google Gemini API key for AI generation and embeddings | No | Required for Google Gemini models and Google AI embeddings
| `OPENAI_API_KEY` | OpenAI API key for embeddings and models | Conditional | Required if using OpenAI embeddings or models |
| `OPENROUTER_API_KEY` | OpenRouter API key for alternative models | No | Required only if you want to use OpenRouter models |
| `AZURE_OPENAI_API_KEY` | Azure OpenAI API key | No | Required only if you want to use Azure OpenAI models |
| `AZURE_OPENAI_ENDPOINT` | Azure OpenAI endpoint | No | Required only if you want to use Azure OpenAI models |
| `AZURE_OPENAI_VERSION` | Azure OpenAI version | No | Required only if you want to use Azure OpenAI models |
| `OLLAMA_HOST` | Ollama Host (default: http://localhost:11434) | No | Required only if you want to use external Ollama server |
| `DEEPWIKI_EMBEDDER_TYPE` | Embedder type: `openai`, `google`, or `ollama` (default: `openai`) | No | Controls which embedding provider to use |
| `PORT` | Port for the API server (default: 8001) | No | If you host API and frontend on the same machine, make sure change port of `SERVER_BASE_URL` accordingly |
| `SERVER_BASE_URL` | Base URL for the API server (default: http://localhost:8001) | No |
| `DEEPWIKI_AUTH_MODE` | Set to `true` or `1` to enable authorization mode. | No | Defaults to `false`. If enabled, `DEEPWIKI_AUTH_CODE` is required. |
| `DEEPWIKI_AUTH_CODE` | The secret code required for wiki generation when `DEEPWIKI_AUTH_MODE` is enabled. | No | Only used if `DEEPWIKI_AUTH_MODE` is `true` or `1`. |

If you're not using ollama mode, you need to configure an OpenAI API key for embeddings. Other API keys are only required when configuring and using models from the corresponding providers.
**API Key Requirements:**
- If using `DEEPWIKI_EMBEDDER_TYPE=openai` (default): `OPENAI_API_KEY` is required
- If using `DEEPWIKI_EMBEDDER_TYPE=google`: `GOOGLE_API_KEY` is required
- If using `DEEPWIKI_EMBEDDER_TYPE=ollama`: No API key required (local processing)

Other API keys are only required when configuring and using models from the corresponding providers.

## Authorization Mode

Expand Down
2 changes: 1 addition & 1 deletion api/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -507,7 +507,7 @@ async def delete_wiki_cache(

if WIKI_AUTH_MODE:
logger.info("check the authorization code")
if WIKI_AUTH_CODE != authorization_code:
if not authorization_code or WIKI_AUTH_CODE != authorization_code:
raise HTTPException(status_code=401, detail="Authorization code is invalid")

logger.info(f"Attempting to delete wiki cache for {owner}/{repo} ({repo_type}), lang: {language}")
Expand Down
53 changes: 49 additions & 4 deletions api/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
from api.openai_client import OpenAIClient
from api.openrouter_client import OpenRouterClient
from api.bedrock_client import BedrockClient
from api.google_embedder_client import GoogleEmbedderClient
from api.azureai_client import AzureAIClient
from adalflow import GoogleGenAIClient, OllamaClient

Expand Down Expand Up @@ -43,12 +44,16 @@
WIKI_AUTH_MODE = raw_auth_mode.lower() in ['true', '1', 't']
WIKI_AUTH_CODE = os.environ.get('DEEPWIKI_AUTH_CODE', '')

# Embedder settings
EMBEDDER_TYPE = os.environ.get('DEEPWIKI_EMBEDDER_TYPE', 'openai').lower()

# Get configuration directory from environment variable, or use default if not set
CONFIG_DIR = os.environ.get('DEEPWIKI_CONFIG_DIR', None)

# Client class mapping
CLIENT_CLASSES = {
"GoogleGenAIClient": GoogleGenAIClient,
"GoogleEmbedderClient": GoogleEmbedderClient,
"OpenAIClient": OpenAIClient,
"OpenRouterClient": OpenRouterClient,
"OllamaClient": OllamaClient,
Expand Down Expand Up @@ -141,7 +146,7 @@ def load_embedder_config():
embedder_config = load_json_config("embedder.json")

# Process client classes
for key in ["embedder", "embedder_ollama"]:
for key in ["embedder", "embedder_ollama", "embedder_google"]:
if key in embedder_config and "client_class" in embedder_config[key]:
class_name = embedder_config[key]["client_class"]
if class_name in CLIENT_CLASSES:
Expand All @@ -151,12 +156,18 @@ def load_embedder_config():

def get_embedder_config():
"""
Get the current embedder configuration.
Get the current embedder configuration based on DEEPWIKI_EMBEDDER_TYPE.

Returns:
dict: The embedder configuration with model_client resolved
"""
return configs.get("embedder", {})
embedder_type = EMBEDDER_TYPE
if embedder_type == 'google' and 'embedder_google' in configs:
return configs.get("embedder_google", {})
elif embedder_type == 'ollama' and 'embedder_ollama' in configs:
return configs.get("embedder_ollama", {})
else:
return configs.get("embedder", {})

def is_ollama_embedder():
"""
Expand All @@ -178,6 +189,40 @@ def is_ollama_embedder():
client_class = embedder_config.get("client_class", "")
return client_class == "OllamaClient"

def is_google_embedder():
"""
Check if the current embedder configuration uses GoogleEmbedderClient.

Returns:
bool: True if using GoogleEmbedderClient, False otherwise
"""
embedder_config = get_embedder_config()
if not embedder_config:
return False

# Check if model_client is GoogleEmbedderClient
model_client = embedder_config.get("model_client")
if model_client:
return model_client.__name__ == "GoogleEmbedderClient"

# Fallback: check client_class string
client_class = embedder_config.get("client_class", "")
return client_class == "GoogleEmbedderClient"

def get_embedder_type():
"""
Get the current embedder type based on configuration.

Returns:
str: 'ollama', 'google', or 'openai' (default)
"""
if is_ollama_embedder():
return 'ollama'
elif is_google_embedder():
return 'google'
else:
return 'openai'

# Load repository and file filters configuration
def load_repo_config():
return load_json_config("repo.json")
Expand Down Expand Up @@ -265,7 +310,7 @@ def load_lang_config():

# Update embedder configuration
if embedder_config:
for key in ["embedder", "embedder_ollama", "retriever", "text_splitter"]:
for key in ["embedder", "embedder_ollama", "embedder_google", "retriever", "text_splitter"]:
if key in embedder_config:
configs[key] = embedder_config[key]

Expand Down
14 changes: 14 additions & 0 deletions api/config/embedder.json
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,20 @@
"encoding_format": "float"
}
},
"embedder_ollama": {
"client_class": "OllamaClient",
"model_kwargs": {
"model": "nomic-embed-text"
}
},
"embedder_google": {
"client_class": "GoogleEmbedderClient",
"batch_size": 100,
"model_kwargs": {
"model": "text-embedding-004",
"task_type": "SEMANTIC_SIMILARITY"
}
},
"retriever": {
"top_k": 20
},
Expand Down
Loading