This project allows you to ask questions about any PDF document using a hybrid retrieval-augmented generation (RAG) pipeline. It combines semantic search (FAISS) and keyword search (BM25) with Google Gemini LLM to provide accurate answers.
- Upload and read PDF documents
- Chunk and preprocess text for better retrieval
- Semantic embeddings with
HuggingFaceBgeEmbeddings - Keyword search using BM25
- Hybrid retrieval with
EnsembleRetriever - Answer questions using Google Gemini LLM
- Interactive web interface via Gradio
Install required packages:
pip install -q langchain langchain-community langchain-google-genai langchain-text-splitters faiss-cpu pypdf2 sentence_transformers gradio rank_bm25Set your Google API key:
import os
os.environ["GOOGLE_API_KEY"] = "YOUR_GOOGLE_API_KEY"- Upload a PDF
- Read the PDF text
- Split text into chunks
- Initialize embeddings
- Set up retrievers
- Initialize Google Gemini LLM
- Create prompt template
- Build RAG pipeline
- Ask questions
- Interactive web interface
- Python 3.9+
- Google API key with access to Gemini models
- Packages listed in Environment Setup
- Ensure your PDF contains text (not scanned images) for proper extraction.
- The ensemble retriever combines semantic and keyword-based search for better results.
- Designed to run in Google Colab, but can be adapted locally.