Skip to content

Goutam16-Withcode/ArXivLens-AI

Repository files navigation

📚 ArXivLens AI - Advanced Analytics Suite

A comprehensive data science project for analyzing ArXiv AI/ML research papers (2025-2026) with an interactive, animated, glassmorphic Streamlit dashboard.


🎯 Project Overview

This project combines multiple machine learning and LLM/RAG techniques to analyze research papers:

  • LDA Topic Modeling: Discover 10 major research topics dynamically
  • Semantic Search: Vector similarity searches across abstracts using sentence embeddings
  • Clustering: K-Means clustering (8 clusters) projected in interactive 3D PCA space
  • Network Analysis: Co-authorship collaboration and PageRank analysis
  • Trend Forecasting: Track and forecast research popularity using regression models
  • Advanced LLM/RAG: Question-answering, document summarization, and automated literature reviews
  • Interactive Dashboard: Modern responsive UI with animated particle canvas backgrounds

📦 Project Structure

Project-1/
├── model.ipynb                 # Main Jupyter notebook with all analysis
├── streamlit_app.py           # Streamlit web application dashboard
├── requirements.txt           # Python dependencies
├── README.md                  # This file
├── DEPLOYMENT_GUIDE.md        # Deployment instructions and guidelines
├── run_api.py                 # REST API entry point
├── run_streamlit.bat          # Streamlit launcher batch script (Windows)
├── run_streamlit.ps1          # Streamlit launcher PowerShell script
├── setup.bat                  # Developer environment setup script
├── .env                       # Local environment API keys and configs
├── src/                       # Modular computational engines
│   ├── api.py                 # REST API endpoints & payload models
│   ├── db.py                 # SQLite database connection management
│   ├── citation_analysis.py   # PageRank and citation leaderboard computation
│   ├── predictive_analytics.py# Topic forecasting and saturation regression models
│   ├── nlp_applications.py    # Summarization, QA, and Literature Review engines
│   ├── dl_models.py           # Neural network classifiers and scientific tone engines
│   ├── knowledge_graph.py     # NetworkX graphs & semantic relationships 
│   ├── report_generator.py    # ReportLab PDF executive brief builder
│   └── run_pipeline.py        # Database initializer and model trainer
└── models/                    # Saved models & outputs (auto-generated)
    ├── lda_model.model       # LDA topic model
    ├── kmeans_model.pkl      # K-Means clustering model
    ├── pca_model.pkl         # PCA mapping model
    ├── dictionary.dict       # Gensim topic dictionary
    ├── corpus.pkl            # Preprocessed gensim corpus
    ├── vectorizer.pkl        # TF-IDF vectorizer
    ├── network_graph.pkl     # Co-authorship NetworkX graph
    ├── embeddings.npy        # High-dimensional sentence embeddings
    ├── embeddings_2d.npy     # 2D PCA projected embeddings
    ├── metadata.json         # Run metadata
    ├── processed_data.csv    # Flattened paper catalog dataset
    ├── Research_Intelligence_Brief.pdf # Generated PDF brief
    ├── citation_predictor.pkl # Citations regression model
    ├── pytorch_classifier.pkl # PyTorch model checkpoint
    └── research_suite.db      # Primary SQLite database data store

🚀 Quick Start

1. Install Dependencies

cd e:\Project-1
pip install -r requirements.txt

2. Initialize database and Train ML models

# This script creates sqlite_database.db and trains LDA, PCA, K-Means, and network models
python -m src.run_pipeline

3. Launch Streamlit Dashboard

streamlit run streamlit_app.py --server.port 8501

The dashboard will open automatically in your browser at http://localhost:8501.


🤖 LLM & RAG Integration

The suite supports generative AI analysis using OpenAI or DeepSeek chat models.

🔑 Sidebar API Key Manager

Directly inside the sidebar, you can expand 🔑 LLM API Key Configuration to enter your personal DeepSeek or OpenAI API keys. Keys entered here are cached in st.session_state and prioritized for all queries.

⚠️ Transparent Billing Error Handling

If a configured API key runs out of funds, the application detects the 402 Insufficient Balance API response, issues a styled warning notice with platform top-up links, and falls back gracefully to local resources.

🔄 Offline Fallbacks

If all LLM APIs are offline or lack balance:

  • Summarization: Falls back to a local TF-IDF TextRank extractive summary.
  • Literature Reviews: Falls back to an extractive metadata synthesis template.
  • Document QA: Synthesizes structured insights directly from retrieved document abstracts.

📊 Dashboard Sections

🏠 Executive Dashboard

  • High-level KPIs (Total papers, clusters, citation edges)
  • Glassmorphic card metrics with hover animation effects
  • Interactive Plotly distribution chart

🔍 Semantic Search & QA

  • Vector Search: Semantic query matching on BERT embeddings
  • Interactive QA: Ask questions and get answers synthesized from papers. Toggle context between the Local Database or Global Live ArXiv.
  • Automated Literature Review: Generate cohesive paragraphs mapping out papers on a topic using local or global live search scopes.

📊 Topics & Trend Forecasting

  • LDA Topic list with TF-IDF keyword frequencies
  • 12-Month topic popularity projections
  • Growth velocity and saturation analytics

🔗 Citation Network & PageRank

  • View collaboration networks, Node Degree distributions, and PageRank rankings
  • Export network insights

👥 Researcher Analytics & Predictive Success

  • Metric calculations (Citations, h-index) and citation velocity trajectory projections
  • Search Scope Toggle: Toggle between Local Database Catalog and Global AI Search (LLM-driven) for instant global academic profiling.

📁 PCA Paper Clustering & Summaries

  • 3D PCA Interactive Cluster Map: Project and rotate high-dimensional embeddings in a 3D Plotly canvas.
  • Document Summaries: Paste custom abstracts, select local papers, or use the Live ArXiv Search API to fetch and summarize any paper dynamically.

📈 Live ArXiv Monitor

  • Directly connects to the live ArXiv API feed to extract and run real-time ML classifiers and citation predictions on the newest preprints.

⚙️ System & Reports

  • Build a styled PDF research brief with ReportLab
  • Export processed paper datasets to CSV
  • View system path configurations

🛠️ Technical Stack

Component Technology
Data Processing Pandas, NumPy, SQLite
ML & Clustering Scikit-learn, Sentence-Transformers (all-MiniLM-L6-v2)
NLP & Topic Modeling NLTK, Gensim, TF-IDF
Generative LLM / RAG OpenAI API, DeepSeek API, Live ArXiv Client
Graph Analytics NetworkX
Visualizations Plotly Express, Plotly Graph Objects, HTML Canvas
PDF Generation ReportLab PDF Library
Frontend UI Streamlit, Glassmorphism, CSS Micro-animations

About

ArXivLens AI - This is platform where all paper of ArXiv is present and on the basis of their dataset they predict

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors