A comprehensive data science project for analyzing ArXiv AI/ML research papers (2025-2026) with an interactive, animated, glassmorphic Streamlit dashboard.
This project combines multiple machine learning and LLM/RAG techniques to analyze research papers:
- LDA Topic Modeling: Discover 10 major research topics dynamically
- Semantic Search: Vector similarity searches across abstracts using sentence embeddings
- Clustering: K-Means clustering (8 clusters) projected in interactive 3D PCA space
- Network Analysis: Co-authorship collaboration and PageRank analysis
- Trend Forecasting: Track and forecast research popularity using regression models
- Advanced LLM/RAG: Question-answering, document summarization, and automated literature reviews
- Interactive Dashboard: Modern responsive UI with animated particle canvas backgrounds
Project-1/
├── model.ipynb # Main Jupyter notebook with all analysis
├── streamlit_app.py # Streamlit web application dashboard
├── requirements.txt # Python dependencies
├── README.md # This file
├── DEPLOYMENT_GUIDE.md # Deployment instructions and guidelines
├── run_api.py # REST API entry point
├── run_streamlit.bat # Streamlit launcher batch script (Windows)
├── run_streamlit.ps1 # Streamlit launcher PowerShell script
├── setup.bat # Developer environment setup script
├── .env # Local environment API keys and configs
├── src/ # Modular computational engines
│ ├── api.py # REST API endpoints & payload models
│ ├── db.py # SQLite database connection management
│ ├── citation_analysis.py # PageRank and citation leaderboard computation
│ ├── predictive_analytics.py# Topic forecasting and saturation regression models
│ ├── nlp_applications.py # Summarization, QA, and Literature Review engines
│ ├── dl_models.py # Neural network classifiers and scientific tone engines
│ ├── knowledge_graph.py # NetworkX graphs & semantic relationships
│ ├── report_generator.py # ReportLab PDF executive brief builder
│ └── run_pipeline.py # Database initializer and model trainer
└── models/ # Saved models & outputs (auto-generated)
├── lda_model.model # LDA topic model
├── kmeans_model.pkl # K-Means clustering model
├── pca_model.pkl # PCA mapping model
├── dictionary.dict # Gensim topic dictionary
├── corpus.pkl # Preprocessed gensim corpus
├── vectorizer.pkl # TF-IDF vectorizer
├── network_graph.pkl # Co-authorship NetworkX graph
├── embeddings.npy # High-dimensional sentence embeddings
├── embeddings_2d.npy # 2D PCA projected embeddings
├── metadata.json # Run metadata
├── processed_data.csv # Flattened paper catalog dataset
├── Research_Intelligence_Brief.pdf # Generated PDF brief
├── citation_predictor.pkl # Citations regression model
├── pytorch_classifier.pkl # PyTorch model checkpoint
└── research_suite.db # Primary SQLite database data store
cd e:\Project-1
pip install -r requirements.txt# This script creates sqlite_database.db and trains LDA, PCA, K-Means, and network models
python -m src.run_pipelinestreamlit run streamlit_app.py --server.port 8501The dashboard will open automatically in your browser at http://localhost:8501.
The suite supports generative AI analysis using OpenAI or DeepSeek chat models.
Directly inside the sidebar, you can expand 🔑 LLM API Key Configuration to enter your personal DeepSeek or OpenAI API keys. Keys entered here are cached in st.session_state and prioritized for all queries.
If a configured API key runs out of funds, the application detects the 402 Insufficient Balance API response, issues a styled warning notice with platform top-up links, and falls back gracefully to local resources.
If all LLM APIs are offline or lack balance:
- Summarization: Falls back to a local TF-IDF TextRank extractive summary.
- Literature Reviews: Falls back to an extractive metadata synthesis template.
- Document QA: Synthesizes structured insights directly from retrieved document abstracts.
- High-level KPIs (Total papers, clusters, citation edges)
- Glassmorphic card metrics with hover animation effects
- Interactive Plotly distribution chart
- Vector Search: Semantic query matching on BERT embeddings
- Interactive QA: Ask questions and get answers synthesized from papers. Toggle context between the Local Database or Global Live ArXiv.
- Automated Literature Review: Generate cohesive paragraphs mapping out papers on a topic using local or global live search scopes.
- LDA Topic list with TF-IDF keyword frequencies
- 12-Month topic popularity projections
- Growth velocity and saturation analytics
- View collaboration networks, Node Degree distributions, and PageRank rankings
- Export network insights
- Metric calculations (Citations, h-index) and citation velocity trajectory projections
- Search Scope Toggle: Toggle between
Local Database CatalogandGlobal AI Search (LLM-driven)for instant global academic profiling.
- 3D PCA Interactive Cluster Map: Project and rotate high-dimensional embeddings in a 3D Plotly canvas.
- Document Summaries: Paste custom abstracts, select local papers, or use the Live ArXiv Search API to fetch and summarize any paper dynamically.
- Directly connects to the live ArXiv API feed to extract and run real-time ML classifiers and citation predictions on the newest preprints.
- Build a styled PDF research brief with ReportLab
- Export processed paper datasets to CSV
- View system path configurations
| Component | Technology |
|---|---|
| Data Processing | Pandas, NumPy, SQLite |
| ML & Clustering | Scikit-learn, Sentence-Transformers (all-MiniLM-L6-v2) |
| NLP & Topic Modeling | NLTK, Gensim, TF-IDF |
| Generative LLM / RAG | OpenAI API, DeepSeek API, Live ArXiv Client |
| Graph Analytics | NetworkX |
| Visualizations | Plotly Express, Plotly Graph Objects, HTML Canvas |
| PDF Generation | ReportLab PDF Library |
| Frontend UI | Streamlit, Glassmorphism, CSS Micro-animations |