A machine learning-powered web application that classifies SMS messages as spam or not using NLP techniques and the Multinomial Naive Bayes algorithm. This project includes full model training, evaluation, and a user-friendly Streamlit interface.
- Source: Kaggle - UCI SMS Spam Collection Dataset
- Description: A set of SMS labeled messages as spam or not.
- Data cleaning and preprocessing
- Exploratory Data Analysis (EDA)
- Text tokenization using NLTK
- Vectorization using TF-IDF
- Model comparison using multiple classifiers
- Final model: Multinomial Naive Bayes
- Evaluation metrics: Accuracy, Precision, Confusion Matrix
- Streamlit web app for user interaction
-
Clone the repo
git clone https://github.com/Mozeel-V/spam-detection.git cd spam-detection
-
Create a Conda Environment(Optional)
conda create -n spamguard conda activate spamguard
-
Install dependencies
pip install -r requirements.txt
-
Run the app
streamlit run app.py
Preview of the app can be accessed from here
📦 spam-detection/
├── app.py # Streamlit app
├── model.pkl # Trained Naive Bayes model
├── vectorizer.pkl # TF-IDF vectorizer
├── spam.csv # Original dataset
├── spam_utf8.csv # UTF-8 converted dataset
├── spam-detection.ipynb # Training and EDA notebook
├── requirements.txt # Python dependencies
├── LICENSE # MIT open-source license
└── README.md # Contains basic info about the project
- The dataset was vectorized using TF-IDF to capture term importance.
- Multiple classifiers were tested (e.g. Logistic Regression, SVM).
- Multinomial Naive Bayes gave the best results on precision and accuracy.
- The model was saved as
model.pkl
and used directly in the app.
- Python, Pandas, Scikit-learn, NLTK
- TF-IDF Vectorizer
- Streamlit (for frontend)
This project is licensed under the MIT License.
Feel free to fork, raise issues, or submit PRs to improve this project!
Mozeel Vanwani | IIT Kharagpur CSE
Email: [[email protected]]