This application leverages machine learning to detect spam messages
This repository contains a web application for detecting spam SMS messages. The application uses machine learning models (Extra Trees and Bernoulli Naive Bayes) to classify messages as spam or not spam. The app also allows users to provide feedback on the classification results, which can be used to retrain the models periodically.
- Prediction: Classify SMS messages as spam or not spam using Extra Trees or Bernoulli Naive Bayes models.
- Feedback: Users can provide feedback on the predictions to improve model performance.
- Continuous Training: The application supports periodic retraining of models using the feedback data.
/sms-spam-detection
│
├──/model
│ ├── BernoulliNB.pkl
│ └── Extra_Tree.pkl
│
├──/static
│ └──/images
│
├── app.py
├── streamlit_app.py
├── docker_app.py
├── Dockerfile
├── Dockerfile.fastapi
├── docker-compose.yml
├── requirements.txt
app.py
: Defines the FastAPI application.streamlit_app.py
: Defines the streamlit webapp.docker_app.py
: streamlit webapp for dockerDockerfile
: Dockerfile for building the Docker image.docker-compose.yml
: Docker Compose file for orchestrating the services.requirements.txt
: List of dependencies.model/
: Directory containing pre-trained machine learning models.static/
: Directory containing static files such as images used in the interface.
-
Clone the repository:
git clone https://github.com/Sibikrish3000/sms-spam-detection.git cd sms-spam-detection
-
Install the required packages:
pip install -r requirements.txt
-
Download NLTK data:
python -m nltk.downloader punkt python -m nltk.downloader stopwords
-
Start the FastAPI Server:
uvicorn app:app --host 0.0.0.0 --port 8000 --reload
-
Run the Streamlit Application:
streamlit run streamlit_app.py
-
Build and start the containers:
docker network create AIservice
docker-compose up --build
-
Access the streamlit webapp at http://localhost:8501.
Click the button below to start a new development environment:
- Enter SMS Message: Input the SMS message you want to classify.
- Select Model: Choose between Extra Trees and Bernoulli Naive Bayes models.
- Predict: Click the "Predict" button to see the classification result.
- Feedback: Provide feedback on the prediction by marking the message as spam or not spam and submit.
Continuous Training (CT) ensures that the machine learning models stay up-to-date with new data and feedback. Here are some suggestions for implementing CT for this application:
Online learning is suitable for scenarios where data arrives continuously, and the model needs to update frequently.
- Implementation: Implement online learning techniques where models are updated incrementally as new labeled data arrives.
Use techniques like stochastic gradient descent or mini-batch learning to update models in real-time based on user feedback. Use the
partial_fit()
method available in some scikit-learn models (e.g., SGDClassifier,BernoulliNB) to update the model incrementally. - Benefits: The model updates with each new feedback, allowing it to adapt quickly to new patterns.
- Challenges: May require more careful tuning and monitoring to ensure model stability.
Offline learning involves retraining the model periodically with the accumulated feedback data.
- Implementation: Retrain the model every fixed interval (e.g., daily, weekly) using the feedback data stored in the CSV file.
- Benefits: Simpler to implement and manage, as retraining can be scheduled during off-peak times.
- Challenges: Model updates less frequently compared to online learning, which may delay the incorporation of new patterns.
Partial fit combines aspects of both online and offline learning.
- Implementation: Use models that support the
partial_fit()
method. Collect feedback data over a period and then update the model in smaller batches. - Benefits: Provides a balance between frequent updates and stability.
- Challenges: Requires careful management of the batch size and frequency of updates.
- Collect Feedback: Save feedback data into a CSV file.
- Scheduled Retraining: Set up a cron job or similar scheduling tool to retrain the model every 10 days.
- Model Update: Load the feedback data, preprocess it, and retrain the model.
- Save Model: Save the retrained model to a file and replace the old model.
# Open the crontab editor
crontab -e
# Add the following line to schedule retraining every 10 days
0 0 */10 * * /usr/bin/python3 /path/to/your/retrain_script.py
import pandas as pd
import joblib
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import ExtraTreesClassifier
# Load feedback data
df = pd.read_csv('feedback.csv')
# Preprocess the messages
# Include your preprocessing function here
# Vectorize the messages
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df['message'])
y = df['label']
# Retrain the model
model = ExtraTreesClassifier()
model.fit(X, y)
# Save the retrained model
joblib.dump(model, 'Extra_Tree.pkl')
This project is licensed under the MIT License. See the LICENSE file for details.