In today’s digital economy, financial fraud is a growing threat, costing businesses and consumers billions annually. Proactively detecting and preventing fraudulent transactions is essential for customer trust and financial stability.
The Goal: Build and deploy a scalable, end-to-end fraud detection pipeline on AWS.
- Ingests raw transaction data
- Processes it at scale using Apache Spark
- Predicts fraud with a machine learning model
- Generates actionable reports with Gemini Gen AI
This repository is structured as a real-world case study, simulating the role of a Data Engineer tasked with delivering a production-ready fraud detection system.
To handle the high volume & velocity of financial data, the pipeline is fully cloud-native, leveraging AWS services for scalability, automation, and efficiency.
-
Data Lake & Storage (AWS S3):
Centralized data lake with folders for raw data (input/
), processed data (processed/
), reports (output/
), models, scripts, and logs. -
Large-Scale Processing (AWS EMR + Spark):
Distributed ETL viapreprocessing_pipeline.py
for cleaning & feature engineering. -
Fraud Prediction (ML Inference):
fraud_simulation.py
loads a Random Forest Classifier (model.pkl
) to detect suspicious transactions. -
Automated Reporting (Gemini Gen AI):
Converts prediction outputs into human-readable fraud reports for risk teams.
The Random Forest Classifier was chosen for robustness and handling imbalanced data.
Key Metrics:
- 🎯 Precision: Ensures flagged fraud cases are highly reliable
- 🔍 Recall: Captures the majority of fraudulent transactions
- ⚖️ Balance: Minimized false positives while maximizing detection
Top Predictive Features:
- Transaction Amount
- Transaction Category (e.g., shopping, travel)
- Time of Day
- Customer Age
# Create bucket
aws s3 mb s3://fraudetection
# Folder structure
aws s3api put-object --bucket fraudetection --key input/
aws s3api put-object --bucket fraudetection --key processed/
aws s3api put-object --bucket fraudetection --key output/
aws s3api put-object --bucket fraudetection --key model/
aws s3api put-object --bucket fraudetection --key scripts/
aws s3api put-object --bucket fraudetection --key logs/
# Upload files
aws s3 cp model/model.pkl s3://fraudetection/model/
aws s3 cp scripts/ s3://fraudetection/scripts/ --recursive
aws emr create-cluster \
--name "FraudDetectionCluster" \
--release-label emr-6.9.0 \
--applications Name=Spark Name=Hadoop \
--ec2-attributes KeyName=your-key,InstanceProfile=EMR_EC2_DefaultRole \
--instance-type m5.xlarge --instance-count 3 \
--use-default-roles \
--log-uri s3://fraudetection/logs/
# 1. Preprocessing (Spark ETL)
spark-submit --deploy-mode cluster \
s3://fraudetection/scripts/preprocessing_pipeline.py \
--input s3://fraudetection/input/ \
--output s3://fraudetection/processed/
# 2. Fraud Detection + Reporting
python3 s3://fraudetection/scripts/fraud_simulation.py \
--model s3://fraudetection/model/model.pkl \
--input s3://fraudetection/processed/ \
--output s3://fraudetection/output/
# Terminate cluster
aws emr terminate-clusters --cluster-ids j-XXXXXXXXXXXX
# Remove S3 bucket
aws s3 rm s3://fraudetection --recursive
aws s3 rb s3://fraudetection
Contributions are welcome! Feel free to open issues, suggest improvements, or submit pull requests.
This project is licensed under the MIT License.