🛡️ End-to-End Fraud Detection System on AWS

📖 1. Business Context & Project Goal

In today’s digital economy, financial fraud is a growing threat, costing businesses and consumers billions annually. Proactively detecting and preventing fraudulent transactions is essential for customer trust and financial stability.

The Goal: Build and deploy a scalable, end-to-end fraud detection pipeline on AWS.

Ingests raw transaction data
Processes it at scale using Apache Spark
Predicts fraud with a machine learning model
Generates actionable reports with Gemini Gen AI

This repository is structured as a real-world case study, simulating the role of a Data Engineer tasked with delivering a production-ready fraud detection system.

🏛️ 2. Technical Solution: A Cloud-Native ML Pipeline

To handle the high volume & velocity of financial data, the pipeline is fully cloud-native, leveraging AWS services for scalability, automation, and efficiency.

⚙️ Architecture Overview

Data Lake & Storage (AWS S3):
Centralized data lake with folders for raw data (input/), processed data (processed/), reports (output/), models, scripts, and logs.
Large-Scale Processing (AWS EMR + Spark):
Distributed ETL via preprocessing_pipeline.py for cleaning & feature engineering.
Fraud Prediction (ML Inference):
fraud_simulation.py loads a Random Forest Classifier (model.pkl) to detect suspicious transactions.
Automated Reporting (Gemini Gen AI):
Converts prediction outputs into human-readable fraud reports for risk teams.

🧠 3. Model Insights & Key Findings

The Random Forest Classifier was chosen for robustness and handling imbalanced data.

Key Metrics:

🎯 Precision: Ensures flagged fraud cases are highly reliable
🔍 Recall: Captures the majority of fraudulent transactions
⚖️ Balance: Minimized false positives while maximizing detection

Top Predictive Features:

Transaction Amount
Transaction Category (e.g., shopping, travel)
Time of Day
Customer Age

☁️ 4. AWS Deployment & Execution Guide

✅ Step 1: IAM & S3 Setup

# Create bucket
aws s3 mb s3://fraudetection

# Folder structure
aws s3api put-object --bucket fraudetection --key input/
aws s3api put-object --bucket fraudetection --key processed/
aws s3api put-object --bucket fraudetection --key output/
aws s3api put-object --bucket fraudetection --key model/
aws s3api put-object --bucket fraudetection --key scripts/
aws s3api put-object --bucket fraudetection --key logs/

# Upload files
aws s3 cp model/model.pkl s3://fraudetection/model/
aws s3 cp scripts/ s3://fraudetection/scripts/ --recursive

📊 Step 2: Launch EMR Cluster

aws emr create-cluster \
  --name "FraudDetectionCluster" \
  --release-label emr-6.9.0 \
  --applications Name=Spark Name=Hadoop \
  --ec2-attributes KeyName=your-key,InstanceProfile=EMR_EC2_DefaultRole \
  --instance-type m5.xlarge --instance-count 3 \
  --use-default-roles \
  --log-uri s3://fraudetection/logs/

🔄 Step 3: Run Pipeline Jobs

# 1. Preprocessing (Spark ETL)
spark-submit --deploy-mode cluster \
  s3://fraudetection/scripts/preprocessing_pipeline.py \
  --input s3://fraudetection/input/ \
  --output s3://fraudetection/processed/

# 2. Fraud Detection + Reporting
python3 s3://fraudetection/scripts/fraud_simulation.py \
  --model s3://fraudetection/model/model.pkl \
  --input s3://fraudetection/processed/ \
  --output s3://fraudetection/output/

🧹 Step 4: Cleanup

# Terminate cluster
aws emr terminate-clusters --cluster-ids j-XXXXXXXXXXXX

# Remove S3 bucket
aws s3 rm s3://fraudetection --recursive
aws s3 rb s3://fraudetection

🤝 Contributing

Contributions are welcome! Feel free to open issues, suggest improvements, or submit pull requests.

📜 License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
model		model
notebooks		notebooks
scripts		scripts
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🛡️ End-to-End Fraud Detection System on AWS

📖 1. Business Context & Project Goal

🏛️ 2. Technical Solution: A Cloud-Native ML Pipeline

⚙️ Architecture Overview

🧠 3. Model Insights & Key Findings

☁️ 4. AWS Deployment & Execution Guide

✅ Step 1: IAM & S3 Setup

📊 Step 2: Launch EMR Cluster

🔄 Step 3: Run Pipeline Jobs

🧹 Step 4: Cleanup

🤝 Contributing

📜 License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

yashvisharma1204/financial_fraud_detection

Folders and files

Latest commit

History

Repository files navigation

🛡️ End-to-End Fraud Detection System on AWS

📖 1. Business Context & Project Goal

🏛️ 2. Technical Solution: A Cloud-Native ML Pipeline

⚙️ Architecture Overview

🧠 3. Model Insights & Key Findings

☁️ 4. AWS Deployment & Execution Guide

✅ Step 1: IAM & S3 Setup

📊 Step 2: Launch EMR Cluster

🔄 Step 3: Run Pipeline Jobs

🧹 Step 4: Cleanup

🤝 Contributing

📜 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages