The Tagging & Vibe Classification Engine is a computer vision system designed to detect fashion items from videos and match them with product databases. The system uses YOLO for object detection and CLIP for semantic matching.
- Video Processing: Extract fashion items from video frames
- Duplicate Detection: Remove similar items using color histogram analysis
- Product Matching: Match detected items with product database using CLIP embeddings
- Batch Processing: Handle multiple videos and large product catalogs
tag_engine/
├── model.py # Main video processing script
├── matching.py # Product matching with CLIP
├── bad_url_check.py # URL validation utility
├── src/
│ └── similar.py # Image similarity detection
├── data/
│ ├── instagram_reels/ # Input videos
│ ├── shopify_data/ # Product database
│ └── datasets/ # Training datasets
├── weights/
│ ├── epoch_3/ # Trained YOLO model
│ └── epoch_10/ # Alternative model
├── video_crops/ # Extracted fashion items
├── results/ # Matching results
└── notebooks/ # Jupyter notebooks
- Python 3.8 or higher
- CUDA-compatible GPU (recommended)
- 8GB+ RAM
-
Clone the repository
git clone https://github.com/yourusername/flickd.git cd flickd
-
Install dependencies
pip install -r requirements.txt
-
Download model weights
# Ensure weights/epoch_3/best.pt exists
Process videos to extract fashion items:
python model.py
Configuration:
- Model:
weights/epoch_3/best.pt
- Confidence threshold: 0.6
- Frame skip: 5 frames
- Output: Timestamped directory in
video_crops/
Match extracted items with product database:
python matching.py
Features:
- Uses CLIP model for semantic matching
- Cosine similarity threshold: 0.85
- Outputs CSV with matches and scores
Check product image URLs:
python bad_url_check.py
The training dataset is organized as follows:
data/datasets/
├── train/
│ ├── images/ # Training images
│ └── labels/ # YOLO format labels
├── valid/
│ ├── images/ # Validation images
│ └── labels/ # YOLO format labels
├── test/
│ ├── images/ # Test images
│ └── labels/ # YOLO format labels
└── data.yaml # Dataset configuration
The data.yaml
file contains:
train: ../train/images
val: ../valid/images
test: ../test/images
nc: 12 # Number of classes
names: ['Casual_Jeans', 'Casual_Sneakers', 'Casual_Top', 'Corporate_Gown',
'Corporate_Shoe', 'Corporate_Skirt', 'Corporate_Top', 'Corporate_Trouser',
'Native', 'Shorts', 'Suits', 'Tie']
# Train YOLO model on custom dataset
yolo train model=yolov8n.pt data=data/datasets/data.yaml epochs=100 imgsz=640
# Train with custom parameters
yolo train \
model=yolov8m.pt \
data=data/datasets/data.yaml \
epochs=100 \
imgsz=640 \
batch=16 \
device=0 \
patience=50 \
save_period=10
Parameter | Description | Default | Recommended |
---|---|---|---|
model |
Base model architecture | yolov8n.pt | yolov8m.pt |
data |
Dataset configuration | - | data/datasets/data.yaml |
epochs |
Number of training epochs | 100 | 100-300 |
imgsz |
Input image size | 640 | 640 |
batch |
Batch size | 16 | 16-32 |
device |
Training device | auto | 0 (GPU) |
patience |
Early stopping patience | 50 | 50 |
save_period |
Save checkpoint every N epochs | -1 | 10 |
After training, you'll find:
runs/detect/train/
├── weights/
│ ├── best.pt # Best model (highest mAP)
│ └── last.pt # Last checkpoint
├── results.png # Training metrics
├── confusion_matrix.png # Confusion matrix
└── args.yaml # Training configuration
best.pt
: Use for inference (highest validation mAP)last.pt
: Use for resuming training or fine-tuning
# Continue training from existing weights
yolo train \
model=weights/epoch_3/best.pt \
data=data/datasets/data.yaml \
epochs=50 \
imgsz=640
# Validate trained model
yolo val model=weights/epoch_3/best.pt data=data/datasets/data.yaml
During training, monitor:
- [email protected]: Mean Average Precision at IoU=0.5
- [email protected]:0.95: Mean Average Precision across IoU thresholds
- Precision: Precision on validation set
- Recall: Recall on validation set
- YOLO Detection: Fashion item detection using trained model
- Frame Sampling: Process every 5th frame for efficiency
- Duplicate Removal: Color histogram-based similarity detection
- Crop Extraction: Save unique fashion items with metadata
- CLIP Embeddings: Generate semantic representations
- Cosine Similarity: Compare extracted items with product database
- Batch Processing: Handle large product catalogs efficiently
- Result Export: Save matches to CSV with confidence scores
- Color Histogram: Compare image color distributions
- Normalization: Standardized comparison across images
- Configurable Threshold: Adjustable similarity sensitivity
# model.py
model = YOLO("weights/epoch_3/best.pt")
conf_threshold = 0.6
frame_skip = 5
# matching.py
threshold = 0.85
model = SentenceTransformer("clip-ViT-B-32")
ultralytics>=8.0.0
opencv-python>=4.8.0
Pillow>=9.0.0
torch>=2.0.0
torchvision>=0.15.0
pandas>=1.5.0
numpy>=1.21.0
sentence-transformers>=2.0.0
scikit-learn>=1.0.0
tqdm>=4.60.0
requests>=2.25.0
- Location:
data/instagram_reels/
- Format: MP4, AVI, MOV
- Processing: Frame-by-frame analysis
- Location:
data/shopify_data/
- Format: CSV with
id
andimage_url
columns - Matching: CLIP embeddings comparison
- Crops:
video_crops/crops_YYYY-MM-DD_HH-MM-SS/
- Matches:
results/matched_results_YYYYMMDD_HHMMSS.csv
- Logs:
failed_urls.txt
,bad_urls.csv
- Video Processing: ~76ms per frame (640x384)
- Frame Skip: 5 frames for efficiency
- Duplicate Detection: Color histogram comparison
- CLIP Model: clip-ViT-B-32 for semantic matching
- Similarity Threshold: 0.85 for confident matches
- Batch Processing: Efficient handling of large datasets
- Fork the repository
- Create a feature branch
- Commit your changes
- Push to the branch
- Open a Pull Request
- Follow PEP 8 style guidelines
- Add comprehensive docstrings
- Include error handling
- Update documentation as needed
This project is licensed under the MIT License.
For questions or issues:
- Issues: GitHub Issues
- Email: [email protected]
Tagging & Vibe Classification Engine - Fashion Detection & Matching