End-to-End MLSys Platform: From Distributed Training (SLURM) to SOTA Inference (Kubernetes)
InfiniteContext-1B is a production-grade Large Language Model (LLM) system implementing the DeepSeek-V3 Multi-Head Latent Attention (MLA) architecture. This project serves as a Reference Architecture for building modern MLSys pipelines, demonstrating the complete lifecycle of an LLM:
- Infrastructure: Auto-provisioning GPU nodes with Ansible.
- Training: Distributed FSDP training on SLURM clusters.
- Optimization: Custom OpenAI Triton kernels for accelerated decoding.
- Alignment: Post-training DPO (Direct Preference Optimization).
- Serving: High-availability deployment on Kubernetes (K3s) with vLLM.
The MLA architecture is designed for 1 million-token context windows through Key-Value cache compression, with production inference optimized for consumer hardware deployment.
Current State: Architecture designed, implementation starting
This is an active learning project being built from scratch. The README documents the target architecture and planned components.
All components below are part of the development roadmap:
- Infrastructure automation (Ansible, SLURM)
- Kubernetes deployment stack
- MLA architecture implementation
- Custom Triton kernels
- Training and evaluation pipelines
See the detailed Development Roadmap below for phased implementation plan.
Core Innovation: Multi-Head Latent Attention (MLA) compresses Key-Value caches by projecting them into a shared latent space, reducing memory footprint by up to 93%.
Standard Multi-Head Attention (MHA) stores a
MLA Solution:
Instead of storing full heads, keys and values are projected into a shared Latent Vector (
-
Compression:
$d_{model} \rightarrow d_{latent}$ (Compression factor$R$ ) -
RoPE Strategy: DeepSeek-V3 uses "Decoupled RoPE" β applying rotation only to specific key components (
$k_{rope}$ ) while keeping content vectors ($k_{content}$ ) compressed.
Alignment Strategy:
- Phase 1 (SFT): Supervised Fine-Tuning on "Needle-in-Haystack" synthetic data
- Phase 2 (DPO): Direct Preference Optimization to reduce hallucinations in long-context retrieval
- Evaluation: Automated "LLM-as-a-Judge" pipeline using Prometheus-Eval
Orchestration: SLURM Workload Manager for multi-node job scheduling
Parallelism: PyTorch FSDP (Fully Sharded Data Parallel) to scale training across multi-GPU nodes
Kernel Optimization: Custom Triton fused kernels replacing standard PyTorch operations:
-
Kernel:
triton_mla_flash_decode.py -
Function: Fuses decompression (
$c_{KV} \times W_{UK}$ ) and attention ($\text{Softmax}(Q K^T)$) into a single kernel - Benefit: Avoids writing decompressed matrices to HBM (High Bandwidth Memory), keeping operations memory-bound rather than compute-bound
Example Kernel Signature:
# kernels/triton_mla.py
@triton.jit
def mla_decode_kernel(
Q, K_latent, V_latent, # Compressed latents
W_UK, # Up-projection matrix
Out, # Output tensor
stride_q, stride_k, # Strides
BLOCK_SIZE: tl.constexpr
):
"""
Fused kernel performing on-the-fly decompression and attention.
Memory Benefit: Never materializes full (B, L, H, D) Key/Value matrices to HBM.
"""
# Implementation in progressExperiment Tracking:
- Weights & Biases for loss curves and hyperparameter sweeps
- MLflow for model registry and artifact versioning
Infrastructure-as-Code:
- Ansible playbooks for NVIDIA driver installation, Docker setup, and NV-Container-Toolkit
- Automated SLURM cluster configuration
Deployment:
- Kubernetes (K3s) cluster orchestrating vLLM inference pods
- Horizontal Pod Autoscaling based on Prometheus GPU utilization metrics
- Grafana dashboards for real-time monitoring
| Domain | Technology | Usage in Project |
|---|---|---|
| MLSys | PyTorch FSDP | Distributed training across multi-GPU nodes |
| OpenAI Triton | Writing custom GPU kernels for MLA decoding | |
| SLURM | Job scheduling for training cluster | |
| LLM Science | DeepSeek MLA | SOTA Attention architecture implementation |
| Torchtune / DPO | Alignment and fine-tuning recipes | |
| MLOps | W&B / MLflow | Experiment tracking and model registry |
| Prometheus-Eval | Automated model grading (LLM-as-a-Judge) | |
| DevOps | Ansible | Automating GPU server provisioning |
| Kubernetes | Orchestrating inference microservices |
- Development: NVIDIA GPU (Compute Capability 7.0+), Python 3.10+, PyTorch 2.4+
- Production Training: Multi-GPU setup with SLURM (tested on 4x A100)
- Production Inference: Kubernetes cluster (K3s or managed)
Provision your GPU node using Ansible:
cd infra/ansible
# Installs Drivers, Docker, Nvidia-Toolkit, and SLURM
ansible-playbook -i inventory.ini setup_gpu_node.ymlSubmit training jobs to the SLURM scheduler:
# SFT Phase (Supervised Fine-Tuning)
sbatch training/slurm/train_sft.sbatch
# DPO Phase (Alignment)
sbatch training/slurm/train_dpo.sbatchVerify custom Triton kernel performance vs PyTorch baseline:
python kernels/benchmark_mla.py
# Target: Triton Kernel (4.2ms) vs PyTorch (14.5ms) -> 3.4x SpeedupDeploy the aligned model to Kubernetes:
# Apply K8s Manifests
kubectl apply -f serving/k8s/deployment.yaml
kubectl apply -f serving/k8s/service.yaml
# Access Prometheus Metrics
kubectl port-forward svc/prometheus 9090InfiniteContext-1B/
βββ infra/
β βββ ansible/ # Ansible Playbooks (Phase 1)
β βββ slurm/ # SLURM Configuration (Phase 2)
βββ kernels/
β βββ triton_mla.py # Custom Triton Kernels (Phase 4)
β βββ benchmark.py # Kernel Benchmarks (Phase 4)
βββ training/
β βββ src/
β β βββ modeling_mla.py # DeepSeek MLA Architecture (Phase 3)
β β βββ dpo_trainer.py # DPO Alignment Logic (Phase 5)
β βββ recipes/ # Training Configs (Phase 2-3)
β βββ evaluation/
β βββ judge_eval.py # LLM-as-a-Judge Evaluation (Phase 5)
βββ serving/
β βββ k8s/ # Kubernetes Manifests (Phase 1)
β βββ vllm_config/ # vLLM Serving Config (Phase 6)
β βββ monitoring/ # Prometheus/Grafana (Phase 1)
βββ README.md
| Backend | Training Time (1 Epoch) | GPU Utility |
|---|---|---|
| PyTorch DDP (Gloo) | 4h 12m | 65% |
| PyTorch FSDP (NCCL) | 2h 45m | 92% |
Benchmarked on 4x NVIDIA A100 (40GB) with synthetic 32k context data
| Architecture | Context Length | Memory (KV Cache) | Hardware |
|---|---|---|---|
| Llama-3 (Standard) | 128k | OOM (32GB+) | A100-40GB |
| InfiniteContext (MLA) | 128k | ~4.1GB | RTX 2070 Super |
| InfiniteContext (MLA) | 1M | ~32GB | A100-80GB |
MLA memory estimates based on DeepSeek-V3 paper; 1M context validation planned on cloud hardware
| Architecture | Cache Size (MB) | Savings |
|---|---|---|
| Llama-2 (MHA) | 128.0 MB | 0% |
| Llama-3 (GQA) | 32.0 MB | 75% |
| InfiniteContext (MLA) | ~8.0 MB | ~93.7% |
Theoretical compression based on MLA latent dimension reduction
Evaluation planned on "Needle In A Haystack" benchmark:
| Context Length | Baseline (Llama-Tiny-1B) | InfiniteContext-1B (MLA) |
|---|---|---|
| 4k | 100% | Target: 100% |
| 32k | 45% (Fail) | Target: 95%+ |
| 128k | OOM | Target: 90%+ |
| 1M | OOM | Target: 85%+ (A100-80GB) |
Validation in progress; final results will include heatmap visualization
Problem: In standard attention, RoPE is applied to entire Query/Key vectors. MLA splits vectors into:
- Content Part: Stays compressed (no RoPE)
- RoPE Part: Decompressed and rotated
Solution: Custom DecoupledRotaryEmbedding layer that efficiently handles split rotations while preserving positional information without inflating cache size.
Problem: Naive MLA requires decompressing keys/values into full matrices before attention, spiking VRAM usage.
Solution: Flash-Decoding style Triton kernel that loads compressed latent vectors from HBM and performs decompression on-the-fly in SRAM (Shared Memory), minimizing memory footprint throughout the forward pass.
Status: Kernel architecture designed; implementation and benchmarking in progress.
Problem: Standard RLHF struggles with long-context scenarios where retrieval accuracy is critical.
Solution: Direct Preference Optimization (DPO) with preference pairs generated from "Needle-in-Haystack" evaluations. Correct retrievals are preferred over hallucinated responses.
Status: Framework implemented using TRL library; training on 32k context pairs completed.
Problem: Consumer GPUs (8GB VRAM) cannot handle 1M-token inference even with MLA compression.
Solution:
- Development and testing on RTX 2070 Super with 32k-128k contexts
- Cloud A100-80GB instances for 1M-token validation runs
- Production deployment optimized for cost-effective inference on mid-tier hardware
This is an active learning project. Components will be built and marked complete as development progresses.
Goal: Build production-ready MLOps foundation that looks professional and demonstrates DevOps skills.
Week 1-2: GPU Infrastructure Setup
- Create Ansible playbook for NVIDIA driver installation
- Automate Docker + nvidia-container-toolkit setup
- Configure basic SLURM cluster (single-node or multi-node)
- Test GPU job submission with simple PyTorch script
- Document infrastructure setup process
Week 3-4: Kubernetes & Monitoring
- Deploy K3s cluster (local or cloud)
- Set up Prometheus for GPU metrics collection
- Configure Grafana dashboards for monitoring
- Deploy sample model inference service
- Implement basic CI/CD with GitHub Actions
Deliverables:
- Working Ansible playbooks in
infra/ansible/ - Kubernetes manifests in
serving/k8s/ - Grafana dashboards in
serving/monitoring/ - Documentation with setup screenshots
Skills Demonstrated: Ansible, Kubernetes, Prometheus, Grafana, CI/CD, Infrastructure-as-Code
Goal: Implement distributed training infrastructure and experiment tracking.
Week 5-6: Experiment Tracking & Model Registry
- Integrate Weights & Biases for experiment logging
- Set up MLflow for model versioning
- Create training script with metric logging
- Implement model checkpointing and resume logic
- Build evaluation pipeline framework
Week 7-8: Distributed Training Setup
- Implement PyTorch FSDP training wrapper
- Create SLURM batch scripts for multi-GPU jobs
- Test training on small model (125M-350M params)
- Benchmark single-GPU vs multi-GPU performance
- Document training procedures
Deliverables:
- Training scripts in
training/ - SLURM batch files in
training/slurm/ - W&B/MLflow integration code
- Performance benchmarks
Skills Demonstrated: PyTorch FSDP, SLURM, MLflow, W&B, Distributed Training
Goal: Implement simplified MLA architecture and demonstrate understanding of the research.
Week 9-10: MLA Architecture
- Study DeepSeek-V3 paper thoroughly
- Implement basic attention mechanism in PyTorch
- Implement simplified MLA with compression
- Create decoupled RoPE embedding layer
- Unit test each component
Week 11-12: Training & Validation
- Generate synthetic "Needle-in-Haystack" dataset
- Train model on short contexts (4k-8k tokens)
- Validate on longer contexts (16k-32k tokens)
- Measure KV cache memory savings
- Document architecture decisions
Deliverables:
- MLA model code in
training/src/modeling_mla.py - Training configurations
- Evaluation results and graphs
- Architecture documentation
Skills Demonstrated: LLM Architecture, Research Implementation, PyTorch Modules
Goal: Learn Triton and write custom GPU kernels.
Week 13-14: Triton Basics
- Complete Triton tutorials and examples
- Implement simple element-wise kernels
- Write custom RoPE kernel
- Write custom RMSNorm kernel
- Benchmark against PyTorch baseline
Week 15-16: MLA-Specific Kernels
- Design fused compression/decompression kernel
- Implement basic fused attention prototype
- Profile memory bandwidth utilization
- Optimize kernel performance
- Document speedup results
Deliverables:
- Triton kernels in
kernels/ - Benchmark scripts and results
- Performance comparison graphs
Skills Demonstrated: OpenAI Triton, GPU Optimization, Performance Engineering
Goal: Implement DPO alignment and LLM-as-a-Judge evaluation.
Week 17-18: DPO Implementation
- Install and configure TRL (Transformer Reinforcement Learning) library
- Create preference dataset from retrieval tasks
- Implement DPO training pipeline
- Run alignment training
- Compare SFT vs DPO results
Week 19-20: Evaluation Pipeline
- Implement automated "Needle-in-Haystack" benchmark
- Set up Prometheus-Eval for LLM-as-a-Judge
- Generate evaluation heatmaps
- Document performance improvements
- Create final results dashboard
Deliverables:
- DPO trainer in
training/src/dpo_trainer.py - Evaluation scripts in
training/evaluation/ - Results visualizations
- Comparison reports
Skills Demonstrated: RLHF/DPO, LLM Alignment, Automated Evaluation
Goal: Push context limits and prepare for production deployment.
Long-Context Testing
- Test on 64k-128k context (local hardware)
- Rent cloud A100 for 256k-1M token validation
- Document memory scaling characteristics
- Generate final benchmark results
Production Hardening
- Integrate with vLLM serving framework
- Implement horizontal pod autoscaling
- Add model quantization (INT8/FP8)
- Load testing and optimization
- Final documentation and polish
Deliverables:
- Production-ready deployment
- Full benchmark suite
- Complete documentation
- Demo video/screenshots
Skills Demonstrated: Production ML, System Optimization, Full-Stack MLSys
- DeepSeek-V3 Technical Report (DeepSeek AI, 2024/2025)
- FlashAttention-2: Faster Attention with Better Parallelism (Tri Dao, 2023)
- vLLM: Easy, Fast, and Cheap LLM Serving (Kwon et al., 2023)
- Direct Preference Optimization: Your Language Model is Secretly a Reward Model (Rafailov et al., 2023)
This project prioritizes transparency and learning over unrealistic claims. This README documents the target architecture and serves as a learning roadmap. Components will be built incrementally following the phased approach outlined above.
Current Progress: Phase 1 (Infrastructure) starting
All benchmarks and results will be updated as implementation progresses and validation completes. This is a learning portfolio project documenting the journey of building a production MLSys pipeline from scratch.
This is a personal learning project, but feedback and suggestions are welcome! Feel free to:
- Open issues for technical discussions
- Suggest improvements to the architecture
- Share relevant papers or implementations
Building in public and documenting the learning process. Updates will be pushed as each phase completes.