GitHub - codingwithshawnyt/FluxInfer: FluxInfer is an architectural prototype and simulation engine. It is designed to help systems engineers and researchers model the behavior of large-scale inference clusters. It demonstrates how optimization techniques (like PagedAttention and Speculative Decoding) impact latency and throughput using a Rust-based discrete event simulation.

███████╗██╗     ██╗   ██╗██╗  ██╗██╗███╗   ██╗███████╗███████╗██████╗ 
██╔════╝██║     ██║   ██║╚██╗██╔╝██║████╗  ██║██╔════╝██╔════╝██╔══██╗
█████╗  ██║     ██║   ██║ ╚███╔╝ ██║██╔██╗ ██║█████╗  █████╗  ██████╔╝
██╔══╝  ██║     ██║   ██║ ██╔██╗ ██║██║╚██╗██║██╔══╝  ██╔══╝  ██╔══██╗
██║     ███████╗╚██████╔╝██╔╝ ██╗██║██║ ╚████║██║     ███████╗██║  ██║
╚═╝     ╚══════╝ ╚═════╝ ╚═╝  ╚═╝╚═╝╚═╝  ╚═══╝╚═╝     ╚══════╝╚═╝  ╚═╝

A High-Fidelity Simulator for Multimodal LLM Infrastructure

Documentation | Benchmarks | Paper | Discord

⚠️ Project Scope & Transparency

FluxInfer is an architectural prototype and simulation engine.

It is designed to help systems engineers and researchers model the behavior of large-scale inference clusters. It demonstrates how optimization techniques (like PagedAttention and Speculative Decoding) impact latency and throughput using a Rust-based discrete event simulation.

What this project IS:

✅ A "Digital Twin" for modeling AI infrastructure costs and performance.
✅ A reference architecture for binding Python agents to high-performance Rust backends.
✅ A demonstration of modern LLM optimization concepts (MoE Routing, Quantization effects).

What this project is NOT:

❌ A functioning CUDA/GPU inference kernel (it does not execute .safetensors models).
❌ A replacement for vLLM or TGI in production environments.

🌌 The Genesis Mission: Capacity Planning for AI

As AI models grow (70B+ parameters), predicting infrastructure costs is a $71B problem. Companies blindly deploy H100 clusters without understanding how different optimization compositions affect their specific workloads.

FluxInfer solves this by providing a rigorous simulation environment. It allows you to:

Construct Optimization Graphs: Mix and match techniques (e.g., "What if I use Int4 Quantization with FlashAttention-v3?").
Simulate Workloads: Run Monte Carlo simulations of agent swarms to predict fragmentation and latency.
Architectural Blueprint: Serve as a clean, idiomatic reference for building Rust/Python AI tools.

🚀 Key Features (Simulated)

1. Composable Optimization Modeling

The engine models the theoretical speedups of combining various techniques, helping developers understand the "Optimization Frontier."

Graph Compiler: Verifies compatibility between techniques (e.g., ensuring Int4 AWQ is compatible with the selected attention kernel).
Performance Projection: Uses mathematical models to estimate TTFT (Time To First Token) and TPOT (Time Per Output Token).

2. Memory Fragmentation Simulator

Traditional KV-caches waste 60-80% of VRAM. FluxInfer implements a logical model of PagedAttention to demonstrate how non-contiguous memory allocation reduces fragmentation rates in a simulated heap.

3. Adaptive MoE Routing Logic

Includes a fully functional Complexity-Aware Gating Network (in logic). While it doesn't run a neural net, the routing algorithms (Hash-based, Load-balanced) are implemented to show how requests would be distributed across experts.

📊 Simulation Results

Based on theoretical throughput modeling of Llama-3-70B on H100 hardware.

Metric	Baseline (Modeled)	FluxInfer (Simulated O3)	Projected Gain
Time To First Token (TTFT)	45.0 ms	8.5 ms	⚡ 5.2x
Generation Throughput	85 tok/s	650 tok/s	🚀 7.6x
VRAM Footprint	140 GB	38 GB	📉 3.6x
Cost per 1M Tokens	$2.50	$0.35	💰 7.1x

🛠️ Architecture

FluxInfer employs a hybrid Rust/Python architecture. The control plane (Python) handles high-level routing, while the simulation core (Rust) performs the discrete event modeling.

graph TD
    subgraph "Application Layer (Python)"
        User[Capacity Planner] -->|Config| API[FluxInfer API]
        API -->|Workload Def| Router[MoE Router Logic]
    end

    subgraph "Simulation Core (Rust)"
        Router -->|Events| Sim[Discrete Event Simulator]
        Sim -->|Model| Mem[PagedAttention Model]
        
        subgraph "Virtual Optimization Graph"
            Op1[FlashAttn Model]
            Op2[Quantization Model]
            Op3[Speculative Decoding Model]
        end
        
        Mem --> Op1
        Op1 --> Op2
        Op2 --> Op3
    end
    
    Op3 -->|Telemetry| Dashboard[Results Dashboard]

💻 Quick Start

Installation

# Clone the repository
git clone https://github.com/FluxInfer/FluxInfer.git
cd FluxInfer

# Install dependencies and build the Rust simulator
pip install -r requirements.txt
maturin develop

Running a Capacity Simulation

from flux_infer import FluxPipeline, InferenceConfig, OptimizationLevel, QuantizationMode

# 1. Define your target architecture
config = InferenceConfig(
    batch_size=64,
    optimization_level=OptimizationLevel.O3,  # Simulate aggressive optimization
    quantization_mode=QuantizationMode.Int4,  # Simulate 4-bit precision
    use_flash_attention=True
)

# 2. Initialize the Simulator
pipeline = FluxPipeline("Llama-3-70b-Sim", config)
pipeline.compile()

# 3. Run Workload Simulation
# "generate" here calculates the *projected* metrics for this prompt
response = pipeline.generate(
    prompt="Design a microservice architecture for a fintech app.",
    complexity_score=0.9
)

print(f"Projected Latency: {response['metrics']['latency_ms']} ms")
print(f"Projected Throughput: {response['metrics']['throughput_tokens_per_sec']} tok/s")

🤝 Contributing

We welcome contributions! Since this is a simulation framework, we are especially interested in:

Better mathematical models for GPU performance (e.g., Roofline models).
More accurate memory fragmentation logic in Rust.
Support for modeling new hardware (e.g., Blackwell, TPU v5).

See CONTRIBUTING.md for details.

📄 License

Licensed under the Apache 2.0 License. See LICENSE.

_{Built with ❤️ and 🦀 Rust. A conceptual prototype for the future of AI Infrastructure.}

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
benchmarks		benchmarks
docs		docs
examples		examples
flux_infer		flux_infer
flux_infer_core		flux_infer_core
tests		tests
.gitattributes		.gitattributes
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A High-Fidelity Simulator for Multimodal LLM Infrastructure

⚠️ Project Scope & Transparency

🌌 The Genesis Mission: Capacity Planning for AI

🚀 Key Features (Simulated)

1. Composable Optimization Modeling

2. Memory Fragmentation Simulator

3. Adaptive MoE Routing Logic

📊 Simulation Results

🛠️ Architecture

💻 Quick Start

Installation

Running a Capacity Simulation

🤝 Contributing

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

A High-Fidelity Simulator for Multimodal LLM Infrastructure

⚠️ Project Scope & Transparency

🌌 The Genesis Mission: Capacity Planning for AI

🚀 Key Features (Simulated)

1. Composable Optimization Modeling

2. Memory Fragmentation Simulator

3. Adaptive MoE Routing Logic

📊 Simulation Results

🛠️ Architecture

💻 Quick Start

Installation

Running a Capacity Simulation

🤝 Contributing

📄 License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages