Skip to content

Conversation

@BrewTestBot
Copy link
Contributor

Created by brew bump


Created with brew bump-formula-pr.

Details

release notes
# 🚀 Shimmy v1.7.0: The MoE Revolution is Here!

💥 BREAKTHROUGH: Run 42B+ Models on Consumer Hardware

Shimmy v1.7.0 unleashes the MoE (Mixture of Experts) CPU Offloading Revolution - enabling massive expert models to run on everyday GPUs with up to 99.9% VRAM reduction.


🔥 What's New & Game-Changing

⚡ MoE CPU Offloading Technology

Transform impossible into possible:

  • --cpu-moe: Automatically offload MoE layers to CPU
  • --n-cpu-moe N: Fine-tune performance with precise layer control
  • Massive Memory Savings: 15GB models → 4GB VRAM usage
  • Enterprise Ready: Deploy 42B parameter models on 8GB consumer cards

📊 Real Performance Gains (Validated)

  • GPT-OSS 20B: 71.5% VRAM reduction (15GB → 4.3GB actual measurement)
  • Phi-3.5-MoE 42B: Runs on consumer hardware for the first time
  • DeepSeek 16B: Intelligent CPU-GPU hybrid execution
  • Smart Tradeoffs: Accept 2-7x slower inference for 10-100x memory savings

🛠️ Technical Excellence

  • First-Class Rust: Enhanced llama.cpp bindings with MoE support
  • Cross-Platform: Windows MSVC CUDA, macOS ARM64 Metal, Linux x86_64/ARM64
  • Production Tested: 295/295 tests passing, comprehensive validation pipeline
  • Still Tiny: Sub-5MB binary maintains legendary efficiency

🎯 Use Cases Unlocked

🏢 Enterprise Deployment

  • Cost Revolution: Run large models without GPU farm investments
  • Scalable AI: Deploy expert models on existing infrastructure
  • Flexible Performance: Balance speed vs. memory for any workload
  • On-Premises Ready: Keep sensitive data in-house with minimal hardware

🔬 Research & Development

  • Democratized Access: Test large models on developer laptops
  • Rapid Iteration: Prototype MoE architectures efficiently
  • Educational Power: Advanced AI models accessible to everyone
  • Hybrid Intelligence: Combine CPU and GPU resources intelligently

🚀 Quick Start Your MoE Journey

Installation Options

# Install from crates.io (LIVE NOW!)
cargo install shimmy

# Or grab platform binaries below ⬇️

🤖 Ready-to-Use MoE Models

Curated collection on HuggingFace - optimized for CPU offloading:

🥇 Recommended Starting Points

# Download and run Phi-3.5-MoE 42B (Q4 K-M) - Best balance of quality/performance
huggingface-cli download MikeKuykendall/phi-3.5-moe-q4-k-m-cpu-offload-gguf
./shimmy serve --cpu-moe --model-path phi-3.5-moe-q4-k-m.gguf

# Or DeepSeek-MoE 16B (Q4 K-M) - Faster alternative
huggingface-cli download MikeKuykendall/deepseek-moe-16b-q4-k-m-cpu-offload-gguf
./shimmy serve --cpu-moe --model-path deepseek-moe-16b-q4-k-m.gguf

📊 Complete Model Collection

Model Size Quantization VRAM Use Case Download
Phi-3.5-MoE 42B Q8.0 ~4GB 🏆 Maximum Quality phi-3.5-moe-q8-0-cpu-offload-gguf
Phi-3.5-MoE 42B Q4 K-M ~2.5GB Recommended phi-3.5-moe-q4-k-m-cpu-offload-gguf
Phi-3.5-MoE 42B Q2 K ~1.5GB 🚀 Ultra Fast phi-3.5-moe-q2-k-cpu-offload-gguf
DeepSeek-MoE 16B Q8.0 ~2GB 🎯 High Precision deepseek-moe-16b-q8-0-cpu-offload-gguf
DeepSeek-MoE 16B Q4 K-M ~1.2GB Budget Pick deepseek-moe-16b-q4-k-m-cpu-offload-gguf
DeepSeek-MoE 16B Q2 K ~800MB 💨 Lightning Fast deepseek-moe-16b-q2-k-cpu-offload-gguf
GPT-OSS 21B Various ~3GB 🔬 Research/Testing gpt-oss-20b-moe-cpu-offload-gguf

🎯 Model Selection Guide

  • 🥇 First Time? → Phi-3.5-MoE Q4 K-M (best balance)
  • 💪 High-End GPU (8GB+)? → Phi-3.5-MoE Q8.0 (maximum quality)
  • 💻 Limited VRAM (4GB)? → DeepSeek-MoE Q4 K-M (budget friendly)
  • ⚡ Speed Critical? → DeepSeek-MoE Q2 K (blazing fast)
  • 🔬 Research/Validation? → GPT-OSS 21B (proven baseline)

⚡ Launch Commands

# Enable MoE CPU offloading magic
./shimmy serve --cpu-moe --port 11435 --model-path your-model.gguf

# Fine-tune performance for your hardware
./shimmy serve --n-cpu-moe 8 --port 11435 --model-path your-model.gguf

# Standard OpenAI-compatible API - zero changes to your code!
curl -X POST http://localhost:11435/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "your-model", "prompt": "Explain quantum computing in simple terms"}'

📦 Cross-Platform Binaries

Choose your platform and start the revolution:

Platform Binary Features
🐧 Linux x86_64 shimmy-linux-x86_64 SafeTensors + llama.cpp + MoE
🦾 Linux ARM64 shimmy-linux-arm64 Native ARM64 + full MoE support
🪟 Windows x86_64 shimmy-windows-x86_64.exe CUDA GPU + MoE offloading
🍎 macOS Intel shimmy-macos-intel SafeTensors + Apple MLX
🚀 macOS Apple Silicon shimmy-macos-arm64 Metal GPU + MLX + MoE power

All binaries include zero Python dependencies and native SafeTensors support.


🌟 Why This Changes Everything

Before Shimmy v1.7.0: "I need a $10,000 GPU to run expert models"

After Shimmy v1.7.0: "I'm running 42B models on my gaming laptop"

This isn't just an update - it's sustainable AI democratization. Organizations can now:

  • ✅ Deploy cutting-edge models without infrastructure overhaul
  • ✅ Experiment with state-of-the-art architectures on existing hardware
  • ✅ Scale AI capabilities based on actual needs, not hardware limits
  • ✅ Maintain complete data sovereignty with on-premises deployment

📈 Validated & Transparent

  • Multi-Model Testing: 3 models validated across all platforms
  • Real Baselines: Controlled A/B testing with actual measurements
  • Production Quality: Comprehensive release gate system
  • Open Development: Technical validation report available

🤝 Join the Revolution


Ready to revolutionize your AI deployment? The future of efficient model serving is here. Download Shimmy v1.7.0 and experience the MoE revolution! 🚀

View the full release notes at https://github.com/Michael-A-Kuykendall/shimmy/releases/tag/v1.7.0.


@github-actions github-actions bot added rust Rust use is a significant feature of the PR or issue bump-formula-pr PR was created using `brew bump-formula-pr` labels Oct 9, 2025
@github-actions
Copy link
Contributor

github-actions bot commented Oct 9, 2025

🤖 An automated task has requested bottles to be published to this PR.

Caution

Please do not push to this PR branch before the bottle commits have been pushed, as this results in a state that is difficult to recover from. If you need to resolve a merge conflict, please use a merge commit. Do not force-push to this PR branch.

@github-actions github-actions bot added the CI-published-bottle-commits The commits for the built bottles have been pushed to the PR branch. label Oct 9, 2025
@BrewTestBot BrewTestBot enabled auto-merge October 9, 2025 09:52
@BrewTestBot BrewTestBot added this pull request to the merge queue Oct 9, 2025
Merged via the queue into main with commit 5f0cebc Oct 9, 2025
22 checks passed
@BrewTestBot BrewTestBot deleted the bump-shimmy-1.7.0 branch October 9, 2025 10:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bump-formula-pr PR was created using `brew bump-formula-pr` CI-published-bottle-commits The commits for the built bottles have been pushed to the PR branch. rust Rust use is a significant feature of the PR or issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants