[Roadmap] vLLM Roadmap Q3 2024

Update:
* Please see #6801 for major items in performance sprint. 
* Please see #8779 for major items in a new architecture aim at simplicity and performance.
* We are in the feedback gathering phase for Q4 roadmap! 

---
This document includes the features in vLLM's roadmap for Q3 2024. Please feel free to discuss and contribute, as this roadmap is shaped by the vLLM community.

### Themes. 

As before, we categorized our roadmap into 6 broad themes:

* **Broad model support**: vLLM should support a wide range of transformer based models. It should be kept up to date as much as possible. This includes new auto-regressive decoder models, encoder-decoder models, hybrid architectures, and models supporting multi-modal inputs. 
* **Excellent hardware coverage**: vLLM should run on a wide range of accelerators for production AI workload. This includes GPUs, tensor accelerators, and CPUs. We will work closely with hardware vendors to ensure vLLM utilizes the greatest performance out of the chip. 
* **Performance optimization**:vLLM should be kept up to date with the latest performance optimization techniques. Users of vLLM can trust its performance to be competitive and strong.
* **Production level engine**: vLLM should be the go-to choice for production level serving engine with a suite of features bridging the gaps from single forward pass to 24/7 service. 
* **Strong OSS product**: vLLM is and will be a true community project. We want it to be a healthy project with regular release cadence, good documentation, and adding new reviewers to the codebase.
* **Extensible architectures**: For vLLM to grow at an even faster pace, it needs good abstractions to support a wide range of scheduling policies, hardware backends, and inference optimizations. We will work on refactoring the codebase to support that.


### Broad Model Support
- [x] Support Large Models (Arctic, Nemotron4, Llama3 400B+ when released)
  - [x] Via Pipeline Parallelism #4412 
  - [x] Via FP8
- [x] New Attention Mechanism (Jamba, Phi3-Small, etc)
- [x] Encoder Decoder (#4837, #4888, #4942)
- [x]  Multi-Modal #4194 

Help wanted:
- [ ] Whisper and the audio API
- [ ] Arbitrary HF model
- [x] Chameleon (#5770)
- [ ] Multi token prediction
- [ ] Reward model API
- [ ] Embedding Model Expansion (Bert, XLMRoberta) (#5447)

### Hardware Support
- [ ] A feature matrix for all the hardware that vLLM supports, and their maturity level
- [x] Enhanced performance benchmark across hardwares
- [ ] Expanding features support on various hardwares
  - [x] PagedAttention and Chunked Prefill on Inferentia
  - [ ] Chunked Prefill on Intel CPU/GPU
  - [ ] PagedAttention on Intel Gaudi
  - [x] TP and INT8 on TPU
  - [x] Bug fixes and GEMM tuning on AMD GPUs


### Performance Optimizations
- [x] Spec Decode Optimization ([tracker](https://docs.google.com/document/d/1ZwLLOhDsGq1IaLzqI2h-h2ZpKpF6P64p1CaOczJTidg/edit#heading=h.5tcmnevdkelo))
- [x] APC Optimizations
- [ ] Guided Decode Optimizations
- [x] API server performance
- [x] Quantization
  - [x] FP8/INT8 quantization improvements
  - [x] Quantized MoEs
  - [x] AWQ Performance
  - [ ] Fused GEMM/all-reduce
- [x] Scheduler overhead removal
- [x] Optimize prepare input, sampling, process output

### Production Features
- [x] Chunked Prefill on by default
- [ ] APC on by default
- [ ] N-gram prompt lookup spec decode on by default
- [x] Tool use
- [x] Request prioritization framework

Help wanted
- [ ] Support multiple models in the same server
- [ ] [Feedback wanted] Disaggregated prefill: please discuss with us your use case and in what scenario it is preferred over chunked prefill. 

### OSS Community
- [x] Reproducible performance benchmark on realistic workload
- [x] CI enhancements
- [x] Release process: minimize breaking changes and include deprecations

Help wanted
- [ ] Documentation enhancements in general (styling, UI, explainers, tutorials, examples, etc)

### Extensible Architecture
- [ ] KV cache transfer #5557 
- [x] Distributed execution #5775
- [x] Improvements to scheduler and memory manager supporting new attention mechanisms
- [ ] Performance enhancement for multi-modal processing

-----
If any of the item you wanted is not on the roadmap, your suggestion and contribution is still welcomed! Please feel free to comment in this thread, open feature request, or create an RFC. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Roadmap] vLLM Roadmap Q3 2024 #5805

Themes.

Broad Model Support

Hardware Support

Performance Optimizations

Production Features

OSS Community

Extensible Architecture

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Roadmap] vLLM Roadmap Q3 2024 #5805

Description

Themes.

Broad Model Support

Hardware Support

Performance Optimizations

Production Features

OSS Community

Extensible Architecture

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions