Releases: vllm-project/vllm-spyre
Releases · vllm-project/vllm-spyre
v0.9.4
This release:
- Fixes a bug where the incorrect attention algorithm was used for static batching with fp8 quantized models
- Fixes a bug where invalid
--num-gpu-blocks-override
values could crash the server - Supports specific model revisions in the unit test suite
What's Changed
- fix: block_size to be multiple of max_batch_size by @wallashss in #454
- fix: static batching with FP8 by @wallashss in #457
- ⚗️ Support model revision in tests by @joerunde in #456
Full Changelog: v0.9.3...v0.9.4
v0.9.3
This release fixes a bug where a unit test is failing on spyre hardware due to a misconfiguration
What's Changed
- 🎨 make available_blocks as 18 in scheduler test by @prashantgupta24 in #453
- [cb] scheduler heuristic 2: unblock long prompts by @yannicks1 in #440
Full Changelog: v0.9.2...v0.9.3
v0.9.2
This release:
- Updates tests to check token probabilities against transformers (instead of logprobs) for better human interpretability
- Adds the VLLM_SPYRE_GLOO_TIMEOUT_MINUTES config to workaround long compilation timeouts with staggered compilation
What's Changed
- CI: Show isort diff by @ckadner in #449
- 🔥 Remove vllm 0.9.2 support by @joerunde in #448
- chore: Use Python 3.11 by default for type-check by @ckadner in #450
- [cb] scheduler heuristic balancing prefill/decode prioritization by @yannicks1 in #433
- ♻️ Temporary fix for VLLM_SPYRE_GLOO_TIMEOUT_MINUTES by @prashantgupta24 in #452
- ♻️ use token probability with abs_tolerance instead of comparing logprobs by @prashantgupta24 in #447
Full Changelog: v0.9.1...v0.9.2
v0.9.1
This release:
- Updates the default vllm install to 0.10.1.1
- Fixes a bug where vlm 0.10.1.1 did not work
- Fixes a bug where FP8 did not work with continuous batching
What's Changed
- 🐛 fix image by @joerunde in #442
- [CB] optimization: cache volumetric constraint in scheduler by @yannicks1 in #418
- feat: FP8 initial support on continuous batching by @wallashss in #402
- 🐛 Fix SB max-model-len override by @prashantgupta24 in #436
- ⬆️ bump default vllm version to 0.10.1.1 by @joerunde in #446
Full Changelog: v0.9.0...v0.9.1
v0.9.0
This release
- Adds suport for reranker models
- Adds support for vllm 0.10.1
- Adds extra debug options for tensor parallel operation
- Fixes a bug where VLLM_SPYRE_MAX_LOAD_PROCESSES did not work properly
What's Changed
- [GHA] 🐛 fix: Save HF models cache for all jobs by @yannicks1 in #400
- [GHA] 🎨 refactor test yaml by @yannicks1 in #401
- 🔥 remove FLEX_OVERWRITE_NMB_FRAME by @prashantgupta24 in #408
- [test] 🎨 fix test description string by @yannicks1 in #416
- [cb][test] fix scheduler constraint and add tests for batch x tkv limit by @yannicks1 in #417
- ⚡ Cache LLMs during tests by @joerunde in #396
- [CB][Tests] Reduce number of steps in scheduler steps tests by @sducouedic in #409
- 🎨 reword logs for loading model weights by @prashantgupta24 in #397
- 🔥 trim local envs not required anymore by @prashantgupta24 in #399
- 🎨 make hf_cache.json prettier by @joerunde in #422
- ⬆️ bump base image by @joerunde in #427
- ♻️ [tests] Full model testing by @prashantgupta24 in #428
- 🎨 add info about DT_DEEPRT_VERBOSE by @prashantgupta24 in #430
- 🐛 fixup compilation wrapper by @joerunde in #431
- 🔨 Add debug log redirection option by @joerunde in #429
- [doc] 👨🎨 Adding drawings explaining optimizations by @yannicks1 in #426
- [cb][test] add tests for volumetric constraint with prefill optimization by @yannicks1 in #425
- 🐛 solve undetected merge conflict with main by @yannicks1 in #432
- Add reranker support by @maxdebayser in #403
- 🎨 print relative tolerance diff in tests by @prashantgupta24 in #438
- Bump vllm to v0.10.1 and add compatibility code by @maxdebayser in #443
- fix VLLM_SPYRE_MAX_LOAD_PROCESSES to int instead of bool by @jberkhahn in #444
Full Changelog: v0.8.0...v0.9.0
v0.8.0
What's Changed
- 🎨 improve log statement by @prashantgupta24 in #395
- [GHA] Triggering test agains vLLM:main for ready labels added by bot by @yannicks1 in #398
- [tests] load only the needed models from hf cache by @yannicks1 in #393
- Add VLLMS_SPYRE_MAX_LOAD_PROCESSES to limit number of processes that … by @jberkhahn in #357
- 🐛 COMPILATION_MODE conditionally by @prashantgupta24 in #404
New Contributors
- @jberkhahn made their first contribution in #357
Full Changelog: v0.7.3...v0.8.0
v0.7.3
What's Changed
- 🎨 make max_num_seqs 4 for online test by @prashantgupta24 in #394
Full Changelog: v0.7.2...v0.7.3
v0.7.2
Mostly testing changes, but we need to be able to skip unsupported compiler tests
What's Changed
- [Docs] Add q3 roadmap by @rafvasq in #382
- Remove block_size from arguments on LLM constructor. by @yannicks1 in #383
- [CB] fix scheduler assert prints by @yannicks1 in #387
- 🐛 fix a bug in tests, add DISABLE_ASSERTS by @prashantgupta24 in #375
- [Tests] Limit long-context test to 16k by @rafvasq in #389
- [GHA] skip tests against vLLM:main if ready label not assigned yet by @yannicks1 in #388
- 🔥 remove long context test with bad config by @joerunde in #391
- feat: removed 32k from test_swap_decode_programs_for_cb by @wallashss in #390
- ✅ Compiler unsupported flag for tests by @prashantgupta24 in #392
Full Changelog: v0.7.1...v0.7.2
v0.7.1
This release:
🐛 Fixes support for TP 4 for the full ibm-granite/granite-3.3-8b-instruct-cb
model
🎉 Allow sequences to join a batch anytime
What's Changed
- 🎨 fix warning logs for _cast_bf16_to_f16 by @prashantgupta24 in #372
- [CB] Optimization: allowing sequences to join a batch anytime by @yannicks1 in #340
- Add optional option --max_tokens by @kkvtran in #377
- Add compat code for changing Pooler function signature by @maxdebayser in #374
- 🎨 fully parametrize the online script by @yannicks1 in #378
- [CB] Support batch size 1 for decode, simplify warmup by @yannicks1 in #312
- feat: tests to check swapping of decode program by @wallashss in #370
- [Tests] Add long context batch tests by @rafvasq in #365
- Document support for POWER architecture in README by @RajalakshmiSR in #366
- [Compat]: Fix renamed NewRequestData argument by @maxdebayser in #380
- ⚡ cache hf results in tests by @joerunde in #373
- [high prio][CB] 🐛 fix warmup by @yannicks1 in #384
- [CB] add warning when exceeding 32K context length by @yannicks1 in #385
- 🎨 fix spacing in log msg by @prashantgupta24 in #386
New Contributors
- @kkvtran made their first contribution in #377
- @RajalakshmiSR made their first contribution in #366
Full Changelog: v0.7.0...v0.7.1
v0.7.0
This release:
- 🎉 Supports FP8 quantized models on cpu!
- 🚧 Adds scheduler constraints and config for future long-context support with continuous batching
- 📌 Sets an upper bound on the vllm dependency so that users no longer install future versions of vllm which were untested at the time of release
What's Changed
- ♻️ fix vllm:main - replace EngineCoreRequest with Request by @prashantgupta24 in #354
- [Tests][FP8]: Add fp8 test by @rafvasq in #350
- feat: removed triton from dependencies by @wallashss in #353
- [docs] remove pooling models from supported features by @yannicks1 in #358
- [docs][CB] remove warning that no output correctness is asserted for scheduler step tests by @yannicks1 in #360
- [embedding] support newest vllm main branch by @yannicks1 in #361
- [CB] hard code number of spyre blocks to 2080 by @yannicks1 in #362
- ♻️ use fp8 model for testing SB + CB by @prashantgupta24 in #359
- [cb][tests] 🐛 fix bug in test utils, please merge ASAP by @yannicks1 in #367
- 📌 pin vllm upper bound by @prashantgupta24 in #369
- [CB] set and respect compiler constraint VLLM_DT_MAX_BATCH_TKV_LIMIT by @yannicks1 in #363
Full Changelog: v0.6.0...v0.7.0