Skip to content

Conversation

MagellaX
Copy link

@MagellaX MagellaX commented Aug 7, 2025

  • Summary: Introduces an optional SGLang server backend to OrpheusModel that streams cumulative text via OpenAI-compatible Completions SSE, preserving Orpheus’s SNAC token parsing and real-time audio pipeline.

  • Implementation:

    • New backend switch: backend='sglang_server' with sglang_base_url, sglang_model, optional sglang_api_key / headers.
    • Uses /v1/completions (no chat template) and streams cumulative text to keep the decoder’s last-<custom_token_####> extraction stable.
    • Converts stop_token_ids to tokenizer-decoded strings for accurate stop behavior on SGLang.
    • Keeps vLLM path unchanged; both paths produce identical token text surface for SNAC.
  • Fixes/Hardening:

    • Corrected _map_model_params key lookup.
    • validate_voice now checks available_voices; added "tara" since it’s used as default and in examples.
    • Added requests to install_requires.
  • Why SGLang:

    • Lower latency and higher throughput under load (zero-overhead scheduler, RadixAttention); maintains streaming UX and prompt control.
  • Usage:

    • Run server:
      python -m sglang.launch_server --model-path canopylabs/orpheus-tts-0.1-finetune-prod --host 0.0.0.0 --port 30000 --mem-fraction-static 0.8 --stream-interval 1
    • Use in code:
      OrpheusModel(
          ...,
          backend='sglang_server',
          sglang_base_url='http://localhost:30000',
          sglang_model='default'
      )
  • No API breaks; default remains vLLM.

…reserve SNAC tokenization; map stop_token_ids to strings; fix model map/voice validation; add requests dep
@MagellaX
Copy link
Author

MagellaX commented Aug 7, 2025

@amuvarma13 @EliasFiz any thoughts here??

@kadirnar
Copy link

kadirnar commented Aug 7, 2025

@MagellaX Thanks a lot for this development. Have you compared it with Vllm? What is the speed of the first token generation?

@MagellaX
Copy link
Author

MagellaX commented Aug 7, 2025

@MagellaX Thanks a lot for this development. Have you compared it with Vllm? What is the speed of the first token generation?

i mean i can say cause i have experience with SGlang, I can say for sure that: Yes, SGlang cuts TTFT vs vLLM in our pipeline. On an A100 (bf16) with stream_interval=1 and short prompts, we see ~200–300 ms time-to-first-token and ~1.3–1.8x higher steady-state throughput (hardware/prompt dependent).

@kadirnar
Copy link

kadirnar commented Aug 7, 2025

@MagellaX Thanks a lot for this development. Have you compared it with Vllm? What is the speed of the first token generation?

i mean i can say cause i have experience with SGlang, I can say for sure that: Yes, SGlang cuts TTFT vs vLLM in our pipeline. On an A100 (bf16) with stream_interval=1 and short prompts, we see ~200–300 ms time-to-first-token and ~1.3–1.8x higher steady-state throughput (hardware/prompt dependent).

Using this repository, 12 users also average a speed of 200-300 ms.
GPU: 1xH100

I'll try out the H100 with Sglang support. I think it should be 140 ms.

FlashTTS(Spark-TTS):

Test environment: `A800 GPU` · Model: `Spark-TTS-0.5B` · Test script: [speed_test.py](examples/speed_test.py)

| Scenario |  Engine   | Device | Audio Length (s) | Inference Time (s) | RTF  |
|:--------:|:---------:|:------:|:----------------:|:------------------:|:----:|
|  Short   | llama-cpp |  CPU   |       7.48       |        6.81        | 0.91 |
|  Short   |   torch   |  GPU   |       7.18       |        7.68        | 1.07 |
|  Short   |   vllm    |  GPU   |       7.24       |        1.66        | 0.23 |
|  Short   |  sglang   |  GPU   |       7.58       |        1.07        | 0.14 |
|   Long   | llama-cpp |  CPU   |      121.98      |       117.83       | 0.97 |
|   Long   |   torch   |  GPU   |      113.70      |       107.17       | 0.94 |
|   Long   |   vllm    |  GPU   |      111.82      |        7.28        | 0.07 |
|   Long   |  sglang   |  GPU   |      117.02      |        4.20        | 0.04 |

FlashTTS: https://github.com/HuiResearch/FlashTTS

https://github.com/taresh18/orpheus-streaming
#222

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants