|
| 1 | +# Flexible Inference Benchmarker |
| 2 | +A modular, extensible LLM inference benchmarking framework that supports multiple benchmarking frameworks and paradigms. |
| 3 | + |
| 4 | +This benchmarking framework operates entirely external to any serving framework, and can easily be extended and modified. It is intended to be fully-featured to provide a variety of statistics and profiling modes and be easily extensible. |
| 5 | + |
| 6 | +## Installation |
| 7 | +``` |
| 8 | +cd flexible-inference-benchmark |
| 9 | +pip install . |
| 10 | +``` |
| 11 | + |
| 12 | +## Usage |
| 13 | +After installing with the above instructions, the benchmarker can be invoked with `inference-benchmark <args>`. |
| 14 | + |
| 15 | +After you get your output (using `--output-file`), you can invoke one of the data postprocessors in `data_postprocessors`. |
| 16 | + |
| 17 | +### Parameters |
| 18 | +| argument | description | |
| 19 | +| --- | --- | |
| 20 | +| `--seed` | Seed for reproducibility. | |
| 21 | +| `--backend` | Backend options: `tgi`,`vllm`,`cserve`,`cserve-debug`,`lmdeploy`,`deepspeed-mii`,`openai`,`openai-chat`,`tensorrt-llm`. <br> **For tensorrt-llm temperature is set to 0.01 since NGC container >= 24.06 does not support 0.0** | |
| 22 | +| `--base-url` | Server or API base url, without endpoint | |
| 23 | +| `--endpoint` | API endpoint. | |
| 24 | +| one of <br> `--num-of-req` **or** <br> `--max-time-for-reqs` | <br> Total number of requests to send <br> time window for sending requests **(in seconds)** | |
| 25 | +| `--request-distribution` | Distribution for sending requests: <br> **eg:** `exponential 5` (request will follow an exponential distribution with an average time between requests of **5 seconds**) <br> options: <br> `poisson rate` <br> `uniform min_val max_val` <br> `normal mean std`. | |
| 26 | +| `--input-token-distribution` | Request distribution for prompt length. eg: <br> `uniform min_val max_val` <br> `normal mean std`. | |
| 27 | +| `--output-token-distribution` | Request distribution for output token length. eg: <br> `uniform min_val max_val` <br> `normal mean std`. | |
| 28 | +| one of:<br>`--prefix-text`<br>`--prefix-len`<br>`--no-prefix` | <br> Text to use as prefix for all requests. <br> Length of prefix to use for all requests. <br> No prefix for requests. | |
| 29 | +| `--dataset-name` | Name of the dataset to benchmark on <br> {`sharegpt`,`other`,`random`}. | |
| 30 | +| `--dataset-path` | Path to the dataset. | |
| 31 | +| `--model` | Name of the model. | |
| 32 | +| `--tokenizer` | Name or path of the tokenizer, if not using the default tokenizer. | |
| 33 | +| `--disable-tqdm` | Specify to disable tqdm progress bar. | |
| 34 | +| `--best-of` | Number of best completions to return. | |
| 35 | +| `--use-beam-search` | Use beam search for completions. | |
| 36 | +| `--output-file` | Output json file to save the results. | |
| 37 | +| `--debug` | Log debug messages. | |
| 38 | +| `--disable-ignore-eos` | Ignores end of sequence.<br> **Note:** Not valid argument for TensorRT-LLM | |
| 39 | +| `--disable-stream` | The requests are send with Stream: False. (Used for APIs without an stream option) | |
| 40 | +| `--cookies` | Include cookies in the request. | |
| 41 | +| `--config-file` | Path to configuration file. | |
| 42 | + |
| 43 | +**For ease of use we recommend passing a configuration file with all the required parameters for your use case. Examples are provided in `examples/`** |
| 44 | + |
| 45 | +### Output |
| 46 | +The output json file in an array of objects that contain the following fields:<br> |
| 47 | +* `backend`: backend used |
| 48 | +* `time`: Total time |
| 49 | +* `outputs`: |
| 50 | + * `text`: Generated text |
| 51 | + * `success`: Whether the request was successful |
| 52 | + * `latency`: End-to-end time for the request |
| 53 | + * `ttft`: Time to first token |
| 54 | + * `itl`: Inter-token latency |
| 55 | + * `prompt_len`: Length of the prompt |
| 56 | + * `error`: Error message |
| 57 | +* `inputs`: List of `[prompt string, input tokens, expected output tokens]` |
| 58 | +* `tokenizer`: Tokenizer name |
| 59 | +* `stream`: Indicates if we used the stream argument or not |
| 60 | + |
| 61 | +### Data Postprocessors |
| 62 | +Below is a description of the data postprocessors. |
| 63 | + |
| 64 | +#### `performance.py` |
| 65 | +Prints the following output for a given run, same as vLLM. |
| 66 | + |
| 67 | +``` |
| 68 | +============ Serving Benchmark Result ============ |
| 69 | +Successful requests: 20 |
| 70 | +Benchmark duration (s): 19.39 |
| 71 | +Total input tokens: 407 |
| 72 | +Total generated tokens: 5112 |
| 73 | +Request throughput (req/s): 1.03 |
| 74 | +Input token throughput (tok/s): 20.99 |
| 75 | +Output token throughput (tok/s): 263.66 |
| 76 | +---------------Time to First Token---------------- |
| 77 | +Mean TTFT (ms): 24.66 |
| 78 | +Median TTFT (ms): 24.64 |
| 79 | +P99 TTFT (ms): 34.11 |
| 80 | +-----Time per Output Token (excl. 1st token)------ |
| 81 | +Mean TPOT (ms): 2295.86 |
| 82 | +Median TPOT (ms): 2362.54 |
| 83 | +P99 TPOT (ms): 2750.76 |
| 84 | +================================================== |
| 85 | +``` |
| 86 | + |
| 87 | +Supports the following args: |
| 88 | + |
| 89 | +| argument | description | |
| 90 | +| --- | --- | |
| 91 | +| `--datapath` | Path to the output json file produced. | |
| 92 | + |
| 93 | +#### `itl.py` |
| 94 | + |
| 95 | +Returns a plot of inter-token latencies for a specific request. Takes the following args: |
| 96 | + |
| 97 | +| argument | description | |
| 98 | +| --- | --- | |
| 99 | +| `--datapath` | Path to the output json file produced. | |
| 100 | +| `--output` | Path to save figure supported by matplotlib. | |
| 101 | +| `--request-num` | Which request to produce ITL plot for. | |
| 102 | + |
| 103 | +#### `ttft.py` |
| 104 | + |
| 105 | +Generates a simple CDF plot of **time to first token** requests. You can pass a single file or a list of generated files from the benchmark to make a comparisson <br> |
| 106 | + |
| 107 | +| argument | description | |
| 108 | +| --- | --- | |
| 109 | +| `--files` | file(s) to generate the plot |
| 110 | + |
| 111 | +## `Example` |
| 112 | + |
| 113 | +Let's take vllm as the backend for our benchmark. |
| 114 | +You can install vllm with the command:<br> |
| 115 | +`pip install vllm` |
| 116 | + |
| 117 | +We will use gpt2 as the model<br> |
| 118 | +`python -m vllm.entrypoints.openai.api_server --model gpt2` |
| 119 | + |
| 120 | +Once the backend is up and running we can go to the examples folder and run the inference benchmark using vllm_args.json file <br> |
| 121 | +`cd examples`<br> |
| 122 | +`inference-benchmark --config-file vllm_args.json --output-file vllm-benchmark.json` |
| 123 | + |
| 124 | +then you can go to the folder data_postprocessors and see the performance with performance.py<br> |
| 125 | +`cd ../data_postprocessors` <br> |
| 126 | +`python performance.py --datapath ../examples/vllm-benchmark.json` <br> |
| 127 | + |
| 128 | +``` |
| 129 | +============ Serving Benchmark Result ============ |
| 130 | +Successful requests: 20 |
| 131 | +Benchmark duration (s): 4.15 |
| 132 | +Total input tokens: 3836 |
| 133 | +Total generated tokens: 4000 |
| 134 | +Request throughput (req/s): 4.82 |
| 135 | +Input token throughput (tok/s): 925.20 |
| 136 | +Output token throughput (tok/s): 964.76 |
| 137 | +---------------Time to First Token---------------- |
| 138 | +Mean TTFT (ms): 19.91 |
| 139 | +Median TTFT (ms): 22.11 |
| 140 | +P99 TTFT (ms): 28.55 |
| 141 | +-----Time per Output Token (excl. 1st token)------ |
| 142 | +Mean TPOT (ms): 6.73 |
| 143 | +Median TPOT (ms): 7.96 |
| 144 | +P99 TPOT (ms): 8.41 |
| 145 | +---------------Inter-token Latency---------------- |
| 146 | +Mean ITL (ms): 6.73 |
| 147 | +Median ITL (ms): 7.40 |
| 148 | +P99 ITL (ms): 20.70 |
| 149 | +================================================== |
| 150 | +``` |
0 commit comments