CentML
diff --git a/‎.github/workflows/lint.yml‎
Lines changed: 40 additions & 0 deletions b/‎.github/workflows/lint.yml‎
Lines changed: 40 additions & 0 deletions
diff --git a/‎.github/workflows/run-tests.yml‎
Lines changed: 36 additions & 0 deletions b/‎.github/workflows/run-tests.yml‎
Lines changed: 36 additions & 0 deletions
diff --git a/‎.gitignore‎
Lines changed: 166 additions & 0 deletions b/‎.gitignore‎
Lines changed: 166 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 150 additions & 0 deletions b/‎README.md‎
Lines changed: 150 additions & 0 deletions
@@ -0,0 +1,40 @@
+name: Format & Lint
+
+on:
+  push:
+    branches: [main]
+  pull_request:
+
+jobs:
+  format-and-lint:
+    if: github.repository == 'CentML/flexible-inference-bench'
+    concurrency:
+      group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }}
+      cancel-in-progress: true
+    runs-on: ubuntu-latest
+    strategy:
+      matrix:
+        python-version: ["3.10"]
+    steps:
+      - uses: actions/checkout@v3
+      - name: Set up Python ${{ matrix.python-version }}
+        uses: actions/setup-python@v4
+        with:
+          python-version: ${{ matrix.python-version }}
+      - name: Install dependencies
+        run: |
+          python -m pip install --upgrade pip
+          pip install -r requirements-dev.txt
+      - name: Format with black
+        run: |
+          # stop the build if format is not correct
+          echo "Running with " $(pip freeze | grep "black")
+          bash scripts/lint/format.sh --check
+      - name: Lint with pylint
+        run: |
+          echo "Running with" $(pip freeze | grep "pylint")
+          bash scripts/lint/lint.sh
+      - name: Type checking with mypy
+        run: |
+          echo "Running with" $(pip freeze | grep "mypy")
+          bash scripts/lint/mypy.sh
@@ -0,0 +1,36 @@
+name: Unit Tests
+
+on:
+  push:
+    branches: [main]
+  pull_request:
+
+jobs:
+  unit-tests:
+    if : github.repository == 'CentML/flexible-inference-bench'
+    runs-on:
+      group: arc-a100-80
+    container: 
+      image: nvidia/cuda:12.1.0-devel-ubuntu22.04
+      options: --gpus all
+    strategy:
+      matrix:
+        python-version: ["3.10"]
+    steps:
+      - name: Fetch repository
+        uses: actions/checkout@v4
+
+      - name: Set up Python ${{ matrix.python-version }}
+        uses: actions/setup-python@v4
+        with:
+          python-version: ${{ matrix.python-version }}
+
+      - name: Install dependencies
+        run: |
+          python -m pip install --upgrade pip
+          pip install -r requirements-dev.txt
+          pip install -e .
+
+      - name : Run tests
+        run: |
+          bash scripts/unit_test/test.sh
@@ -0,0 +1,166 @@
+# Protobuf generated files
+python/cserve/protos/*pb2*
+
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+
+# C extensions
+*.so
+
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+cover/
+
+# Translations
+*.mo
+*.pot
+
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+
+# Flask stuff:
+instance/
+.webassets-cache
+
+# Scrapy stuff:
+.scrapy
+
+# Sphinx documentation
+docs/_build/
+
+# PyBuilder
+.pybuilder/
+target/
+
+# Jupyter Notebook
+.ipynb_checkpoints
+
+# IPython
+profile_default/
+ipython_config.py
+
+# pyenv
+#   For a library or package, you might want to ignore these files since the code is
+#   intended to run in multiple environments; otherwise, check them in:
+# .python-version
+
+# pipenv
+#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
+#   However, in case of collaboration, if having platform-specific dependencies or dependencies
+#   having no cross-platform support, pipenv may install dependencies that don't work, or not
+#   install all needed dependencies.
+#Pipfile.lock
+
+# poetry
+#   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
+#   This is especially recommended for binary packages to ensure reproducibility, and is more
+#   commonly ignored for libraries.
+#   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
+#poetry.lock
+
+# pdm
+#   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
+#pdm.lock
+#   pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
+#   in version control.
+#   https://pdm.fming.dev/#use-with-ide
+.pdm.toml
+
+# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
+__pypackages__/
+
+# Celery stuff
+celerybeat-schedule
+celerybeat.pid
+
+# SageMath parsed files
+*.sage.py
+
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+
+# Spyder project settings
+.spyderproject
+.spyproject
+
+# Rope project settings
+.ropeproject
+
+# mkdocs documentation
+/site
+
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+
+# Pyre type checker
+.pyre/
+
+# pytype static type analyzer
+.pytype/
+
+# Cython debug symbols
+cython_debug/
+
+# PyCharm
+#  JetBrains specific template is maintained in a separate JetBrains.gitignore that can
+#  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
+#  and can be added to the global gitignore or merged into this file.  For a more nuclear
+#  option (not recommended) you can uncomment the following to ignore the entire idea folder.
+#.idea/
+
+# NSYS Files
+*.nsys-rep
@@ -0,0 +1,150 @@
+# Flexible Inference Benchmarker
+A modular, extensible LLM inference benchmarking framework that supports multiple benchmarking frameworks and paradigms.
+
+This benchmarking framework operates entirely external to any serving framework, and can easily be extended and modified. It is intended to be fully-featured to provide a variety of statistics and profiling modes and be easily extensible.
+
+## Installation
+```
+cd flexible-inference-benchmark
+pip install .
+```
+
+## Usage
+After installing with the above instructions, the benchmarker can be invoked with `inference-benchmark <args>`.
+
+After you get your output (using `--output-file`), you can invoke one of the data postprocessors in `data_postprocessors`.
+
+### Parameters
+| argument | description |
+| --- | --- |
+| `--seed` | Seed for reproducibility. |
+| `--backend` | Backend options: `tgi`,`vllm`,`cserve`,`cserve-debug`,`lmdeploy`,`deepspeed-mii`,`openai`,`openai-chat`,`tensorrt-llm`. <br> **For tensorrt-llm temperature is set to 0.01 since NGC container >= 24.06 does not support 0.0** |
+| `--base-url` | Server or API base url, without endpoint |
+| `--endpoint` | API endpoint. |
+| one of <br> `--num-of-req` **or** <br> `--max-time-for-reqs` | <br> Total number of requests to send <br> time window for sending requests **(in seconds)** |
+| `--request-distribution` | Distribution for sending requests: <br> **eg:** `exponential 5` (request will follow an exponential distribution with an average time between requests of **5 seconds**) <br> options: <br> `poisson rate` <br> `uniform min_val max_val` <br> `normal mean std`. | 
+| `--input-token-distribution` | Request distribution for prompt length. eg: <br> `uniform min_val max_val` <br> `normal mean std`. |
+| `--output-token-distribution` | Request distribution for output token length. eg: <br> `uniform min_val max_val` <br> `normal mean std`. |
+| one of:<br>`--prefix-text`<br>`--prefix-len`<br>`--no-prefix` | <br> Text to use as prefix for all requests. <br> Length of prefix to use for all requests. <br> No prefix for requests. |
+| `--dataset-name` | Name of the dataset to benchmark on <br> {`sharegpt`,`other`,`random`}. |
+| `--dataset-path` | Path to the dataset. |
+| `--model` | Name of the model. |
+| `--tokenizer` | Name or path of the tokenizer, if not using the default tokenizer. |
+| `--disable-tqdm` | Specify to disable tqdm progress bar. |
+| `--best-of` | Number of best completions to return. |
+| `--use-beam-search` | Use beam search for completions. |
+| `--output-file` | Output json file to save the results. |
+| `--debug` | Log debug messages. |
+| `--disable-ignore-eos` | Ignores end of sequence.<br> **Note:** Not valid argument for TensorRT-LLM |
+| `--disable-stream` | The requests are send with Stream: False. (Used for APIs without an stream option) |
+| `--cookies` | Include cookies in the request. |
+| `--config-file` | Path to configuration file. |
+
+**For ease of use we recommend passing a configuration file with all the required parameters for your use case. Examples are provided in `examples/`**
+
+### Output
+The output json file in an array of objects that contain the following fields:<br>
+* `backend`: backend used
+* `time`: Total time
+* `outputs`: 
+    * `text`: Generated text
+    * `success`: Whether the request was successful
+    * `latency`: End-to-end time for the request
+    * `ttft`: Time to first token
+    * `itl`: Inter-token latency
+    * `prompt_len`: Length of the prompt
+    * `error`: Error message
+* `inputs`: List of `[prompt string, input tokens, expected output tokens]`
+* `tokenizer`: Tokenizer name
+* `stream`: Indicates if we used the stream argument or not
+
+### Data Postprocessors
+Below is a description of the data postprocessors.
+
+#### `performance.py`
+Prints the following output for a given run, same as vLLM.
+
+```
+============ Serving Benchmark Result ============
+Successful requests:                     20
+Benchmark duration (s):                  19.39
+Total input tokens:                      407
+Total generated tokens:                  5112
+Request throughput (req/s):              1.03
+Input token throughput (tok/s):          20.99
+Output token throughput (tok/s):         263.66
+---------------Time to First Token----------------
+Mean TTFT (ms):                          24.66
+Median TTFT (ms):                        24.64
+P99 TTFT (ms):                           34.11
+-----Time per Output Token (excl. 1st token)------
+Mean TPOT (ms):                          2295.86
+Median TPOT (ms):                        2362.54
+P99 TPOT (ms):                           2750.76
+==================================================
+```
+
+Supports the following args:
+
+| argument | description |
+| --- | --- |
+| `--datapath` | Path to the output json file produced. |
+
+#### `itl.py`
+
+Returns a plot of inter-token latencies for a specific request. Takes the following args:
+
+| argument | description |
+| --- | --- |
+| `--datapath` | Path to the output json file produced. |
+| `--output` | Path to save figure supported by matplotlib. |
+| `--request-num` | Which request to produce ITL plot for. |
+
+#### `ttft.py`
+
+Generates a simple CDF plot of **time to first token** requests. You can pass a single file or  a list of generated files from the benchmark to make a comparisson <br>
+
+| argument | description |
+| --- | --- |
+| `--files` | file(s) to generate the plot
+
+## `Example`
+
+Let's take vllm as the backend for our benchmark.
+You can install vllm with the command:<br>
+`pip install vllm`
+
+We will use gpt2 as the model<br>
+`python -m vllm.entrypoints.openai.api_server --model gpt2`
+
+Once the backend is up and running we can go to the examples folder and run the inference benchmark using vllm_args.json file <br>
+`cd examples`<br>
+`inference-benchmark --config-file vllm_args.json --output-file vllm-benchmark.json`
+
+then you can go to the folder data_postprocessors and see the performance with performance.py<br>
+`cd ../data_postprocessors` <br>
+`python performance.py --datapath ../examples/vllm-benchmark.json` <br>
+
+```
+============ Serving Benchmark Result ============
+Successful requests:                     20        
+Benchmark duration (s):                  4.15      
+Total input tokens:                      3836      
+Total generated tokens:                  4000      
+Request throughput (req/s):              4.82      
+Input token throughput (tok/s):          925.20    
+Output token throughput (tok/s):         964.76    
+---------------Time to First Token----------------
+Mean TTFT (ms):                          19.91     
+Median TTFT (ms):                        22.11     
+P99 TTFT (ms):                           28.55     
+-----Time per Output Token (excl. 1st token)------
+Mean TPOT (ms):                          6.73      
+Median TPOT (ms):                        7.96      
+P99 TPOT (ms):                           8.41      
+---------------Inter-token Latency----------------
+Mean ITL (ms):                           6.73      
+Median ITL (ms):                         7.40      
+P99 ITL (ms):                            20.70     
+==================================================
+```