Skip to content

Commit 3d31322

Browse files
authored
Merge pull request #42 from georgian-io/inference
Inference
2 parents 4a8ed11 + 861d3e7 commit 3d31322

File tree

37 files changed

+2667
-282
lines changed

37 files changed

+2667
-282
lines changed

README.md

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@ We stress-test both open-source and close-source LLMs through our Evaluation Fra
2121
[Getting Started](#getting-started)
2222
[LLM Roadmap](#llm-roadmap)
2323
[Benchmarks](#benchmarks)
24+
[Cost estimation and load testing](#cost-estimation-and-load-testing)
2425
[Contributing](#contributing)
2526

2627
</div>
@@ -169,7 +170,35 @@ We use the samsum dataset which contains chat conversations and their summarized
169170
|ROUGE-1 (in %) |47.23 |49.21 |52.18 |47.75 |49.96 |51.71 |52.97 |
170171
|ROUGE-2 (in %) |21.01 |23.39 |27.84 |23.53 |25.94 |26.86 |28.32 |
171172

173+
## Cost estimation and load testing
172174

175+
We deployed the models mentioned above on two servers: FastApi and the HuggingFace Text Generation Inference server. The goal was to compare the cost and latency between our custom server, developed using FastApi, and the inference server (TGI), which comes with many built-in optimizations.
176+
177+
All servers were run and received inference requests on an AWS g5.4xlarge instance with Nvidia GPU A10. For load testing, we utilized Vegeta to see how the system copes with a high volume of requests. Our objective was to identify the maximum RPS each model could manage, along with throughput, latency, and cost per 1,000 tokens. We created a set of sample sentences, each about ~100 tokens long, to generate the requests. During the load testing, a random sentence was chosen for each request, ensuring consistent testing results. This method allowed us to identify the typical RPS range each model and service could handle for various tasks.
178+
179+
Below, two tables summarize our observations for all the models, tasks, and most used deployment options explored in this repository (we also tried LLama on Nvidia A100 using the Ray server; more details can be found [here](https://github.com/georgian-io/LLM-Finetuning-Hub/blob/main/llama2/README.md)). Generally, the TGI server is more cost-effective than the custom server and simpler to set up. It provided better RPS, throughput, and lower latency. A different inference server, [vLLm](https://vllm.readthedocs.io), can offer even higher maximum RPS compared to TGI (you can find more details about our load testing experiments with it for LLama-2 [here](https://github.com/georgian-io/LLM-Finetuning-Hub/blob/main/llama2/README.md)). Last thing to mention is that models designed for classification are slower than those for summarization. Aslo, the model's size (number of training parameters) doesn't significantly impact its performance.
180+
181+
### Text Generation Inference
182+
183+
| | Classification | | | | | |Summarization| | | | | |
184+
|--------------------------|----------------|---------|---------|---------|-------------|----------|-------------|--------|--------|--------|---------|--------|
185+
| Model | Flan-T5 Large | Falcon-7B | RP-3B | RP-7B | LLama2-7B |LLama2-13B | Flan-T5 Large | Falcon-7B |RP-3B |RP-7B |LLama2-7B | LLama-2-13B|
186+
| Inference cost (per 1K tokens) | $0.00001 | $0.00005 | $0.00003 | $0.00003 | $0.00003 |$0.00003 | $0.00001 | $0.00004|$0.00001|$0.00002| $0.00002 | $0.00002 |
187+
| RPS | 145 | 125 | 135 | 125 | 125 |125 | 120 | 145 | 195 |145 | 135 | 125 |
188+
| Throughput | 78.5 | 30.3 | 57.3 | 26.13 | 19.81 | 9.60 | 45.5 | 53.8 | 96.06 |41.5 | 36.10 | 22.16 |
189+
| Latency 90% (seconds) | 1.5 | 2.7 | 1.44 | 3.98 | 4.8 | 12.04 | 2.03 | 1.82 | 0.7139 |2.5 | 2.6 | 5.15 |
190+
191+
### FastApi
192+
193+
| | Classification | | | | | |Summarization| | | | | |
194+
|--------------------------|----------------|---------|---------|---------|-----------|----------|-------------|--------|--------|--------|---------|-----------|
195+
| Model | Flan-T5 Large | Falcon-7B | RP-3B | RP-7B | LLama2-7B |LLama2-13B | Flan-T5 Large | Falcon-7B |RP-3B |RP-7B |LLama2-7B | LLama2-13B |
196+
| Inference cost (per 1K tokens) | $0.00001 | - | $0.001 | $0.001 | $0.001 | $0.001 | $0.00007 | - |$0.00002|$0.00002| $0.00003 | $0.0003 |
197+
| RPS | 180 | - | 4 | 4 | 4 | 4 | 30 | - | 160 |160 | 100 | 10 |
198+
| Throughput | 5.84 | - | 0.15 | 0.14 | 0.11 | 0.14 | 1.5 | - | 5.46 |5.27 | 3.43 | 1.73 |
199+
| Latency 90% (seconds) | 28.01 | - | 26.4 | 28.1 | 27.3 | 27.9 | 18.27 | - | 28.4 |29.527 | 28.1 | 5.1 |
200+
201+
In conclusion, the TGI server offers a more cost-efficient and streamlined approach compared to custom servers, delivering superior performance metrics. While classification models tend to be slower, the size of the model, in terms of training parameters, doesn't notably affect its efficiency. Choosing the right server and model type is crucial for optimizing cost and latency.
173202

174203
## Contributing
175204

inference/README.md

Lines changed: 115 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,115 @@
1+
# Deployment
2+
3+
In this section you can find the instructions on how to deploy your model using FastApi and Text Generation Inference.
4+
5+
To follow these instructions you need:
6+
7+
- Docker installed
8+
- Path of the folder with model weights
9+
- HuggingFace account
10+
11+
Note: To use GPUs, you need to install the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html).
12+
13+
## FastApi
14+
15+
For building FastApi application, do the following:
16+
17+
18+
1. Copy folder with model weights to the ```fastapi_naive``` directory
19+
20+
```
21+
cp model_weights llm-tuning-hub/inference/fastapi_naive
22+
```
23+
2. Navigate to Inference folder
24+
25+
```
26+
cd ./inference
27+
```
28+
3. Build the Docker image
29+
```
30+
docker build -t fastapi_ml_app:latest ./fastapi_naive/
31+
```
32+
4. Run Docker image specifying the parameters:
33+
34+
- <code> APP_MODEL_PATH </code>: model path of your model (the one from the step 1)
35+
- <code> APP_TASK </code>: summarization/classification depending for what task your model was trained on
36+
- <code> APP_MAX_TARGET_LENGTH </code>: the maximum numbers of tokens to generate, ignoring the number of tokens in the prompt
37+
- <code> APP_MODEL_TYPE </code>: depending on what model you want to deploy, you should choose respective model type according to this table
38+
39+
| Model | Type |
40+
|------------|---------|
41+
| Flan-T5 | seq2seq |
42+
| Falcon-7B | causal |
43+
| RedPajama | causal |
44+
| LLama-2 | causal |
45+
<p></p>
46+
47+
```
48+
docker run --gpus all -it --rm -p 8080:8080 --name app-web-test-run-ti --runtime=nvidia -e NVIDIA_DRIVER_CAPABILITIES=compute,utility -e NVIDIA_VISIBLE_DEVICES=all -e APP_MODEL_PATH="weights/checkpoints-for-summarization/assets" -e APP_MODEL_TYPE="causal" -e APP_TASK="summarization" -e APP_MAX_TARGET_LENGTH=100 fastapi_ml_app:latest
49+
```
50+
5. Test application
51+
52+
```
53+
python client.py --url http://localhost:8080/predict --prompt "Your custom prompt here"
54+
```
55+
56+
## [Text Generation Inference](https://github.com/huggingface/text-generation-inference)
57+
58+
1. Install HuggingFace library:
59+
60+
```
61+
pip install huggingface_hub
62+
```
63+
64+
2. Login into your HuggingFace account:
65+
66+
```
67+
huggingface-cli login
68+
```
69+
70+
Note: you will need a read/write token which you can create in Settings in your HF account.
71+
72+
3. Create [New model](https://huggingface.co/new) repository in HugginFace
73+
4. For using Text Generation Inference you need standalone model which you can get using merge script:
74+
75+
```
76+
python merge_script.py --model_path /my/path --model_type causal --repo_id johndoe/new_model
77+
```
78+
5. Serve the model:
79+
80+
```
81+
model=meta-llama/Llama-2-7b-chat-hf
82+
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
83+
token=<your cli READ token>
84+
```
85+
86+
```
87+
docker run --gpus all --shm-size 1g -e HUGGING_FACE_HUB_TOKEN=$token -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:1.0.3 --model-id $model
88+
```
89+
90+
## [vLLm](https://vllm.readthedocs.io/en/latest/getting_started/quickstart.html)
91+
92+
1. Install the package:
93+
94+
```
95+
pip install vllm
96+
```
97+
2. Start the server:
98+
99+
Use the model name from HuggingFace repository for ```--model``` argument
100+
101+
```
102+
python -m vllm.entrypoints.openai.api_server --model username/model
103+
```
104+
3. Make request:
105+
106+
```
107+
curl http://localhost:8000/v1/completions \
108+
-H "Content-Type: application/json" \
109+
-d '{
110+
"model": "facebook/opt-125m",
111+
"prompt": "San Francisco is a",
112+
"max_tokens": 7,
113+
"temperature": 0
114+
}'
115+
```

inference/fastapi_naive/Dockerfile

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,5 +9,11 @@ RUN pip install -r requirements.txt
99
RUN ln -s /usr/bin/python3 /usr/bin/python
1010

1111
ENV PYTHONPATH /app
12+
ENV APP_MODEL_PATH="weights/checkpoints/assets"
13+
ENV APP_MODEL_TYPE="causal"
14+
ENV APP_TASK="summarization"
15+
ENV APP_MAX_TARGET_LENGTH=50
16+
ENV APP_TEMPERATURE="0.01"
17+
1218
COPY . /app
13-
CMD uvicorn --host 0.0.0.0 --port 8080 --workers 2 fastapi_naive.api:app
19+
CMD uvicorn --host 0.0.0.0 --port 8080 --workers 2 api:app

inference/fastapi_naive/api.py

Lines changed: 11 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -3,20 +3,17 @@
33
from pydantic import BaseModel
44
from pydantic_settings import BaseSettings
55

6-
from fastapi_naive.predictor import Predictor
6+
from predictor import Predictor
77

8-
# Settings for summarization
98
class Settings(BaseSettings):
10-
model_path: str = "weights/checkpoints-for-summarization/assets"
9+
model_path: str = "weights/checkpoints/assets"
10+
model_type: str= "causal"
1111
task: str = "summarization"
1212
max_target_length: int = 50
13-
14-
15-
# Settings for classification
16-
# class Settings(BaseSettings):
17-
# model_path: str = 'weights/checkpoints-for-classification/assets'
18-
# task: str = 'classification'
19-
# max_target_length: int = 20
13+
temperature: float = 0.01
14+
15+
class Config:
16+
env_prefix = 'APP_'
2017

2118

2219
class Payload(BaseModel):
@@ -29,12 +26,14 @@ class Prediction(BaseModel):
2926

3027
app = FastAPI()
3128
settings = Settings()
32-
predictor = Predictor(model_load_path=settings.model_path, task=settings.task)
29+
predictor = Predictor(model_load_path=settings.model_path, model_type=settings.model_type,
30+
task=settings.task)
3331

3432

3533
@app.post("/predict", response_model=Prediction)
3634
def predict(paylod: Payload) -> Prediction:
37-
prediction = predictor.predict(prompt=paylod.prompt, max_target_length=settings.max_target_length)
35+
prediction = predictor.predict(prompt=paylod.prompt, max_target_length=settings.max_target_length,
36+
temperature=settings.temperature)
3837
return Prediction(prediction=prediction)
3938

4039

inference/fastapi_naive/client.py

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
import argparse
2+
import requests
3+
4+
def send_request(url, prompt):
5+
response = requests.post(url, json={"prompt": prompt})
6+
7+
if response.status_code == 200:
8+
return response.json()["prediction"]
9+
else:
10+
response.raise_for_status()
11+
12+
if __name__ == "__main__":
13+
parser = argparse.ArgumentParser(description="Send a POST request to a FastAPI endpoint with a given prompt.")
14+
15+
parser.add_argument("--url", type=str, default="http://0.0.0.0:8080/predict", help="Endpoint URL to make the POST request.")
16+
parser.add_argument("--prompt", type=str)
17+
18+
args = parser.parse_args()
19+
20+
prediction = send_request(args.url, args.prompt)
21+
print(f"Prediction: {prediction}")
Lines changed: 18 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,27 +1,33 @@
1-
from peft import PeftModel, PeftConfig
2-
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
1+
from peft import PeftConfig, AutoPeftModelForCausalLM
2+
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
33
import torch
44

5-
65
class Predictor:
7-
def __init__(self, model_load_path: str, task: str = "summarization", load_in_8bit: bool = False):
8-
config = PeftConfig.from_pretrained(model_load_path)
9-
self.model = AutoModelForSeq2SeqLM.from_pretrained(config.base_model_name_or_path, load_in_8bit=load_in_8bit)
10-
self.tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)
11-
6+
def __init__(self, model_load_path: str, model_type: str, task: str = "summarization",
7+
load_in_8bit: bool = False):
8+
if model_type == "seq2seq":
9+
config = PeftConfig.from_pretrained(model_load_path)
10+
self.model = AutoModelForSeq2SeqLM.from_pretrained(config.base_model_name_or_path, load_in_8bit=load_in_8bit)
11+
self.tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)
12+
else:
13+
self.model = AutoPeftModelForCausalLM.from_pretrained(model_load_path,
14+
low_cpu_mem_usage=True,
15+
torch_dtype=torch.float16,
16+
load_in_4bit=True,)
17+
self.tokenizer = AutoTokenizer.from_pretrained(model_load_path)
18+
1219
self.task = task
13-
self.model = PeftModel.from_pretrained(self.model, model_load_path)
1420

1521
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
1622
self.model.to(device)
1723
self.model.eval()
1824

1925
def get_input_ids(self, prompt: str):
2026
if self.task == "summarization":
21-
input_ids = self.tokenizer("summarize: " + prompt, return_tensors="pt", truncation=True).input_ids.cuda()
27+
input_ids = self.tokenizer(prompt, return_tensors="pt", truncation=True).input_ids.cuda()
2228
else:
2329
input_ids = self.tokenizer(
24-
"Classify the following sentence into a category: " + prompt.replace("\n", " ") + " The answer is: ",
30+
prompt.replace("\n", " "),
2531
return_tensors="pt",
2632
truncation=True,
2733
).input_ids.cuda()
@@ -39,4 +45,4 @@ def predict(self, prompt: str, max_target_length: int = 512, temperature: float
3945
)
4046
prediction = self.tokenizer.batch_decode(outputs.cpu().numpy(), skip_special_tokens=True)[0]
4147

42-
return prediction
48+
return prediction

inference/fastapi_naive/predictor_causal_llm.py

Lines changed: 0 additions & 43 deletions
This file was deleted.

inference/fastapi_naive/test_predictor.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
from fastapi_naive.predictor import Predictor
1+
from predictor import Predictor
22

33
if __name__ == "__main__":
44
from datasets import load_dataset

inference/load_testing/vegeta/fastapi/classification/llama_7B/results1.txt

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -70,10 +70,6 @@ Bytes Out [total, mean] 1359, 453.00
7070
Success [ratio] 100.00%
7171
Status Codes [code:count] 200:3
7272
Error Set:
73-
Requests [total, rate, throughput] 4, 5.33, 0.00
74-
Duration [total, attack, wait] 30.75s, 750.001ms, 30s
75-
Latencies [min, mean, 50, 90, 95, 99, max] 30s, 30s, 30s, 30s, 30s, 30s, 30s
76-
Bytes In [total, mean] 0, 0.00
7773
Requests [total, rate, throughput] 3, 4.50, 0.11
7874
Duration [total, attack, wait] 27.602s, 666.664ms, 26.935s
7975
Latencies [min, mean, 50, 90, 95, 99, max] 26.935s, 27.14s, 27.137s, 27.347s, 27.347s, 27.347s, 27.347s
2 KB
Loading

0 commit comments

Comments
 (0)