georgian-io
diff --git a/‎README.md‎
Lines changed: 29 additions & 0 deletions b/‎README.md‎
Lines changed: 29 additions & 0 deletions
diff --git a/‎inference/README.md‎
Lines changed: 115 additions & 0 deletions b/‎inference/README.md‎
Lines changed: 115 additions & 0 deletions
diff --git a/‎inference/fastapi_naive/Dockerfile‎
Lines changed: 7 additions & 1 deletion b/‎inference/fastapi_naive/Dockerfile‎
Lines changed: 7 additions & 1 deletion
diff --git a/‎inference/fastapi_naive/api.py‎
Lines changed: 11 additions & 12 deletions b/‎inference/fastapi_naive/api.py‎
Lines changed: 11 additions & 12 deletions
diff --git a/‎inference/fastapi_naive/client.py‎
Lines changed: 21 additions & 0 deletions b/‎inference/fastapi_naive/client.py‎
Lines changed: 21 additions & 0 deletions
diff --git a/‎inference/fastapi_naive/predictor.py‎
Lines changed: 18 additions & 12 deletions b/‎inference/fastapi_naive/predictor.py‎
Lines changed: 18 additions & 12 deletions
diff --git a/‎inference/fastapi_naive/predictor_causal_llm.py‎
Lines changed: 0 additions & 43 deletions b/‎inference/fastapi_naive/predictor_causal_llm.py‎
Lines changed: 0 additions & 43 deletions
diff --git a/‎inference/fastapi_naive/test_predictor.py‎
Lines changed: 1 addition & 1 deletion b/‎inference/fastapi_naive/test_predictor.py‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎inference/load_testing/vegeta/fastapi/classification/llama_7B/results1.txt‎
Lines changed: 0 additions & 4 deletions b/‎inference/load_testing/vegeta/fastapi/classification/llama_7B/results1.txt‎
Lines changed: 0 additions & 4 deletions
diff --git a/‎inference/load_testing/vegeta/llama_plots/llama13b/classification/a100.png‎
2 KB b/‎inference/load_testing/vegeta/llama_plots/llama13b/classification/a100.png‎
2 KB
@@ -21,6 +21,7 @@ We stress-test both open-source and close-source LLMs through our Evaluation Fra
 [Getting Started](#getting-started) •
 [LLM Roadmap](#llm-roadmap) •
 [Benchmarks](#benchmarks) •
+[Cost estimation and load testing](#cost-estimation-and-load-testing) •
 [Contributing](#contributing)
 
 </div>
@@ -169,7 +170,35 @@ We use the samsum dataset which contains chat conversations and their summarized
 |ROUGE-1 (in %) |47.23                        |49.21          |52.18      |47.75  |49.96  |51.71      |52.97       | 
 |ROUGE-2 (in %) |21.01                        |23.39          |27.84      |23.53  |25.94  |26.86      |28.32       |
 
+## Cost estimation and load testing
 
+We deployed the models mentioned above on two servers: FastApi and the HuggingFace Text Generation Inference server. The goal was to compare the cost and latency between our custom server, developed using FastApi, and the inference server (TGI), which comes with many built-in optimizations.
+
+All servers were run and received inference requests on an AWS g5.4xlarge instance with Nvidia GPU A10. For load testing, we utilized Vegeta to see how the system copes with a high volume of requests. Our objective was to identify the maximum RPS each model could manage, along with throughput, latency, and cost per 1,000 tokens. We created a set of sample sentences, each about ~100 tokens long, to generate the requests. During the load testing, a random sentence was chosen for each request, ensuring consistent testing results. This method allowed us to identify the typical RPS range each model and service could handle for various tasks.
+
+Below, two tables summarize our observations for all the models, tasks, and most used deployment options explored in this repository (we also tried LLama on Nvidia A100 using the Ray server; more details can be found [here](https://github.com/georgian-io/LLM-Finetuning-Hub/blob/main/llama2/README.md)). Generally, the TGI server is more cost-effective than the custom server and simpler to set up. It provided better RPS, throughput, and lower latency. A different inference server, [vLLm](https://vllm.readthedocs.io), can offer even higher maximum RPS compared to TGI (you can find more details about our load testing experiments with it for LLama-2 [here](https://github.com/georgian-io/LLM-Finetuning-Hub/blob/main/llama2/README.md)). Last thing to mention is that models designed for classification are slower than those for summarization. Aslo, the model's size (number of training parameters) doesn't significantly impact its performance.
+
+### Text Generation Inference
+
+|                          | Classification |         |         |		  |             | 		 |Summarization|        |        |		  |         |		|
+|--------------------------|----------------|---------|---------|---------|-------------|----------|-------------|--------|--------|--------|---------|--------|
+| Model                    | Flan-T5 Large          | Falcon-7B  | RP-3B   | RP-7B   |  LLama2-7B |LLama2-13B |  Flan-T5 Large       | Falcon-7B |RP-3B   |RP-7B   |LLama2-7B | LLama-2-13B| 
+| Inference cost (per 1K tokens)          | $0.00001   		| $0.00005 | $0.00003 | $0.00003 |    $0.00003      |$0.00003			 | $0.00001    | $0.00004|$0.00001|$0.00002|	$0.00002	    |	$0.00002	  |
+| RPS                      | 145        	| 125     | 135     |    125  |     125     |125			 | 120         | 145    | 195    |145     |	135 	    |	125	|
+| Throughput               | 78.5       	| 30.3    | 57.3    | 26.13   |      19.81    |	9.60		 | 45.5        | 53.8   | 96.06  |41.5	  |	36.10	    |	22.16		|
+| Latency 90% (seconds)    | 1.5       		| 2.7     | 1.44    |   3.98  |      4.8    |	12.04		 | 2.03        | 1.82   | 0.7139 |2.5	  |	2.6	   |	5.15		|
+
+### FastApi
+
+|                          | Classification |         |         |		  |           | 		 |Summarization|        |        |		  |         |		    |
+|--------------------------|----------------|---------|---------|---------|-----------|----------|-------------|--------|--------|--------|---------|-----------|
+| Model                    | Flan-T5 Large          | Falcon-7B  | RP-3B   |   RP-7B |  LLama2-7B |LLama2-13B |  Flan-T5 Large       | Falcon-7B |RP-3B   |RP-7B   |LLama2-7B | LLama2-13B  | 
+| Inference cost (per 1K tokens)           | $0.00001   		| -		  | $0.001   |  $0.001 |    $0.001      |	$0.001		 | $0.00007    | -		|$0.00002|$0.00002|	$0.00003	    |	$0.0003	    |
+| RPS                      | 180        	| -    	  | 4       |    4    |     4     |		4	 | 30          | -    	| 160    |160     |	100	    |	10	    |
+| Throughput               | 5.84       	| -   	  | 0.15    |   0.14  |      0.11    |	0.14		 | 1.5         | - 		| 5.46   |5.27	  |	3.43	    |	1.73		|
+| Latency 90% (seconds)    | 28.01       	| -    	  | 26.4    |   28.1  |      27.3    |	27.9		 | 18.27       | -   	| 28.4   |29.527  |	28.1	    |	5.1		|
+
+In conclusion, the TGI server offers a more cost-efficient and streamlined approach compared to custom servers, delivering superior performance metrics. While classification models tend to be slower, the size of the model, in terms of training parameters, doesn't notably affect its efficiency. Choosing the right server and model type is crucial for optimizing cost and latency.
 
 ## Contributing
 
 
@@ -0,0 +1,115 @@
+# Deployment
+
+In this section you can find the instructions on how to deploy your model using FastApi and Text Generation Inference. 
+
+To follow these instructions you need:
+
+- Docker installed
+- Path of the folder with model weights
+- HuggingFace account
+
+Note: To use GPUs, you need to install the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html). 
+
+## FastApi
+
+For building FastApi application, do the following:
+
+
+1. Copy folder with model weights to the ```fastapi_naive``` directory
+   
+   ```
+   cp model_weights llm-tuning-hub/inference/fastapi_naive 
+   ```
+2. Navigate to Inference folder
+   
+   ```
+   cd ./inference 
+   ```
+3. Build the Docker image
+    ```
+    docker build -t fastapi_ml_app:latest ./fastapi_naive/
+    ```
+4. Run Docker image specifying the parameters:
+
+   - <code> APP_MODEL_PATH </code>: model path of your model (the one from the step 1)
+   - <code> APP_TASK </code>: summarization/classification depending for what task your model was trained on
+   - <code> APP_MAX_TARGET_LENGTH </code>: the maximum numbers of tokens to generate, ignoring the number of tokens in the prompt
+   - <code> APP_MODEL_TYPE </code>: depending on what model you want to deploy, you should choose respective model type according to this table
+  
+        | Model      | Type    |
+        |------------|---------|
+        | Flan-T5       | seq2seq |
+        | Falcon-7B     | causal  |
+        | RedPajama  | causal  |
+        | LLama-2      | causal  |
+    <p></p>
+
+   ```
+   docker run --gpus all -it --rm -p 8080:8080 --name app-web-test-run-ti --runtime=nvidia -e NVIDIA_DRIVER_CAPABILITIES=compute,utility -e NVIDIA_VISIBLE_DEVICES=all -e APP_MODEL_PATH="weights/checkpoints-for-summarization/assets" -e APP_MODEL_TYPE="causal" -e APP_TASK="summarization" -e APP_MAX_TARGET_LENGTH=100 fastapi_ml_app:latest
+   ```
+5. Test application
+   
+   ```
+   python client.py --url http://localhost:8080/predict --prompt "Your custom prompt here"
+   ```
+
+## [Text Generation Inference](https://github.com/huggingface/text-generation-inference)
+
+1. Install HuggingFace library:
+
+    ```
+    pip install huggingface_hub
+    ```
+
+2. Login into your HuggingFace account:
+   
+   ```
+   huggingface-cli login
+   ```
+
+    Note: you will need a read/write token which you can create in Settings in your HF account. 
+
+3. Create [New model](https://huggingface.co/new) repository in HugginFace
+4. For using Text Generation Inference you need standalone model which you can get using merge script:
+   
+   ```
+   python merge_script.py --model_path /my/path --model_type causal --repo_id johndoe/new_model
+   ```
+5. Serve the model:
+   
+   ```
+   model=meta-llama/Llama-2-7b-chat-hf
+   volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
+   token=<your cli READ token>
+   ```
+
+   ```
+   docker run --gpus all --shm-size 1g -e HUGGING_FACE_HUB_TOKEN=$token -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:1.0.3 --model-id $model
+   ```
+   
+## [vLLm](https://vllm.readthedocs.io/en/latest/getting_started/quickstart.html)
+
+1. Install the package:
+   
+   ```
+   pip install vllm
+   ```
+2. Start the server:
+   
+   Use the model name from HuggingFace repository for ```--model``` argument
+
+   ```
+   python -m vllm.entrypoints.openai.api_server --model username/model
+   ```
+3. Make request:
+
+   ```
+   curl http://localhost:8000/v1/completions \
+      -H "Content-Type: application/json" \
+      -d '{
+      "model": "facebook/opt-125m",
+      "prompt": "San Francisco is a",
+      "max_tokens": 7,
+      "temperature": 0
+      }'
+   ```
@@ -9,5 +9,11 @@ RUN pip install -r requirements.txt
 RUN ln -s /usr/bin/python3 /usr/bin/python
 
 ENV PYTHONPATH /app 
+ENV APP_MODEL_PATH="weights/checkpoints/assets"
+ENV APP_MODEL_TYPE="causal"
+ENV APP_TASK="summarization"
+ENV APP_MAX_TARGET_LENGTH=50
+ENV APP_TEMPERATURE="0.01"
+
 COPY . /app
-CMD uvicorn --host 0.0.0.0 --port 8080 --workers 2 fastapi_naive.api:app
+CMD uvicorn --host 0.0.0.0 --port 8080 --workers 2 api:app
@@ -3,20 +3,17 @@
 from pydantic import BaseModel
 from pydantic_settings import BaseSettings
 
-from fastapi_naive.predictor import Predictor
+from predictor import Predictor
 
-# Settings for summarization
 class Settings(BaseSettings):
-    model_path: str = "weights/checkpoints-for-summarization/assets"
+    model_path: str = "weights/checkpoints/assets"
+    model_type: str= "causal"
     task: str = "summarization"
     max_target_length: int = 50
-
-
-# Settings for classification
-# class Settings(BaseSettings):
-#     model_path: str = 'weights/checkpoints-for-classification/assets'
-#     task: str = 'classification'
-#     max_target_length: int = 20
+    temperature: float = 0.01
+    
+    class Config:
+        env_prefix = 'APP_'
 
 
 class Payload(BaseModel):
@@ -29,12 +26,14 @@ class Prediction(BaseModel):
 
 app = FastAPI()
 settings = Settings()
-predictor = Predictor(model_load_path=settings.model_path, task=settings.task)
+predictor = Predictor(model_load_path=settings.model_path, model_type=settings.model_type,
+                      task=settings.task)
 
 
 @app.post("/predict", response_model=Prediction)
 def predict(paylod: Payload) -> Prediction:
-    prediction = predictor.predict(prompt=paylod.prompt, max_target_length=settings.max_target_length)
+    prediction = predictor.predict(prompt=paylod.prompt, max_target_length=settings.max_target_length,
+                                   temperature=settings.temperature)
     return Prediction(prediction=prediction)
 
 
 
@@ -0,0 +1,21 @@
+import argparse
+import requests
+
+def send_request(url, prompt):
+    response = requests.post(url, json={"prompt": prompt})
+    
+    if response.status_code == 200:
+        return response.json()["prediction"]
+    else:
+        response.raise_for_status()
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="Send a POST request to a FastAPI endpoint with a given prompt.")
+    
+    parser.add_argument("--url", type=str, default="http://0.0.0.0:8080/predict", help="Endpoint URL to make the POST request.")
+    parser.add_argument("--prompt", type=str)
+    
+    args = parser.parse_args()
+    
+    prediction = send_request(args.url, args.prompt)
+    print(f"Prediction: {prediction}")
@@ -1,27 +1,33 @@
-from peft import PeftModel, PeftConfig
-from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
+from peft import PeftConfig, AutoPeftModelForCausalLM
+from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
 import torch
 
-
 class Predictor:
-    def __init__(self, model_load_path: str, task: str = "summarization", load_in_8bit: bool = False):
-        config = PeftConfig.from_pretrained(model_load_path)
-        self.model = AutoModelForSeq2SeqLM.from_pretrained(config.base_model_name_or_path, load_in_8bit=load_in_8bit)
-        self.tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)
-
+    def __init__(self, model_load_path: str, model_type: str, task: str = "summarization", 
+                load_in_8bit: bool = False):
+        if model_type == "seq2seq":
+            config = PeftConfig.from_pretrained(model_load_path)
+            self.model = AutoModelForSeq2SeqLM.from_pretrained(config.base_model_name_or_path, load_in_8bit=load_in_8bit)
+            self.tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)
+        else:
+            self.model = AutoPeftModelForCausalLM.from_pretrained(model_load_path,
+                                                low_cpu_mem_usage=True,
+                                                torch_dtype=torch.float16,
+                                                load_in_4bit=True,)
+            self.tokenizer = AutoTokenizer.from_pretrained(model_load_path)
+        
         self.task = task
-        self.model = PeftModel.from_pretrained(self.model, model_load_path)
 
         device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
         self.model.to(device)
         self.model.eval()
 
     def get_input_ids(self, prompt: str):
         if self.task == "summarization":
-            input_ids = self.tokenizer("summarize: " + prompt, return_tensors="pt", truncation=True).input_ids.cuda()
+            input_ids = self.tokenizer(prompt, return_tensors="pt", truncation=True).input_ids.cuda()
         else:
             input_ids = self.tokenizer(
-                "Classify the following sentence into a category: " + prompt.replace("\n", " ") + " The answer is: ",
+                prompt.replace("\n", " "),
                 return_tensors="pt",
                 truncation=True,
             ).input_ids.cuda()
@@ -39,4 +45,4 @@ def predict(self, prompt: str, max_target_length: int = 512, temperature: float
         )
         prediction = self.tokenizer.batch_decode(outputs.cpu().numpy(), skip_special_tokens=True)[0]
 
-        return prediction
+        return prediction
@@ -1,4 +1,4 @@
-from fastapi_naive.predictor import Predictor
+from predictor import Predictor
 
 if __name__ == "__main__":
     from datasets import load_dataset
 
@@ -70,10 +70,6 @@ Bytes Out     [total, mean]                     1359, 453.00
 Success       [ratio]                           100.00%
 Status Codes  [code:count]                      200:3
 Error Set:
-Requests      [total, rate, throughput]         4, 5.33, 0.00
-Duration      [total, attack, wait]             30.75s, 750.001ms, 30s
-Latencies     [min, mean, 50, 90, 95, 99, max]  30s, 30s, 30s, 30s, 30s, 30s, 30s
-Bytes In      [total, mean]                     0, 0.00
 Requests      [total, rate, throughput]         3, 4.50, 0.11
 Duration      [total, attack, wait]             27.602s, 666.664ms, 26.935s
 Latencies     [min, mean, 50, 90, 95, 99, max]  26.935s, 27.14s, 27.137s, 27.347s, 27.347s, 27.347s, 27.347s
Original file line number	Diff line number	Diff line change
`@@ -1,4 +1,4 @@`
`1`		`-from fastapi_naive.predictor import Predictor`
	`1`	`+from predictor import Predictor`
`2`	`2`
`3`	`3`	`if __name__ == "__main__":`
`4`	`4`	`from datasets import load_dataset`