Does LMDeploy support original meta LLaMA2? #276

realhaik · 2023-08-20T12:29:06Z

realhaik
Aug 20, 2023

I am reading this article and unfortunately I see that the article uses hugging face llama2 model. Does this mean that the original meta llama2 is not supported?
https://openmmlab.medium.com/deploy-llama-2-models-easily-with-lmdeploy-1cb001d70290

This is a real deal breaker, because the hf model is defective. The results that hf model produces are a joke, completely broken. The original meta model works quite nice.
I feel really sorry for anyone who is wasting his time with the hf model.

lvhan028 · 2023-08-21T02:56:32Z

lvhan028
Aug 21, 2023
Maintainer

Hi, LMDeploy does support Llama 2 both in ckpt and huggingface transformers model format. You can try the following command to convert it to LMDeploy required format

python lmdeploy/serve/turbomind/deploy.py llama2 /the/path/of/original/llama2/model --model-format llama  --tokenizer-path /the/path/of/tokenizer/model/of/original/llama2

1 reply

realhaik Aug 22, 2023
Author

Thank you for the helpful answer! After some tweaking I was able to get it working, but there is an issue with the model responses. The response is the same as the question. If I ask "who is bill gates", the answer is the same string "who is bill gates".
Here is my code, maybe I did something wrong:

`C:\AI\LMDeploy>docker run ^
--gpus all ^
--rm ^
-v "C:/AI/LMDeploy/lmdeploy/workspace/triton_models":/workspace/triton_models ^
--shm-size 22g ^
-p 33336:22 ^
-p 33337-33400:33337-33400 ^
-p 8099:8099 ^
--cap-add=SYS_PTRACE ^
--cap-add=SYS_ADMIN ^
--security-opt seccomp=unconfined ^
--name lmdeploy ^
-it --env NCCL_LAUNCH_MODE=GROUP openmmlab/lmdeploy:latest ^
tritonserver ^
--model-repository=/workspace/triton_models ^
--allow-http=1 ^
--allow-grpc=1 ^
--grpc-port=33337 ^
--http-port=8099 ^
--log-verbose=0 ^
--allow-metrics=1

=============================
== Triton Inference Server ==
=============================

NVIDIA Release 22.12 (build 50109463)
Triton Server Version 2.29.0

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

I0821 20:05:28.650277 1 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x304600000' with size 268435456
I0821 20:05:28.650810 1 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
I0821 20:05:28.789131 1 model_lifecycle.cc:459] loading: postprocessing:1
I0821 20:05:28.795179 1 model_lifecycle.cc:459] loading: preprocessing:1
I0821 20:05:31.115156 1 python_be.cc:1856] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (CPU device 0)
I0821 20:05:33.657252 1 python_be.cc:1856] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_1 (CPU device 0)
I0821 20:05:34.292309 1 python_be.cc:1856] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (CPU device 0)
I0821 20:05:34.942804 1 python_be.cc:1856] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_2 (CPU device 0)
I0821 20:05:35.559129 1 python_be.cc:1856] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_1 (CPU device 0)
I0821 20:05:36.180473 1 python_be.cc:1856] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_3 (CPU device 0)
I0821 20:05:36.821193 1 python_be.cc:1856] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_2 (CPU device 0)
I0821 20:05:37.445100 1 python_be.cc:1856] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_4 (CPU device 0)
I0821 20:05:38.035155 1 python_be.cc:1856] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_3 (CPU device 0)
I0821 20:05:38.648672 1 python_be.cc:1856] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_5 (CPU device 0)
I0821 20:05:38.648899 1 model_lifecycle.cc:694] successfully loaded 'preprocessing' version 1
I0821 20:05:39.268296 1 python_be.cc:1856] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_6 (CPU device 0)
I0821 20:05:39.898524 1 python_be.cc:1856] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_7 (CPU device 0)
I0821 20:05:40.503333 1 python_be.cc:1856] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_8 (CPU device 0)
I0821 20:05:41.137828 1 python_be.cc:1856] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_9 (CPU device 0)
I0821 20:05:41.781381 1 python_be.cc:1856] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_10 (CPU device 0)
I0821 20:05:42.409057 1 python_be.cc:1856] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_11 (CPU device 0)
I0821 20:05:43.025120 1 python_be.cc:1856] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_12 (CPU device 0)
I0821 20:05:43.632547 1 python_be.cc:1856] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_13 (CPU device 0)
I0821 20:05:44.265726 1 python_be.cc:1856] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_14 (CPU device 0)
I0821 20:05:44.875901 1 python_be.cc:1856] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_15 (CPU device 0)
I0821 20:05:45.494304 1 model_lifecycle.cc:694] successfully loaded 'postprocessing' version 1
I0821 20:05:45.503200 1 server.cc:563]
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I0821 20:05:45.503237 1 server.cc:590]
+---------+-------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Backend | Path | Config |
+---------+-------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
| python | /opt/tritonserver/backends/python/libtriton_python.so | {"cmdline":{"auto-complete-config":"true","min-compute-capability":"6.000000","backend-directory":"/opt/tritonserver/backends","default-max-batch-size":"4"}} |
+---------+-------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0821 20:05:45.530539 1 metrics.cc:864] Collecting metrics for GPU 0: NVIDIA GeForce RTX 4090
I0821 20:05:45.530985 1 metrics.cc:757] Collecting CPU metrics
I0821 20:05:45.531086 1 tritonserver.cc:2264]
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option | Value |
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id | triton |
| server_version | 2.29.0 |
| server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data statistics trace logging |
| model_repository_path[0] | /workspace/triton_models |
| model_control_mode | MODE_NONE |
| strict_model_config | 0 |
| rate_limit | OFF |
| pinned_memory_pool_byte_size | 268435456 |
| cuda_memory_pool_byte_size{0} | 67108864 |
| response_cache_byte_size | 0 |
| min_supported_compute_capability | 6.0 |
| strict_readiness | 1 |
| exit_timeout | 30 |
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0821 20:05:45.539838 1 grpc_server.cc:4819] Started GRPCInferenceService at 0.0.0.0:33337
I0821 20:05:45.540481 1 http_server.cc:3477] Started HTTPService at 0.0.0.0:8099
I0821 20:05:45.581970 1 http_server.cc:184] Started Metrics Service at 0.0.0.0:8002
`

And this is the code that I use for the inference:



from tritonclient.http import InferenceServerClient, InferInput, InferRequestedOutput
import tritonclient.http as httpclient
import numpy as np
import argparse
import sys
import gevent.ssl
from tritonclient.utils import InferenceServerException



url = "localhost:8099"  
client = InferenceServerClient(url=url)

def postprocess(tokens_batch, sequence_length):

    tokens_batch = tokens_batch.reshape(tokens_batch.shape[0], 1, tokens_batch.shape[1])
    sequence_length = sequence_length.reshape(sequence_length.shape[0], -1)  

    inputs = [
        httpclient.InferInput("TOKENS_BATCH", tokens_batch.shape, "UINT32"),
        httpclient.InferInput("sequence_length", sequence_length.shape, "UINT32"),
    ]


    inputs[0].set_data_from_numpy(tokens_batch, binary_data=True)
    inputs[1].set_data_from_numpy(sequence_length, binary_data=True)

    outputs = [httpclient.InferRequestedOutput("OUTPUT", binary_data=True)]

    results = client.infer(
        "postprocessing",
        inputs,
        outputs=outputs,
    )

    return results.as_numpy('OUTPUT')



def test_infer(query_text, bad_words_dict, stop_words_dict, request_output_len):

    inputs = [
        httpclient.InferInput("QUERY", query_text.shape, "BYTES"),
        httpclient.InferInput("BAD_WORDS_DICT", bad_words_dict.shape, "BYTES"),
        httpclient.InferInput("STOP_WORDS_DICT", stop_words_dict.shape, "BYTES"),
        httpclient.InferInput("REQUEST_OUTPUT_LEN", request_output_len.shape, "UINT32"),
    ]


    inputs[0].set_data_from_numpy(query_text, binary_data=True)
    inputs[1].set_data_from_numpy(bad_words_dict, binary_data=True)
    inputs[2].set_data_from_numpy(stop_words_dict, binary_data=True)
    inputs[3].set_data_from_numpy(request_output_len, binary_data=True)

    outputs = [
        httpclient.InferRequestedOutput("INPUT_ID", binary_data=True),
        httpclient.InferRequestedOutput("REQUEST_INPUT_LEN", binary_data=True),
        httpclient.InferRequestedOutput("BAD_WORDS_IDS", binary_data=True),
        httpclient.InferRequestedOutput("STOP_WORDS_IDS", binary_data=True),
        httpclient.InferRequestedOutput("REQUEST_OUTPUT_LEN", binary_data=True),
        httpclient.InferRequestedOutput("PROMPT_LEARNING_TASK_NAME_IDS", binary_data=True),
    ]

    results = client.infer(
        "preprocessing",
        inputs,
        outputs=outputs,
        query_params=None,
        headers=None,
        request_compression_algorithm=None,
        response_compression_algorithm=None,
    )
    
    return results

query_text = np.array([["who is bill gates"]], dtype=np.bytes_)
bad_words_dict = np.array([[""]], dtype=np.bytes_)
stop_words_dict = np.array([[""]], dtype=np.bytes_)
request_output_len = np.array([[1000]], dtype=np.uint32)

results = test_infer(query_text, bad_words_dict, stop_words_dict, request_output_len)


input_id = results.as_numpy('INPUT_ID')
request_input_len = results.as_numpy('REQUEST_INPUT_LEN')



output = postprocess(input_id, request_input_len)


print("Final output:", output)

lvhan028 · 2023-08-23T11:00:37Z

lvhan028
Aug 23, 2023
Maintainer

Hi, @realhaik
We haven't integrated Triton Inference Server(TIS) yet. But we've implemented another server with OpenAI-like RESTful API #223
Would you like to give it a try?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Does LMDeploy support original meta LLaMA2? #276

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Does LMDeploy support original meta LLaMA2? #276

Uh oh!

Uh oh!

realhaik Aug 20, 2023

Replies: 2 comments · 1 reply

Uh oh!

lvhan028 Aug 21, 2023 Maintainer

Uh oh!

realhaik Aug 22, 2023 Author

Uh oh!

lvhan028 Aug 23, 2023 Maintainer

realhaik
Aug 20, 2023

Replies: 2 comments 1 reply

lvhan028
Aug 21, 2023
Maintainer

realhaik Aug 22, 2023
Author

lvhan028
Aug 23, 2023
Maintainer