跳转到主要内容

标签(标签)

资源精选(342) Go开发(108) Go语言(103) Go(99) angular(82) LLM(75) 大语言模型(63) 人工智能(53) 前端开发(50) LangChain(43) golang(43) 机器学习(39) Go工程师(38) Go程序员(38) Go开发者(36) React(33) Go基础(29) Python(24) Vue(22) Web开发(20) Web技术(19) 精选资源(19) 深度学习(19) Java(18) ChatGTP(17) Cookie(16) android(16) 前端框架(13) JavaScript(13) Next.js(12) 安卓(11) 聊天机器人(10) typescript(10) 资料精选(10) NLP(10) 第三方Cookie(9) Redwoodjs(9) LLMOps(9) Go语言中级开发(9) 自然语言处理(9) PostgreSQL(9) 区块链(9) mlops(9) 安全(9) 全栈开发(8) ChatGPT(8) OpenAI(8) Linux(8) AI(8) GraphQL(8) iOS(8) 软件架构(7) Go语言高级开发(7) AWS(7) C++(7) 数据科学(7) whisper(6) Prisma(6) 隐私保护(6) RAG(6) JSON(6) DevOps(6) 数据可视化(6) wasm(6) 计算机视觉(6) 算法(6) Rust(6) 微服务(6) 隐私沙盒(5) FedCM(5) 语音识别(5) Angular开发(5) 快速应用开发(5) 提示工程(5) Agent(5) LLaMA(5) 低代码开发(5) Go测试(5) gorm(5) REST API(5) 推荐系统(5) WebAssembly(5) GameDev(5) CMS(5) CSS(5) machine-learning(5) 机器人(5) 游戏开发(5) Blockchain(5) Web安全(5) Kotlin(5) 低代码平台(5) 机器学习资源(5) Go资源(5) Nodejs(5) PHP(5) Swift(5) 智能体(4) devin(4) Blitz(4) javascript框架(4) Redwood(4) GDPR(4) 生成式人工智能(4) Angular16(4) Alpaca(4) 编程语言(4) SAML(4) JWT(4) JSON处理(4) Go并发(4) kafka(4) 移动开发(4) 移动应用(4) security(4) 隐私(4) spring-boot(4) 物联网(4) nextjs(4) 网络安全(4) API(4) Ruby(4) 信息安全(4) flutter(4) 专家智能体(3) Chrome(3) CHIPS(3) 3PC(3) SSE(3) 人工智能软件工程师(3) LLM Agent(3) Remix(3) Ubuntu(3) GPT4All(3) 软件开发(3) 问答系统(3) 开发工具(3) 最佳实践(3) RxJS(3) SSR(3) Node.js(3) Dolly(3) 移动应用开发(3) 低代码(3) IAM(3) Web框架(3) CORS(3) 基准测试(3) Go语言数据库开发(3) Oauth2(3) 并发(3) 主题(3) Theme(3) earth(3) nginx(3) 软件工程(3) azure(3) keycloak(3) 生产力工具(3) gpt3(3) 工作流(3) C(3) jupyter(3) 认证(3) prometheus(3) GAN(3) Spring(3) 逆向工程(3) 应用安全(3) Docker(3) Django(3) R(3) .NET(3) 大数据(3) Hacking(3) 渗透测试(3) C++资源(3) Mac(3) 微信小程序(3) Python资源(3) JHipster(3) 大型语言模型(2) 语言模型(2) 可穿戴设备(2) JDK(2) SQL(2) Apache(2) Hashicorp Vault(2) Spring Cloud Vault(2) Go语言Web开发(2) Go测试工程师(2) WebSocket(2) 容器化(2) AES(2) 加密(2) 输入验证(2) ORM(2) Fiber(2) Postgres(2) Gorilla Mux(2) Go数据库开发(2) 模块(2) 泛型(2) 指针(2) HTTP(2) PostgreSQL开发(2) Vault(2) K8s(2) Spring boot(2) R语言(2) 深度学习资源(2) 半监督学习(2) semi-supervised-learning(2) architecture(2) 普罗米修斯(2) 嵌入模型(2) productivity(2) 编码(2) Qt(2) 前端(2) Rust语言(2) NeRF(2) 神经辐射场(2) 元宇宙(2) CPP(2) 数据分析(2) spark(2) 流处理(2) Ionic(2) 人体姿势估计(2) human-pose-estimation(2) 视频处理(2) deep-learning(2) kotlin语言(2) kotlin开发(2) burp(2) Chatbot(2) npm(2) quantum(2) OCR(2) 游戏(2) game(2) 内容管理系统(2) MySQL(2) python-books(2) pentest(2) opengl(2) IDE(2) 漏洞赏金(2) Web(2) 知识图谱(2) PyTorch(2) 数据库(2) reverse-engineering(2) 数据工程(2) swift开发(2) rest(2) robotics(2) ios-animation(2) 知识蒸馏(2) 安卓开发(2) nestjs(2) solidity(2) 爬虫(2) 面试(2) 容器(2) C++精选(2) 人工智能资源(2) Machine Learning(2) 备忘单(2) 编程书籍(2) angular资源(2) 速查表(2) cheatsheets(2) SecOps(2) mlops资源(2) R资源(2) DDD(2) 架构设计模式(2) 量化(2) Hacking资源(2) 强化学习(2) flask(2) 设计(2) 性能(2) Sysadmin(2) 系统管理员(2) Java资源(2) 机器学习精选(2) android资源(2) android-UI(2) Mac资源(2) iOS资源(2) Vue资源(2) flutter资源(2) JavaScript精选(2) JavaScript资源(2) Rust开发(2) deeplearning(2) RAD(2)

category

TensorRT-LLM

https://github.com/NVIDIA/TensorRT-LLM

TensorRT-LLM为用户提供了一个易于使用的Python API,以定义大型语言模型(LLM)并构建包含最先进优化的TensorRT引擎,从而在NVIDIA GPU上高效地执行推理。TensorRTLLM还包含用于创建执行这些TensorRT引擎的Python和C++运行时的组件。

https://github.com/NVIDIA/TensorRT-LLM/tree/f430a4b447ef4cba22698902d43eae0debf08594/tensorrt_llm/models/qwen

https://github.com/NVIDIA/TensorRT-LLM/tree/f430a4b447ef4cba22698902d43eae0debf08594/examples/qwen

 

【LLMOps】Triton + TensorRT-LLM部署QWen

https://www.cnblogs.com/zhouwenyang/p/18023854

背景

TensorRT-LLM是Nvidia官方推出的大模型推理加速框架,目前只对部分显卡型号有做定制加速。最近新出的Chat with RTX也是基于TensorRT-LLM进行的本地推理。

TensorRT-LLM支持PagedAttention、FlashAttention、SafeTensor等手动,某些社区号称吞吐能力测试结果超过vLLM。

准备

  • 显卡A800
  • QWen7B 预训练模型

构建镜像最好自己构建最新的。尝试使用nvidia提供的镜像,发现镜像版本滞后。而且使用后出现各种不兼容,很容易让人误以为是自身操作问题。

开始

转换权重

首先需要将QWen模型转换为TensorRT所支持的.engine格式的权重文件

环境构建

下载TensorRT-LLM的官方代码:https://github.com/NVIDIA/TensorRT-LLM.git 

然后编辑 TensorRT-LLM/docker/Dockerfile.multi ,内容如下

View Code

 主要是在59行加上一个pip镜像。

cd TensorRT-LLM/docker
make build

执行上述命令,构建镜像。以我这边为例,构建完的镜像名为 tensorrt-llm:v3

容器启动

docker run -it --gpus '"device=1"' --name trt-llm -v /home:/home tensorrt-llm:v3 bash
docker exec -it trt-llm bash

转换权重

进入到容器内部

cd examples/qwen
pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
pip install -r requirements.txt

中间会报tensorrt版本冲突,忽略即可。

执行转换:

python3 build.py --hf_model_dir  /home/Qwen-7b/ --dtype bfloat16 --paged_kv_cache  --use_gpt_attention_plugin bfloat16 --enable_context_fmha --use_gemm_plugin bfloat16 --use_inflight_batching --remove_input_padding --enable_context_fmha --output  /home/trt_engines_qwen7b_bf16

测试:

python3 ../run.py --input_text "请你讲述一个故事" --max_output_len=64 --tokenizer_dir /home/Qwen-7b/ --engine_dir=/home/trt_engines_qwen7b_bf16

测试结果如下:

复制代码

/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py:881: UserWarning: The PyTorch API of nested tensors is in prototype stage and will change in the near future. (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/NestedTensorImpl.cpp:178.)
 torch.nested.nested_tensor(split_ids_list,
Input [Text 0]: "<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
hello<|im_end|>
<|im_start|>assistant
"
Output [Text 0 Beam 0]: "Hello! How can I help you today? Is there something you would like to talk about or ask me a question? I'm here to assist you with any information or advice you might need."

复制代码

推理

构建镜像

下载triton代码:https://github.com/triton-inference-server/tensorrtllm_backend

此处有坑,构建时忘记记录了,跳过。最终构建的镜像:triton-trt-llm:v3.0

启动服务

进入到目录下执行

将tensorrtllm_backend/all_models/inflight_batcher_llm 复制到/home/tensorrtllm_backend/model_repository下

python3 tools/fill_template.py -i /home/tensorrtllm_backend/model_repository/tensorrt_llm/config.pbtxt triton_max_batch_size:64,decoupled_mode:False,max_beam_width:1,engine_dir:/tensorrtllm_backend/model_repository/tensorrt_llm/1,max_tokens_in_paged_kv_cache:2560,max_attention_window_size:2560,kv_cache_free_gpu_mem_fraction:0.5,exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:inflight_batching,max_queue_delay_microseconds:600

其中

  • batch_scheduler_policy 设置为guaranteed_no_evict
  • enable_trt_overlap 设置为False
  • max_num_sequences 设置为batch-size一样
  • normalize_log_probs设置为False
  • gpt-model-type 设置为v1

再其中

修改postprocess和postprocess中的model.py 大约在81行左右,加上self.tokenizer.eos_token = "<|endoftext|>"

启动容器 trition-trt-llm

docker run --rm -it --gpus '"device=1"' --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 -p 18000:8000 -v /home/triton-trtllm/:/tensorrtllm_backend trition-trt-llm:v3.0 bash

启动服务

pip install tiktoken
cd /tensorrtllm_backend/tensorrtllm_backend
# --world_size is the number of GPUs you want to use for serving
python3 scripts/launch_triton_server.py --world_size=1 --model_repo=/tensorrtllm_backend/model_repository

请求接口

复制代码

curl --location 'http://localhost:18000/v2/models/ensemble/generate' \
--header 'Content-Type: application/json' \
--data '{
   "text_input": "What is machine learning?",
   "max_tokens": 64,
   "bad_words": "",
   "stop_words": ""
}'

 

 性能

在A800上实际测试,吞吐约为vllm的一半,RT也没有明显的下降。可能A800跟A100还是有很大区别的

Triton24.02 部署TensorRT-LLM,实现http查询

选择正确的环境

 

  1. 选择版本。查询nvidia官方文档,可以看到目前最新的容器是24.02。
  • NVIDIA Driver这一行,它推荐的英伟达驱动版本是545以上,对于数据卡,可以适当降低。如果你是游戏卡,驱动版本没有545,也不想升级,那么建议至少不要低太多,比如535其实也可以。 38a9563ae5435516a18043d93494b7eb.png
  • Triton Inference Server这一行,可以看到它内置了triton server版本是2.43,需要的TensorRT-LLM版本是0.8.0。 ed50e1a173903ea931e8103aecbe29fb.png
  1. 拉取镜像。进入Nvidia镜像中心找到tritonserver的镜像,选择和TensorRT-LLM(简称trtllm)有关的容器,然后拷贝镜像地址,最后使用docker pull来拉取该镜像。 9205bd0697f97ed061db52fd39994fa2.png
docker pull nvcr.io/nvidia/tritonserver:24.02-trtllm-python-py3
  1. 拉取TensorRT-LLM的项目。
  • 可以选择官方项目,但是注意要是v0.8.0
git clone https://github.com/NVIDIA/TensorRT-LLM.git -b v0.8.0
  • 也可以选择我的项目,目前main分支就是0.8.0,后续可能会打成tag,建议实际访问项目地址,查看是否有0.8.0的tag。
git clone https://github.com/Tlntin/Qwen-TensorRT-LLM
  • 下面演示是以我的项目为主,在triton_server上面部署Qwen-1.8B-Chat(毕竟这个模型比较小)
  1. 拉取tensorrtllm_backend。这个是用来编排tensorrt-llm服务的,需要和TensorRT-LLM版本一致,这里同样选择0.8.0
git clone https://github.com/triton-inference-server/tensorrtllm_backend.git -b v0.8.0
  1. 启动tritonserver容器
docker run -d \
    --name triton \
    --net host \
    --shm-size=2g \
    --ulimit memlock=-1 \
    --ulimit stack=67108864 \
    --gpus all \
    -v ${PWD}/tensorrtllm_backend:/tensorrtllm_backend \
    -v ${PWD}/Qwen-TensorRT-LLM/examples:/root/examples \
    nvcr.io/nvidia/tritonserver:24.02-trtllm-python-py3 sleep 864000
  1. 检查服务
  • 进入容器
docker exec -it triton /bin/bash
  • 检查英伟达驱动
nvidia-smi
  • 检查tritonserver版本,至少和上面提到的一样,是2.43
cat /opt/tritonserver/TRITON_VERSION
  • 检查tensorrtllm_backend版本,该数值必须和官方github仓库的0.8.0版本的tool/version.txt文件内容一致,官方仓库链接
cat /tensorrtllm_backend/tools/version.txt
  1. 直接通过pip安装TensorRT-LLM (如果是自己编译的容器,这步可以省略)
pip install tensorrt_llm==0.8.0 --extra-index-url https://pypi.nvidia.com --extra-index-url https://download.pytorch.org/whl/cu121

编译Engine

 

  1. 进入容器
docker exec -it triton /bin/bash
  1. 重复之前的操作,安装qwen的依赖。
  • 进入qwen2目录
cd /root/examples/qwen2
  • 安装依赖
pip install -r requirements.txt
  1. 编译,需要在原来Readme编译的基础上开启paged_kv_cache,方便部署inflight-batching
  • 例如fp16之前编译是python3 build.py,现在改成下面这个。
python3 build.py --paged_kv_cache --remove_input_padding
  • 例如int8-smooth-quant
    • 之前编译命令是

      # 转权重
      python3 hf_qwen_convert.py --smoothquant=0.5
      # 编译
      python3 build.py --use_smooth_quant --per_token --per_channel
    • 现在编译命令是

      # 转权重
      python3 hf_qwen_convert.py --smoothquant=0.5
      # 编译
      python3 build.py --use_smooth_quant --per_token --per_channel --paged_kv_cache --remove_input_padding
  1. 运行一下做个测试
python3 run.py

临时部署Triton

 

  1. (可选)直接复用本项目配置(batch_size=2,input=6144, output=2048),这样就跳过后续的第6,第7步骤,需要在容器外操作,懒人必备,不过还是推荐自己改好一些。
cp -r Qwen-TensorRT-LLM/triton_model_repo tensorrtllm_backend/
  1. 进入容器
docker exec -it triton /bin/bash
  1. 构建好目录
cd /tensorrtllm_backend
cp all_models/inflight_batcher_llm/ -r triton_model_repo
  1. 复制上一部分编译好的Engine文件
cd /root/examples/qwen2/trt_engines/fp16/1-gpu/
cp -r ./* /tensorrtllm_backend/triton_model_repo/tensorrt_llm/1/
  1. 复制tokenzer文件
cd /root/examples/qwen2
mkdir /tensorrtllm_backend/triton_model_repo/tensorrt_llm/qwen1.5_7b_chat
cp qwen1.5_7b_chat/*.json /tensorrtllm_backend/triton_model_repo/tensorrt_llm/qwen1.5_7b_chat/
# 可选,仅适用于qwen1,因为它的tokenizer是tiktoken格式
cp qwen_7b_chat/*.tiktoken /tensorrtllm_backend/triton_model_repo/tensorrt_llm/qwen1.5_7b_chat/
cp qwen_7b_chat/*.py /tensorrtllm_backend/triton_model_repo/tensorrt_llm/qwen1.5_7b_chat/
  1. (可选)编写Triton中的预处理配置和后处理配置, 参考文档
cd /tensorrtllm_backend
export HF_QWEN_MODEL="/tensorrtllm_backend/triton_model_repo/tensorrt_llm/qwen1.5_7b_chat"
export ENGINE_DIR="/tensorrtllm_backend/triton_model_repo/tensorrt_llm/1"
# 设置你的batch_size大小
export MAX_BATCH_SIZE=1
export TOKENIZE_TYPE=auto
# 根据cpu线程数定,一般为batch_size的2倍数或者cpu线程的一半
export INSTANCE_COUNT=2
# 我就一张卡,你可以指定用那些卡,用逗号隔开
export GPU_DEVICE_IDS=0


python3 tools/fill_template.py -i triton_model_repo/preprocessing/config.pbtxt tokenizer_dir:${HF_QWEN_MODEL},tokenizer_type:${TOKENIZE_TYPE},triton_max_batch_size:${MAX_BATCH_SIZE},preprocessing_instance_count:${INSTANCE_COUNT}

python3 tools/fill_template.py -i triton_model_repo/postprocessing/config.pbtxt tokenizer_dir:${HF_QWEN_MODEL},tokenizer_type:${TOKENIZE_TYPE},triton_max_batch_size:${MAX_BATCH_SIZE},postprocessing_instance_count:${INSTANCE_COUNT}

python3 tools/fill_template.py -i triton_model_repo/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE},decoupled_mode:False,bls_instance_count:${INSTANCE_COUNT},accumulate_tokens:True

python3 tools/fill_template.py -i triton_model_repo/ensemble/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE}

python3 tools/fill_template.py -i triton_model_repo/tensorrt_llm/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE},decoupled_mode:True,max_beam_width:1,engine_dir:${ENGINE_DIR},exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:inflight_batching,max_queue_delay_microseconds:600,gpu_device_ids:${GPU_DEVICE_IDS}
  1. (可选)简单修改一下preprocess/postprocess的model.py的initialize函数,示例是llama的,我们要改成qwen的tokenizer配置。
  • 修改前(preprocessing有三行,postprocessing只有一行):
self.tokenizer.pad_token = self.tokenizer.eos_token
self.tokenizer_end_id = self.tokenizer.encode(
    self.tokenizer.eos_token, add_special_tokens=False)[0]
self.tokenizer_pad_id = self.tokenizer.encode(
    self.tokenizer.pad_token, add_special_tokens=False)[0]
  • 修改后
import os


gen_config_path = os.path.join(tokenizer_dir, 'generation_config.json')
with open(gen_config_path, 'r') as f:
	gen_config = json.load(f)
if isinstance (gen_config["eos_token_id"], list):
	pad_id = end_id = gen_config["eos_token_id"][0]
### if model type is base, run this branch
else:
	pad_id = gen_config["bos_token_id"]
	end_id = gen_config["eos_token_id"]
self.tokenizer_pad_id = pad_id
self.tokenizer_end_id = end_id
eos_token = self.tokenizer.decode(end_id)
self.tokenizer.eos_token = self.tokenizer.pad_token = eos_token
  1. 启动服务,单卡启动。
cd /tensorrtllm_backend
python3 scripts/launch_triton_server.py --world_size=1 --model_repo=/tensorrtllm_backend/triton_model_repo
  1. 另外开一个终端,测试一下http效果。
  • 请求
curl -X POST localhost:8000/v2/models/ensemble/generate \
-d '{"text_input": "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n你好,你叫什么?<|im_end|>\n<|im_start|>assistant\n", "max_tokens": 100, "bad_words": "", "stop_words": "", "end_id": [151645], "pad_id": [151645]}'
  • 输出结果
{"cum_log_probs":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"你好,我是来自阿里云的大规模语言模型,我叫通义千问。"}%    

调用服务

 

python客户端请求

 

  1. 安装python依赖(可选)
pip install tritonclient transformers gevent geventhttpclient tiktoken grpcio
  1. 运行qwen/triton_client/inflight_batcher_llm_client.py文件即可开启
cd /root/examples/triton_client
python3 inflight_batcher_llm_client.py --tokenizer_dir=/tensorrtllm_backend/triton_model_repo/tensorrt_llm/qwen1.5_7b_chat
  1. 测试结果
====================
Human: 你好
Output: 你好!有什么我可以帮助你的吗?
Human: 你叫什么?
Output: 我是来自阿里云的大规模语言模型,我叫通义千问。
http流式调用

 

  1. 前提
  • 编译的Engine开启了paged_kv_cache
  • 部署triton时,tensorrt_llm/config.pbtxt里面的gpt_model_type对应的value为inflight_batching
  1. 运行命令
curl -X POST localhost:8000/v2/models/ensemble/generate_stream \
-d '{"text_input": "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n你好,你叫什么?<|im_end|>\n<|im_start|>assistant\n", "max_tokens": 100, "bad_words": "", "stop_words": "", "end_id": [151645], "pad_id": [151645], "stream": true}'
  1. 输出结果:
data: {"cum_log_probs":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":0.0,"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"你好"}

data: {"cum_log_probs":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":[0.0,0.0],"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":","}

data: {"cum_log_probs":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":[0.0,0.0,0.0],"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"我是"}

data: {"cum_log_probs":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":[0.0,0.0,0.0,0.0],"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"来自"}

data: {"cum_log_probs":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":[0.0,0.0,0.0,0.0,0.0],"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"阿里"}

data: {"cum_log_probs":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":[0.0,0.0,0.0,0.0,0.0,0.0],"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"云"}

data: {"cum_log_probs":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":[0.0,0.0,0.0,0.0,0.0,0.0,0.0],"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"的大"}

data: {"cum_log_probs":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"规模"}

data: {"cum_log_probs":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"语言"}

data: {"cum_log_probs":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"模型"}

data: {"cum_log_probs":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":","}

data: {"cum_log_probs":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"我"}

data: {"cum_log_probs":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"叫"}

data: {"cum_log_probs":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"通"}

data: {"cum_log_probs":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"义"}

data: {"cum_log_probs":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"千"}

data: {"cum_log_probs":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"问"}

data: {"cum_log_probs":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"。"}

data: {"cum_log_probs":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":""}

关闭triton服务

 

pkill tritonserver

永久部署

 

  1. 在上个容器部署时,我们启动的命令是python3 scripts/launch_triton_server.py --world_size=1 --model_repo=/tensorrtllm_backend/triton_model_repo,经过修改tensorrtllm_backend/scripts/launch_triton_server.py文件,倒数第二行增加一个print("cmd", cmd)在结尾打印出它真实运行的命令如下:
["mpirun", "--allow-run-as-root", "-n", "1", "/opt/tritonserver/bin/tritonserver", "--model-repository=/tensorrtllm_backend/triton_model_repo", "--grpc-port=8001", "--http-port=8000", "--metrics-port=8002", "--disable-auto-complete-config", "--backend-config=python,shm-region-prefix-name=prefix0_", ":"]
  1. 编写一个Dockerfile来启动刚刚的命令,替换原来容器自带的命令。
FROM nvcr.io/nvidia/tritonserver:24.02-trtllm-python-py3
USER root
# Option
# COPY tensorrtllm_backend /tensorrtllm_backend
WORKDIR /tensorrtllm_backend

CMD ["mpirun", "--allow-run-as-root", "-n", "1", "/opt/tritonserver/bin/tritonserver", "--model-repository=/tensorrtllm_backend/triton_model_repo", "--grpc-port=8001", "--http-port=8000", "--metrics-port=8002", "--disable-auto-complete-config", "--backend-config=python,shm-region-prefix-name=prefix0_", ":"]
  1. 编译新镜像,命名为tritonserver:24.02
docker build . -t tritonserver:24.02
  1. 测试一下是否ok
docker run -it \
    --name triton_server \
    --net host \
    --shm-size=2g \
    --ulimit memlock=-1 \
    --ulimit stack=67108864 \
    -v ${PWD}/tensorrtllm_backend:/tensorrtllm_backend \
    --gpus all \
    tritonserver:24.02
  • 测试一下请求,没问题就退出,然后删除该容器
docker rm -f triton_server
  1. 永久开启该容器,设置后台启动,并且设置自动重启
docker run -d \
    --name triton_server \
    --net host \
    --shm-size=2g \
    --restart always \
    --ulimit memlock=-1 \
    --ulimit stack=67108864 \
    -v ${PWD}/tensorrtllm_backend:/tensorrtllm_backend \
    --gpus all \
    tritonserver:24.02
  1. 查看一下这个容器运行情况,发现正常。
docker logs triton_server

Qwen-TensorRT-LLM

 

文章链接