LMDeploy 量化部署 LLM & VLM 实战作业（含进阶）（lesson 5）|电子爱好者

admin管理员组
文章数量:1622629

LMDeploy 量化部署 LLM & VLM 实战作业（含进阶）（lesson 5）

- 一、LMDeploy环境部署
- - 1.1 打开InternStudio平台，创建开发机。
  - 1.2 创建conda环境
  - 1.3 安装LMDeploy
- 二、LMDeploy模型对话(chat)
- - 2.1 Huggingface与TurboMind
  - 2.2 下载模型
  - 2.3 使用Transformer库运行模型
  - 2.4 使用LMDeploy直接与模型进行对话
- 三.LMDeploy模型量化(lite)（进阶）
- - 3.1 设置最大KV Cache缓存大小
  - - 3.1.1首先保持不加该参数（默认0.8），运行1.8B模型。
    - 3.1.2 改变--cache-max-entry-count参数，设为0.5。
    - 3.1.3 把--cache-max-entry-count参数设置为0.01，约等于禁止KV Cache占用显存。
  - 3.2 使用W4A16量化
  - - 3.2.1 使用Chat功能运行W4A16量化后的模型。
- 四、LMDeploy服务(serve)
- - 4.1 启动API服务器
  - 4.2 命令行客户端连接API服务器
  - 4.3 网页客户端连接API服务器
- 五、Python代码集成
- - 5.1 Python代码集成运行1.8B模型
  - 5.2 向TurboMind后端传递参数
- 六、拓展部分
- - 6.1 使用LMDeploy运行视觉多模态大模型llava
  - 6.2 使用LMDeploy运行第三方大模型
  - 6.3 定量比较LMDeploy与Transformer库的推理速度差异

一、LMDeploy环境部署

1.1 打开InternStudio平台，创建开发机。

填写开发机名称；选择镜像Cuda12.2-conda；选择10% A100*1GPU；点击“立即创建”。注意请不要选择Cuda11.7-conda的镜像，新版本的lmdeploy会出现兼容性问题。

1.2 创建conda环境

InternStudio开发机创建conda环境
由于环境依赖项存在torch，下载过程比较缓慢。
InternStudio上提供了快速创建conda环境的方法。
打开命令行终端，创建一个名为lmdeploy的环境：

studio-conda -t lmdeploy -o pytorch-2.1.2

本地环境创建conda环境
详细信息打开命令行终端，让我们来创建一个名为lmdeploy的conda环境，python版本为3.10。

conda create -n lmdeploy -y python=3.10

1.3 安装LMDeploy

激活刚刚创建的虚拟环境。

conda activate lmdeploy

安装0.3.0版本的lmdeploy。

pip install lmdeploy[all]==0.3.0

安装结束

二、LMDeploy模型对话(chat)

2.1 Huggingface与TurboMind

HuggingFace是一个高速发展的社区，包括Meta、Google、Microsoft、Amazon在内的超过5000家组织机构在为HuggingFace开源社区贡献代码、数据集和模型。可以认为是一个针对深度学习模型和数据集的在线托管社区，如果你有数据集或者模型想对外分享，网盘又不太方便，就不妨托管在HuggingFace。

托管在HuggingFace社区的模型通常采用HuggingFace格式存储，简写为HF格式。

但是HuggingFace社区的服务器在国外，国内访问不太方便。国内可以使用阿里巴巴的MindScope社区，或者上海AI Lab搭建的OpenXLab社区，上面托管的模型也通常采用HF格式。

TurboMind是LMDeploy团队开发的一款关于LLM推理的高效推理引擎，它的主要功能包括：LLaMa 结构模型的支持，continuous batch 推理模式和可扩展的 KV 缓存管理器。

TurboMind推理引擎仅支持推理TurboMind格式的模型。因此，TurboMind在推理HF格式的模型时，会首先自动将HF格式模型转换为TurboMind格式的模型。该过程在新版本的LMDeploy中是自动进行的，无需用户操作。

几个容易迷惑的点：

TurboMind与LMDeploy的关系：LMDeploy是涵盖了LLM 任务全套轻量化、部署和服务解决方案的集成功能包，TurboMind是LMDeploy的一个推理引擎，是一个子模块。LMDeploy也可以使用pytorch作为推理引擎。
TurboMind与TurboMind模型的关系：TurboMind是推理引擎的名字，TurboMind模型是一种模型存储格式，TurboMind引擎只能推理TurboMind格式的模型。

2.2 下载模型

本次实战营已经在开发机的共享目录中准备好了常用的预训练模型，可以运行如下命令查看：

ls /root/share/new_models/Shanghai_AI_Laboratory/

显示如下，每一个文件夹都对应一个预训练模型。

以InternLM2-Chat-1.8B模型为例，从官方仓库下载模型。

InternStudio开发机上下载模型（推荐）
如果你是在InternStudio开发机上，可以按照如下步骤快速下载模型。

首先进入一个你想要存放模型的目录，本教程统一放置在Home目录。执行如下指令：

cd ~

然后执行如下指令由开发机的共享目录软链接或拷贝模型：

ln -s /root/share/new_models/Shanghai_AI_Laboratory/internlm2-chat-1_8b /root/
# cp -r /root/share/new_models/Shanghai_AI_Laboratory/internlm2-chat-1_8b /root/

执行完如上指令后，可以运行“ls”命令。可以看到，当前目录下已经多了一个internlm2-chat-1_8b文件夹，即下载好的预训练模型。

ls

由OpenXLab平台下载模型，在上一步已经从InternStudio开发机上下载了模型，这一步就没必要执行了。

2.3 使用Transformer库运行模型

Transformer库是Huggingface社区推出的用于运行HF模型的官方库。

在2.2中，我们已经下载好了InternLM2-Chat-1.8B的HF模型。下面我们先用Transformer来直接运行InternLM2-Chat-1.8B模型，后面对比一下LMDeploy的使用感受。

现在，让我们点击左上角的图标，打开VSCode。

在左边栏空白区域单击鼠标右键，点击Open in Intergrated Terminal。等待片刻，打开终端。

在终端中输入如下指令，新建pipeline_transformer.py。

touch /root/pipeline_transformer.py

回车执行指令，可以看到侧边栏多出了pipeline_transformer.py文件，点击打开。后文中如果要创建其他新文件，也是采取类似的操作。

将以下内容复制粘贴进入pipeline_transformer.py。

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("/root/internlm2-chat-1_8b", trust_remote_code=True)
# Set `torch_dtype=torch.float16` to load model in float16, otherwise it will be loaded as float32 and cause OOM Error.
model = AutoModelForCausalLM.from_pretrained("/root/internlm2-chat-1_8b", torch_dtype=torch.float16, trust_remote_code=True).cuda()
model = model.eval()
inp = "hello"
print("[INPUT]", inp)
response, history = model.chat(tokenizer, inp, history=[])
print("[OUTPUT]", response)
inp = "please provide three suggestions about time management"
print("[INPUT]", inp)
response, history = model.chat(tokenizer, inp, history=history)
print("[OUTPUT]", response)

按Ctrl+S键保存（Mac用户按Command+S）。
回到终端，激活conda环境。

conda activate lmdeploy

运行python代码：

python /root/pipeline_transformer.py

得到输出：

2.4 使用LMDeploy直接与模型进行对话

首先激活创建好的conda环境：

conda activate lmdeploy

使用LMDeploy与模型进行对话的通用命令格式为：

lmdeploy chat [HF格式模型路径/TurboMind格式模型路径]

执行如下命令运行下载的1.8B模型：

lmdeploy chat /root/internlm2-chat-1_8b

与InternLM2-Chat-1.8B大模型对话。输入“请给我讲一个小故事吧”，然后按两下回车键。

速度比原生Transformer快不少，输入“exit”并按两下回车，可以退出对话。

拓展内容：有关LMDeploy的chat功能的更多参数可通过-h命令查看。

lmdeploy chat -h

三.LMDeploy模型量化(lite)（进阶）

模型量化主要包括 KV8量化和W4A16量化，量化是一种以参数或计算中间结果精度下降换空间节省（以及同时带来的性能提升）的策略。

两个概念：
计算密集（compute-bound）: 指推理过程中，绝大部分时间消耗在数值计算上；针对计算密集型场景，可以通过使用更快的硬件计算单元来提升计算速度。
访存密集（memory-bound）: 指推理过程中，绝大部分时间消耗在数据读取上；针对访存密集型场景，一般通过减少访存次数、提高计算访存比或降低访存量来优化。常见的 LLM
模型由于 Decoder Only 架构的特性，实际推理时大多数的时间都消耗在了逐 Token 生成阶段（Decoding
阶段），是典型的访存密集型场景。

可以使用KV8量化和W4A16量化，来优化 LLM 模型推理中的访存密集问题。

KV8量化是指将逐 Token（Decoding）生成过程中的上下文 K 和 V 中间结果进行 INT8
量化（计算时再反量化），以降低生成过程中的显存占用。 W4A16 量化，将 FP16 的模型权重量化为 INT4，Kernel
计算时，访存量直接降为 FP16 模型的 1/4，大幅降低了访存成本。Weight Only 是指仅量化权重，数值计算依然采用
FP16（需要将 INT4 权重反量化）。

3.1 设置最大KV Cache缓存大小

KV Cache是一种缓存技术，通过存储键值对的形式来复用计算结果，以达到提高性能和降低内存消耗的目的。在大规模训练和推理中，KV
Cache可以显著减少重复计算量，从而提升模型的推理速度。理想情况下，KV
Cache全部存储于显存，以加快访存速度。当显存空间不足时，也可以将KV
Cache放在内存，通过缓存管理器控制将当前需要使用的数据放入显存。

模型在运行时，占用的显存可大致分为三部分：

模型参数本身占用的显存
KV Cache占用的显存
中间运算结果占用的显存。
LMDeploy的KV Cache管理器可以通过设置–cache-max-entry-count参数，控制KV缓存占用剩余显存的最大比例。默认的比例为0.8。

下面通过几个例子，来看一下调整–cache-max-entry-count参数的效果。

3.1.1首先保持不加该参数（默认0.8），运行1.8B模型。

lmdeploy chat /root/internlm2-chat-1_8b

与模型对话，查看右上角资源监视器中的显存占用情况。

此时显存占用为20944MB。

3.1.2 改变–cache-max-entry-count参数，设为0.5。

lmdeploy chat /root/internlm2-chat-1_8b --cache-max-entry-count 0.5

与模型对话，再次查看右上角资源监视器中的显存占用情况。

看到显存占用明显降低，变为14832MB。

3.1.3 把–cache-max-entry-count参数设置为0.01，约等于禁止KV Cache占用显存。

lmdeploy chat /root/internlm2-chat-1_8b --cache-max-entry-count 0.01

与模型对话，可以看到，此时显存占用仅为4720MB，代价是降低模型推理速度。

3.2 使用W4A16量化

LMDeploy使用AWQ算法，实现模型4bit权重量化。推理引擎TurboMind提供了非常高效的4bit推理cuda kernel，性能是FP16的2.4倍以上。

它支持以下NVIDIA显卡：图灵架构（sm75）：20系列、T4
安培架构（sm80,sm86）：30系列、A10、A16、A30、A100 Ada Lovelace架构（sm90）：40 系列
运行前，首先安装一个依赖库。

pip install einops==0.7.0

仅需执行一条命令，就可以完成模型量化工作。

lmdeploy lite auto_awq \
   /root/internlm2-chat-1_8b \
  --calib-dataset 'ptb' \
  --calib-samples 128 \
  --calib-seqlen 1024 \
  --w-bits 4 \
  --w-group-size 128 \
  --work-dir /root/internlm2-chat-1_8b-4bit

运行时间较长，量化工作结束后，新的HF模型被保存到internlm2-chat-1_8b-4bit目录。

3.2.1 使用Chat功能运行W4A16量化后的模型。

lmdeploy chat /root/internlm2-chat-1_8b-4bit --model-format awq

显存占用20520mb,为了更加明显体会到W4A16的作用，我将KV Cache比例再次调为0.01，查看显存占用情况。

lmdeploy chat /root/internlm2-chat-1_8b-4bit --model-format awq --cache-max-entry-count 0.01

竟然报错了，再来一次。
可以看到，显存占用变为2632MB，明显降低。

拓展内容：有关LMDeploy的lite功能的更多参数可通过-h命令查看。

lmdeploy lite -h

四、LMDeploy服务(serve)

之前都是在本地直接推理大模型，这种方式成为本地部署。在生产环境下，需要将大模型封装为API接口服务，供客户端访问。

先看下面一张架构图：

我们把从架构上把整个服务流程分成下面几个模块。

模型推理/服务：主要提供模型本身的推理，一般来说可以和具体业务解耦，专注模型推理本身性能的优化。可以以模块、API等多种方式提供。
API Server：中间协议层，把后端推理/服务通过HTTP，gRPC或其他形式的接口，供前端调用。
Client：可以理解为前端，与用户交互的地方。通过通过网页端/命令行去调用API接口，获取模型推理/服务。

值得说明的是，以上的划分是一个相对完整的模型，但在实际中这并不是绝对的。比如可以把“模型推理”和“API> Server”合并，有的甚至是三个流程打包在一起提供服务。

4.1 启动API服务器

通过以下命令启动API服务器，推理internlm2-chat-1_8b模型：

lmdeploy serve api_server \
    /root/internlm2-chat-1_8b \
    --model-format hf \
    --quant-policy 0 \
    --server-name 0.0.0.0 \
    --server-port 23333 \
    --tp 1

其中，model-format、quant-policy这些参数是与之前的量化推理模型一致的；server-name和server-port表示API服务器的服务IP与服务端口；tp参数表示并行数量（GPU数量）。通过运行以上指令，成功启动了API服务器，不能关闭该窗口，后面要新建客户端连接该服务。
运行有一行警告信息，实际业务窗口没能打开。
重新来一次。
这一步由于Server在远程服务器上，所以本地需要做一下ssh转发才能直接访问。在本地打开一个cmd窗口，输入命令如下：

C:\Users\Administrator\Desktop\xll\xy\data>ssh -CNg -L 23333:127.0.0.1:23333 root@ssh.intern-ai.org.cn -p 36317
The authenticity of host '[ssh.intern-ai]:36317 ([8.130.47.207]:36317)' can't be established.
ED25519 key fingerprint is SHA256:FHKSn+aBDe/ZqW/92VSMgbyffG0Pp9ApyCiwCidliSI.
This key is not known by any other names
Are you sure you want to continue connecting (yes/no/[fingerprint])? yes
Warning: Permanently added '[ssh.intern-ai]:36317' (ED25519) to the list of known hosts.
root@ssh.intern-ai.org.cn's password:

然后打开浏览器，访问http://127.0.0.1:23333。查看接口的具体使用说明，如下图所示。

可以通过运行一下指令，查看更多参数及使用方法：

lmdeploy serve api_server -h

4.2 命令行客户端连接API服务器

在“4.1”中，我们在终端里新开了一个API服务器。本节中，要新建一个命令行客户端去连接API服务器。
首先通过VS Code新建一个终端：

激活conda环境。

conda activate lmdeploy

运行命令行客户端：

lmdeploy serve api_client http://localhost:23333

运行后，可以通过命令行窗口直接与模型对话：

现在使用的架构是这样的：

4.3 网页客户端连接API服务器

关闭刚刚的VSCode终端，但服务器端的终端不关闭。新建一个VSCode终端，激活conda环境。

conda activate lmdeploy

使用Gradio作为前端，启动网页客户端。

lmdeploy serve gradio http://localhost:23333 \
    --server-name 0.0.0.0 \
    --server-port 6006

运行命令后，网页客户端启动。在电脑本地新建一个cmd终端，新开一个转发端口：

ssh -CNg -L 6006:127.0.0.1:6006 root@ssh.intern-ai.org.cn -p
36317


C:\Users\Administrator\Desktop\xll\xy\data>ssh -CNg -L 6006:127.0.0.1:6006 root@ssh.intern-ai.org.cn -p
36317
root@ssh.intern-ai.org.cn's password:

打开浏览器，访问地址http://127.0.0.1:6006

然后就可以与模型进行对话了！

现在使用的架构是这样的：

五、Python代码集成

在开发项目时，有时我们需要将大模型推理集成到Python代码里面。

5.1 Python代码集成运行1.8B模型

首先激活conda环境。

conda activate lmdeploy

新建Python源代码文件pipeline.py。

touch /root/pipeline.py

打开pipeline.py，填入以下内容。

from lmdeploy import pipeline

pipe = pipeline('/root/internlm2-chat-1_8b')
response = pipe(['Hi, pls intro yourself', '上海是'])
print(response)

代码解读：

第1行，引入lmdeploy的pipeline模块 \ 第3行，从目录“./internlm2-chat-1_8b”加载HF模型
第4行，运行pipeline，这里采用了批处理的方式，用一个列表包含两个输入，lmdeploy同时推理两个输入，产生两个输出结果，结果返回给response
\ 第5行，输出response

保存后运行代码文件：

python /root/pipeline.py

(lmdeploy) root@intern-studio-50092202:~# touch /root/pipeline.py
(lmdeploy) root@intern-studio-50092202:~# python /root/pipeline.py
[WARNING] gemm_config.in is not found; using default GEMM algo                                                                   
[Response(text="Hello! I'm InternLM, a conversational language model developed by Shanghai AI Laboratory. I'm here to help you with any questions or tasks you may have. My goal is to provide accurate and helpful responses while being honest and harmless. I'm here to assist you in a friendly and respectful manner. Let's get started!", generate_token_len=67, input_token_len=108, session_id=0, finish_reason='stop'), Response(text='上海是中国最大的城市之一，也是中国经济最发达的城市之一。它位于长江三角洲，与江苏、浙江、安徽、福建等省份接壤。上海拥有丰富的历史、文化和现代经济，是全球著名的国际大都市之一。', generate_token_len=50, input_token_len=104, session_id=1, finish_reason='stop')]
(lmdeploy) root@intern-studio-50092202:~#

5.2 向TurboMind后端传递参数

在第3章，我们通过向lmdeploy传递附加参数，实现模型的量化推理，及设置KV Cache最大占用比例。在Python代码中，可以通过创建TurbomindEngineConfig，向lmdeploy传递参数。

以设置KV Cache占用比例为例，新建python文件pipeline_kv.py。

touch /root/pipeline_kv.py

打开pipeline_kv.py，填入如下内容：

from lmdeploy import pipeline, TurbomindEngineConfig

# 调低 k/v cache内存占比调整为总显存的 20%
backend_config = TurbomindEngineConfig(cache_max_entry_count=0.2)

pipe = pipeline('/root/internlm2-chat-1_8b',
                backend_config=backend_config)
response = pipe(['Hi, pls intro yourself', '上海是'])
print(response)

保存后运行python代码：

python /root/pipeline_kv.py

得到输出结果：

(lmdeploy) root@intern-studio-50092202:~# touch /root/pipeline_kv.py
(lmdeploy) root@intern-studio-50092202:~# python /root/pipeline_kv.py
[WARNING] gemm_config.in is not found; using default GEMM algo                                                                   
[Response(text="Hello! I'm InternLM, a conversational language model developed by Shanghai AI Laboratory. I'm here to help and provide assistance in English and Chinese. My goal is to be helpful, honest, and harmless in all my interactions. Let's get started!", generate_token_len=53, input_token_len=108, session_id=0, finish_reason='stop'), Response(text='上海是中国最大的城市之一，位于中国的东部沿海地区。它有着丰富的历史和文化遗产，也是全球著名的经济、科技、文化中心之一。上海是中国最现代化的城市之一，拥有众多现代化的建筑、购物中心、娱乐设施等。此外，上海还是中国的一个重要港口城市，是中国对外贸易的重要门户之一。', generate_token_len=65, input_token_len=104, session_id=1, finish_reason='stop')]

六、拓展部分

6.1 使用LMDeploy运行视觉多模态大模型llava

最新版本的LMDeploy支持了llava多模态模型，下面演示使用pipeline推理llava-v1.6-7b。

注意，运行本pipeline最低需要30%的InternStudio开发机。

首先激活conda环境。

conda activate lmdeploy

安装llava依赖库。

pip install git+https://github.com/haotian-liu/LLaVA.git@4e2277a060da264c4f21b364c867cc622c945874

新建一个python文件，比如pipeline_llava.py。

touch /root/pipeline_llava.py

打开pipeline_llava.py，填入内容如下：

from lmdeploy.vl import load_image
from lmdeploy import pipeline, TurbomindEngineConfig


backend_config = TurbomindEngineConfig(session_len=8192) # 图片分辨率较高时请调高session_len
# pipe = pipeline('liuhaotian/llava-v1.6-vicuna-7b', backend_config=backend_config) 非开发机运行此命令
pipe = pipeline('/share/new_models/liuhaotian/llava-v1.6-vicuna-7b', backend_config=backend_config)

image = load_image('https://raw.githubusercontent/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
response = pipe(('describe this image', image))
print(response)

代码解读： \

第1行引入用于载入图片的load_image函数，第2行引入了lmdeploy的pipeline模块， \ 第5行创建了pipeline实例
\ 第7行从github下载了一张关于老虎的图片，如下：
\ 第8行运行pipeline，输入提示词“describe this
image”，和图片，结果返回至response \ 第9行输出response

保存后运行pipeline。

python /root/pipeline_llava.py

(lmdeploy) root@intern-studio-50092202:~# touch /root/pipeline_llava.py
(lmdeploy) root@intern-studio-50092202:~# python /root/pipeline_llava.py
[WARNING] gemm_config.in is not found; using default GEMM algo                                                                   
You are using a model of type llava to instantiate a model of type llava_llama. This is not supported for all configurations of models and can yield errors.
You are using a model of type llava to instantiate a model of type llava_llama. This is not supported for all configurations of models and can yield errors.
preprocessor_config.json: 100%|█████████████████████████████████████████████████████████████████| 316/316 [00:00<00:00, 3.06MB/s]
config.json: 4.76kB [00:00, 34.5MB/s]                                                                                            
pytorch_model.bin: 100%|████████████████████████████████████████████████████████████████████| 1.71G/1.71G [00:52<00:00, 32.8MB/s]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00,  9.83it/s]
/root/.conda/envs/lmdeploy/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.

得到输出结果：

Response(text="This is a color photograph of a tiger lying on the grass. The tiger is facing towards the camera with its head slightly turned to the side, showing a clear view of its head and part of its body. The tiger has distinctive stripes with a pattern that is characteristic of the species. Its fur is a mix of dark and light orange with black stripes. The tiger's eyes are open and appear alert, and its mouth is closed. The background is a blurred green, indicating a natural outdoor setting with grass. There are no visible texts or other objects in the image. The lighting suggests it could be either early morning or late afternoon, as the shadows are soft and the light is not harsh.", generate_token_len=153, input_token_len=1023, session_id=0, finish_reason='stop')

百度翻译：这是一张老虎躺在草地上的彩色照片。老虎面朝镜头，头微微转向侧面，头部和部分身体清晰可见。老虎身上有独特的条纹，图案是该物种的特征。它的皮毛是深橙色和浅橙色的混合体，带有黑色条纹。老虎的眼睛睁开，看起来很警觉，嘴巴闭着。背景是模糊的绿色，表示有草地的自然户外环境。图像中没有可见的文本或其他对象。灯光显示可能是清晨或傍晚，因为阴影柔和，光线也不刺眼。照片的风格是自然的野生动物拍摄，旨在捕捉环境中的动物。

由于官方的Llava模型对中文支持性不好，因此如果使用中文提示词，可能会得到出乎意料的结果，比如将提示词改为“请描述一下这张图片”，得到类似《印度鳄鱼》的回复。

通过Gradio来运行llava模型。
新建python文件gradio_llava.py。

touch /root/gradio_llava.py

打开文件，填入以下内容：

import gradio as gr
from lmdeploy import pipeline, TurbomindEngineConfig


backend_config = TurbomindEngineConfig(session_len=8192) # 图片分辨率较高时请调高session_len
# pipe = pipeline('liuhaotian/llava-v1.6-vicuna-7b', backend_config=backend_config) 非开发机运行此命令
pipe = pipeline('/share/new_models/liuhaotian/llava-v1.6-vicuna-7b', backend_config=backend_config)

def model(image, text):
    if image is None:
        return [(text, "请上传一张图片。")]
    else:
        response = pipe((text, image)).text
        return [(text, response)]

demo = gr.Interface(fn=model, inputs=[gr.Image(type="pil"), gr.Textbox()], outputs=gr.Chatbot())
demo.launch()

运行python程序。

python /root/gradio_llava.py

(lmdeploy) root@intern-studio-50092202:~# touch /root/pipeline_llava.py
(lmdeploy) root@intern-studio-50092202:~# python /root/pipeline_llava.py
[WARNING] gemm_config.in is not found; using default GEMM algo                                                                   
You are using a model of type llava to instantiate a model of type llava_llama. This is not supported for all configurations of models and can yield errors.
You are using a model of type llava to instantiate a model of type llava_llama. This is not supported for all configurations of models and can yield errors.
preprocessor_config.json: 100%|█████████████████████████████████████████████████████████████████| 316/316 [00:00<00:00, 3.06MB/s]
config.json: 4.76kB [00:00, 34.5MB/s]                                                                                            
pytorch_model.bin: 100%|████████████████████████████████████████████████████████████████████| 1.71G/1.71G [00:52<00:00, 32.8MB/s]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00,  9.83it/s]
/root/.conda/envs/lmdeploy/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
Response(text="This is a color photograph of a tiger lying on the grass. The tiger is facing towards the camera with its head slightly turned to the side, showing a clear view of its head and part of its body. The tiger has distinctive stripes with a pattern that is characteristic of the species. Its fur is a mix of dark and light orange with black stripes. The tiger's eyes are open and appear alert, and its mouth is closed. The background is a blurred green, indicating a natural outdoor setting with grass. There are no visible texts or other objects in the image. The lighting suggests it could be either early morning or late afternoon, as the shadows are soft and the light is not harsh.", generate_token_len=153, input_token_len=1023, session_id=0, finish_reason='stop')
(lmdeploy) root@intern-studio-50092202:~# touch /root/gradio_llava.py
(lmdeploy) root@intern-studio-50092202:~# python /root/gradio_llava.py
[WARNING] gemm_config.in is not found; using default GEMM algo                                                                   
You are using a model of type llava to instantiate a model of type llava_llama. This is not supported for all configurations of models and can yield errors.
You are using a model of type llava to instantiate a model of type llava_llama. This is not supported for all configurations of models and can yield errors.
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00,  5.77it/s]
/root/.conda/envs/lmdeploy/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
IMPORTANT: You are using gradio version 4.16.0, however version 4.29.0 is available, please upgrade.
--------

通过ssh转发一下7860端口。


C:\Users\Administrator\Desktop\xll\xy\data>ssh -CNg -L 7860:127.0.0.1:7860 root@ssh.intern-ai.org.cn -p
36317
root@ssh.intern-ai.org.cn's password:

通过浏览器访问http://127.0.0.1:7860。

测试一下：

6.2 使用LMDeploy运行第三方大模型

LMDeploy不仅支持运行InternLM系列大模型，还支持其他第三方大模型。支持的模型列表如下：


Model	Size
Llama	7B - 65B
Llama2	7B - 70B
InternLM	7B - 20B
InternLM2	7B - 20B
InternLM-XComposer	7B
QWen	7B - 72B
QWen-VL	7B
QWen1.5	0.5B - 72B
QWen1.5-MoE	A2.7B
Baichuan	7B - 13B
Baichuan2	7B - 13B
Code Llama	7B - 34B
ChatGLM2	6B
Falcon	7B - 180B
YI	6B - 34B
Mistral	7B
DeepSeek-MoE	16B
DeepSeek-VL	7B
Mixtral	8x7B
Gemma	2B-7B
Dbrx	132B

可以从Modelscope，OpenXLab下载相应的HF模型，下载好HF模型，步骤和使用LMDeploy运行InternLM2一样。

6.3 定量比较LMDeploy与Transformer库的推理速度差异

为了直观感受LMDeploy与Transformer库推理速度的差异，让我们来编写一个速度测试脚本。测试环境是30%的InternStudio开发机。

先来测试一波Transformer库推理Internlm2-chat-1.8b的速度，新建python文件，命名为benchmark_transformer.py，填入以下内容：

import torch
import datetime
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("/root/internlm2-chat-1_8b", trust_remote_code=True)

# Set `torch_dtype=torch.float16` to load model in float16, otherwise it will be loaded as float32 and cause OOM Error.
model = AutoModelForCausalLM.from_pretrained("/root/internlm2-chat-1_8b", torch_dtype=torch.float16, trust_remote_code=True).cuda()
model = model.eval()

# warmup
inp = "hello"
for i in range(5):
    print("Warm up...[{}/5]".format(i+1))
    response, history = model.chat(tokenizer, inp, history=[])

# test speed
inp = "请介绍一下你自己。"
times = 10
total_words = 0
start_time = datetime.datetime.now()
for i in range(times):
    response, history = model.chat(tokenizer, inp, history=history)
    total_words += len(response)
end_time = datetime.datetime.now()

delta_time = end_time - start_time
delta_time = delta_time.seconds + delta_time.microseconds / 1000000.0
speed = total_words / delta_time
print("Speed: {:.3f} words/s".format(speed))

运行python脚本：

python benchmark_transformer.py

(base) root@intern-studio-50092202:~# touch /root/benchmark_transformer.py
(base) root@intern-studio-50092202:~# conda activate lmdeploy
(lmdeploy) root@intern-studio-50092202:~# python benchmark_transformer.py
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████| 2/2 [00:07<00:00,  3.74s/it]
Traceback (most recent call last):
  File "/root/benchmark_transformer.py", line 8, in <module>
    model = AutoModelForCausalLM.from_pretrained("/root/internlm2-chat-1_8b", torch_dtype=torch.float16, trust_remote_code=True).cuda()
  File "/root/.conda/envs/lmdeploy/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2567, in cuda
    return super().cuda(*args, **kwargs)
  File "/root/.conda/envs/lmdeploy/lib/python3.10/site-packages/torch/nn/modules/module.py", line 918, in cuda
    return self._apply(lambda t: t.cuda(device))
  File "/root/.conda/envs/lmdeploy/lib/python3.10/site-packages/torch/nn/modules/module.py", line 810, in _apply
    module._apply(fn)
  File "/root/.conda/envs/lmdeploy/lib/python3.10/site-packages/torch/nn/modules/module.py", line 810, in _apply
    module._apply(fn)
  File "/root/.conda/envs/lmdeploy/lib/python3.10/site-packages/torch/nn/modules/module.py", line 810, in _apply
    module._apply(fn)
  [Previous line repeated 2 more times]
  File "/root/.conda/envs/lmdeploy/lib/python3.10/site-packages/torch/nn/modules/module.py", line 833, in _apply
    param_applied = fn(param)
  File "/root/.conda/envs/lmdeploy/lib/python3.10/site-packages/torch/nn/modules/module.py", line 918, in <lambda>
    return self._apply(lambda t: t.cuda(device))
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacty of 23.99 GiB of which 4.00 MiB is free. Process 2781910 has 23.21 GiB memory in use. Process 3249933 has 792.00 MiB memory in use. Of the allocated memory 378.00 MiB is allocated by PyTorch, and 0 bytes is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
(lmdeploy) root@intern-studio-50092202:~#

显存爆掉了，重启开发机，不再激活虚拟环境，直接运行，得到运行结果：

(base) root@intern-studio-50092202:~# python benchmark_transformer.py
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████| 2/2 [00:04<00:00,  2.33s/it]
Warm up...[1/5]
Warm up...[2/5]
Warm up...[3/5]
Warm up...[4/5]
Warm up...[5/5]
Speed: 53.702 words/s

可以看到，Transformer库的推理速度约为53.702 words/s，注意单位是words/s，不是token/s，word和token在数量上可以近似认为成线性关系。

下面来测试一下LMDeploy的推理速度，新建python文件benchmark_lmdeploy.py，填入以下内容：

import datetime
from lmdeploy import pipeline

pipe = pipeline('/root/internlm2-chat-1_8b')

# warmup
inp = "hello"
for i in range(5):
    print("Warm up...[{}/5]".format(i+1))
    response = pipe([inp])

# test speed
inp = "请介绍一下你自己。"
times = 10
total_words = 0
start_time = datetime.datetime.now()
for i in range(times):
    response = pipe([inp])
    total_words += len(response[0].text)
end_time = datetime.datetime.now()

delta_time = end_time - start_time
delta_time = delta_time.seconds + delta_time.microseconds / 1000000.0
speed = total_words / delta_time
print("Speed: {:.3f} words/s".format(speed))

运行脚本：

python benchmark_lmdeploy.py
得到运行结果：

(base) root@intern-studio-50092202:~# touch /root/benchmark_lmdeploy.py
(base) root@intern-studio-50092202:~# python benchmark_lmdeploy.py
[WARNING] gemm_config.in is not found; using default GEMM algo                                                                   
Warm up...[1/5]
Warm up...[2/5]
Warm up...[3/5]
Warm up...[4/5]
Warm up...[5/5]
Speed: 470.884 words/s

可以看到，LMDeploy的推理速度约为470.884 words/s，是Transformer库的8倍多。

至此，本课全部作业，包括进阶，全部完成

本文标签：进阶作业实战 LLM LMDeploy

版权声明：本文标题：LMDeploy 量化部署 LLM & VLM 实战作业（含进阶）（lesson 5）内容由热心网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：https://m.elefans.com/dongtai/1728871382a1177314.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

电子爱好者 - 最新技术资讯及电子产品介绍！

LMDeploy 量化部署 LLM &amp; VLM 实战作业（含进阶）（lesson 5）

LMDeploy 量化部署 LLM & VLM 实战作业（含进阶）（lesson 5）

一、LMDeploy环境部署

1.1 打开InternStudio平台，创建开发机。

1.2 创建conda环境

1.3 安装LMDeploy

二、LMDeploy模型对话(chat)

2.1 Huggingface与TurboMind

2.2 下载模型

2.3 使用Transformer库运行模型

2.4 使用LMDeploy直接与模型进行对话

三.LMDeploy模型量化(lite)（进阶）

3.1 设置最大KV Cache缓存大小

3.1.1首先保持不加该参数（默认0.8），运行1.8B模型。

3.1.2 改变–cache-max-entry-count参数，设为0.5。

3.1.3 把–cache-max-entry-count参数设置为0.01，约等于禁止KV Cache占用显存。

3.2 使用W4A16量化

3.2.1 使用Chat功能运行W4A16量化后的模型。

四、LMDeploy服务(serve)

4.1 启动API服务器

4.2 命令行客户端连接API服务器

4.3 网页客户端连接API服务器

五、Python代码集成

5.1 Python代码集成运行1.8B模型

5.2 向TurboMind后端传递参数

六、拓展部分

6.1 使用LMDeploy运行视觉多模态大模型llava

6.2 使用LMDeploy运行第三方大模型

6.3 定量比较LMDeploy与Transformer库的推理速度差异

更多相关文章

【LLM评估】GLUE基准数据集介绍

【免费领源码】javaMysql数据库+基于安卓平台的汉语言学习应用系统的设计与实现80400，计算机毕业设计项目推荐上万套实战教程JAVA、PHP，node.js，C++、python、大屏可视化

支付宝钱包手势密码破解实战（root过的手机可直接绕过手势密码）

前端开发Vue项目实战：电商后台管理系统（二）-- 登录退出功能 --主页界面

Jetson Nano 从入门到实战（案例：Opencv配置、人脸检测、二维码检测）

2022 最新 Android 基础教程，从开发入门到项目实战【b站动脑学院】学习笔记——第四章：活动Activity

web前端大作业 (仿英雄联盟网站制作HTML+CSS+JavaScript) 学生dreamweaver网页设计作业

【Python爬虫系列教程 18-100】Python网络爬虫实战：小姐姐手把手教你爬取并下载英雄联盟所有英雄皮肤高清大图

python爬虫--实战英雄联盟LOL壁纸下载

无线路由攻击和WiFi密码破解实战[渗透技术]

零基础STM32单片机编程入门(二十二) ESP8266 WIFI模块实战含源码

LINUX系统无线网频繁断开_实战：无线网络接入密码(WPA2)的破解过程

电子证据 利用Kali进行wifi钓鱼实战详细教程

个人作业4——alpha阶段个人总结

vbvb.net开发实战之经验分享（1）

数据结构项目实战——通讯录

刷机维修进阶教程-----高通9008线刷固件 救砖不开机 非硬件故障修复解决

Python3《机器学习实战》学习笔记（三）：决策树实战篇之为自己配个隐形眼镜

软件工程作业-输入法评价

Llama3-Tutorial之LMDeploy高效部署Llama3实践

发表评论

推荐文章

100天精通Python丨黑科技篇 —— 24、英雄属性面板分析 ①掌握爬虫技术；②Python数据可视化

如何看自己在英雄联盟中的定位？也就是自己的本命英雄？

破解虚拟机ESXi服务器密码,esxi虚机Windows server 2012忘记密码解决办法

TDLS 协议学习

VMware16Pro虚拟机安装教程(超详细)

热门文章

nc: getnameinfo: Temporary failure in name resolution

APK安装失败：[INSTALL_FAILED_VERIFICATION_FAILURE]

Failure recovering jobs: Lock wait timeout exceeded； try restarting transaction

VCS仿真遇到【CNST-CIF】constraints inconsistency failure如何解决

浏览器控制台的快捷键

创新 innovation

直接在Google Chrome浏览器中查看文档和PDF

Win10系统误删IE浏览器修复方法

teb_localplanner源码学习

计算机型号或配置,查看电脑各项配置参数的方法

最新文章

联想电脑，笔记本电脑，亮度无法显示

联想笔记本更新驱动

联想微型计算机怎么光盘启动,联想笔记本电脑win10怎么设置光盘启动

联想笔记本计算机在哪里找不到,联想笔记本电脑找不到WLAN怎么解决

如何查看笔记本电脑的型号？

如何关闭联想笔记本电脑上意外启动的小键盘

解决联想拯救者系列笔记本电脑无线网高频断联问题~

如何让iPhone投屏到联想小新笔记本电脑（WindowsLinux系统）？

计算机的正确配置文件,显示器颜色配置文件在win10电脑中设置正确配置的方法...

不带网口的笔记本电脑使用海康GigE工业相机

LMDeploy 量化部署 LLM & VLM 实战作业（含进阶）（lesson 5）

电子证据利用Kali进行wifi钓鱼实战详细教程

刷机维修进阶教程-----高通9008线刷固件救砖不开机非硬件故障修复解决

联想笔记本键盘排线,联想笔记本原装键盘价格表联想笔记本键盘如何更换

【鬼泣5（Devil May Cry V）v1.0十四项修改】鬼泣5（Devil May Cry V）v1.0十四项修改官方免费下载