大模型部署&接口调用监控&性能测试实践-华为JDC

基于昇腾服务器NPU卡部署，mindie容器镜像+大模型方式 "port" : 1025, "managementPort" : 1026, "metricsPort" : 1027,模型文件存放路径cd /modelscope/QWQ-32B 【mindie20t17容器镜像需要安装4.51.0_transformers.tar】(已更新到T18镜像无需在安装)容器内操作：tar xf 4.51.0_transformers.tarpip install --no-index --find-links=./transformers_offline_packages transformers==4.51.0 使用自行构建的普通用户镜像，并且规避容器相关权限风险，可以使用以下命令指定用户与设备：docker run -it -d --net=host --shm-size=500g \ --name QWQ-32B-test \ --device=/dev/davinci_manager \ --device=/dev/hisi_hdc \ --device=/dev/devmm_svm \ --device=/dev/davinci0 \ --device=/dev/davinci1 \ --device=/dev/davinci2 \ --device=/dev/davinci3 \ --device=/dev/davinci4 \ --device=/dev/davinci5 \ --device=/dev/davinci6 \ --device=/dev/davinci7 \--privileged=true \ -v /usr/local/Ascend/driver:/usr/local/Ascend/driver:ro \ -v /usr/local/sbin:/usr/local/sbin:ro \ -v /modelscope/QWQ-32B:/modelscope/QWQ-32B \ mindie20t18:latest bash 进入容器docker exec -it ${容器名称} bash # 设置CANN包的环境变量source /usr/local/Ascend/ascend-toolkit/set_env.sh# 关闭虚拟内存export PYTORCH_NPU_ALLOC_CONF=expandable_segments:False 服务化推理打开配置文件vim /usr/local/Ascend/mindie/latest/mindie-service/conf/config.json 更改配置文件[root@ip241 /]# cat /usr/local/Ascend/mindie/latest/mindie-service/conf/config.json{ "Version" : "1.0.0", "LogConfig" : { "logLevel" : "Info", "logFileSize" : 20, "logFileNum" : 20, "logPath" : "logs/mindie-server.log" }, "ServerConfig" : { "ipAddress" : "10.32.xx.111", "managementIpAddress" : "10.32.xx.111", "port" : 1025, "managementPort" : 1026, "metricsPort" : 1027, "allowAllZeroIpListening" : false, "maxLinkNum" : 1000, "httpsEnabled" : false, //关闭https "fullTextEnabled" : false, "tlsCaPath" : "security/ca/", "tlsCaFile" : ["ca.pem"], "tlsCert" : "security/certs/server.pem", "tlsPk" : "security/keys/server.key.pem", "tlsPkPwd" : "security/pass/key_pwd.txt", "tlsCrlPath" : "security/certs/", "tlsCrlFiles" : ["server_crl.pem"], "managementTlsCaFile" : ["management_ca.pem"], "managementTlsCert" : "security/certs/management/server.pem", "managementTlsPk" : "security/keys/management/server.key.pem", "managementTlsPkPwd" : "security/pass/management/key_pwd.txt", "managementTlsCrlPath" : "security/management/certs/", "managementTlsCrlFiles" : ["server_crl.pem"], "kmcKsfMaster" : "tools/pmt/master/ksfa", "kmcKsfStandby" : "tools/pmt/standby/ksfb", "inferMode" : "standard", "interCommTLSEnabled" : true, "interCommPort" : 1121, "interCommTlsCaPath" : "security/grpc/ca/", "interCommTlsCaFiles" : ["ca.pem"], "interCommTlsCert" : "security/grpc/certs/server.pem", "interCommPk" : "security/grpc/keys/server.key.pem", "interCommPkPwd" : "security/grpc/pass/key_pwd.txt", "interCommTlsCrlPath" : "security/grpc/certs/", "interCommTlsCrlFiles" : ["server_crl.pem"], "openAiSupport" : "vllm" }, "BackendConfig" : { "backendName" : "mindieservice_llm_engine", "modelInstanceNumber" : 1, "npuDeviceIds" : [[0,1,2,3]], //根据实际NPU卡调整 "tokenizerProcessNumber" : 8, "multiNodesInferEnabled" : false, "multiNodesInferPort" : 1120, "interNodeTLSEnabled" : true, "interNodeTlsCaPath" : "security/grpc/ca/", "interNodeTlsCaFiles" : ["ca.pem"], "interNodeTlsCert" : "security/grpc/certs/server.pem", "interNodeTlsPk" : "security/grpc/keys/server.key.pem", "interNodeTlsPkPwd" : "security/grpc/pass/mindie_server_key_pwd.txt", "interNodeTlsCrlPath" : "security/grpc/certs/", "interNodeTlsCrlFiles" : ["server_crl.pem"], "interNodeKmcKsfMaster" : "tools/pmt/master/ksfa", "interNodeKmcKsfStandby" : "tools/pmt/standby/ksfb", "ModelDeployConfig" : { "maxSeqLen" : 102400, //总序列长度（输入 + 输出）的最大值 "maxInputTokenLen" : 71680, //最大输入 token 数 "truncation" : false, "ModelConfig" : [ { "modelInstanceType" : "Standard", "modelName" : "QwQ-32B", "modelWeightPath" : "/modelscope/QWQ-32B", "worldSize" : 4, //根据上面NPU卡数量来 "cpuMemSize" : 5, "npuMemSize" : -1, "backendType" : "atb", "trustRemoteCode" : false } ] }, "ScheduleConfig" : { "templateType" : "Standard", "templateName" : "Standard_LLM", "cacheBlockSize" : 128, "maxPrefillBatchSize" : 50, "maxPrefillTokens" : 86016, //允许最大输入 token 数 "prefillTimeMsPerReq" : 150, "prefillPolicyType" : 0, "decodeTimeMsPerReq" : 50, "decodePolicyType" : 0, "maxBatchSize" : 200, "maxIterTimes" : 30720, #模型最大可能输出长度的倍数 "maxPreemptCount" : 0, "supportSelectBatch" : false, "maxQueueDelayMicroseconds" : 5000 } }} #容器内导入环境变量source /usr/local/Ascend/ascend-toolkit/set_env.shsource /usr/local/Ascend/mindie/set_env.shsource /usr/local/Ascend/nnal/atb/set_env.shsource /usr/local/Ascend/atb-models/set_env.sh#文件路径赋权chmod -R 640 /modelscope/QWQ-32B/ 拉起服务化#cd /usr/local/Ascend/mindie/latest/mindie-service/bin#nohup ./mindieservice_daemon & 关服务pkill -9 'mindie|python' # 新建窗口测试(VLLM接口)#curl http://10.32.xx.111:1025/generate -d '{"prompt": "你是谁？","max_tokens": 100,"stream": false,"do_sample":true,"repetition_penalty": 1.00,"temperature": 0.01,"top_p": 0.001,"top_k": 1,"model": "llama"}' 常见问题 ImportError: cannot import name 'shard_checkpoint' from 'transformers.modeling_utils'. 降低transformers版本可解决。 pip install transformers==4.46.3 --force-reinstallpip install numpy==1.26.4 --force-reinstall ============================================================================查看接口调用情况监控起服务窗口导入环境变量export MIES_SERVICE_MONITOR_MODE=1 curl -H "Accept: application/json" -H "Content-type: application/json" -X GET http://10.32.xx.111:1027/metrics ============性能测试======容器内操作=======================export MINDIE_LOG_TO_STDOUT="benchmark:1; client:1" #性能测试完结果输出到屏幕 [root@ip241 config]# pwdcd /usr/local/lib/python3.11/site-packages/mindiebenchmark/config[root@ip241 config]# cat config.json { "LogConfig": { "LOG_PATH": "~/mindie/log", "LOG_TO_FILE": "1", "LOG_TO_STDOUT": "benchmark:1; client:1", "LOG_LEVEL": "INFO", "LOG_VERBOSE": "true", "LOG_ROTATE": "-fs 20 -r 10" }, "CertConfig": { "CA_CERT": "/path/to/cacert.pem", "KEY_FILE": "/path/to/client.pem.key", "CERT_FILE": "/path/to/client.pem", "CRL_FILE": "/path/to/crl.pem" }, "OutputConfig": { "INSTANCE_PATH": "./instance" }, "ServerConfig": { "ENABLE_MANAGEMENT": false, "MAX_LINK_NUM": 1000 }} [root@ip241 config]#[root@ip241 config]# cat synthetic_config.json{ "Input":{ "Method": "uniform", "Params": {"MinValue": 1, "MaxValue": 200} }, "Output": { "Method": "gaussian", "Params": {"Mean": 100, "Var": 200, "MinValue": 1, "MaxValue": 100} }, "RequestCount": 100}[[root@ip241 config]#[root@ip241config]#cat /usr/local/lib/python3.11/site-packages/mindieclient/python/config/config.json{ "LogConfig": { "LOG_PATH": "~/mindie/log", "LOG_TO_FILE": "1", "LOG_TO_STDOUT": "benchmark:1; client:1", "LOG_LEVEL": "INFO", "LOG_VERBOSE": "true", "LOG_ROTATE": "-fs 20 -r 10" }} 设置640权限chmod 640 /usr/local/lib/python3.11/site-packages/mindiebenchmark/config/synthetic_config.jsonchmod 640 /usr/local/lib/python3.11/site-packages/mindiebenchmark/config/config.jsonchmod 640 /usr/local/lib/python3.11/site-packages/mindieclient/python/config/config.json 进入QWQ-32B的容器性能测试benchmark --DatasetType "synthetic" --ModelName QwQ-32B --ModelPath "/iflytek/modelscope/QWQ32B" --TestType vllm_client --Http http://10.32.xx.111:1025 --ManagementHttp http://10.32.xx.111:1026 --Concurrency 128 --MaxOutputLen 512 --TaskKind stream --Tokenizer True --SyntheticConfigPath /usr/local/lib/python3.11/site-packages/mindiebenchmark/config/synthetic_config.json[图片]https://jdc100.huawei.com/CommunityGatewayService/com.huawei.ipd.sppm.jdcforum:JDCCommunityUserService/CommunityUserService/user/attachment/v1/download?aid=2009293169388834816[图片] Common MetricValue说明CurrentTime2025-03-31 12:15:17当前时间，即性能测试执行完毕的时间。TimeElapsed5.5949 s性能测试的总耗时，单位是秒。DataSourceNone数据源，这里显示为 None，表示没有特定的数据源。Failed0( 0.0% )失败的请求数及其占比。在这个例子中，没有任何请求失败。Returned100( 100.0% )成功返回的请求数及其占比。在这个例子中，所有 100 个请求都成功返回。Total100[ 100.0% ]总请求数及其占比。在这个例子中，总共发送了 100 个请求，全部成功。Concurrency128并发请求数，即同时发送的请求数量。ModelNameQwQ-32B使用的模型名称。lpct7.9003 ms平均每个请求的延迟（Latency Per Client Time），单位是毫秒。Throughput17.8733 req/s每秒处理的请求数（吞吐量），单位是请求/秒。GenerateSpeed1687.6008 token/s模型每秒生成的 token 数量，单位是 token/秒。GenerateSpeedPerClient13.1844 token/s每个客户端每秒生成的 token 数量，单位是 token/秒。accuracy/准确率。在这个例子中，准确率未提供数据，因此显示为 /。 cat /usr/local/lib/python3.11/site-packages/mindiebenchmark/config/synthetic_config.json[图片]https://jdc100.huawei.com/CommunityGatewayService/com.huawei.ipd.sppm.jdcforum:JDCCommunityUserService/CommunityUserService/user/attachment/v1/download?aid=2009293169388834817[图片] 具体解释1. Input（输入）· Method: 数据生成的方法。在这个例子中，使用的是 "uniform" 方法，即均匀分布。· Params: 均匀分布的具体参数。o MinValue: 均匀分布的最小值，这里是 1。o MaxValue: 均匀分布的最大值，这里是 200。2. Output（输出）· Method: 数据生成的方法。在这个例子中，使用的是 "gaussian" 方法，即高斯分布（正态分布）。· Params: 高斯分布的具体参数。o Mean: 高斯分布的均值，这里是 100。o Var: 高斯分布的方差，这里是 200。o MinValue: 高斯分布的最小截断值，这里是 1。o MaxValue: 高斯分布的最大截断值，这里是 100。3. RequestCount（请求数量）· RequestCount: 请求的数量，表示要生成的数据点的数量，这里是 100。