Llama.cpp

您可以使用 llama.cpp 在 Dragonwing 开发板上运行各种大语言模型（LLM）和视觉语言模型（VLM）。在 llama.cpp 下运行的模型在GPU而非NPU上运行。您可以通过 GENIE 在NPU上运行部分模型子集。

构建 llama.cpp

您需要为 llama.cpp 构建一些依赖项。在开发板上打开终端，或建立 SSH 会话，然后执行以下操作：

1️⃣ 安装构建依赖项：

sudo apt update
sudo apt install -y cmake ninja-build curl libcurl4-openssl-dev

2️⃣ 安装 OpenCL 头文件和 ICD 加载器库：

mkdir -p ~/dev/llm

# Symlink the OpenCL shared library
sudo rm -f /usr/lib/libOpenCL.so
sudo ln -s /lib/aarch64-linux-gnu/libOpenCL.so.1.0.0 /usr/lib/libOpenCL.so

# OpenCL headers
cd ~/dev/llm
git clone https://github.com/KhronosGroup/OpenCL-Headers
cd OpenCL-Headers
git checkout 5d52989617e7ca7b8bb83d7306525dc9f58cdd46
mkdir -p build && cd build
cmake .. -G Ninja \
    -DBUILD_TESTING=OFF \
    -DOPENCL_HEADERS_BUILD_TESTING=OFF \
    -DOPENCL_HEADERS_BUILD_CXX_TESTS=OFF \
    -DCMAKE_INSTALL_PREFIX="$HOME/dev/llm/opencl"
cmake --build . --target install

# ICD Loader
cd ~/dev/llm
git clone https://github.com/KhronosGroup/OpenCL-ICD-Loader
cd OpenCL-ICD-Loader
git checkout 02134b05bdff750217bf0c4c11a9b13b63957b04
mkdir -p build && cd build
cmake .. -G Ninja \
    -DCMAKE_BUILD_TYPE=Release \
    -DCMAKE_PREFIX_PATH="$HOME/dev/llm/opencl" \
    -DCMAKE_INSTALL_PREFIX="$HOME/dev/llm/opencl"
cmake --build . --target install

# Symlink OpenCL headers
sudo rm -f /usr/include/CL
sudo ln -s ~/dev/llm/opencl/include/CL/ /usr/include/CL

3️⃣ 使用 OpenCL 后端编译 llama.cpp：

cd ~/dev/llm

# Clone repository
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

# We've tested this commit explicitly, you can try master if you want bleeding edge
git checkout f6da8cb86a28f0319b40d9d2a957a26a7d875f8c

# Build
mkdir -p build
cd build
cmake .. -G Ninja \
    -DCMAKE_BUILD_TYPE=Release \
    -DBUILD_SHARED_LIBS=OFF \
    -DGGML_OPENCL=ON
ninja -j`nproc`

4️⃣ 将 llama.cpp 路径添加到 PATH：

cd ~/dev/llm/llama.cpp/build/bin

echo "" >> ~/.bash_profile
echo "# Begin llama.cpp" >> ~/.bash_profile
echo "export PATH=\$PATH:$PWD" >> ~/.bash_profile
echo "# End llama.cpp" >> ~/.bash_profile
echo "" >> ~/.bash_profile

# To use the llama.cpp files in your current session
source ~/.bash_profile

5️⃣ 现在已完成 llama.cpp 的部署:

llama-cli --version
# ggml_opencl: selected platform: 'QUALCOMM Snapdragon(TM)'
# ggml_opencl: device: 'QUALCOMM Adreno(TM) 635 (OpenCL 3.0 Adreno(TM) 635)'
# ggml_opencl: OpenCL driver: OpenCL 3.0 QUALCOMM build: 0808.0.7 Compiler E031.49.02.00
# ggml_opencl: vector subgroup broadcast support: true

下载并量化模型

为了运行 GPU 加速的模型，你需要使用纯 4-bit 量化(Q4_0)模型，并且格式为 GGUF（llama.cpp 格式，转换指南。可以选择已量化好的模型，或使用 llama-quantize 自行量化模型。例如，对于 Qwen2-1.5B-Instruct：

1️⃣从 HuggingFace 获取 fp16 格式的 Qwen2-1.5B-Instruct 并使用 llama-quantize 进行量化：

# Download fp16 model
wget https://huggingface.co/Qwen/Qwen2-1.5B-Instruct-GGUF/resolve/main/qwen2-1_5b-instruct-fp16.gguf

# Quantize (pure Q4_0)
llama-quantize --pure qwen2-1_5b-instruct-fp16.gguf qwen2-1_5b-instruct-q4_0-pure.gguf Q4_0

2️⃣按照 llama.cpp 编译说明运行此模型。

使用 llama-cli 运行你的第一个 LLM

您现在可以通过llama-cli 运行LLM。它会自动将各层转移到 GPU：

llama-cli -m ./qwen2-1_5b-instruct-q4_0-pure.gguf -no-cnv --no-warmup -b 128 -c 2048 -s 11 -n 128 -p "Knock knock, " -fa off

# ... You'll see:
# load_tensors: offloaded 29/29 layers to GPU
# ...
# Knock knock, 11:59 pm ... rest of the story

现在，您设备的 GPU 上已经运行有 LLM。

使用 llama-server 部署 LLM

接下来可以使用llama-server启动带有聊天界面的 Web 服务器，该服务器同时提供OpenAI兼容的会话补全API。

1️⃣ 首先，找到开发板的IP地址：

ifconfig | grep -Eo 'inet (addr:)?([0-9]*\.){3}[0-9]*' | grep -Eo '([0-9]*\.){3}[0-9]*' | grep -v '127.0.0.1'

# ... Example:
# 192.168.1.253

2️⃣ 通过以下方式启动服务器：

llama-server -m ./qwen2-1_5b-instruct-q4_0-pure.gguf --no-warmup -b 128 -c 2048 -s 11 -n 128 --host 0.0.0.0 --port 9876

3️⃣ 在您的计算机上，打开 Web 浏览器并前往 http://192.168.1.253:9876 （将 IP 地址替换为您在 1 中找到的 IP 地址）：

4️⃣ 您还可以通过 OpenAI 会话补全 API 以编程方式访问此服务器。例如，从 Python：
a.创建一个新的虚拟环境（venv）并安装requests：

python3 -m venv .venv-chat
source .venv/bin/activate
pip3 install requests

b. 创建一个新文件chat.py：

import requests

# if running from your own computer, replace localhost with the IP address of your development board
url = "http://localhost:9876/v1/chat/completions"

payload = {
    "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain Qualcomm in one sentence."}
    ],
    "temperature": 0.7,
    "max_tokens": 200
}

response = requests.post(url, headers={ "Content-Type": "application/json" }, json=payload)
print(response.json())

c. 运行chat.py：

python3 chat.py

# ...
# {'choices': [{'finish_reason': 'stop', 'index': 0, 'message': {'role': 'assistant', 'content': 'Qualcomm is a leading global technology company that designs, develops, licenses, and markets semiconductor-based products and mobile platform technologies to major telecommunications and consumer electronics manufacturers worldwide.'}}], 'created': 1757073340, 'model': 'gpt-3.5-turbo', 'system_fingerprint': 'b6362-f6da8cb8', 'object': 'chat.completion', 'usage': {'completion_tokens': 34, 'prompt_tokens': 26, 'total_tokens': 60}, 'id': 'chatcmpl-3O7l005WG1DzN191FTNomJNweHMoH8Is', 'timings': {'prompt_n': 12, 'prompt_ms': 303.581, 'prompt_per_token_ms': 25.298416666666668, 'prompt_per_second': 39.52816546490064, 'predicted_n': 34, 'predicted_ms': 4052.23, 'predicted_per_token_ms': 119.18323529411765, 'predicted_per_second': 8.390441806116632}}

部署多模态LLM

您同样可以部署多模态大语言模型。例如 SmolVLM-500M-Instruct-GGUF。下载 Q4_0 量化权重文件（或自行量化），并下载 CLIP 编码器mmproj-*.gguf文件。例如：

# Download weights
wget https://huggingface.co/ggml-org/SmolVLM-500M-Instruct-GGUF/resolve/main/SmolVLM-500M-Instruct-f16.gguf
wget https://huggingface.co/ggml-org/SmolVLM-500M-Instruct-GGUF/resolve/main/mmproj-SmolVLM-500M-Instruct-f16.gguf

# Quantize model (mmproj- models are not quantizable via llama-quantize, see below)
llama-quantize --pure SmolVLM-500M-Instruct-f16.gguf SmolVLM-500M-Instruct-q4_0-pure.gguf Q4_0

# Server the model
llama-server -m ./SmolVLM-500M-Instruct-q4_0-pure.gguf --mmproj ./mmproj-SmolVLM-500M-Instruct-f16.gguf --no-warmup -b 128 -c 2048 -s 11 -n 128 --host 0.0.0.0 --port 9876

信息

CLIP 模型仍为 fp16: mmproj 模型仍然是 fp16；因此处理图像会很慢。旧版本 llama.cpp 中，有量化 CLIP 编码器的代码。

提示与技巧

比较 CPU 性能

把-ngl 0添加到llama-*命令以跳过将层转移到 GPU 的操作。模型将在 CPU 上运行，您可以将其与 GPU 性能进行比较。

例如， RUBIK Pi 3 上的 Qwen2-1.5B-Instruct Q4_0：

GPU:

llama-cli -m ./qwen2-1_5b-instruct-q4_0-pure.gguf -no-cnv --no-warmup -b 128 -c 2048 -s 11 -n 128 -p "Knock knock, " -fa off

# llama_perf_sampler_print:    sampling time =     225.78 ms /   133 runs   (    1.70 ms per token,   589.06 tokens per second)
# llama_perf_context_print:        load time =    5338.13 ms
# llama_perf_context_print: prompt eval time =     201.32 ms /     5 tokens (   40.26 ms per token,    24.84 tokens per second)
# llama_perf_context_print:        eval time =   13214.35 ms /   127 runs   (  104.05 ms per token,     9.61 tokens per second)
# llama_perf_context_print:       total time =   18958.06 ms /   132 tokens
# llama_perf_context_print:    graphs reused =        122

CPU:

llama-cli -m ./qwen2-1_5b-instruct-q4_0-pure.gguf -no-cnv --no-warmup -b 128 -ngl 99 -c 2048 -s 11 -n 128 -p "Knock knock, " -fa off -ngl 0

# llama_perf_sampler_print:    sampling time =      23.47 ms /   133 runs   (    0.18 ms per token,  5666.08 tokens per second)
# llama_perf_context_print:        load time =     677.25 ms
# llama_perf_context_print: prompt eval time =     253.39 ms /     5 tokens (   50.68 ms per token,    19.73 tokens per second)
# llama_perf_context_print:        eval time =   17751.29 ms /   127 runs   (  139.77 ms per token,     7.15 tokens per second)
# llama_perf_context_print:       total time =   18487.26 ms /   132 tokens
# llama_perf_context_print:    graphs reused =        122

这里 GPU 的令牌处理速度比 CPU 快约 33%。

构建 llama.cpp​

下载并量化模型​

使用 llama-cli 运行你的第一个 LLM​

使用 llama-server 部署 LLM​

部署多模态LLM​

提示与技巧​

比较 CPU 性能​