开源多模态聊天机器人llava-v1.5-13B-AWQ - 支持图像对话交互体验

首页

Llava V1.5 13B AWQ

由 TheBloke 开发

LLaVA是一个开源的多模态聊天机器人，通过微调LLaMA/Vicuna在GPT生成的多模态指令跟随数据上进行训练。

文本生成图像

Transformers

#多模态对话 #指令跟随 #学术VQA

下载量 141

发布时间 : 10/15/2023

模型简介

LLaVA是一个基于transformer架构的自回归语言模型，能够理解和生成与图像相关的文本内容。

模型特点

多模态理解

能够同时处理图像和文本输入，理解两者之间的关系

指令跟随

可以遵循复杂的多模态指令执行任务

开源

模型完全开源，可供研究和商业使用

模型能力

视觉问答

图像描述生成

多模态对话

指令跟随

使用案例

研究

多模态模型研究

用于研究大型多模态模型的行为和能力

教育

视觉辅助学习

帮助学生通过图像理解复杂概念

🚀 Llava v1.5 13B - AWQ

Llava v1.5 13B - AWQ 是基于 Llava v1.5 13B 模型进行 AWQ 量化的版本。AWQ 量化方法高效、准确且推理速度快，支持多用户服务器场景下的高吞吐量并发推理。该模型适用于图像识别、多模态对话等领域，能帮助开发者更高效地进行相关研究和应用开发。

🚀 快速开始

从 vLLM 部署此模型

安装和使用 vLLM 的文档可在此处找到。

注意：在编写本文档时，vLLM 尚未发布支持 AWQ 的新版本。

如果在尝试以下 vLLM 示例时遇到 quantization 未被识别的错误或其他与 AWQ 相关的问题，请从 Github 源代码安装 vLLM。

当将 vLLM 用作服务器时，传递 --quantization awq 参数，例如：

python3 python -m vllm.entrypoints.api_server --model TheBloke/llava-v1.5-13B-AWQ --quantization awq --dtype half

当从 Python 代码使用 vLLM 时，传递 quantization=awq 参数，例如：

from vllm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(model="TheBloke/llava-v1.5-13B-AWQ", quantization="awq", dtype="half")

outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

从 Text Generation Inference (TGI) 部署此模型

使用 TGI 版本 1.1.0 或更高版本。官方 Docker 容器为：ghcr.io/huggingface/text-generation-inference:1.1.0

示例 Docker 参数：

--model-id TheBloke/llava-v1.5-13B-AWQ --port 3000 --quantize awq --max-input-length 3696 --max-total-tokens 4096 --max-batch-prefill-tokens 4096

示例 Python 代码（需要 huggingface-hub 0.17.0 或更高版本）：

pip3 install huggingface-hub

from huggingface_hub import InferenceClient

endpoint_url = "https://your-endpoint-url-here"

prompt = "Tell me about AI"
prompt_template=f'''{prompt}

'''

client = InferenceClient(endpoint_url)
response = client.text_generation(prompt,
                                  max_new_tokens=128,
                                  do_sample=True,
                                  temperature=0.7,
                                  top_p=0.95,
                                  top_k=40,
                                  repetition_penalty=1.1)

print(f"Model output: {response}")

从 Python 代码使用此 AWQ 模型

安装必要的包

需要：AutoAWQ 0.1.1 或更高版本

pip3 install autoawq

如果在使用预构建的轮子安装 AutoAWQ 时遇到问题，请从源代码安装：

pip3 uninstall -y autoawq
git clone https://github.com/casper-hansen/AutoAWQ
cd AutoAWQ
pip3 install .

示例代码

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_name_or_path = "TheBloke/llava-v1.5-13B-AWQ"

# Load model
model = AutoAWQForCausalLM.from_quantized(model_name_or_path, fuse_layers=True,
                                          trust_remote_code=False, safetensors=True)
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=False)

prompt = "Tell me about AI"
prompt_template=f'''{prompt}

'''

print("\n\n*** Generate:")

tokens = tokenizer(
    prompt_template,
    return_tensors='pt'
).input_ids.cuda()

# Generate output
generation_output = model.generate(
    tokens,
    do_sample=True,
    temperature=0.7,
    top_p=0.95,
    top_k=40,
    max_new_tokens=512
)

print("Output: ", tokenizer.decode(generation_output[0]))

"""
# Inference should be possible with transformers pipeline as well in future
# But currently this is not yet supported by AutoAWQ (correct as of September 25th 2023)
from transformers import pipeline

print("*** Pipeline:")
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=512,
    do_sample=True,
    temperature=0.7,
    top_p=0.95,
    top_k=40,
    repetition_penalty=1.1
)

print(pipe(prompt_template)[0]['generated_text'])
"""

✨ 主要特性

高效量化：采用 AWQ 量化方法，支持 4 位量化，相比 GPTQ 提供更快的基于 Transformer 的推理。
多平台支持：支持 vLLM、Huggingface Text Generation Inference (TGI) 等，可用于高吞吐量并发推理。
多版本可用：提供 AWQ、GPTQ 等不同量化版本，以及原始未量化的 fp16 模型。

📦 安装指南

安装 vLLM

文档可在此处找到。

安装 AutoAWQ

pip3 install autoawq

若安装有问题，从源代码安装：

pip3 uninstall -y autoawq
git clone https://github.com/casper-hansen/AutoAWQ
cd AutoAWQ
pip3 install .

安装 huggingface-hub（用于 TGI 示例）

pip3 install huggingface-hub

📚 详细文档

模型信息

模型创建者：Haotian Liu
原始模型：Llava v1.5 13B
模型类型：llama
许可证：llama2

可用仓库

提示模板

A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
USER: <image>{prompt}
ASSISTANT: