開源多模態聊天機器人llava-v1.5-13B-AWQ - 支持圖像對話交互體驗

首頁

Llava V1.5 13B AWQ

由TheBloke開發

LLaVA是一個開源的多模態聊天機器人，通過微調LLaMA/Vicuna在GPT生成的多模態指令跟隨數據上進行訓練。

文本生成圖像

Transformers

#多模態對話 #指令跟隨 #學術VQA

下載量 141

發布時間 : 10/15/2023

模型概述

LLaVA是一個基於transformer架構的自迴歸語言模型，能夠理解和生成與圖像相關的文本內容。

模型特點

多模態理解

能夠同時處理圖像和文本輸入，理解兩者之間的關係

指令跟隨

可以遵循複雜的多模態指令執行任務

開源

模型完全開源，可供研究和商業使用

模型能力

視覺問答

圖像描述生成

多模態對話

指令跟隨

使用案例

研究

多模態模型研究

用於研究大型多模態模型的行為和能力

教育

視覺輔助學習

幫助學生通過圖像理解複雜概念

🚀 Llava v1.5 13B - AWQ

Llava v1.5 13B - AWQ 是基於 Llava v1.5 13B 模型進行 AWQ 量化的版本。AWQ 量化方法高效、準確且推理速度快，支持多用戶服務器場景下的高吞吐量併發推理。該模型適用於圖像識別、多模態對話等領域，能幫助開發者更高效地進行相關研究和應用開發。

🚀 快速開始

從 vLLM 部署此模型

安裝和使用 vLLM 的文檔可在此處找到。

注意：在編寫本文檔時，vLLM 尚未發佈支持 AWQ 的新版本。

如果在嘗試以下 vLLM 示例時遇到 quantization 未被識別的錯誤或其他與 AWQ 相關的問題，請從 Github 源代碼安裝 vLLM。

當將 vLLM 用作服務器時，傳遞 --quantization awq 參數，例如：

python3 python -m vllm.entrypoints.api_server --model TheBloke/llava-v1.5-13B-AWQ --quantization awq --dtype half

當從 Python 代碼使用 vLLM 時，傳遞 quantization=awq 參數，例如：

from vllm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(model="TheBloke/llava-v1.5-13B-AWQ", quantization="awq", dtype="half")

outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

從 Text Generation Inference (TGI) 部署此模型

使用 TGI 版本 1.1.0 或更高版本。官方 Docker 容器為：ghcr.io/huggingface/text-generation-inference:1.1.0

示例 Docker 參數：

--model-id TheBloke/llava-v1.5-13B-AWQ --port 3000 --quantize awq --max-input-length 3696 --max-total-tokens 4096 --max-batch-prefill-tokens 4096

示例 Python 代碼（需要 huggingface-hub 0.17.0 或更高版本）：

pip3 install huggingface-hub

from huggingface_hub import InferenceClient

endpoint_url = "https://your-endpoint-url-here"

prompt = "Tell me about AI"
prompt_template=f'''{prompt}

'''

client = InferenceClient(endpoint_url)
response = client.text_generation(prompt,
                                  max_new_tokens=128,
                                  do_sample=True,
                                  temperature=0.7,
                                  top_p=0.95,
                                  top_k=40,
                                  repetition_penalty=1.1)

print(f"Model output: {response}")

從 Python 代碼使用此 AWQ 模型

安裝必要的包

需要：AutoAWQ 0.1.1 或更高版本

pip3 install autoawq

如果在使用預構建的輪子安裝 AutoAWQ 時遇到問題，請從源代碼安裝：

pip3 uninstall -y autoawq
git clone https://github.com/casper-hansen/AutoAWQ
cd AutoAWQ
pip3 install .

示例代碼

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_name_or_path = "TheBloke/llava-v1.5-13B-AWQ"

# Load model
model = AutoAWQForCausalLM.from_quantized(model_name_or_path, fuse_layers=True,
                                          trust_remote_code=False, safetensors=True)
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=False)

prompt = "Tell me about AI"
prompt_template=f'''{prompt}

'''

print("\n\n*** Generate:")

tokens = tokenizer(
    prompt_template,
    return_tensors='pt'
).input_ids.cuda()

# Generate output
generation_output = model.generate(
    tokens,
    do_sample=True,
    temperature=0.7,
    top_p=0.95,
    top_k=40,
    max_new_tokens=512
)

print("Output: ", tokenizer.decode(generation_output[0]))

"""
# Inference should be possible with transformers pipeline as well in future
# But currently this is not yet supported by AutoAWQ (correct as of September 25th 2023)
from transformers import pipeline

print("*** Pipeline:")
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=512,
    do_sample=True,
    temperature=0.7,
    top_p=0.95,
    top_k=40,
    repetition_penalty=1.1
)

print(pipe(prompt_template)[0]['generated_text'])
"""

✨ 主要特性

高效量化：採用 AWQ 量化方法，支持 4 位量化，相比 GPTQ 提供更快的基於 Transformer 的推理。
多平臺支持：支持 vLLM、Huggingface Text Generation Inference (TGI) 等，可用於高吞吐量併發推理。
多版本可用：提供 AWQ、GPTQ 等不同量化版本，以及原始未量化的 fp16 模型。

📦 安裝指南

安裝 vLLM

文檔可在此處找到。

安裝 AutoAWQ

pip3 install autoawq

若安裝有問題，從源代碼安裝：

pip3 uninstall -y autoawq
git clone https://github.com/casper-hansen/AutoAWQ
cd AutoAWQ
pip3 install .

安裝 huggingface-hub（用於 TGI 示例）

pip3 install huggingface-hub

📚 詳細文檔

模型信息

模型創建者：Haotian Liu
原始模型：Llava v1.5 13B
模型類型：llama
許可證：llama2

可用倉庫

提示模板

A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
USER: <image>{prompt}
ASSISTANT: