Llama JoyCaption Alpha Two FP8壓縮開源模型 - 免費部署，兼容框架高效使用

首頁

Llama Joycaption Alpha Two Hf Llava FP8 Dynamic

由JKCHSTR開發

這是對fancyfeast開發的Llama JoyCaption Alpha Two模型進行的FP8壓縮版本，使用llm-compressor工具實現，併兼容vllm框架。

圖像生成文本

Safetensors

英語開源協議:MIT #FP8動態量化 #圖像描述生成 #Llava架構

下載量 248

發布時間 : 2/18/2025

模型概述

該模型是基於Llama JoyCaption Alpha Two的FP8壓縮版本，主要用於圖像字幕生成任務，能夠為圖片生成詳細描述。

模型特點

FP8動態量化

使用llm-compressor工具實現的FP8壓縮版本，減少模型大小同時保持性能

vllm兼容

優化後兼容vllm框架，便於部署和使用

圖像描述生成

能夠為輸入的圖片生成詳細、專業的描述

模型能力

圖像分析

文本生成

多模態處理

使用案例

內容生成

圖片描述生成

為圖片生成正式、詳細的描述

生成符合要求的圖片描述文本

輔助工具

視覺輔助

為視障人士提供圖片內容描述

🚀 Llama Joycaption Alpha Two hf Llava FP8 Dynamic

這是對 fancyfeast 製作的 Llama JoyCaption Alpha Two 模型進行的 FP8 壓縮版本，使用了 llm-compressor 工具，並且與 vllm 兼容。

該模型已由本人進行測試，遺憾的是測試方法可能不夠規範，但在我的使用場景下表現良好。

所有榮譽歸於 fancyfeast，你可以在官方模型頁面查看更多詳細信息。

🚀 快速開始

與 Llama JoyCaption Alpha Two 模型的使用方法相同：

你需要 compressed-tensors 庫才能以 FP8 模式運行以下代碼。

💻 使用示例

基礎用法

import torch
from PIL import Image
from transformers import AutoProcessor, LlavaForConditionalGeneration


IMAGE_PATH = "image.jpg"
PROMPT = "Write a long descriptive caption for this image in a formal tone."
MODEL_NAME = "JKCHSTR/llama-joycaption-alpha-two-hf-llava-FP8-Dynamic"


# Load JoyCaption
# bfloat16 is the native dtype of the LLM used in JoyCaption (Llama 3.1)
# device_map=0 loads the model into the first GPU
processor = AutoProcessor.from_pretrained(MODEL_NAME)
llava_model = LlavaForConditionalGeneration.from_pretrained(MODEL_NAME, torch_dtype="bfloat16", device_map=0)
llava_model.eval()

with torch.no_grad():
    # Load image
    image = Image.open(IMAGE_PATH)

    # Build the conversation
    convo = [
        {
            "role": "system",
            "content": "You are a helpful image captioner.",
        },
        {
            "role": "user",
            "content": PROMPT,
        },
    ]

    # Format the conversation
    # WARNING: HF's handling of chat's on Llava models is very fragile.  This specific combination of processor.apply_chat_template(), and processor() works
    # but if using other combinations always inspect the final input_ids to ensure they are correct.  Often times you will end up with multiple <bos> tokens
    # if not careful, which can make the model perform poorly.
    convo_string = processor.apply_chat_template(convo, tokenize = False, add_generation_prompt = True)
    assert isinstance(convo_string, str)

    # Process the inputs
    inputs = processor(text=[convo_string], images=[image], return_tensors="pt").to('cuda')
    inputs['pixel_values'] = inputs['pixel_values'].to(torch.bfloat16)

    # Generate the captions
    generate_ids = llava_model.generate(
        **inputs,
        max_new_tokens=300,
        do_sample=True,
        suppress_tokens=None,
        use_cache=True,
        temperature=0.6,
        top_k=None,
        top_p=0.9,
    )[0]

    # Trim off the prompt
    generate_ids = generate_ids[inputs['input_ids'].shape[1]:]

    # Decode the caption
    caption = processor.tokenizer.decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
    caption = caption.strip()
    print(caption)