Lava_phi开源视觉语言模型 - 结合CLIP免费实现强大图像处理

首页

Lava Phi

由 sagar007 开发

基于微软Phi-1.5架构的视觉语言模型，结合CLIP实现图像处理能力

图像生成文本

Transformers

支持多种语言开源协议:MIT #多模态问答 #指令微调 #小参数高效

下载量 17

发布时间 : 1/2/2025

模型简介

这是一个多模态模型，能够同时处理图像和文本输入，生成相关的文本输出。

模型特点

多模态能力

结合文本和图像处理能力，能理解并生成与图像相关的文本描述

高效训练

采用QLoRA(量化低秩适配)训练方法，4位量化提高效率

混合精度训练

使用bfloat16进行混合精度训练，提高训练效率

模型能力

图像理解

图像描述生成

视觉问答

多模态对话

使用案例

图像理解

图像描述生成

为输入图像生成详细的文本描述

视觉问答

基于图像的问答

回答关于图像内容的自然语言问题

🚀 LLaVA-Phi模型

LLaVA-Phi是一个基于微软Phi-1.5架构的视觉语言模型，集成了CLIP以实现图像处理能力，可有效处理图像到文本的转换任务。

🚀 快速开始

此模型可用于图像到文本的转换任务。以下是使用该模型的代码示例：

from transformers import AutoModelForCausalLM, AutoTokenizer, AutoProcessor
import torch
from PIL import Image

# 加载模型和分词器
model = AutoModelForCausalLM.from_pretrained("sagar007/Lava_phi")
tokenizer = AutoTokenizer.from_pretrained("sagar007/Lava_phi")
processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32")

# 处理文本
def generate_text(prompt):
    inputs = tokenizer(f"human: {prompt}\ngpt:", return_tensors="pt")
    outputs = model.generate(**inputs, max_new_tokens=128)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# 处理图像和提示
def process_image_and_prompt(image_path, prompt):
    image = Image.open(image_path)
    image_tensor = processor(images=image, return_tensors="pt").pixel_values
    
    inputs = tokenizer(f"human: <image>\n{prompt}\ngpt:", return_tensors="pt")
    outputs = model.generate(
        input_ids=inputs["input_ids"],
        attention_mask=inputs["attention_mask"],
        images=image_tensor,
        max_new_tokens=128
    )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

✨ 主要特性

基础模型：采用微软的Phi-1.5模型作为基础。
视觉编码器：使用CLIP ViT-B/32进行图像特征提取。
训练方式：通过QLoRA进行微调，提高模型性能。
数据集：使用Instruct 150K数据集进行训练。

🔧 技术细节

训练方法：使用QLoRA（量化低秩自适应）进行训练。
量化处理：采用4位量化以提高效率。
梯度检查点：启用梯度检查点以减少内存使用。
混合精度训练：使用bfloat16进行混合精度训练。

📄 许可证

本项目采用MIT许可证。

📚 详细文档

引用信息

如果您使用了本模型，请参考以下引用信息：

@software{llava_phi_2024,
  author = {sagar007},
  title = {LLaVA-Phi: Vision-Language Model},
  year = {2024},
  publisher = {Hugging Face},
  url = {https://huggingface.co/sagar007/Lava_phi}
}

模型信息

属性	详情
模型类型	视觉语言模型
基础模型	Microsoft Phi-1.5
视觉编码器	CLIP ViT-B/32
训练数据	Instruct 150K
训练方法	QLoRA微调
许可证	MIT License
标签	vision-language, phi, llava, clip, qlora, multimodal
数据集	laion/instructional-image-caption-data
库名称	transformers
任务类型	图像到文本