Mplug-owl-llama-7b Open-Source Multimodal Large Model - Supports Image Understanding and Text Generation Tasks

Mplug Owl Llama 7b

Developed by MAGAer13

mPLUG-Owl is a multimodal large language model based on the LLaMA-7B architecture, supporting image understanding and text generation tasks.

Image-to-Text

Transformers

EnglishOpen Source License:Apache-2.0 #Multimodal Dialogue #Image Understanding #Meme Analysis

Downloads 327

Release Time : 5/8/2023

Model Overview

This model combines visual and language processing capabilities, enabling it to understand image content and generate relevant textual descriptions or answer questions, suitable for multimodal interaction scenarios.

Model Features

Multimodal Understanding

Processes both image and text inputs simultaneously to achieve cross-modal content understanding

Conversational Interaction

Supports multi-turn dialogue templates for natural language interaction

Open-domain Applications

Suitable for open-domain visual question answering and image caption generation

Model Capabilities

Image Content Understanding

Visual Question Answering

Meme Analysis

Multi-turn Dialogue Generation

Cross-modal Reasoning

Use Cases

Social Media Analysis

Meme Interpretation

Analyzes the humorous elements and cultural context of internet memes

Generates humorous explanations that align with human cognition

Creative Assistance

Image Caption Generation

Automatically generates descriptive text for visual content

Produces accurate and contextually appropriate textual descriptions

🚀 MplugOwl Image-to-Text Model

MplugOwl is an image-to-text model that can generate text descriptions based on input images. It offers a seamless way to integrate image understanding into text generation tasks.

🚀 Quick Start

Get the latest codebase from Github

git clone https://github.com/X-PLUG/mPLUG-Owl.git

✨ Features

Image-to-Text Conversion: Generate text descriptions for given images.
Multi - turn Conversation Support: Organize context as multi - turn conversations for more interactive responses.

📦 Installation

The installation mainly involves cloning the repository from GitHub. The codebase can be obtained using the following command:

git clone https://github.com/X-PLUG/mPLUG-Owl.git

💻 Usage Examples

Basic Usage

Model initialization

from mplug_owl.modeling_mplug_owl import MplugOwlForConditionalGeneration
from mplug_owl.tokenization_mplug_owl import MplugOwlTokenizer
from mplug_owl.processing_mplug_owl import MplugOwlImageProcessor, MplugOwlProcessor

pretrained_ckpt = 'MAGAer13/mplug-owl-llama-7b'
model = MplugOwlForConditionalGeneration.from_pretrained(
    pretrained_ckpt,
    torch_dtype=torch.bfloat16,
)
image_processor = MplugOwlImageProcessor.from_pretrained(pretrained_ckpt)
tokenizer = MplugOwlTokenizer.from_pretrained(pretrained_ckpt)
processor = MplugOwlProcessor(image_processor, tokenizer)

Model inference

Prepare model inputs.

# We use a human/AI template to organize the context as a multi-turn conversation.
# <image> denotes an image placehold.
prompts = [
'''The following is a conversation between a curious human and AI assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
Human: <image>
Human: Explain why this meme is funny.
AI: ''']

# The image paths should be placed in the image_list and kept in the same order as in the prompts.
# We support urls, local file paths and base64 string. You can custom the pre-process of images by modifying the mplug_owl.modeling_mplug_owl.ImageProcessor
image_list = ['https://xxx.com/image.jpg']

Get response.

# generate kwargs (the same in transformers) can be passed in the do_generate()
generate_kwargs = {
    'do_sample': True,
    'top_k': 5,
    'max_length': 512
}
from PIL import Image
images = [Image.open(_) for _ in image_list]
inputs = processor(text=prompts, images=images, return_tensors='pt')
inputs = {k: v.bfloat16() if v.dtype == torch.float else v for k, v in inputs.items()}
inputs = {k: v.to(model.device) for k, v in inputs.items()}
with torch.no_grad():
    res = model.generate(**inputs, **generate_kwargs)
sentence = tokenizer.decode(res.tolist()[0], skip_special_tokens=True)
print(sentence)

📄 License

This project is licensed under the Apache-2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご