Shusheng·PUYU 2 Open-Source Vision-Language Large Model - Free Deployment for Image-Text Comprehension and Creation

Internlm Xcomposer2 7b 4bit

Developed by internlm

InternLM-XComposer2 is a vision-language large model (VLLM) based on InternLM2, featuring advanced image-text understanding and creation capabilities.

Image-to-Text

Transformers

Open Source License:Other #Interleaved Image-Text Creation #Multimodal Understanding #4-bit Quantization

Downloads 74

Release Time : 2/6/2024

Model Overview

InternLM-XComposer2 is a vision-language large model focused on image-text understanding and creation, supporting free-form interleaved image-text creation tasks.

Model Features

Advanced Image-Text Understanding

Excels in multiple multimodal benchmarks with robust image-text comprehension capabilities.

Free-form Interleaved Creation

Fine-tuned for free-form interleaved image-text creation tasks, supporting complex multimodal interactions.

4-bit Quantized Version

Offers a 4-bit quantized version to reduce hardware requirements while maintaining high performance.

Model Capabilities

Image-text understanding

Image-text creation

Multimodal interaction

Free-form interleaved creation

Use Cases

Content Creation

Image-based Article Writing

Generate coherent articles based on provided images.

Produces image-aligned articles like 'My Favorite Animal: The Giant Panda'.

Education

Teaching Assistance

Generate explanatory text or Q&A based on educational images.

🚀 InternLM-XComposer2

InternLM-XComposer2 is a vision-language large model (VLLM) based on InternLM2, designed for advanced text-image comprehension and composition.

InternLM-XComposer2

[💻Github Repo](https://github.com/InternLM/InternLM-XComposer) [Paper](https://arxiv.org/abs/2401.16420)

InternLM-XComposer2 is a vision-language large model (VLLM) based on InternLM2 for advanced text-image comprehension and composition.

We release InternLM-XComposer2 series in two versions:

InternLM-XComposer2-VL: The pretrained VLLM model with InternLM2 as the initialization of the LLM, achieving strong performance on various multimodal benchmarks.
InternLM-XComposer2: The finetuned VLLM for Free-from Interleaved Text-Image Composition.

This is the 4-bit version of InternLM-XComposer2, install the latest version of auto_gptq before using.

💻 Usage Examples

Basic Usage

import torch, auto_gptq
from PIL import Image
from transformers import AutoModel, AutoTokenizer 
from auto_gptq.modeling import BaseGPTQForCausalLM

auto_gptq.modeling._base.SUPPORTED_MODELS = ["internlm"]
torch.set_grad_enabled(False)

class InternLMXComposer2QForCausalLM(BaseGPTQForCausalLM):
    layers_block_name = "model.layers"
    outside_layer_modules = [
        'vit', 'vision_proj', 'model.tok_embeddings', 'model.norm', 'output', 
    ]
    inside_layer_modules = [
        ["attention.wqkv.linear"],
        ["attention.wo.linear"],
        ["feed_forward.w1.linear", "feed_forward.w3.linear"],
        ["feed_forward.w2.linear"],
    ]
 
# init model and tokenizer
model = InternLMXComposer2QForCausalLM.from_quantized(
  'internlm/internlm-xcomposer2-7b-4bit', trust_remote_code=True, device="cuda:0").eval()
tokenizer = AutoTokenizer.from_pretrained(
  'internlm/internlm-xcomposer2-7b-4bit', trust_remote_code=True)

img_path_list = [
    'panda.jpg',
    'bamboo.jpeg',
]
images = []
for img_path in img_path_list:
    image = Image.open(img_path).convert("RGB")
    image = model.vis_processor(image)
    images.append(image)
image = torch.stack(images)
query = '<ImageHere> <ImageHere>please write an article based on the images. Title: my favorite animal.'
with torch.cuda.amp.autocast():
    response, history = model.chat(tokenizer, query=query, image=image, history=[], do_sample=False)
print(response)

#My Favorite Animal: The Panda
#The panda, also known as the giant panda, is one of the most beloved animals in the world. These adorable creatures are native to China and can be found in the wild in a few select locations, but they are more commonly seen in captivity at zoos or wildlife reserves.
#Pandas have a distinct black-and-white coloration that makes them instantly recognizable. They are known for their love of bamboo, which they eat almost exclusively. In fact, pandas spend up to 14 hours a day eating, with the majority of their diet consisting of bamboo. Despite this seemingly unbalanced diet, pandas are actually quite healthy and have a low body fat percentage, thanks to their ability to digest bamboo efficiently.
#In addition to their unique eating habits, pandas are also known for their playful personalities. They are intelligent and curious creatures, often engaging in activities like playing with toys or climbing trees. However, they do not typically exhibit these behaviors in the wild, where they are solitary creatures who prefer to spend their time alone.
#One of the biggest threats to the panda's survival is habitat loss due to deforestation. As a result, many pandas now live in captivity, where they are cared for by dedicated staff and provided with enrichment opportunities to keep them engaged and stimulated. While it is important to protect these animals from extinction, it is also crucial to remember that they are still wild creatures and should be treated with respect and care.
#Overall, the panda is an amazing animal that has captured the hearts of people around the world. Whether you see them in the wild or in captivity, there is no denying the charm and allure of these gentle giants.

📄 License

The code is licensed under Apache-2.0, while model weights are fully open for academic research and also allow free commercial usage. To apply for a commercial license, please fill in the application form (English)/申请表（中文）. For other questions or collaborations, please contact internlm@pjlab.org.cn.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご