InternLM-XComposer2_Enhanced: An Open-Source Large Vision-Language Model - Freely Achieve Image-Text Understanding and Creation

Internlm XComposer2 Enhanced

Developed by Coobiw

A vision-language large model developed based on InternLM2 with exceptional text-image understanding and creation capabilities

Text-to-Image

PyTorch

Open Source License:Other #Interleaved text-image creation #Multimodal understanding #Vision-language model

Downloads 14

Release Time : 2/13/2025

Model Overview

InternLM-XComposer2 is a vision-language large model (VLLM) developed based on InternLM2, featuring exceptional text-image understanding and creation capabilities. It includes two versions: InternLM-XComposer2-VL (a multimodal pre-trained model) and InternLM-XComposer2 (a vision-language model fine-tuned specifically for free-form interleaved text-image creation tasks).

Model Features

Multimodal understanding and creation

Features exceptional text-image understanding and creation capabilities, supporting free-form interleaved text-image creation

Dual-version models

Provides both VL pre-trained model and fine-tuned model optimized for text-image creation

Efficient inference

Supports batch training and flash-attn acceleration

Model Capabilities

Image understanding

Text generation

Interleaved text-image creation

Visual question answering

Use Cases

Content creation

Text-image blog creation

Automatically generates detailed descriptions and accompanying text content based on images

Generates natural language descriptions that match the image content

Intelligent Q&A

Visual question answering

Answers various questions about image content

Accurately understands image content and provides relevant answers

🚀 InternLM-XComposer2

InternLM-XComposer2 is a vision - language large model (VLLM) based on InternLM2. It is designed for advanced text - image comprehension and composition, offering strong performance in multimodal scenarios.

InternLM-XComposer2

[💻Github Repo](https://github.com/InternLM/InternLM-XComposer) [Paper](https://arxiv.org/abs/2401.16420)

🚀 Quick Start

This repo is based on InternLM-XComposer2 Official. It supports batchified training and flash-attn for acceleration. Welcome to have a try and give some advice!

We provide a simple example to show how to use InternLM-XComposer with 🤗 Transformers.

import torch
from transformers import AutoModel, AutoTokenizer

torch.set_grad_enabled(False)

# init model and tokenizer
model = AutoModel.from_pretrained('internlm/internlm-xcomposer2-vl-7b', trust_remote_code=True).cuda().eval()
tokenizer = AutoTokenizer.from_pretrained('internlm/internlm-xcomposer2-vl-7b', trust_remote_code=True)

query = '<ImageHere>Please describe this image in detail.'
image = './image1.webp'
with torch.cuda.amp.autocast():
  response, _ = model.chat(tokenizer, query=query, image=image, history=[], do_sample=False)
print(response)
#The image features a quote by Oscar Wilde, "Live life with no excuses, travel with no regret,"
# set against a backdrop of a breathtaking sunset. The sky is painted in hues of pink and orange,
# creating a serene atmosphere. Two silhouetted figures stand on a cliff, overlooking the horizon.
# They appear to be hiking or exploring, embodying the essence of the quote.
# The overall scene conveys a sense of adventure and freedom, encouraging viewers to embrace life without hesitation or regrets.

✨ Features

We release InternLM-XComposer2 series in two versions:

InternLM-XComposer2-VL: The pretrained VLLM model with InternLM2 as the initialization of the LLM, achieving strong performance on various multimodal benchmarks.
InternLM-XComposer2: The finetuned VLLM for Free-from Interleaved Text-Image Composition.

💻 Usage Examples

Basic Usage

To load the InternLM-XComposer2-VL-7B model using Transformers, use the following code:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
ckpt_path = "internlm/internlm-xcomposer2-vl-7b"
tokenizer = AutoTokenizer.from_pretrained(ckpt_path, trust_remote_code=True).cuda()
# Set `torch_dtype=torch.float16` to load model in float16, otherwise it will be loaded as float32 and might cause OOM Error.
model = AutoModelForCausalLM.from_pretrained(ckpt_path, torch_dtype=torch.float16, trust_remote_code=True).cuda()
model = model.eval()

📄 License

The code is licensed under Apache-2.0, while model weights are fully open for academic research and also allow free commercial usage. To apply for a commercial license, please fill in the application form (English)/申请表（中文）. For other questions or collaborations, please contact internlm@pjlab.org.cn.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご