The 360VL - 8B open - source multimodal model: Achieve image understanding and bilingual dialogue functions for free!

360VL 8B

Developed by qihoo360

360VL is a multimodal model developed based on the LLama3 language model, featuring powerful image understanding and bilingual dialogue capabilities.

Text-to-Image

Transformers

Supports Multiple LanguagesOpen Source License:Apache-2.0 #Multimodal Dialogue #High-Resolution Image Understanding #Bilingual Support (Chinese-English)

Downloads 22

Release Time : 5/16/2024

Model Overview

360VL is an open-source large multimodal model developed based on the LLama3 language model, designed with a globally aware multi-branch projector architecture, supporting bilingual dialogue (Chinese-English) and image understanding.

Model Features

Multiturn Image-Text Dialogue

Can simultaneously receive text and image inputs and output text content, supporting multiturn visual question answering for a single image.

Bilingual Text Support

Supports bilingual dialogue (Chinese-English), including text recognition in images.

Powerful Image Understanding

Excels at analyzing visual content, efficiently completing tasks such as image information extraction, organization, and summarization.

Fine Image Resolution

Supports higher-resolution image understanding at 672×672.

Model Capabilities

Multimodal Dialogue

Image Understanding

Visual Question Answering

Bilingual Text Processing

Use Cases

Intelligent Customer Service

Product Inquiry

User uploads a product image and asks for product information.

The model can accurately identify the product and provide relevant information.

Education

Image-Based Learning Assistance

Students upload images of study materials and ask related questions.

The model can understand the image content and provide answers.

🚀 360VL

360VL is developed based on the LLama3 language model. It's the industry's first open - source large multi - modal model based on LLama3 - 70B, and has a globally aware multi - branch projector architecture for better image understanding.

🚀 Quick Start

Here is a quick start example to use the 360VL model:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
from PIL import Image

checkpoint = "qihoo360/360VL-8B"

model = AutoModelForCausalLM.from_pretrained(checkpoint, torch_dtype=torch.float16, device_map='auto', trust_remote_code=True).eval()
tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True)
vision_tower = model.get_vision_tower()
vision_tower.load_model()
vision_tower.to(device="cuda", dtype=torch.float16)
image_processor = vision_tower.image_processor
tokenizer.pad_token = tokenizer.eos_token


image = Image.open("docs/008.jpg").convert('RGB')
query = "Who is this cartoon character?"
terminators = [
    tokenizer.convert_tokens_to_ids("<|eot_id|>",)
]

inputs = model.build_conversation_input_ids(tokenizer, query=query, image=image, image_processor=image_processor)

input_ids = inputs["input_ids"].to(device='cuda', non_blocking=True)
images = inputs["image"].to(dtype=torch.float16, device='cuda', non_blocking=True)

output_ids = model.generate(
    input_ids,
    images=images,
    do_sample=False,
    eos_token_id=terminators,
    num_beams=1,
    max_new_tokens=512,
    use_cache=True)

input_token_len = input_ids.shape[1]
outputs = tokenizer.batch_decode(output_ids[:, input_token_len:], skip_special_tokens=True)[0]
outputs = outputs.strip()
print(outputs)

✨ Features

360VL offers the following features:

Multi - round text - image conversations: 360VL can take both text and images as inputs and produce text outputs. Currently, it supports multi - round visual question answering with one image.
Bilingual text support: 360VL supports conversations in both English and Chinese, including text recognition in images.
Strong image comprehension: 360VL is adept at analyzing visuals, making it an efficient tool for tasks like extracting, organizing, and summarizing information from images.
Fine - grained image resolution: 360VL supports image understanding at a higher resolution of 672×672.

📦 Model Zoo

360VL has released the following versions:

Model	Download
360VL - 8B	🤗 Hugging Face
360VL - 70B	🤗 Hugging Face

📊 Performance

Model	Checkpoints	MMB_T	MMB_D	MMB - CN_T	MMB - CN_D	MMMU_V	MMMU_T	MME
QWen - VL - Chat	🤗LINK	61.8	60.6	56.3	56.7	37	32.9	1860
mPLUG - Owl2	🤖LINK	66.0	66.5	60.3	59.5	34.7	32.1	1786.4
CogVLM	🤗LINK	65.8	63.7	55.9	53.8	37.3	30.1	1736.6
Monkey - Chat	🤗LINK	72.4	71	67.5	65.8	40.7	-	1887.4
MM1 - 7B - Chat	LINK	-	72.3	-	-	37.0	35.6	1858.2
IDEFICS2 - 8B	🤗LINK	75.7	75.3	68.6	67.3	43.0	37.7	1847.6
SVIT - v1.5 - 13B	🤗LINK	69.1	-	63.1	-	38.0	33.3	1889
LLaVA - v1.5 - 13B	🤗LINK	69.2	69.2	65	63.6	36.4	33.6	1826.7
LLaVA - v1.6 - 13B	🤗LINK	70	70.7	68.5	64.3	36.2	-	1901
Honeybee	LINK	73.6	74.3	-	-	36.2	-	1976.5
YI - VL - 34B	🤗LINK	72.4	71.1	70.7	71.4	45.1	41.6	2050.2
360VL - 8B	🤗LINK	75.3	73.7	71.1	68.6	39.7	37.1	1944.6
360VL - 70B	🤗LINK	78.1	80.4	76.9	77.7	50.8	44.3	2012.3

📚 Documentation

Model type

360VL - 8B is an open - source chatbot trained by fine - tuning LLM on multimodal instruction - following data. It is an auto - regressive language model, based on the transformer architecture. Base LLM: [meta - llama/Meta - Llama - 3 - 8B - Instruct](https://huggingface.co/meta - llama/Meta - Llama - 3 - 8B - Instruct)

Model date

360VL - 8B was trained in April 2024.

📄 License

This project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses. The content of this project itself is licensed under the [Apache license 2.0]

Where to send questions or comments about the model: https://github.com/360CVGroup/360VL

🔗 Related Projects

This work wouldn't be possible without the incredible open - source code of these projects. Huge thanks!

[Meta Llama 3](https://github.com/meta - llama/llama3)
[LLaVA: Large Language and Vision Assistant](https://github.com/haotian - liu/LLaVA)
Honeybee: Locality - enhanced Projector for Multimodal LLM

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご