360VL-70B Open-source Multimodal Model - Free to Use, Enabling Image Understanding and Bilingual Text Processing

360VL 70B

Developed by qihoo360

360VL is an open-source large multimodal model developed based on the LLama3 language model, featuring powerful image understanding and bilingual text support capabilities.

Text-to-Image

Transformers

Supports Multiple LanguagesOpen Source License:Apache-2.0 #Multimodal Dialogue #High-Resolution Image Understanding #Bilingual Support (Chinese-English)

Downloads 103

Release Time : 5/16/2024

Model Overview

360VL is the industry's first open-source large multimodal model based on LLama3-70B, featuring a globally aware multi-branch projector architecture that supports multi-round image-text dialogues and fine-grained image parsing.

Model Features

Multi-Round Image-Text Dialogue

Supports text and images as input and generates text output, enabling multi-round visual Q&A with a single image.

Bilingual Text Support

Supports Chinese and English dialogues, including text recognition in images.

Powerful Image Understanding

Excels at analyzing visual content, efficiently completing tasks such as image information extraction, organization, and summarization.

Fine-Grained Image Parsing

Supports higher-resolution image understanding at 672×672.

Model Capabilities

Visual Question Answering

Image Content Analysis

Chinese-English Text Generation

Image Information Extraction

Multi-Round Dialogue

Use Cases

Visual Question Answering

Image Content Q&A

Users upload an image and ask questions, and the model answers questions about the image content.

Accurately identifies objects, scenes, and text information in images.

Image Analysis

Image Information Extraction

Extracts key information from images and summarizes it.

Efficiently completes the extraction and organization of image information.

🚀 360VL

360VL is developed based on the LLama3 language model and is the industry's first open - source large multi - modal model based on LLama3 - 70B[[🤗Meta - Llama - 3 - 70B - Instruct](https://huggingface.co/meta - llama/Meta - Llama - 3 - 70B - Instruct)]. It designs a globally aware multi - branch projector architecture, enabling more sufficient image understanding capabilities.

🚀 Quick Start

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
from PIL import Image

checkpoint = "qihoo360/360VL-70B"

model = AutoModelForCausalLM.from_pretrained(checkpoint, torch_dtype=torch.float16, device_map='auto', trust_remote_code=True).eval()
tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True)
vision_tower = model.get_vision_tower()
vision_tower.load_model()
vision_tower.to(device="cuda", dtype=torch.float16)
image_processor = vision_tower.image_processor
tokenizer.pad_token = tokenizer.eos_token


image = Image.open("docs/008.jpg").convert('RGB')
query = "Who is this cartoon character?"
terminators = [
    tokenizer.convert_tokens_to_ids("<|eot_id|>",)
]

inputs = model.build_conversation_input_ids(tokenizer, query=query, image=image, image_processor=image_processor)

input_ids = inputs["input_ids"].to(device='cuda', non_blocking=True)
images = inputs["image"].to(dtype=torch.float16, device='cuda', non_blocking=True)

output_ids = model.generate(
    input_ids,
    images=images,
    do_sample=False,
    eos_token_id=terminators,
    num_beams=1,
    max_new_tokens=512,
    use_cache=True)

input_token_len = input_ids.shape[1]
outputs = tokenizer.batch_decode(output_ids[:, input_token_len:], skip_special_tokens=True)[0]
outputs = outputs.strip()
print(outputs)

✨ Features

360VL offers the following features:

Multi - round text - image conversations: 360VL can take both text and images as inputs and produce text outputs. Currently, it supports multi - round visual question answering with one image.
Bilingual text support: 360VL supports conversations in both English and Chinese, including text recognition in images.
Strong image comprehension: 360VL is adept at analyzing visuals, making it an efficient tool for tasks like extracting, organizing, and summarizing information from images.
Fine - grained image resolution: 360VL supports image understanding at a higher resolution of 672×672.

📦 Model Zoo

360VL has released the following versions.

Model	Download
360VL - 8B	[🤗 Hugging Face](https://huggingface.co/qihoo360/360VL - 8B)
360VL - 70B	[🤗 Hugging Face](https://huggingface.co/qihoo360/360VL - 70B)

📈 Performance

Model	Checkpoints	MMB_T	MMB_D	MMB - CN_T	MMB - CN_D	MMMU_V	MMMU_T	MME
QWen - VL - Chat	[🤗LINK](https://huggingface.co/Qwen/Qwen - VL - Chat)	61.8	60.6	56.3	56.7	37	32.9	1860
mPLUG - Owl2	[🤖LINK](https://www.modelscope.cn/models/iic/mPLUG - Owl2/summary)	66.0	66.5	60.3	59.5	34.7	32.1	1786.4
CogVLM	[🤗LINK](https://huggingface.co/THUDM/cogvlm - grounding - generalist - hf)	65.8	63.7	55.9	53.8	37.3	30.1	1736.6
Monkey - Chat	[🤗LINK](https://huggingface.co/echo840/Monkey - Chat)	72.4	71	67.5	65.8	40.7	-	1887.4
MM1 - 7B - Chat	LINK	-	72.3	-	-	37.0	35.6	1858.2
IDEFICS2 - 8B	[🤗LINK](https://huggingface.co/HuggingFaceM4/idefics2 - 8b)	75.7	75.3	68.6	67.3	43.0	37.7	1847.6
SVIT - v1.5 - 13B	[🤗LINK](https://huggingface.co/Isaachhe/svit - v1.5 - 13b - full)	69.1	-	63.1	-	38.0	33.3	1889
LLaVA - v1.5 - 13B	[🤗LINK](https://huggingface.co/liuhaotian/llava - v1.5 - 13b)	69.2	69.2	65	63.6	36.4	33.6	1826.7
LLaVA - v1.6 - 13B	[🤗LINK](https://huggingface.co/liuhaotian/llava - v1.6 - vicuna - 13b)	70	70.7	68.5	64.3	36.2	-	1901
Honeybee	LINK	73.6	74.3	-	-	36.2	-	1976.5
YI - VL - 34B	[🤗LINK](https://huggingface.co/01 - ai/Yi - VL - 34B)	72.4	71.1	70.7	71.4	45.1	41.6	2050.2
360VL - 8B	[🤗LINK](https://huggingface.co/qihoo360/360VL - 8B)	75.3	73.7	71.1	68.6	39.7	37.1	1944.6
360VL - 70B	[🤗LINK](https://huggingface.co/qihoo360/360VL - 70B)	78.1	80.4	76.9	77.7	50.8	44.3	2012.3

📚 Documentation

Model type:

360VL - 70B is an open - source chatbot trained by fine - tuning LLM on multimodal instruction - following data. It is an auto - regressive language model, based on the transformer architecture. Base LLM: [meta - llama/Meta - Llama - 3 - 70B - Instruct](https://huggingface.co/meta - llama/Meta - Llama - 3 - 70B - Instruct)

Model date:

360VL - 70B was trained in May 2024.

📄 License

This project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses. The content of this project itself is licensed under the Apache license 2.0

Where to send questions or comments about the model: https://github.com/360CVGroup/360VL

🔗 Related Projects

This work wouldn't be possible without the incredible open - source code of these projects. Huge thanks!

[Meta Llama 3](https://github.com/meta - llama/llama3)
[LLaVA: Large Language and Vision Assistant](https://github.com/haotian - liu/LLaVA)
Honeybee: Locality - enhanced Projector for Multimodal LLM

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご