🚀 360VL
360VL is developed based on the LLama3 language model. It's the industry's first open - source large multi - modal model based on LLama3 - 70B, and has a globally aware multi - branch projector architecture for better image understanding.
🚀 Quick Start
Here is a quick start example to use the 360VL model:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
from PIL import Image
checkpoint = "qihoo360/360VL-8B"
model = AutoModelForCausalLM.from_pretrained(checkpoint, torch_dtype=torch.float16, device_map='auto', trust_remote_code=True).eval()
tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True)
vision_tower = model.get_vision_tower()
vision_tower.load_model()
vision_tower.to(device="cuda", dtype=torch.float16)
image_processor = vision_tower.image_processor
tokenizer.pad_token = tokenizer.eos_token
image = Image.open("docs/008.jpg").convert('RGB')
query = "Who is this cartoon character?"
terminators = [
tokenizer.convert_tokens_to_ids("<|eot_id|>",)
]
inputs = model.build_conversation_input_ids(tokenizer, query=query, image=image, image_processor=image_processor)
input_ids = inputs["input_ids"].to(device='cuda', non_blocking=True)
images = inputs["image"].to(dtype=torch.float16, device='cuda', non_blocking=True)
output_ids = model.generate(
input_ids,
images=images,
do_sample=False,
eos_token_id=terminators,
num_beams=1,
max_new_tokens=512,
use_cache=True)
input_token_len = input_ids.shape[1]
outputs = tokenizer.batch_decode(output_ids[:, input_token_len:], skip_special_tokens=True)[0]
outputs = outputs.strip()
print(outputs)
✨ Features
360VL offers the following features:
- Multi - round text - image conversations: 360VL can take both text and images as inputs and produce text outputs. Currently, it supports multi - round visual question answering with one image.
- Bilingual text support: 360VL supports conversations in both English and Chinese, including text recognition in images.
- Strong image comprehension: 360VL is adept at analyzing visuals, making it an efficient tool for tasks like extracting, organizing, and summarizing information from images.
- Fine - grained image resolution: 360VL supports image understanding at a higher resolution of 672×672.
📦 Model Zoo
360VL has released the following versions:
📊 Performance
Model |
Checkpoints |
MMBT |
MMBD |
MMB - CNT |
MMB - CND |
MMMUV |
MMMUT |
MME |
QWen - VL - Chat |
🤗LINK |
61.8 |
60.6 |
56.3 |
56.7 |
37 |
32.9 |
1860 |
mPLUG - Owl2 |
🤖LINK |
66.0 |
66.5 |
60.3 |
59.5 |
34.7 |
32.1 |
1786.4 |
CogVLM |
🤗LINK |
65.8 |
63.7 |
55.9 |
53.8 |
37.3 |
30.1 |
1736.6 |
Monkey - Chat |
🤗LINK |
72.4 |
71 |
67.5 |
65.8 |
40.7 |
- |
1887.4 |
MM1 - 7B - Chat |
LINK |
- |
72.3 |
- |
- |
37.0 |
35.6 |
1858.2 |
IDEFICS2 - 8B |
🤗LINK |
75.7 |
75.3 |
68.6 |
67.3 |
43.0 |
37.7 |
1847.6 |
SVIT - v1.5 - 13B |
🤗LINK |
69.1 |
- |
63.1 |
- |
38.0 |
33.3 |
1889 |
LLaVA - v1.5 - 13B |
🤗LINK |
69.2 |
69.2 |
65 |
63.6 |
36.4 |
33.6 |
1826.7 |
LLaVA - v1.6 - 13B |
🤗LINK |
70 |
70.7 |
68.5 |
64.3 |
36.2 |
- |
1901 |
Honeybee |
LINK |
73.6 |
74.3 |
- |
- |
36.2 |
- |
1976.5 |
YI - VL - 34B |
🤗LINK |
72.4 |
71.1 |
70.7 |
71.4 |
45.1 |
41.6 |
2050.2 |
360VL - 8B |
🤗LINK |
75.3 |
73.7 |
71.1 |
68.6 |
39.7 |
37.1 |
1944.6 |
360VL - 70B |
🤗LINK |
78.1 |
80.4 |
76.9 |
77.7 |
50.8 |
44.3 |
2012.3 |
📚 Documentation
Model type
360VL - 8B is an open - source chatbot trained by fine - tuning LLM on multimodal instruction - following data. It is an auto - regressive language model, based on the transformer architecture. Base LLM: [meta - llama/Meta - Llama - 3 - 8B - Instruct](https://huggingface.co/meta - llama/Meta - Llama - 3 - 8B - Instruct)
Model date
360VL - 8B was trained in April 2024.
📄 License
This project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses. The content of this project itself is licensed under the [Apache license 2.0]
Where to send questions or comments about the model:
https://github.com/360CVGroup/360VL
🔗 Related Projects
This work wouldn't be possible without the incredible open - source code of these projects. Huge thanks!