DolphinVision 72b Open-Source Multimodal Model - Comments on the Inference of Rejected Images by Popular Models

Dolphin Vision 72b

Developed by dphn

DolphinVision 72b is a multimodal model that can perform reasoning and commentary on images that other popular models would refuse to process.

Image-to-Text

Transformers

Open Source License:Other #Multimodal reasoning #Uncensored image analysis #High-precision OCR

Downloads 296

Release Time : 6/28/2024

Model Overview

DolphinVision 72b is an uncensored multimodal model carefully curated and trained by Quan Nguyen, Eric Hartford, and Cognitive Computations, capable of handling multimodal tasks involving images and text.

Model Features

Multimodal capabilities

Can process images and text simultaneously for complex multimodal reasoning.

Uncensored

Can perform reasoning and commentary on images that other popular models would refuse to process.

High performance

Performs excellently in multiple benchmark tests, approaching or exceeding the performance of GPT - 4V.

Model Capabilities

Image description

Text generation

OCR recognition

Multimodal reasoning

Use Cases

Image analysis

OCR text recognition

Extract and recognize text content from images.

Accurately recognize the sarcastic text content in the image.

Content generation

Image description generation

Generate detailed descriptive text based on images.

Generate accurate and detailed image descriptions.

🚀 DolphinVision 72b 🐬

DolphinVision 72b is a multimodal model. It is uncensored and capable of reasoning and commenting on images that other popular models would object to. This model is curated and trained by Quan Nguyen (qnguyen3/stablequan), Eric Hartford, and Cognitive Computations.

🔗 Links

Discord: https://discord.gg/h3K4XGj2RH

🎨 Model Image

💖 Sponsors

Our appreciation for the sponsors of DolphinVision:

Crusoe Cloud - provided excellent on-demand 8xH100 node used for training
TensorWave - provided 8x mi300x node used for evaluations and inference

📦 Model Information

Property	Details
License	other (tongyi-qianwen)
Base Model	Qwen/Qwen2-72B
Tags	generated_from_trainer, axolotl
Datasets	cognitivecomputations/Dolphin-2.9, teknium/OpenHermes-2.5, m-a-p/CodeFeedback-Filtered-Instruction, cognitivecomputations/dolphin-coder, cognitivecomputations/samantha-data, microsoft/orca-math-word-problems-200k, Locutusque/function-calling-chatml, internlm/Agent-FLAN

💻 Usage Examples

Basic Usage

import torch
import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
from PIL import Image
import warnings

# disable some warnings
transformers.logging.set_verbosity_error()
transformers.logging.disable_progress_bar()
warnings.filterwarnings('ignore')

# set device
torch.set_default_device('cuda')  # or 'cpu'

model_name = 'cognitivecomputations/dolphin-vision-72b'

# create model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map='auto',
    trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    trust_remote_code=True)

# text prompt
prompt = 'Describe this image in detail'

messages = [
    {"role": "user", "content": f'<image>\n{prompt}'}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

print(text)

text_chunks = [tokenizer(chunk).input_ids for chunk in text.split('<image>')]
input_ids = torch.tensor(text_chunks[0] + [-200] + text_chunks[1], dtype=torch.long).unsqueeze(0)

# image, sample images can be found in images folder
image = Image.open('/path/to/image.png')
image_tensor = model.process_images([image], model.config).to(dtype=model.dtype)

# generate
output_ids = model.generate(
    input_ids,
    images=image_tensor,
    max_new_tokens=2048,
    use_cache=True)[0]

print(tokenizer.decode(output_ids[input_ids.shape[1]:], skip_special_tokens=True).strip())

Example Output

image/png

<|im_start|>user
: do OCR on all the text
<|im_start|>assistant
: The text in the image reads as follows:

- STOP USING DOCKER
- Applications were not meant to be trapped in virtual jars like digital fireflies
- Years of software development, yet no one questioned why we were putting programs in imaginary boats
- Let's containerize our monolith. I'll just quickly spin up 17 microservices. Statements dreamt up by the utterly deranged
- Observe the whale-sized deception they fooled us with all this time
- These are real images used to explain Docker
- "Yay Matey, I'd like to run my website in a virtual sea"
- They have played us for absolute fools

This text is a satirical commentary on the use of Docker in software development, using humor and exaggeration to question the practice.

📊 Benchmarks

Model	VQA v2	MMBench	MMMU (Eval)	MathVista	RealWorldQA
Dolphin-Vision-72B	83.6	81.2	45.7	47.25	66.4
GPT-4V	84.4	78.1	52.4	-	67.9

📄 License

The license of this model is other, with the license name tongyi-qianwen.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご