VARCO-VISION-14B-HF Open-Source Vision-Language Model - Supports Text and Image Input, with Positioning, Referring and OCR Functions

VARCO VISION 14B HF

Developed by NCSOFT

VARCO-VISION-14B is a powerful English-Korean visual language model that supports image and text input to generate text output, equipped with localization, referencing, and OCR capabilities.

Image-to-Text

Transformers

Supports Multiple Languages#Multimodal Dialogue #Korean-English Visual Language #OCR Localization

Downloads 449

Release Time : 11/27/2024

Model Overview

VARCO-VISION-14B is a multimodal visual language model supporting English and Korean, capable of processing image and text input to generate text output. The model features localization, referencing, and optical character recognition (OCR) functionalities, making it suitable for various visual language tasks.

Model Features

Multimodal Support

Supports image and text input to generate text output, suitable for various visual language tasks.

Localization Function

Can identify specific locations in images and provide precise localization information via bounding boxes.

Referencing Function

Can understand context and focus on objects at specified locations, marking object positions with bounding boxes.

OCR Function

Supports optical character recognition (OCR), enabling the identification of text content within images.

Model Capabilities

Image Description

Localization

Referencing

Optical Character Recognition (OCR)

Multimodal Dialogue

Use Cases

Image Understanding

Image Description

Input an image, and the model generates a detailed description of the image.

Generates a detailed description including objects and scenes in the image.

Localization

Input an image and a question, and the model identifies specific locations in the image and provides bounding box information.

Generates a detailed description including object location information.

Text Recognition

OCR

Input an image containing text, and the model identifies and extracts the text content from the image.

Generates the recognized text and its location information from the image.

🚀 VARCO-VISION-14B-HF

VARCO-VISION-14B is a powerful English-Korean Vision-Language Model (VLM). It accepts a single image and a text as inputs and generates an output text. It supports grounding, referring, and OCR (Optical Character Recognition).

Model Information

Property	Details
Developed by	NC Research, Multimodal Generation Team
Technical Report	VARCO-VISION: Expanding Frontiers in Korean Vision-Language Models
Blog(Korean)	VARCO-VISION Technical Report Summary
Demo Page	The demo page is no longer available.
Languages	Korean, English
License	CC BY-NC 4.0
Architecture	Follows the architecture of LLaVA-OneVision
Base Model - Language Model	Qwen/Qwen2.5-14B-Instruct
Base Model - Vision Encoder	google/siglip-so400m-patch14-384
LLaVA-NeXT Codebase Model	NCSOFT/VARCO-VISION-14B
Korean VLM Benchmarks	LLMs-Eval toolkit datasets: NCSOFT/K-MMBench, NCSOFT/K-SEED, NCSOFT/K-MMStar, NCSOFT/K-DTCBench, NCSOFT/K-LLaVA-W. Also, evaluate in VLMEval kit

⚠️ Important Note

This model is for research purposes only. Commercial use is prohibited.

🚀 Quick Start

To use this model, ensure you have transformers >= 4.45.0 installed.

import torch
import requests
from PIL import Image
from transformers import LlavaOnevisionForConditionalGeneration, AutoProcessor

model_name = "NCSOFT/VARCO-VISION-14B-HF"
model = LlavaOnevisionForConditionalGeneration.from_pretrained(
        model_name,
        torch_dtype="float16",
        device_map="auto",
        attn_implementation="flash_attention_2"
    )
processor = AutoProcessor.from_pretrained(model_name)
device = model.device

# Define a chat history and use `apply_chat_template` to get correctly formatted prompt
# Each value in "content" has to be a list of dicts with types ("text", "image")
conversation = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Describe this image."},
            {"type": "image"},
        ],
    },
]

prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)

EOS_TOKEN = "<|im_end|>"
image_file = "http://images.cocodataset.org/val2017/000000039769.jpg"
raw_image = Image.open(requests.get(image_file, stream=True).raw)
inputs = processor(images=raw_image, text=prompt, return_tensors='pt').to(device, torch.float16)

output = model.generate(**inputs, max_new_tokens=1024, do_sample=False)
output = processor.decode(output[0][inputs.input_ids.shape[1]:])
if output.endswith(EOS_TOKEN):
    output = output[: -len(EOS_TOKEN)]

output = output.strip()
print(output)

✨ Features

Specialized Features

If a question is based on bounding boxes or requires bounding boxes as an output, please include the special tokens in the input text.

The following special tokens are used to define specific tasks, inputs, and outputs for the model:

<gro>: Indicates that the model's response should include bounding box information.
<ocr>: Specifies OCR tasks for recognizing text within an image.
<char> and </char>: Used to mark a text phrase.
<obj> and </obj>: Used to indicate an object.
<bbox> and </bbox>: Used to represent a bounding box.
<delim>: Represents multiple location points for a single object or text.

Grounding

Grounding refers to a task where the model needs to identify specific locations within an image to provide an appropriate answer. To perform grounding, prepend the special token <gro> to the question.

conversation = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "<gro>\nDescribe the image in detail."},
            {"type": "image"},
        ],
    },
]

Expected Output Example:

The image shows <obj>two cats</obj><bbox>0.014, 0.106, 0.51, 0.996<delim>0.51, 0.054, 0.996, 0.787</bbox> lying on <obj>a pink blanket</obj><bbox>0.003, 0.231, 0.999, 0.999</bbox>. The cat on the left is lying on its side with its head resting on the blanket, while the cat on the right is lying on its stomach with its head also resting on the blanket. Both cats appear to be relaxed and comfortable. There are <obj>two remote controls</obj><bbox>0.037, 0.141, 0.283, 0.253<delim>0.506, 0.171, 0.581, 0.295</bbox> placed near the cats, one on the left side and one on the right side of the image.

Referring

VARCO-VISION-14B can handle location-specific questions using bounding boxes. To perform referring tasks, make a conversation including the object of interest within <obj> and </obj> tags. You have to specify its location with <bbox> and </bbox> tags. This allows the model to understand the context and focus on the object at the specified location. A bbox is represented in a form of (x1, y1, x2, y2). The first two values indicate the top-left position of a bbox, and the latter two values are the bottom-right position.

conversation = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "<obj>이 물건</obj><bbox>0.039, 0.138, 0.283, 0.257</bbox>은 어떻게 쓰는거야?",
            },
            {"type": "image"},
        ],
    },
]

Expected Output Example:

**이 물건**은 리모컨으로, 주로 텔레비전이나 다른 전자 기기를 원격으로 조작하는 데 사용됩니다. 리모컨에는 다양한 버튼이 있으며, 각  버튼은 채널 변경, 볼륨 조절, 전원 켜기/끄기 등의 기능을 수행합니다. 사용자는 리모컨을 손에 들고 버튼을 누르면, 해당 기기에 신호를 보내 원하는 조작을 할 수 있습니다. 리모컨은 일반적으로 가정이나 사무실에서 편리하게 전자 기기를 조작할 수 있도록 사용됩니다.

OCR

To perform Optical Character Recognition (OCR), use the <ocr> token.

image_file = "./assets/ocr_1.png"
raw_image = Image.open(image_file)

conversation = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "<ocr>"},
            {"type": "image"},
        ],
    },
]

Expected Output Example:

<char>백범로</char><bbox>0.172, 0.266, 0.328, 0.341</bbox>
<char>124번길</char><bbox>0.347, 0.266, 0.512, 0.341</bbox>
<char>Baekbeom-ro</char><bbox>0.171, 0.337, 0.433, 0.392</bbox>
<char>124</char><bbox>0.444, 0.341, 0.508, 0.392</bbox>
<char>만수주공아파트</char><bbox>0.109, 0.531, 0.335, 0.601</bbox>
<char>시흥</char><bbox>0.443, 0.518, 0.522, 0.581</bbox>
<char>시청</char><bbox>0.711, 0.521, 0.811, 0.594</bbox>
<char>Mansu</char><bbox>0.102, 0.601, 0.181, 0.648</bbox>
<char>Jugong</char><bbox>0.186, 0.601, 0.273, 0.658</bbox>
<char>Apt</char><bbox>0.28, 0.601, 0.327, 0.651</bbox>
<char>42</char><bbox>0.377, 0.601, 0.416, 0.648</bbox>
<char>Shieung</char><bbox>0.445, 0.578, 0.53, 0.625</bbox>
<char>인천대공원</char><bbox>0.43, 0.621, 0.609, 0.684</bbox>
<char>모래내시장역</char><bbox>0.651, 0.59, 0.873, 0.665</bbox>
<char>IncheonGrand</char><bbox>0.432, 0.681, 0.561, 0.723</bbox>
<char>Park</char><bbox>0.564, 0.681, 0.611, 0.723</bbox>

📄 License

This model is licensed under CC BY-NC 4.0.

📚 Documentation

Citing the Model

If you use VARCO-VISION-14B in your research, please cite the following:

@misc{ju2024varcovisionexpandingfrontierskorean,
    title={VARCO-VISION: Expanding Frontiers in Korean Vision-Language Models}, 
    author={Jeongho Ju and Daeyoung Kim and SunYoung Park and Youngjune Kim},
    year={2024},
    eprint={2411.19103},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2411.19103},

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご