Typhoon2-Vision Open-Source Vision-Language Model - Supports Thai, Processes Images and Videos, Optimizes Image Applications

Typhoon2 Qwen2vl 7b Vision Instruct

Developed by scb10x

Typhoon2-Vision is a Thai-supported visual language model capable of processing image and video inputs, specifically optimized for image-based applications.

Text-to-Image

Transformers

Supports Multiple LanguagesOpen Source License:Apache-2.0 #Thai Visual Question Answering #Multimodal Instruction Model #Image Content Understanding

Downloads 793

Release Time : 12/10/2024

Model Overview

A Thai visual language model built on Qwen2-VL-7B-Instruct, supporting multimodal interaction with images and text, suitable for visual tasks in Thai and English environments.

Model Features

Thai Optimization

Specifically optimized for Thai environments, supporting multimodal interaction in Thai and English.

Multimodal Processing

Capable of processing both image and text inputs, supporting complex visual language tasks.

High Performance

Outperforms peer models in multiple benchmarks, especially excelling in Thai visual tasks.

Model Capabilities

Image analysis

Text generation

Multimodal interaction

Thai visual task processing

English visual task processing

Use Cases

Image Understanding

Image Location Recognition

Identify location names and countries in images

Accurately recognizes landmarks and geographic locations in images

Image Similarity Analysis

Compare similarities between multiple images

Identifies common features and differences between images

Education

Thai Visual Question Answering

Answer Thai questions about image content

Excels in Thai visual question answering tasks

🚀 Typhoon2-Vision

Typhoon2-qwen2vl-7b-vision-instruct is a Thai 🇹🇭 vision-language model. It supports both image and video inputs. While Qwen2-VL can handle both image and video processing tasks, Typhoon2-VL is specifically optimized for image-based applications.

For the technical report, please see our arxiv.

🚀 Quick Start

Here is a code snippet to show you how to use the model with transformers.

Before running the snippet, you need to install the following dependencies:

pip install torch transformers accelerate pillow

How to Get Started with the Model

Use the code below to get started with the model.

Question: ระบุชื่อสถานที่และประเทศของภาพนี้เป็นภาษาไทย
Answer: พระบรมมหาราชวัง, กรุงเทพฯ, ประเทศไทย

from PIL import Image
import requests
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor

model_name = "scb10x/typhoon2-qwen2vl-7b-vision-instruct"

model = Qwen2VLForConditionalGeneration.from_pretrained(
    model_name, torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_name)

# Image
url = "https://cdn.pixabay.com/photo/2023/05/16/09/15/bangkok-7997046_1280.jpg"
image = Image.open(requests.get(url, stream=True).raw)

conversation = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
            },
            {"type": "text", "text": "ระบุชื่อสถานที่และประเทศของภาพนี้เป็นภาษาไทย"},
        ],
    }
]


# Preprocess the inputs
text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)

inputs = processor(
    text=[text_prompt], images=[image], padding=True, return_tensors="pt"
)
inputs = inputs.to("cuda")

output_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids = [
    output_ids[len(input_ids) :]
    for input_ids, output_ids in zip(inputs.input_ids, output_ids)
]
output_text = processor.batch_decode(
    generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
)
print(output_text)
# ['พระบรมมหาราชวัง, กรุงเทพฯ, ประเทศไทย']

Processing Multiple Images

from PIL import Image
import requests
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor

model_name = "scb10x/typhoon2-qwen2vl-7b-vision-instruct"

model = Qwen2VLForConditionalGeneration.from_pretrained(
    model_name, torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_name)

# Messages containing multiple images and a text query
conversation = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
            },
            {
                "type": "image",
            },
            {"type": "text", "text": "ระบุ 3 สิ่งที่คล้ายกันในสองภาพนี้"},
        ],
    }
]

urls = [
    "https://cdn.pixabay.com/photo/2023/05/16/09/15/bangkok-7997046_1280.jpg",
    "https://cdn.pixabay.com/photo/2020/08/10/10/09/bangkok-5477405_1280.jpg",
]
images = [Image.open(requests.get(url, stream=True).raw) for url in urls]

# Preprocess the inputs
text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)

inputs = processor(text=[text_prompt], images=images, padding=True, return_tensors="pt")
inputs = inputs.to("cuda")

# Inference
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
# ['1. ทั้งสองภาพแสดงสถาปัตยกรรมที่มีลักษณะคล้ายกัน\n2. ทั้งสองภาพมีสีสันที่สวยงาม\n3. ทั้งสองภาพมีทิวทัศน์ที่สวยงาม']

Tips

To balance between performance of the model and the cost of computation, you can set minimum and maximum number of pixels by passing arguments to the processer.

min_pixels = 128 * 28 * 28
max_pixels = 2560 * 28 * 28
processor = AutoProcessor.from_pretrained(
    model_name, min_pixels=min_pixels, max_pixels=max_pixels
)

Evaluation (Image)

Benchmark	Llama-3.2-11B-Vision-Instruct	Qwen2-VL-7B-Instruct	Pathumma-llm-vision-1.0.0	Typhoon2-qwen2vl-7b-vision-instruct
OCRBench Liu et al., 2024c	72.84 / 51.10	72.31 / 57.90	32.74 / 25.87	64.38 / 49.60
MMBench (Dev) Liu et al., 2024b	76.54 / -	84.10 / -	19.51 / -	83.66 / -
ChartQA Masry et al., 2022	13.41 / x	47.45 / 45.00	64.20 / 57.83	75.71 / 72.56
TextVQA Singh et al., 2019	32.82 / x	91.40 / 88.70	32.54 / 28.84	91.45 / 88.97
OCR (TH) OpenThaiGPT, 2024	64.41 / 35.58	56.47 / 55.34	6.38 / 2.88	64.24 / 63.11
M3Exam Images (TH) Zhang et al., 2023c	25.46 / -	32.17 / -	29.01 / -	33.67 / -
GQA (TH) Hudson et al., 2019	31.33 / -	34.55 / -	10.20 / -	50.25 / -
MTVQ (TH) Tang et al., 2024b	11.21 / 4.31	23.39 / 13.79	7.63 / 1.72	30.59 / 21.55
Average	37.67 / x	54.26 / 53.85	25.61 / 23.67	62.77 / 59.02

Note: The first value in each cell represents Rouge-L.The second value (after /) represents Accuracy, normalized such that Rouge-L = 100%.

✨ Features

Model type: A 7B instruct decoder-only model with vision encoder based on Qwen2 architecture.
Requirement: transformers 4.38.0 or newer.
Primary Language(s): Thai 🇹🇭 and English 🇬🇧
Demo: https://vision.opentyphoon.ai/

📄 License

This model is released under the Apache-2.0 license.

📚 Documentation

Intended Uses & Limitations

This model is an instructional model. However, it’s still undergoing development. It incorporates some level of guardrails, but it still may produce answers that are inaccurate, biased, or otherwise objectionable in response to user prompts. We recommend that developers assess these risks in the context of their use case.

https://twitter.com/opentyphoon

Support

https://discord.gg/us5gAYmrxw

Citation

If you find Typhoon2 useful for your work, please cite it using:

@misc{typhoon2,
      title={Typhoon 2: A Family of Open Text and Multimodal Thai Large Language Models}, 
      author={Kunat Pipatanakul and Potsawee Manakul and Natapong Nitarach and Warit Sirichotedumrong and Surapon Nonesung and Teetouch Jaknamon and Parinthapat Pengpun and Pittawat Taveekitworachai and Adisai Na-Thalang and Sittipong Sripaisarnmongkol and Krisanapong Jirayoot and Kasima Tharnpipitchai},
      year={2024},
      eprint={2412.13702},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2412.13702}, 
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご