ð Typhoon2-Vision
Typhoon2-qwen2vl-7b-vision-instruct is a Thai ðđð vision-language model. It supports both image and video inputs. While Qwen2-VL can handle both image and video processing tasks, Typhoon2-VL is specifically optimized for image-based applications.
For the technical report, please see our arxiv.
ð Quick Start
Here is a code snippet to show you how to use the model with transformers.
Before running the snippet, you need to install the following dependencies:
pip install torch transformers accelerate pillow
How to Get Started with the Model
Use the code below to get started with the model.
Question: āļĢāļ°āļāļļāļāļ·āđāļāļŠāļāļēāļāļāļĩāđāđāļĨāļ°āļāļĢāļ°āđāļāļĻāļāļāļāļ āļēāļāļāļĩāđāđāļāđāļāļ āļēāļĐāļēāđāļāļĒ
Answer: āļāļĢāļ°āļāļĢāļĄāļĄāļŦāļēāļĢāļēāļāļ§āļąāļ, āļāļĢāļļāļāđāļāļāļŊ, āļāļĢāļ°āđāļāļĻāđāļāļĒ
from PIL import Image
import requests
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
model_name = "scb10x/typhoon2-qwen2vl-7b-vision-instruct"
model = Qwen2VLForConditionalGeneration.from_pretrained(
model_name, torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_name)
url = "https://cdn.pixabay.com/photo/2023/05/16/09/15/bangkok-7997046_1280.jpg"
image = Image.open(requests.get(url, stream=True).raw)
conversation = [
{
"role": "user",
"content": [
{
"type": "image",
},
{"type": "text", "text": "āļĢāļ°āļāļļāļāļ·āđāļāļŠāļāļēāļāļāļĩāđāđāļĨāļ°āļāļĢāļ°āđāļāļĻāļāļāļāļ āļēāļāļāļĩāđāđāļāđāļāļ āļēāļĐāļēāđāļāļĒ"},
],
}
]
text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
inputs = processor(
text=[text_prompt], images=[image], padding=True, return_tensors="pt"
)
inputs = inputs.to("cuda")
output_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids = [
output_ids[len(input_ids) :]
for input_ids, output_ids in zip(inputs.input_ids, output_ids)
]
output_text = processor.batch_decode(
generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
)
print(output_text)
Processing Multiple Images
from PIL import Image
import requests
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
model_name = "scb10x/typhoon2-qwen2vl-7b-vision-instruct"
model = Qwen2VLForConditionalGeneration.from_pretrained(
model_name, torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_name)
conversation = [
{
"role": "user",
"content": [
{
"type": "image",
},
{
"type": "image",
},
{"type": "text", "text": "āļĢāļ°āļāļļ 3 āļŠāļīāđāļāļāļĩāđāļāļĨāđāļēāļĒāļāļąāļāđāļāļŠāļāļāļ āļēāļāļāļĩāđ"},
],
}
]
urls = [
"https://cdn.pixabay.com/photo/2023/05/16/09/15/bangkok-7997046_1280.jpg",
"https://cdn.pixabay.com/photo/2020/08/10/10/09/bangkok-5477405_1280.jpg",
]
images = [Image.open(requests.get(url, stream=True).raw) for url in urls]
text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
inputs = processor(text=[text_prompt], images=images, padding=True, return_tensors="pt")
inputs = inputs.to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
Tips
To balance between performance of the model and the cost of computation, you can set minimum and maximum number of pixels by passing arguments to the processer.
min_pixels = 128 * 28 * 28
max_pixels = 2560 * 28 * 28
processor = AutoProcessor.from_pretrained(
model_name, min_pixels=min_pixels, max_pixels=max_pixels
)
Evaluation (Image)
Benchmark |
Llama-3.2-11B-Vision-Instruct |
Qwen2-VL-7B-Instruct |
Pathumma-llm-vision-1.0.0 |
Typhoon2-qwen2vl-7b-vision-instruct |
OCRBench Liu et al., 2024c |
72.84 / 51.10 |
72.31 / 57.90 |
32.74 / 25.87 |
64.38 / 49.60 |
MMBench (Dev) Liu et al., 2024b |
76.54 / - |
84.10 / - |
19.51 / - |
83.66 / - |
ChartQA Masry et al., 2022 |
13.41 / x |
47.45 / 45.00 |
64.20 / 57.83 |
75.71 / 72.56 |
TextVQA Singh et al., 2019 |
32.82 / x |
91.40 / 88.70 |
32.54 / 28.84 |
91.45 / 88.97 |
OCR (TH) OpenThaiGPT, 2024 |
64.41 / 35.58 |
56.47 / 55.34 |
6.38 / 2.88 |
64.24 / 63.11 |
M3Exam Images (TH) Zhang et al., 2023c |
25.46 / - |
32.17 / - |
29.01 / - |
33.67 / - |
GQA (TH) Hudson et al., 2019 |
31.33 / - |
34.55 / - |
10.20 / - |
50.25 / - |
MTVQ (TH) Tang et al., 2024b |
11.21 / 4.31 |
23.39 / 13.79 |
7.63 / 1.72 |
30.59 / 21.55 |
Average |
37.67 / x |
54.26 / 53.85 |
25.61 / 23.67 |
62.77 / 59.02 |
Note: The first value in each cell represents Rouge-L.The second value (after /
) represents Accuracy, normalized such that Rouge-L = 100%.
âĻ Features
- Model type: A 7B instruct decoder-only model with vision encoder based on Qwen2 architecture.
- Requirement: transformers 4.38.0 or newer.
- Primary Language(s): Thai ðđð and English ðŽð§
- Demo: https://vision.opentyphoon.ai/
ð License
This model is released under the Apache-2.0 license.
ð Documentation
Intended Uses & Limitations
This model is an instructional model. However, itâs still undergoing development. It incorporates some level of guardrails, but it still may produce answers that are inaccurate, biased, or otherwise objectionable in response to user prompts. We recommend that developers assess these risks in the context of their use case.
Follow us
https://twitter.com/opentyphoon
Support
https://discord.gg/us5gAYmrxw
Citation
- If you find Typhoon2 useful for your work, please cite it using:
@misc{typhoon2,
title={Typhoon 2: A Family of Open Text and Multimodal Thai Large Language Models},
author={Kunat Pipatanakul and Potsawee Manakul and Natapong Nitarach and Warit Sirichotedumrong and Surapon Nonesung and Teetouch Jaknamon and Parinthapat Pengpun and Pittawat Taveekitworachai and Adisai Na-Thalang and Sittipong Sripaisarnmongkol and Krisanapong Jirayoot and Kasima Tharnpipitchai},
year={2024},
eprint={2412.13702},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2412.13702},
}