TaiVisionLM-base-v2 Open-Source Vision-Language Model - Supports Traditional Chinese Instruction Input, and It's Super Convenient to Load and Fine-Tune

Taivisionlm Base V2

Developed by benchang1110

The first vision-language model supporting Traditional Chinese instruction input (1.2B parameters), compatible with Transformers library, quick to load and easy to fine-tune

Image-to-Text

Transformers

Chinese#Traditional Chinese visual language #Multimodal instruction understanding #SigLIP-Tinyllama architecture

Downloads 122

Release Time : 9/17/2024

Model Overview

Multimodal large language model combining SigLIP visual encoder with Tinyllama language model, connected via visual projector, specifically designed for Traditional Chinese visual language tasks

Model Features

Traditional Chinese support

First vision-language model specifically supporting Traditional Chinese instruction input

Efficient architecture

Lightweight design with only 1.2B parameters, maintaining high performance while reducing computational requirements

Transformers compatibility

Fully compatible with Hugging Face Transformers library, no additional dependencies required

Multi-stage training

Adopts three-phase development process: unimodal pretraining, feature alignment, and task-specific training

Model Capabilities

Image caption generation

Visual question answering

Multimodal understanding

Traditional Chinese text generation

Use Cases

Content understanding

Image captioning

Generate detailed Traditional Chinese descriptions for images

Version v2 provides more detailed visual element analysis compared to v1

Visual question answering

Answer Traditional Chinese questions about image content

Educational applications

Learning assistance

Help Traditional Chinese users understand visual content

🚀 TaiVisionLM: The First of Its Kind!

🌟 This is a small (only 1.2B parameters) visual language model on Hugging Face that responds to Traditional Chinese instructions given an image input! 🌟

✨ Developed compatible with the Transformers library, TaiVisionLM is quick to load, fine-tune, and use for lightning-fast inferences without needing any external libraries! ⚡️

Ready to experience the Traditional Chinese visual language model? Let's go! 🖼️🤖

📦 Dataset and Model Information

Property	Details
Datasets	benchang1110/TaiVision-pretrain-1M-v2.0
Language	Traditional Chinese
Library Name	transformers
Pipeline Tag	image-text-to-text
Base Model	benchang1110/TaiVisionLM-base-v1

📚 Model Details

Model Description

This model is a multimodal large language model that combines SigLIP as its vision encoder with Tinyllama as its language model. The vision projector connects the two modalities together. Its architecture closely resembles PaliGemma.

Here's the summary of the development process:

Unimodal pretraining
- In this stage, instead of pretraining both modalities from scratch, I leverage the image encoder from google/siglip-base-patch16-224-multilingual and the language model trained by ourselves (https://huggingface.co/benchang1110/Taiwan-tinyllama-v1.0-chat).
Feature Alignment
- We trained the vision projector and language model using LoRA using 1M image-text pairs to align visual and textual features.
  This model is the finetuned version of benchang1110/TaiVisionLM-base-v1. We fintuned the model using 1M image-text pairs. The finetuned model will generate a longer and more detailed description of the image.
Task Specific Training
- The aligned model undergoes further training for tasks such as short captioning, detailed captioning, and simple visual question answering. We will undergo this stage after the dataset is ready!

Developed by: benchang1110
Model type: Image-Text-to-Text
Language(s) (NLP): Traditional Chinese

🚀 Quick Start

How to Get Started with the Model

In Transformers, you can load the model and do inference as follows:

⚠️ Important Note

TaiVisionLM model is not yet integrated natively into the Transformers library. So you need to set trust_remote_code=True when loading the model. It will download the configuration_taivisionlm.py, modeling_taivisionlm.py and processing_taivisionlm.py files from the repo. You can check out the content of these files under the Files and Versions tab and pin the specific versions if you have any concerns regarding malicious code.

from transformers import AutoProcessor, AutoModelForCausalLM, AutoConfig
from PIL import Image
import requests
import torch
config = AutoConfig.from_pretrained("benchang1110/TaiVisionLM-base-v2",trust_remote_code=True)
processor = AutoProcessor.from_pretrained("benchang1110/TaiVisionLM-base-v2",trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("benchang1110/TaiVisionLM-base-v2",trust_remote_code=True,torch_dtype=torch.float16,attn_implementation="sdpa").to('cuda')
model.eval()
url = "https://media.wired.com/photos/598e35fb99d76447c4eb1f28/master/pass/phonepicutres-TA.jpg"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
text = "描述圖片"
inputs = processor(text=text,images=image, return_tensors="pt",padding=False).to('cuda')
outputs = processor.tokenizer.decode(model.generate(**inputs,max_length=512)[0])
print(outputs)

Comparision with prior model (benchang1110/TaiVisionLM-base-v1)

Example 1

TaiVisionLM-base-v1:
Cartoon illustration depicting a sign hanging on the door, presented in a cartoon illustration style at the top and bottom.
TaiVisionLM-base-v2:
This picture presents a thought-provoking contrast. In the center of the picture, the word "Smile" in white text is written in bold black letters. The text is slightly tilted, adding a sense of movement to the otherwise static picture. The background is a bright white, highlighting the text and establishing its importance in the picture. The background is not just white; it forms a sharp contrast with the black text, creating a striking visual effect. The text, background, and shape work in harmony, with each element complementing the others to form a harmonious composition. There is a subtle ridged edge in the bottom right corner of the picture. The ridged outline adds a sense of depth to the picture, attracting the viewer's attention and inviting them to explore the overall context of the picture. The ridged edge complements the overall design of the picture, enhancing the rhythm and energy of the atmosphere. Overall, this picture is a harmonious combination of colors and shapes, with each element carefully placed to create a visually pleasing composition. The use of black, bold letters and the subtle ridged edge adds an air of mystery, extending its impression to a deeper level, both engaging and thought-provoking.

Example 2

TaiVisionLM-base-v1:
This is a photographic work showing the Eiffel Tower in Paris surrounded by a water feature.
TaiVisionLM-base-v2:
This picture captures an amazing view of the Eiffel Tower, a famous French landmark. The sky is a bright blue, interwoven with the surrounding greenery, forming an amazing composition. The Eiffel Tower, with its towering arch painted in a light brown color, stands proudly on the right side of the picture. Next to it is a river, its calm waters reflecting the blue sky above. In the distance, the silhouettes of other famous landmarks are visible, including an iconic bridge and a castle-like skyscraper, adding depth and scale to the scene. The trees in the foreground add a touch of green, providing a refreshing contrast to the light brown of the Eiffel Tower and the blue of the sky. This picture is taken from a perspective looking at the Eiffel Tower from the water, providing a bird's-eye view of the entire landscape. This perspective allows for a comprehensive observation of the Eiffel Tower and its surrounding environment, showcasing its grandeur and the life within it. There is no fictional content in this picture, and all descriptions are based on the elements visible in the picture.

🔧 Technical Details

Training Procedure

Feature Alignment

Data size	Global Batch Size	Learning Rate	Epochs	Max Length	Weight Decay
1.35M	4	5e-3	1	1024	0

We use full-parameter finetuning for the projector and apply LoRA to the language model.

We will update the training procedure once we have more resources to train the model on the whole dataset. metric

Compute Infrastructure

Feature Alignment 1xV100(32GB), took approximately 45 GPU hours.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご