đ TaiVisionLM: The First of Its Kind!
đ This is a small (only 1.2B parameters) visual language model on Hugging Face that responds to Traditional Chinese instructions given an image input! đ
⨠Developed compatible with the Transformers library, TaiVisionLM is quick to load, fine-tune, and use for lightning-fast inferences without needing any external libraries! âĄī¸
Ready to experience the Traditional Chinese visual language model? Let's go! đŧī¸đ¤
đĻ Dataset and Model Information
Property |
Details |
Datasets |
benchang1110/TaiVision-pretrain-1M-v2.0 |
Language |
Traditional Chinese |
Library Name |
transformers |
Pipeline Tag |
image-text-to-text |
Base Model |
benchang1110/TaiVisionLM-base-v1 |
đ Model Details
Model Description
This model is a multimodal large language model that combines SigLIP as its vision encoder with Tinyllama as its language model. The vision projector connects the two modalities together.
Its architecture closely resembles PaliGemma.
Here's the summary of the development process:
- Unimodal pretraining
- In this stage, instead of pretraining both modalities from scratch, I leverage the image encoder from google/siglip-base-patch16-224-multilingual and the language model trained by ourselves (https://huggingface.co/benchang1110/Taiwan-tinyllama-v1.0-chat).
- Feature Alignment
- We trained the vision projector and language model using LoRA using 1M image-text pairs to align visual and textual features.
This model is the finetuned version of benchang1110/TaiVisionLM-base-v1. We fintuned the model using 1M image-text pairs. The finetuned model will generate a longer and more detailed description of the image.
- Task Specific Training
- The aligned model undergoes further training for tasks such as short captioning, detailed captioning, and simple visual question answering.
We will undergo this stage after the dataset is ready!
đ Quick Start
How to Get Started with the Model
In Transformers, you can load the model and do inference as follows:
â ī¸ Important Note
TaiVisionLM model is not yet integrated natively into the Transformers library. So you need to set trust_remote_code=True
when loading the model. It will download the configuration_taivisionlm.py
, modeling_taivisionlm.py
and processing_taivisionlm.py
files from the repo. You can check out the content of these files under the Files and Versions tab and pin the specific versions if you have any concerns regarding malicious code.
from transformers import AutoProcessor, AutoModelForCausalLM, AutoConfig
from PIL import Image
import requests
import torch
config = AutoConfig.from_pretrained("benchang1110/TaiVisionLM-base-v2",trust_remote_code=True)
processor = AutoProcessor.from_pretrained("benchang1110/TaiVisionLM-base-v2",trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("benchang1110/TaiVisionLM-base-v2",trust_remote_code=True,torch_dtype=torch.float16,attn_implementation="sdpa").to('cuda')
model.eval()
url = "https://media.wired.com/photos/598e35fb99d76447c4eb1f28/master/pass/phonepicutres-TA.jpg"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
text = "æčŋ°åį"
inputs = processor(text=text,images=image, return_tensors="pt",padding=False).to('cuda')
outputs = processor.tokenizer.decode(model.generate(**inputs,max_length=512)[0])
print(outputs)
- Example 1

- TaiVisionLM-base-v1:
Cartoon illustration depicting a sign hanging on the door, presented in a cartoon illustration style at the top and bottom.
- TaiVisionLM-base-v2:
This picture presents a thought-provoking contrast. In the center of the picture, the word "Smile" in white text is written in bold black letters. The text is slightly tilted, adding a sense of movement to the otherwise static picture. The background is a bright white, highlighting the text and establishing its importance in the picture.
The background is not just white; it forms a sharp contrast with the black text, creating a striking visual effect. The text, background, and shape work in harmony, with each element complementing the others to form a harmonious composition.
There is a subtle ridged edge in the bottom right corner of the picture. The ridged outline adds a sense of depth to the picture, attracting the viewer's attention and inviting them to explore the overall context of the picture. The ridged edge complements the overall design of the picture, enhancing the rhythm and energy of the atmosphere.
Overall, this picture is a harmonious combination of colors and shapes, with each element carefully placed to create a visually pleasing composition. The use of black, bold letters and the subtle ridged edge adds an air of mystery, extending its impression to a deeper level, both engaging and thought-provoking.
- Example 2

- TaiVisionLM-base-v1:
This is a photographic work showing the Eiffel Tower in Paris surrounded by a water feature.
- TaiVisionLM-base-v2:
This picture captures an amazing view of the Eiffel Tower, a famous French landmark. The sky is a bright blue, interwoven with the surrounding greenery, forming an amazing composition. The Eiffel Tower, with its towering arch painted in a light brown color, stands proudly on the right side of the picture. Next to it is a river, its calm waters reflecting the blue sky above.
In the distance, the silhouettes of other famous landmarks are visible, including an iconic bridge and a castle-like skyscraper, adding depth and scale to the scene. The trees in the foreground add a touch of green, providing a refreshing contrast to the light brown of the Eiffel Tower and the blue of the sky.
This picture is taken from a perspective looking at the Eiffel Tower from the water, providing a bird's-eye view of the entire landscape. This perspective allows for a comprehensive observation of the Eiffel Tower and its surrounding environment, showcasing its grandeur and the life within it. There is no fictional content in this picture, and all descriptions are based on the elements visible in the picture.
đ§ Technical Details
Training Procedure
Data size |
Global Batch Size |
Learning Rate |
Epochs |
Max Length |
Weight Decay |
1.35M |
4 |
5e-3 |
1 |
1024 |
0 |
We use full-parameter finetuning for the projector and apply LoRA to the language model.
We will update the training procedure once we have more resources to train the model on the whole dataset.

Compute Infrastructure
- Feature Alignment
1xV100(32GB), took approximately 45 GPU hours.