đ otpensource-vision
The otpensource-vision is a Vision-Language model trained based on Bllossom/llama-3.2-Korean-Bllossom-AICA-5B, designed to perform various tasks by combining text and images in Korean and English.
đ Quick Start
The otpensource-vision model is a powerful Vision-Language model. It combines the strengths of language models and vision-language models, enabling it to handle a wide range of tasks, from generating text descriptions for images to performing natural language processing tasks with just text input.
⨠Features
- Trained on Bllossom: Based on llama-3.2-Korean-Bllossom-AICA-5B, it offers the advantages of both language models and vision-language models.
- Supports Vision-Language Tasks: Can generate text information from images or perform natural language processing tasks with only text input.
- Trained with Fashion Data: Utilizing the Korean fashion dataset (otpensource_data), it has been trained to extract relevant information such as clothing categories, colors, seasons, and features.
- Commercially Available: Licensed under CC-BY-4.0, it allows for commercial use.
đĻ Installation
No installation steps are provided in the original document, so this section is skipped.
đ Documentation
Model Details
Training Data
The datasets used for model training:
- otpensource_dataset:
- Consists of approximately 9000 fashion data.
- Optimized for Vision-Language training, including clothing categories, colors, seasons, features, and image URLs.
Training Method
- Base Model: Bllossom/llama-3.2-Korean-Bllossom-AICA-5B
- GPU Requirement: A100 40GB or higher is recommended.
- Optimization: Integrates Vision-Language tasks and Korean text tasks for training.
Key Use Cases
Vision-Language Tasks
- Image Analysis
- Extracts information about clothing categories, colors, seasons, and features from the input image and returns it in JSON format.
- Example:
{
"category": "Trench coat",
"gender": "Female",
"season": "SS",
"color": "Navy",
"material": "",
"feature": "Trench coat"
}
- Language Model Tasks
- Performs natural language processing when only text is input, and can handle various tasks such as question answering, text summarization, and sentiment analysis.
Training and Performance
LogicKor Benchmark Performance (Performance of Bllossom-based models)
Category |
Single Turn |
Multi Turn |
Reasoning |
6.57 |
5.29 |
Math |
6.43 |
6.29 |
Writing |
9.14 |
8.71 |
Coding |
8.00 |
9.14 |
Understanding |
8.14 |
9.29 |
Grammar |
6.71 |
4.86 |
Training Configuration
- Model Size: 5B parameters
- Training Data Size: Approximately 9000 vision-language data
- Evaluation Results: Provides high accuracy and efficiency in fashion-related tasks.
đģ Usage Examples
Basic Usage
from transformers import MllamaForConditionalGeneration, MllamaProcessor
import torch
from PIL import Image
import requests
model = MllamaForConditionalGeneration.from_pretrained(
'otpensource-vision',
torch_dtype=torch.bfloat16,
device_map='auto'
)
processor = MllamaProcessor.from_pretrained('otpensource-vision')
url = "https://image.msscdn.net/thumbnails/images/prd_img/20240710/4242307/detail_4242307_17205916382801_big.jpg?w=1200"
image = Image.open(requests.get(url, stream=True).raw)
messages = [
{'role': 'user', 'content': [
{'type': 'image', 'image': image},
{'type': 'text', 'text': 'Please provide information about this clothing in JSON format.'}
]}
]
input_text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(
image=image,
text=input_text,
add_special_tokens=False,
return_tensors="pt",
).to(model.device)
output = model.generate(**inputs, max_new_tokens=256, temperature=0.1)
print(processor.decode(output[0]))
đ License
The model is licensed under CC-BY-4.0, allowing for commercial use.
Uploaded finetuned model
- Developed by: hateslopacademy
- License: apache-2.0
- Finetuned from model : Bllossom/llama-3.2-Korean-Bllossom-AICA-5B
This mllama model was trained 2x faster with Unsloth and Huggingface's TRL library.
