Otpensource Vision

Developed by hateslopacademy

A vision-language model trained based on Bllossom/llama-3.2-Korean-Bllossom-AICA-5B, supporting Korean and English, specializing in image-to-text and text classification tasks in the fashion domain.

Image-to-Text

Transformers

Supports Multiple Languages#Korean-English Visual Language #Fashion Image Analysis #Multimodal JSON Output

Downloads 14

Release Time : 1/25/2025

Model Overview

otpensource-vision is a multimodal model combining vision and language capabilities, capable of analyzing fashion elements in images and generating structured textual descriptions, while also supporting pure text natural language processing tasks.

Model Features

Multilingual Visual Understanding

Supports Korean and English visual language processing, capable of extracting fashion-related information from images.

Fashion Domain Optimization

Trained with professional fashion datasets, excels in analyzing fashion elements such as clothing categories, colors, and seasons.

Structured Output

Capable of generating structured output in JSON format, facilitating system integration and further processing.

Business-Friendly License

Uses CC-BY-4.0 license, allowing commercial use.

Model Capabilities

Image-to-Text

Fashion Element Analysis

Multilingual Text Generation

Sentiment Analysis

Text Classification

Use Cases

E-Commerce

Product Auto-Tagging

Automatically analyzes product images and generates structured descriptions including categories, colors, and other information.

Can generate product information in JSON format.

Fashion Recommendation System

Recommends style-matching fashion items to users based on visual analysis.

Content Generation

Social Media Content Creation

Automatically generates descriptive text content based on fashion images.

🚀 otpensource-vision

The otpensource-vision is a Vision-Language model trained based on Bllossom/llama-3.2-Korean-Bllossom-AICA-5B, designed to perform various tasks by combining text and images in Korean and English.

🚀 Quick Start

The otpensource-vision model is a powerful Vision-Language model. It combines the strengths of language models and vision-language models, enabling it to handle a wide range of tasks, from generating text descriptions for images to performing natural language processing tasks with just text input.

✨ Features

Trained on Bllossom: Based on llama-3.2-Korean-Bllossom-AICA-5B, it offers the advantages of both language models and vision-language models.
Supports Vision-Language Tasks: Can generate text information from images or perform natural language processing tasks with only text input.
Trained with Fashion Data: Utilizing the Korean fashion dataset (otpensource_data), it has been trained to extract relevant information such as clothing categories, colors, seasons, and features.
Commercially Available: Licensed under CC-BY-4.0, it allows for commercial use.

📦 Installation

No installation steps are provided in the original document, so this section is skipped.

📚 Documentation

Model Details

Training Data

The datasets used for model training:

otpensource_dataset:
- Consists of approximately 9000 fashion data.
- Optimized for Vision-Language training, including clothing categories, colors, seasons, features, and image URLs.

Training Method

Base Model: Bllossom/llama-3.2-Korean-Bllossom-AICA-5B
GPU Requirement: A100 40GB or higher is recommended.
Optimization: Integrates Vision-Language tasks and Korean text tasks for training.

Key Use Cases

Vision-Language Tasks

Image Analysis

Extracts information about clothing categories, colors, seasons, and features from the input image and returns it in JSON format.

Example:

{
  "category": "Trench coat",
  "gender": "Female",
  "season": "SS",
  "color": "Navy",
  "material": "",
  "feature": "Trench coat"
}

Language Model Tasks
- Performs natural language processing when only text is input, and can handle various tasks such as question answering, text summarization, and sentiment analysis.

Training and Performance

LogicKor Benchmark Performance (Performance of Bllossom-based models)

Category	Single Turn	Multi Turn
Reasoning	6.57	5.29
Math	6.43	6.29
Writing	9.14	8.71
Coding	8.00	9.14
Understanding	8.14	9.29
Grammar	6.71	4.86

Training Configuration

Model Size: 5B parameters
Training Data Size: Approximately 9000 vision-language data
Evaluation Results: Provides high accuracy and efficiency in fashion-related tasks.

💻 Usage Examples

Basic Usage

from transformers import MllamaForConditionalGeneration, MllamaProcessor
import torch
from PIL import Image
import requests

model = MllamaForConditionalGeneration.from_pretrained(
  'otpensource-vision',
  torch_dtype=torch.bfloat16,
  device_map='auto'
)
processor = MllamaProcessor.from_pretrained('otpensource-vision')

url = "https://image.msscdn.net/thumbnails/images/prd_img/20240710/4242307/detail_4242307_17205916382801_big.jpg?w=1200"
image = Image.open(requests.get(url, stream=True).raw)

messages = [
  {'role': 'user', 'content': [
    {'type': 'image', 'image': image},
    {'type': 'text', 'text': 'Please provide information about this clothing in JSON format.'}
  ]}
]

input_text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

inputs = processor(
    image=image,
    text=input_text,
    add_special_tokens=False,
    return_tensors="pt",
).to(model.device)

output = model.generate(**inputs, max_new_tokens=256, temperature=0.1)
print(processor.decode(output[0]))

📄 License

The model is licensed under CC-BY-4.0, allowing for commercial use.

Uploaded finetuned model

Developed by: hateslopacademy
License: apache-2.0
Finetuned from model : Bllossom/llama-3.2-Korean-Bllossom-AICA-5B

This mllama model was trained 2x faster with Unsloth and Huggingface's TRL library.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご