đ unsloth/Llama-3.2-11B-Vision-Instruct (Fine-Tuned)
This model, fine - tuned from the base unsloth/Llama-3.2-11B-Vision-Instruct
, is designed for vision - language tasks. It has enhanced instruction - following capabilities, making it suitable for various multimodal applications.
đ Quick Start
This fine - tuned model offers great potential for vision - language tasks. You can start using it right away with the provided inference example.
⨠Features
- 2x Faster Training: Utilizes the Unsloth framework to speed up the fine - tuning process, achieving 2x faster training.
- Multimodal Capabilities: Enhanced to handle complex vision - language interactions effectively.
- Instruction Optimization: Tailored to better understand and execute instructions, improving overall performance in instruction - following tasks.
đĻ Installation
No specific installation steps are provided in the original document.
đģ Usage Examples
Basic Usage
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("Daemontatox/finetuned-llama-3.2-vision-instruct")
model = AutoModelForCausalLM.from_pretrained("Daemontatox/finetuned-llama-3.2-vision-instruct")
input_text = "Describe the image showing a sunset over mountains."
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
đ Documentation
Model Overview
This model is fine - tuned from the unsloth/Llama-3.2-11B-Vision-Instruct
base. It is optimized for vision - language tasks and has improved instruction - following capabilities. The fine - tuning was completed 2x faster using the Unsloth framework in combination with Hugging Face's TRL library, ensuring efficient training while maintaining high performance.
Key Information
Property |
Details |
Developed by |
Daemontatox |
Base Model |
unsloth/Llama-3.2-11B-Vision-Instruct |
License |
Apache - 2.0 |
Language |
English (en ) |
Frameworks Used |
Hugging Face Transformers, Unsloth, and TRL |
Performance and Use Cases
This model is suitable for applications such as:
- Vision - based text generation and description tasks
- Instruction - following in multimodal contexts
- General - purpose text generation with enhanced reasoning
Evaluation Results
Open LLM Leaderboard Evaluation Results
Detailed results can be found here!
Summarized results can be found here!
Metric |
Value (%) |
Average |
24.21 |
IFEval (0 - Shot) |
50.64 |
BBH (3 - Shot) |
29.79 |
MATH Lvl 5 (4 - Shot) |
16.24 |
GPQA (0 - shot) |
8.84 |
MuSR (0 - shot) |
8.60 |
MMLU - PRO (5 - shot) |
31.14 |
đ License
This model is released under the Apache - 2.0 license.