LlamaV-o1 Open-source Multimodal Large Language Model - Free Deployment to Assist Complex Visual Reasoning Tasks

Llamav O1

Developed by omkarthawakar

LlamaV-o1 is an advanced multimodal large language model specifically designed for complex visual reasoning tasks, optimized through curriculum learning techniques, demonstrating outstanding performance across diverse benchmarks.

Text-to-Image

Safetensors

EnglishOpen Source License:Apache-2.0 #Multimodal Reasoning #Chain-of-Thought Optimization #Visual Question Answering

Downloads 1,406

Release Time : 12/18/2024

Model Overview

LlamaV-o1 is a multimodal large language model based on the Llama architecture, fine-tuned for step-by-step reasoning, capable of handling tasks in visual perception, mathematical reasoning, social and cultural contexts, medical imaging, and document understanding.

Model Features

Multimodal Reasoning Capability

Capable of handling multimodal tasks such as visual perception, mathematical reasoning, social and cultural contexts, medical imaging, and document understanding.

Structured Reasoning Approach

Employs a structured reasoning approach, providing coherent and accurate explanations for its decisions.

High-Performance Benchmarking

Excels in benchmarks like VRC-Bench, surpassing multiple open-source and proprietary models.

Model Capabilities

Visual Reasoning

Mathematical Reasoning

Document Understanding

Medical Imaging Analysis

Multimodal Question Answering

Use Cases

Education

Educational Tools

Used to develop intelligent educational tools that help students understand complex concepts.

Content Creation

Content Generation

Used to generate high-quality multimodal content, such as tutorials or reports combining text and images.

Conversational Agents

Intelligent Dialogue Systems

Used to develop intelligent conversational agents capable of understanding both visual and textual inputs.

🚀 LlamaV-o1

LlamaV-o1 is an advanced multimodal large language model (LLM) crafted for complex visual reasoning tasks. It addresses challenges in various domains, offering high - precision results and interpretable decision - making, which is invaluable for both research and practical applications.

🚀 Quick Start

from transformers import MllamaForConditionalGeneration, AutoProcessor

model_id = "omkarthawakar/LlamaV-o1"

model = MllamaForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
processor = AutoProcessor.from_pretrained(model_id)

Please refer to llamav - o1.py for inference.

✨ Features

Model Size: 11 billion parameters.
Architecture: Based on the Llama (Large Language Model Architecture) family.
Fine - Tuning: Enhanced for instruction - following, chain - of - thought reasoning, and robust generalization across tasks.
Applications: Ideal for use cases such as conversational agents, educational tools, content creation, and more.

📚 Documentation

Overview

LlamaV-o1 is an advanced multimodal large language model (LLM) designed for complex visual reasoning tasks. Built on a foundation of cutting - edge curriculum learning and optimized with techniques like Beam Search, LlamaV-o1 demonstrates exceptional performance across diverse benchmarks. It is fine - tuned for step - by - step reasoning, enabling it to tackle tasks in domains such as visual perception, mathematical reasoning, social and cultural contexts, medical imaging, and document understanding.

The model is designed with a focus on interpretability and precision. By leveraging a structured reasoning approach, LlamaV-o1 provides coherent and accurate explanations for its decisions, making it an excellent tool for research and applications requiring high levels of reasoning. With over 4,000 manually verified reasoning steps in its benchmark evaluations, LlamaV-o1 sets a new standard for multimodal reasoning, delivering consistent and reliable results across challenging scenarios.

Model Details

Property	Details
Developed By	MBZUAI
Model Version	v0.1
Release Date	13th January 2025
Training Dataset	Diverse multilingual corpus, including high - quality sources for instruction tuning, chain - of - thought datasets, and general - purpose corpora.
Framework	Pytorch

Intended Use

LlamaV-o1 is designed for a wide range of NLP tasks, including but not limited to:

Text Generation
Sentiment Analysis
Text Summarization
Question Answering
Chain - of - Thought Reasoning

Out - of - Scope Use

The model should not be used in applications requiring high - stakes decision - making, such as healthcare diagnosis, financial predictions, or any scenarios involving potential harm.

Training Procedure

Fine - Tuning: The model was fine - tuned on a dataset optimized for reasoning, coherence, and diversity, leveraging instruction - tuning techniques to enhance usability in downstream applications.
Optimizations: Includes inference scaling optimizations to balance performance and computational efficiency.

Evaluation

Benchmarks

LlamaV-o1 has been evaluated on a suite of benchmark tasks:

Reasoning: [VRC - Bench](https://huggingface.co/datasets/omkarthawakar/VRC - Bench)

Limitations

While the model performs well on a broad range of tasks, it may struggle with:

Highly technical, domain - specific knowledge outside the training corpus.
Generating accurate outputs for ambiguous or adversarial prompts.

Results

Table 1: Comparison of models based on Final Answer accuracy and Reasoning Steps performance on the proposed VRC - Bench. The best results in each case (closed - source and open - source) are in bold. Our LlamaV-o1 achieves superior performance compared to its open - source counterpart (Llava - CoT) while also being competitive against the closed - source models.

Model	GPT - 4o	Claude - 3.5	Gemini - 2.0	Gemini - 1.5 Pro	Gemini - 1.5 Flash	GPT - 4o Mini	Llama - 3.2 Vision	Mulberry	Llava - CoT	LlamaV - o1 (Ours)
Final Answer	59.28	61.35	61.16	61.35	54.99	56.39	48.40	51.90	54.09	56.49
Reasoning Steps	76.68	72.12	74.08	72.12	71.86	74.05	58.37	63.86	66.21	68.93

Training Data

LlamaV-o1 is trained on the [LLaVA - CoT - 100k dataset](https://huggingface.co/datasets/Xkev/LLaVA - CoT - 100k). We have formatted training sample for multi - step reasoning.

Training Procedure

LlamaV-o1 model is finetuned on [llama - recipes](https://github.com/Meta - Llama/llama - recipes). Detailed Training procedure will be coming soon!

Citation

If you find this paper useful, please consider staring 🌟 our Github repo and citing 📑 our paper:

@misc{thawakar2025llamavo1,
      title={LlamaV - o1: Rethinking Step - by - step Visual Reasoning in LLMs}, 
      author={Omkar Thawakar and Dinura Dissanayake and Ketan More and Ritesh Thawkar and Ahmed Heakl and Noor Ahsan and Yuhao Li and Mohammed Zumri and Jean Lahoud and Rao Muhammad Anwer and Hisham Cholakkal and Ivan Laptev and Mubarak Shah and Fahad Shahbaz Khan and Salman Khan},
      year={2025},
      eprint={2501.06186},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2501.06186}, 
}

📄 License

The model is released under the Apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご