ViCA-7B Open-Source Vision-Language Model - Supports Indoor Video Visual Spatial Reasoning and Complex Tasks

Vica

Developed by nkkbr

ViCA-7B is a vision-language model fine-tuned specifically for visual-spatial reasoning in indoor video environments. Built on the LLaVA-Video-7B-Qwen2 architecture and trained using the ViCA-322K dataset, it emphasizes structured spatial annotation and instruction-based complex reasoning tasks.

Video-to-Text

Transformers

EnglishOpen Source License:Apache-2.0 #Indoor Video Understanding #Visual-Spatial Reasoning #Multimodal Question Answering

Downloads 41

Release Time : 4/21/2025

Model Overview

ViCA-7B focuses on visual-spatial reasoning in indoor video environments, capable of handling tasks such as object counting, absolute distance, object size, room dimensions, relative distance, relative direction, path planning, and sequence of appearance.

Model Features

Visual-Spatial Reasoning

Specializes in visual-spatial reasoning tasks in indoor video environments, such as object counting, distance and size estimation.

Multimodal Alignment

Achieves effective fusion of video content and text prompts through a lightweight projector.

Efficient Training

Utilizes DeepSpeed ZeRO-3 Offload and mixed-precision computing for efficient distributed training.

Fixed-Length Visual Tokenization

Each video is uniformly sampled into 64 frames, with each frame encoded into 210 visual tokens, ensuring consistent memory usage across batches and optimized stability.

Model Capabilities

Visual Question Answering

Video Understanding

Spatial Reasoning

Visual-Spatial Cognition

Multimodal Processing

Use Cases

Indoor Navigation Assistant

Indoor Navigation

Assists users in navigating and planning paths within indoor environments.

Robot Planning and Spatial Queries

Robot Path Planning

Provides robots with spatial understanding and path planning capabilities.

Smart Room Arrangement and AR Layout Analysis

Room Arrangement Analysis

Analyzes room layouts and object placements to offer optimization suggestions.

Scene Understanding for Embodied AI Agents

Scene Understanding

Helps AI agents understand spatial relationships in complex indoor scenes.

🚀 ViCA-7B: Visuospatial Cognitive Assistant

ViCA-7B is a vision - language model fine - tuned for visuospatial reasoning in indoor video environments, offering high - performance solutions for visual question - answering tasks.

You may also be interested in our other project, ViCA2. Please refer to the following links:

🚀 Quick Start

This README provides a comprehensive introduction to the ViCA - 7B model, including its architecture, training, evaluation, and more. For detailed usage, please refer to the relevant sections below.

✨ Features

Multimodal Capability: Specialized for visuospatial reasoning in indoor video environments, integrating video and text information.
State - of - the - Art Performance: Achieves excellent results on the VSI - Bench benchmark, outperforming many proprietary and open - source models.
Interpretable Reasoning: Supports the generation of step - by - step reasoning traces, enhancing the interpretability of responses.

📚 Documentation

Overview

ViCA-7B is a vision - language model specifically fine - tuned for visuospatial reasoning in indoor video environments. Built upon the LLaVA - Video - 7B - Qwen2 architecture, it is trained using our newly proposed ViCA - 322K dataset, which emphasizes both structured spatial annotations and complex instruction - based reasoning tasks.

ViCA - 7B achieves state - of - the - art performance on [VSI - Bench](https://github.com/vision - x - nyu/thinking - in - space), outperforming both proprietary models like GPT - 4o and Gemini - 1.5 Pro, as well as larger open - source baselines.

ViCA - 7B sets a new standard for open - source multimodal spatial reasoning on indoor videos, making it a strong candidate for embodied AI and robotics use cases.

Figure 1: Performance comparison of ViCA - 7B and other models on VSI - Bench.

Model Architecture and Training Strategy

ViCA - 7B is built upon the [LLaVA - NeXT](https://github.com/LLaVA - VL/LLaVA - NeXT) framework, using Qwen2 - 7B as the language backbone and SigLIP as the visual encoder.

Key Training Features

Fixed - Length Visual Tokenization
Each video is uniformly sampled into 64 frames, and each frame is encoded into 210 visual tokens, resulting in a total of 13,440 visual tokens per example. This fixed - length design ensures consistent memory usage and stable optimization across batches.
Multimodal Alignment via Lightweight Projector
A simple MLP - based projector maps visual embeddings into the language embedding space, enabling effective fusion between video content and textual prompts during both training and inference.
Efficient Distributed Training with DeepSpeed
Training is conducted using DeepSpeed ZeRO - 3 Offload on 8× NVIDIA H100 80GB GPUs, with full parameter and optimizer state partitioning across devices. This setup supports large batch sizes and minimizes GPU memory overhead.
Mixed - Precision Computation (fp16)
We adopt mixed - precision training (fp16) to accelerate computation and reduce memory usage, without compromising accuracy. This is combined with ZeRO - 3 partitioning to further enhance training scalability.

The training was conducted over 55 hours, covering both base and complex spatial reasoning subsets.

Training Dynamics

Figure 2: Training loss, learning rate schedule, and gradient norm curves during ViCA - 7B fine - tuning. These curves illustrate a stable optimization process and smooth convergence under the DeepSpeed ZeRO - 3 setup.

Dataset

ViCA - 7B is fine - tuned on two complementary datasets:

[ViCA - 322K](https://huggingface.co/datasets/nkkbr/ViCA - 322K):
A large - scale dataset covering both base spatial reasoning tasks (e.g., object distance, size, count, appearance order) and complex spatial reasoning tasks involving natural language questions and scene understanding. This dataset forms the core of the model's spatial reasoning capabilities.
[ViCA - thinking - 2.68k](https://huggingface.co/datasets/nkkbr/ViCA - thinking - 2.68k):
A focused dataset used for instruction tuning to enhance the model's ability to generate step - by - step reasoning traces before outputting final answers. This supports more interpretable and cognitively - aligned response generation.

For details, please refer to the individual dataset pages linked above.

Evaluation: VSI - BENCH Benchmark

Figure 3: Quantitative comparison of ViCA - 7B and baseline models on VSI - Bench. ViCA - 7B achieves the best overall performance across both numerical and multiple - choice tasks.

Effect of CSR Data

Configuration	Avg Score
Base - only (281K)	55.35
Full with CSR (322K)	60.56

CSR(Complex Spatial Reasoning) boosts generalization and accelerates learning, with notable performance jumps at intermediate checkpoints (e.g., +2.02 at 50–55%).

Data Scale vs. Performance

Performance improves significantly between 5% → 60% of data usage. After 80%, improvements plateau, indicating dataset is well - matched to model capacity.

Figure 4: Performance of ViCA - 7B under varying training data sizes (from 5% to 100%). The full dataset (including Complex Spatial Reasoning, CSR) consistently outperforms the base - only configuration. Notably, the CSR - enhanced model shows a +2.02 score jump between 50% and 55%, and a final performance gain of +4.75 at full scale. Performance plateaus beyond 80%, indicating the dataset is well - aligned with the model capacity.

Intermediate Checkpoints and Evaluation Outputs

To support detailed analysis and reproducibility, we provide two sets of intermediate checkpoints saved at every 5% increment of the training data. These models are trained for a single epoch and are useful for understanding how performance evolves as training progresses.

We also release the corresponding raw evaluation outputs (e.g., .json prediction files) for each checkpoint. The evaluation script used to produce these outputs is available in our GitHub repository.

Full Dataset (ViCA - 322K: Base + CSR)

This series corresponds to the full training set, including both base spatial reasoning and complex spatial reasoning (CSR):

Data Usage	Checkpoint	Data Usage	Checkpoint
5%	[`nkkbr/ViCA - 5p`](https://huggingface.co/nkkbr/ViCA - 5p)	55%	[`nkkbr/ViCA - 55p`](https://huggingface.co/nkkbr/ViCA - 55p)
10%	[`nkkbr/ViCA - 10p`](https://huggingface.co/nkkbr/ViCA - 10p)	60%	[`nkkbr/ViCA - 60p`](https://huggingface.co/nkkbr/ViCA - 60p)
15%	[`nkkbr/ViCA - 15p`](https://huggingface.co/nkkbr/ViCA - 15p)	65%	[`nkkbr/ViCA - 65p`](https://huggingface.co/nkkbr/ViCA - 65p)
20%	[`nkkbr/ViCA - 20p`](https://huggingface.co/nkkbr/ViCA - 20p)	70%	[`nkkbr/ViCA - 70p`](https://huggingface.co/nkkbr/ViCA - 70p)
25%	[`nkkbr/ViCA - 25p`](https://huggingface.co/nkkbr/ViCA - 25p)	75%	[`nkkbr/ViCA - 75p`](https://huggingface.co/nkkbr/ViCA - 75p)
30%	[`nkkbr/ViCA - 30p`](https://huggingface.co/nkkbr/ViCA - 30p)	80%	[`nkkbr/ViCA - 80p`](https://huggingface.co/nkkbr/ViCA - 80p)
35%	[`nkkbr/ViCA - 35p`](https://huggingface.co/nkkbr/ViCA - 35p)	85%	[`nkkbr/ViCA - 85p`](https://huggingface.co/nkkbr/ViCA - 85p)
40%	[`nkkbr/ViCA - 40p`](https://huggingface.co/nkkbr/ViCA - 40p)	90%	[`nkkbr/ViCA - 90p`](https://huggingface.co/nkkbr/ViCA - 90p)
45%	[`nkkbr/ViCA - 45p`](https://huggingface.co/nkkbr/ViCA - 45p)	95%	[`nkkbr/ViCA - 95p`](https://huggingface.co/nkkbr/ViCA - 95p)
50%	[`nkkbr/ViCA - 50p`](https://huggingface.co/nkkbr/ViCA - 50p)	100% (This repo)	`nkkbr/ViCA`

Raw evaluation outputs are available [here](https://huggingface.co/nkkbr/ViCA/tree/main/raw_evaluation_outputs/vsi - bench_all_data/).

Base - only Subset (ViCA - 322K: Base)

This series is trained only on the base spatial reasoning subset of ViCA - 322K, without any CSR examples:

Data Usage	Checkpoint	Data Usage	Checkpoint
5%	[`nkkbr/ViCA - base - 5p`](https://huggingface.co/nkkbr/ViCA - base - 5p)	55%	[`nkkbr/ViCA - base - 55p`](https://huggingface.co/nkkbr/ViCA - base - 55p)
10%	[`nkkbr/ViCA - base - 10p`](https://huggingface.co/nkkbr/ViCA - base - 10p)	60%	[`nkkbr/ViCA - base - 60p`](https://huggingface.co/nkkbr/ViCA - base - 60p)
15%	[`nkkbr/ViCA - base - 15p`](https://huggingface.co/nkkbr/ViCA - base - 15p)	65%	[`nkkbr/ViCA - base - 65p`](https://huggingface.co/nkkbr/ViCA - base - 65p)
20%	[`nkkbr/ViCA - base - 20p`](https://huggingface.co/nkkbr/ViCA - base - 20p)	70%	[`nkkbr/ViCA - base - 70p`](https://huggingface.co/nkkbr/ViCA - base - 70p)
25%	[`nkkbr/ViCA - base - 25p`](https://huggingface.co/nkkbr/ViCA - base - 25p)	75%	[`nkkbr/ViCA - base - 75p`](https://huggingface.co/nkkbr/ViCA - base - 75p)
30%	[`nkkbr/ViCA - base - 30p`](https://huggingface.co/nkkbr/ViCA - base - 30p)	80%	[`nkkbr/ViCA - base - 80p`](https://huggingface.co/nkkbr/ViCA - base - 80p)
35%	[`nkkbr/ViCA - base - 35p`](https://huggingface.co/nkkbr/ViCA - base - 35p)	85%	[`nkkbr/ViCA - base - 85p`](https://huggingface.co/nkkbr/ViCA - base - 85p)
40%	[`nkkbr/ViCA - base - 40p`](https://huggingface.co/nkkbr/ViCA - base - 40p)	90%	[`nkkbr/ViCA - base - 90p`](https://huggingface.co/nkkbr/ViCA - base - 90p)
45%	[`nkkbr/ViCA - base - 45p`](https://huggingface.co/nkkbr/ViCA - base - 45p)	95%	[`nkkbr/ViCA - base - 95p`](https://huggingface.co/nkkbr/ViCA - base - 95p)
50%	[`nkkbr/ViCA - base - 50p`](https://huggingface.co/nkkbr/ViCA - base - 50p)	100%	[`nkkbr/ViCA - base`](https://huggingface.co/nkkbr/ViCA - base)

Raw evaluation outputs are available [here](https://huggingface.co/nkkbr/ViCA/tree/main/raw_evaluation_outputs/vsi - bench_only_base/).

Source - wise Checkpoints

While the full ViCA - 322K dataset was curated by us, the underlying videos and associated metadata are sourced from three distinct indoor video datasets:

ARKitScenes
[ScanNet](http://www.scan - net.org)
ScanNet++

To better understand how each source contributes to model performance, we fine - tuned ViCA - 7B on subsets of ViCA - 322K that exclusively use data from each source. For each subset, we provide checkpoints trained with 10% increments of the available data, from 10% to 100%.

Corresponding raw evaluation outputs (e.g., .json predictions) are also provided for all checkpoints.

ARKitScenes - Only Checkpoints

[The content about ARKitScenes - Only Checkpoints in the original text seems incomplete. Please provide the full content if you need a more complete translation.]

🔧 Technical Details

Model Index

Property	Details
Model Type	Vision - language model
Base Model	lmms - lab/LLaVA - Video - 7B - Qwen2
Training Data	nkkbr/ViCA - 322K, nkkbr/ViCA - thinking - 2.68k

Results on VSI - Bench

task	dataset	metrics	value	name	verified
visual - question - answering	VSI - Bench	score	60.56	Average	false
visual - question - answering	VSI - Bench	MRA	68.81	Object Count	-
visual - question - answering	VSI - Bench	MRA	57.01	Absolute Distance	-
visual - question - answering	VSI - Bench	MRA	79.17	Object Size	-
visual - question - answering	VSI - Bench	MRA	75.14	Room Size	-
visual - question - answering	VSI - Bench	accuracy	58.45	Relative Distance	-
visual - question - answering	VSI - Bench	accuracy	42.56	Relative Direction	-
visual - question - answering	VSI - Bench	accuracy	34.54	Route Plan	-
visual - question - answering	VSI - Bench	accuracy	68.77	Appearance Order	-

📄 License

This project is licensed under the Apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご