V

Vica

Developed by nkkbr
ViCA-7B is a vision-language model fine-tuned specifically for visual-spatial reasoning in indoor video environments. Built on the LLaVA-Video-7B-Qwen2 architecture and trained using the ViCA-322K dataset, it emphasizes structured spatial annotation and instruction-based complex reasoning tasks.
Downloads 41
Release Time : 4/21/2025

Model Overview

ViCA-7B focuses on visual-spatial reasoning in indoor video environments, capable of handling tasks such as object counting, absolute distance, object size, room dimensions, relative distance, relative direction, path planning, and sequence of appearance.

Model Features

Visual-Spatial Reasoning
Specializes in visual-spatial reasoning tasks in indoor video environments, such as object counting, distance and size estimation.
Multimodal Alignment
Achieves effective fusion of video content and text prompts through a lightweight projector.
Efficient Training
Utilizes DeepSpeed ZeRO-3 Offload and mixed-precision computing for efficient distributed training.
Fixed-Length Visual Tokenization
Each video is uniformly sampled into 64 frames, with each frame encoded into 210 visual tokens, ensuring consistent memory usage across batches and optimized stability.

Model Capabilities

Visual Question Answering
Video Understanding
Spatial Reasoning
Visual-Spatial Cognition
Multimodal Processing

Use Cases

Indoor Navigation Assistant
Indoor Navigation
Assists users in navigating and planning paths within indoor environments.
Robot Planning and Spatial Queries
Robot Path Planning
Provides robots with spatial understanding and path planning capabilities.
Smart Room Arrangement and AR Layout Analysis
Room Arrangement Analysis
Analyzes room layouts and object placements to offer optimization suggestions.
Scene Understanding for Embodied AI Agents
Scene Understanding
Helps AI agents understand spatial relationships in complex indoor scenes.
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase