ViCA-7B is a vision-language model fine-tuned specifically for visual-spatial reasoning in indoor video environments. Built on the LLaVA-Video-7B-Qwen2 architecture and trained using the ViCA-322K dataset, it emphasizes structured spatial annotation and instruction-based complex reasoning tasks.
Video-to-Text
Transformers English