Spatial-LLaVA-7B-gguf Open-Source Multimodal Model - Strengthening Spatial Reasoning for Research and Chatbot Development

Spatial LLaVA 7B Gguf

Developed by rogerxi

Spatial-LLaVA-7B is a multimodal model fine-tuned based on the LLaVA model, focusing on improving the ability of spatial relationship reasoning and suitable for multimodal research and chatbot development.

Text-to-Image

Safetensors

Open Source License:Apache-2.0 #Spatial relationship reasoning #Multimodal dialogue #Visual question answering enhancement

Downloads 252

Release Time : 5/10/2025

Model Overview

This model enhances the ability of large multimodal models in spatial relationship reasoning through fine-tuning the LLaVA model and can be used for research and development of multimodal interaction systems.

Model Features

Enhanced spatial relationship reasoning

Through training on a specialized dataset, the model's ability to understand spatial relationships between objects is significantly improved.

Multimodal capabilities

It can process visual and language information simultaneously to achieve cross-modal understanding and reasoning.

Open-source availability

Both the model and training data are open source, facilitating research and secondary development.

Model Capabilities

Visual question answering

Spatial relationship reasoning

Multimodal dialogue

Image understanding

Text generation

Use Cases

Research

Multimodal model research

Used to study the spatial reasoning ability of large multimodal models

It performs better than the basic LLaVA model in the Spatial-Relation-Eval benchmark test

Application development

Intelligent chatbot

Develop a dialogue system that can understand the spatial relationships in images

🚀 Spatial-LLaVA-7B Model Card

This is a fine-tuned LLaVA model aiming to enhance the spatial relation reasoning ability of large multi-modal models.

Github Repo

Huggingface Space Demo

📚 Documentation

✨ Features

This finetuned LLaVA model is trained from liuhaotian/llava-pretrain-vicuna-7b-v1.3 for improving spatial relation reasoning of large multi-modal model.

LLaVA is an open-source chatbot trained by fine-tuning LLaMA/Vicuna on GPT-generated multimodal instruction-following data. It is an auto-regressive language model, based on the transformer architecture.

🎯 Intended Use

Primary intended uses: The primary use of LLaVA is research on large multimodal models and chatbots.

Primary intended users: The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence.

📦 Training Dataset

Instruction following training: rogerxi/LLaVA-Spatial-Instruct-850K

🔧 Evaluation

A collection of 10 benchmarks:

Model	VQAv2	GQA	VizWiz	SQA	TextVQA	POPE	MME	MM-Bench	MM-Bench-cn	MM-Vet
LLaVA-1.5-7b	78.5	62.0	50.0	66.8	58.2	85.9	1510.7	64.3	58.3	31.1
Spatial-LLaVA-7b	79.7	62.7	48.7	68.7	58.5	87.2	1472.7	67.8	60.7	31.6

Spatial-Relation-Eval (built based on SpatialRGPT-Bench):

Qualitative Spatial Relations

Model	Below/Above	Left/Right	Big/Small	Tall/Short	Wide/Thin	Behind/Front	Avg
LLaVA-1.5-7b	53.91	53.49	45.36	40.00	50.00	51.04	48.97
LLaVA-1.5-13b	54.28	52.32	45.36	48.57	49.02	47.92	49.67
Spatial-LLaVA-7b	56.32	66.28	60.82	48.57	49.02	52.08	55.12

Quantitative Spatial Relations

Model	Direct Dist (m / ratio)	Horizontal Dist (m / ratio)	Vertical Dist (m / ratio)	Width (m / ratio)	Height (m / ratio)	Direction (¬∞ / ratio)
LLaVA-1.5-7b	12.90 / 1.06	10.68 / 2.03	20.79 / 0.94	24.19 / 0.50	14.29 / 5.27	10.23 / 58.33
LLaVA-1.5-13b	13.71 / 0.93	10.68 / 3.56	16.83 / 0.85	15.32 / 0.57	17.67 / 5.8	14.77 / 54.29
Spatial-LLaVA-7b	24.19 / 0.57	14.56 / 0.62	41.58 / 0.42	22.58 / 1.12	18.25 / 2.92	20.45 / 56.47

🙏 Acknowledgements

We thank Liu Haotian et al. for the LLaVA pretrained script, weights and LLaVA-v1.5 mixture dataset; the teams behind CLEVR, TextCaps, VisualMRC and VQAv2 (via “HuggingFaceM4/the_cauldron”); remyxai for OpenSpaces; Anjie Cheng et al. for Spatial-Bench and data pipeline; Google for OpenImages; and Hugging Face for their datasets infrastructure.

📄 License

This project is under the Apache-2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご