SpaceOm-GGUF Open Source Multimodal Model - Free for Visual Question Answering and Spatial Reasoning Tasks

Spaceom GGUF

Developed by mgonzs13

SpaceOm-GGUF is a multimodal model focusing on visual question answering tasks and performs excellently in spatial reasoning.

Text-to-Image EnglishOpen Source License:Apache-2.0 #Visual Question Answering #Spatial Reasoning #Multimodal Model

Downloads 196

Release Time : 6/11/2025

Model Overview

SpaceOm-GGUF is a multimodal model trained on specific datasets, proficient in visual question answering and spatial reasoning tasks, and can be used for image - text conversion.

Model Features

Enhanced Spatial Reasoning Ability

Improved on the basis of SpaceThinker, enhancing spatial understanding ability through longer reasoning trajectory training

Optimization for Robotics Field

Trained with the Robo2VLM - Reasoning dataset to enhance performance in robot application scenarios

Multimodal Fusion

Combining visual and language processing capabilities to achieve high - quality image - text conversion

Model Capabilities

Visual Question Answering

Spatial Reasoning

Image Description Generation

Object Localization

Spatial Relationship Understanding

Distance Estimation

Use Cases

Robot Navigation

Spatial Environment Understanding

Help robots understand the spatial layout of the surrounding environment

Achieved a target localization score of 54.00 in the SpatialScore benchmark test

Education

Visual Question Answering System

Answer complex spatial questions about image content

Achieved a target - target spatial relationship score of 50.00 in the SpaCE - 10 benchmark test

🚀 SpaceOm-GGUF

SpaceOm-GGUF is a model designed for visual - question - answering tasks, excelling in spatial reasoning. It enhances performance through specific fine - tuning and additional datasets, offering valuable insights for multimodal and robotics applications.

🚀 Quick Start

This section doesn't provide specific quick - start steps. If you want to use the SpaceOm - GGUF model, you can refer to the official documentation of the llama.cpp library.

✨ Features

Enhanced Fine - Tuning: Adds the target module o_proj in LoRA fine - tuning, inspired by relevant research, which is beneficial for reasoning models.
Rich Datasets: Utilizes the SpaceOm [dataset](https://huggingface.co/datasets/salma - remyx/SpaceOm) for longer reasoning traces and the Robo2VLM - Reasoning [dataset](https://huggingface.co/datasets/salma - remyx/Robo2VLM - Reasoning) for more robotics domain and MCVQA examples.
Improved Reasoning Ability: Incorporates longer reasoning traces in training data to help the model use more tokens in reasoning.

📚 Documentation

Model Overview

SpaceOm improves over SpaceThinker by:

Adding the target module o_proj in LoRA fine - tuning. The choice of including o_proj is inspired by the study here, which emphasizes the importance of this module in reasoning models.
Using the SpaceOm [dataset](https://huggingface.co/datasets/salma - remyx/SpaceOm) for longer reasoning traces. The reasoning traces in the SpaceThinker dataset average ~200 "thinking" tokens, and now longer reasoning traces are included in the training data to assist the model in using more tokens for reasoning.
Employing the Robo2VLM - Reasoning [dataset](https://huggingface.co/datasets/salma - remyx/Robo2VLM - Reasoning) for more robotics domain and MCVQA examples. To improve alignment for robotics applications, it is trained with synthetic reasoning traces derived from the Robo2VLM - 1 [dataset](https://huggingface.co/datasets/keplerccc/Robo2VLM - 1).

Model Evaluation

SpatialScore - 3B and 4B models

Property	Details
Model Type	SpaceOm - GGUF
Base Model	remyxai/SpaceOm
Pipeline Tag	image - text - to - text
Library Name	llama.cpp
License	apache - 2.0
Datasets	remyxai/SpaceThinker, SpaceOm [dataset](https://huggingface.co/datasets/salma - remyx/SpaceOm), Robo2VLM - Reasoning [dataset](https://huggingface.co/datasets/salma - remyx/Robo2VLM - Reasoning)

Model	Overall	Count.	Obj.-Loc.	Pos.-Rel.	Dist.	Obj.-Prop.	Cam.&IT.	Tracking	Others
SpaceQwen2.5 - VL - 3B	42.31	45.01	49.78	57.88	27.36	34.11	26.34	26.44	43.58
SpatialBot - Phi2 - 3B	41.65	53.23	54.32	55.40	27.12	26.10	24.21	27.57	41.66
Kimi - VL - 3B	51.48	49.22	61.99	61.34	38.27	46.74	33.75	56.28	47.23
Kimi - VL - 3B - Thinking	52.60	52.66	58.93	63.28	39.38	42.57	32.00	46.97	42.73
Qwen2.5 - VL - 3B	47.90	46.62	55.55	62.23	32.39	32.97	30.66	36.90	42.19
InternVL2.5 - 4B	49.82	53.32	62.02	62.02	32.80	27.00	32.49	37.02	48.95
SpaceOm (3B)	49.00	56.00	54.00	65.00	41.00	50.00	36.00	42.00	47.00

See [all results](https://huggingface.co/datasets/salma - remyx/SpaceOm_SpatialScore) for evaluating SpaceOm on the SpatialScore benchmark. Compared to SpaceQwen, this model outperforms in all categories.

And comparing to SpaceThinker:

SpaCE - 10 Benchmark Comparison

[![Open In Colab](https://colab.research.google.com/assets/colab - badge.svg)](https://colab.research.google.com/drive/1YpIOjJFZ - Zaomg77ImeQHSqYBLB8T1Ce?usp = sharing)

This table compares SpaceOm evaluated using GPT scoring against several top models from the SpaCE - 10 benchmark leaderboard. Top scores in each category are bolded.

Model	EQ	SQ	SA	OO	OS	EP	FR	SP	Source
SpaceOm	32.47	24.81	47.63	50.00	32.52	9.12	37.04	25.00	GPT Eval
Qwen2.5 - VL - 7B - Instruct	32.70	31.00	41.30	32.10	27.60	15.40	26.30	27.50	Table
LLaVA - OneVision - 7B	37.40	36.20	42.90	44.20	27.10	11.20	45.60	27.20	Table
VILA1.5 - 7B	30.20	38.60	39.90	44.10	16.50	35.10	30.10	37.60	Table
InternVL2.5 - 4B	34.30	34.40	43.60	44.60	16.10	30.10	33.70	36.70	Table

Legend:

EQ: Entity Quantification
SQ: Scene Quantification
SA: Size Assessment
OO: Object - Object spatial relations
OS: Object - Scene spatial relations
EP: Entity Presence
FR: Functional Reasoning
SP: Spatial Planning

⚠️ Important Note

Scores for SpaceOm are generated via gpt_eval_score on single - choice (*-single) versions of the SpaCE - 10 benchmark tasks. Other entries reflect leaderboard accuracy scores from the official SpaCE - 10 evaluation table.

Read more about the SpaCE - 10 benchmark

Limitations

Performance may degrade in cluttered environments or camera perspective.
This model was fine - tuned using synthetic reasoning over an internet image dataset.
Multimodal biases inherent to the base model (Qwen2.5 - VL) may persist.
Not intended for use in safety - critical or legal decision - making.

💡 Usage Tip

Users are encouraged to evaluate outputs critically and consider fine - tuning for domain - specific safety and performance. Distances estimated using autoregressive transformers may help in higher - order reasoning for planning and behavior but may not be suitable replacements for measurements taken with high - precision sensors, calibrated stereo vision systems, or specialist monocular depth estimation models capable of more accurate, pixel - wise predictions and real - time performance.

Citation

@article{chen2024spatialvlm,
  title = {SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities},
  author = {Chen, Boyuan and Xu, Zhuo and Kirmani, Sean and Ichter, Brian and Driess, Danny and Florence, Pete and Sadigh, Dorsa and Guibas, Leonidas and Xia, Fei},
  journal = {arXiv preprint arXiv:2401.12168},
  year = {2024},
  url = {https://arxiv.org/abs/2401.12168},
}

@misc{qwen2.5-VL,
  title = {Qwen2.5-VL},
  url = {https://qwenlm.github.io/blog/qwen2.5-vl/},
  author = {Qwen Team},
  month = {January},
  year = {2025}
}

@misc{vl-thinking2025,
  title={SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models },
  author={Hardy Chen and Haoqin Tu and Fali Wang and Hui Liu and Xianfeng Tang and Xinya Du and Yuyin Zhou and Cihang Xie},
  year = {2025},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/UCSC-VLAA/VLAA-Thinking}},
}


@article{wu2025spatialscore,
    author    = {Wu, Haoning and Huang, Xiao and Chen, Yaohui and Zhang, Ya and Wang, Yanfeng and Xie, Weidi},
    title     = {SpatialScore: Towards Unified Evaluation for Multimodal Spatial Understanding},
    journal   = {arXiv preprint arXiv:2505.17012},
    year      = {2025},
}

@article{gong2025space10,
  title     = {SpaCE-10: A Comprehensive Benchmark for Multimodal Large Language Models in Compositional Spatial Intelligence},
  author    = {Ziyang Gong and Wenhao Li and Oliver Ma and Songyuan Li and Jiayi Ji and Xue Yang and Gen Luo and Junchi Yan and Rongrong Ji},
  journal   = {arXiv preprint arXiv:2506.07966},
  year      = {2025},
  url       = {https://arxiv.org/abs/2506.07966}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご