🚀 SpaceOm-GGUF
SpaceOm-GGUF is a model designed for visual - question - answering tasks, excelling in spatial reasoning. It enhances performance through specific fine - tuning and additional datasets, offering valuable insights for multimodal and robotics applications.
🚀 Quick Start
This section doesn't provide specific quick - start steps. If you want to use the SpaceOm - GGUF
model, you can refer to the official documentation of the llama.cpp
library.
✨ Features
- Enhanced Fine - Tuning: Adds the target module
o_proj
in LoRA fine - tuning, inspired by relevant research, which is beneficial for reasoning models.
- Rich Datasets: Utilizes the SpaceOm [dataset](https://huggingface.co/datasets/salma - remyx/SpaceOm) for longer reasoning traces and the Robo2VLM - Reasoning [dataset](https://huggingface.co/datasets/salma - remyx/Robo2VLM - Reasoning) for more robotics domain and MCVQA examples.
- Improved Reasoning Ability: Incorporates longer reasoning traces in training data to help the model use more tokens in reasoning.
📚 Documentation
Model Overview
SpaceOm improves over SpaceThinker by:
- Adding the target module
o_proj
in LoRA fine - tuning. The choice of including o_proj
is inspired by the study here, which emphasizes the importance of this module in reasoning models.
- Using the SpaceOm [dataset](https://huggingface.co/datasets/salma - remyx/SpaceOm) for longer reasoning traces. The reasoning traces in the SpaceThinker dataset average ~200 "thinking" tokens, and now longer reasoning traces are included in the training data to assist the model in using more tokens for reasoning.
- Employing the Robo2VLM - Reasoning [dataset](https://huggingface.co/datasets/salma - remyx/Robo2VLM - Reasoning) for more robotics domain and MCVQA examples. To improve alignment for robotics applications, it is trained with synthetic reasoning traces derived from the Robo2VLM - 1 [dataset](https://huggingface.co/datasets/keplerccc/Robo2VLM - 1).
Model Evaluation
SpatialScore - 3B and 4B models
Property |
Details |
Model Type |
SpaceOm - GGUF |
Base Model |
remyxai/SpaceOm |
Pipeline Tag |
image - text - to - text |
Library Name |
llama.cpp |
License |
apache - 2.0 |
Datasets |
remyxai/SpaceThinker, SpaceOm [dataset](https://huggingface.co/datasets/salma - remyx/SpaceOm), Robo2VLM - Reasoning [dataset](https://huggingface.co/datasets/salma - remyx/Robo2VLM - Reasoning) |
Model |
Overall |
Count. |
Obj.-Loc. |
Pos.-Rel. |
Dist. |
Obj.-Prop. |
Cam.&IT. |
Tracking |
Others |
SpaceQwen2.5 - VL - 3B |
42.31 |
45.01 |
49.78 |
57.88 |
27.36 |
34.11 |
26.34 |
26.44 |
43.58 |
SpatialBot - Phi2 - 3B |
41.65 |
53.23 |
54.32 |
55.40 |
27.12 |
26.10 |
24.21 |
27.57 |
41.66 |
Kimi - VL - 3B |
51.48 |
49.22 |
61.99 |
61.34 |
38.27 |
46.74 |
33.75 |
56.28 |
47.23 |
Kimi - VL - 3B - Thinking |
52.60 |
52.66 |
58.93 |
63.28 |
39.38 |
42.57 |
32.00 |
46.97 |
42.73 |
Qwen2.5 - VL - 3B |
47.90 |
46.62 |
55.55 |
62.23 |
32.39 |
32.97 |
30.66 |
36.90 |
42.19 |
InternVL2.5 - 4B |
49.82 |
53.32 |
62.02 |
62.02 |
32.80 |
27.00 |
32.49 |
37.02 |
48.95 |
SpaceOm (3B) |
49.00 |
56.00 |
54.00 |
65.00 |
41.00 |
50.00 |
36.00 |
42.00 |
47.00 |
See [all results](https://huggingface.co/datasets/salma - remyx/SpaceOm_SpatialScore) for evaluating SpaceOm on the SpatialScore benchmark. Compared to SpaceQwen, this model outperforms in all categories.
And comparing to SpaceThinker:
SpaCE - 10 Benchmark Comparison
[](https://colab.research.google.com/drive/1YpIOjJFZ - Zaomg77ImeQHSqYBLB8T1Ce?usp = sharing)
This table compares SpaceOm
evaluated using GPT scoring against several top models from the SpaCE - 10 benchmark leaderboard. Top scores in each category are bolded.
Model |
EQ |
SQ |
SA |
OO |
OS |
EP |
FR |
SP |
Source |
SpaceOm |
32.47 |
24.81 |
47.63 |
50.00 |
32.52 |
9.12 |
37.04 |
25.00 |
GPT Eval |
Qwen2.5 - VL - 7B - Instruct |
32.70 |
31.00 |
41.30 |
32.10 |
27.60 |
15.40 |
26.30 |
27.50 |
Table |
LLaVA - OneVision - 7B |
37.40 |
36.20 |
42.90 |
44.20 |
27.10 |
11.20 |
45.60 |
27.20 |
Table |
VILA1.5 - 7B |
30.20 |
38.60 |
39.90 |
44.10 |
16.50 |
35.10 |
30.10 |
37.60 |
Table |
InternVL2.5 - 4B |
34.30 |
34.40 |
43.60 |
44.60 |
16.10 |
30.10 |
33.70 |
36.70 |
Table |
Legend:
- EQ: Entity Quantification
- SQ: Scene Quantification
- SA: Size Assessment
- OO: Object - Object spatial relations
- OS: Object - Scene spatial relations
- EP: Entity Presence
- FR: Functional Reasoning
- SP: Spatial Planning
⚠️ Important Note
Scores for SpaceOm are generated via gpt_eval_score
on single - choice (*-single
) versions of the SpaCE - 10 benchmark tasks. Other entries reflect leaderboard accuracy scores from the official SpaCE - 10 evaluation table.
Read more about the SpaCE - 10 benchmark
Limitations
- Performance may degrade in cluttered environments or camera perspective.
- This model was fine - tuned using synthetic reasoning over an internet image dataset.
- Multimodal biases inherent to the base model (Qwen2.5 - VL) may persist.
- Not intended for use in safety - critical or legal decision - making.
💡 Usage Tip
Users are encouraged to evaluate outputs critically and consider fine - tuning for domain - specific safety and performance. Distances estimated using autoregressive transformers may help in higher - order reasoning for planning and behavior but may not be suitable replacements for measurements taken with high - precision sensors, calibrated stereo vision systems, or specialist monocular depth estimation models capable of more accurate, pixel - wise predictions and real - time performance.
Citation
@article{chen2024spatialvlm,
title = {SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities},
author = {Chen, Boyuan and Xu, Zhuo and Kirmani, Sean and Ichter, Brian and Driess, Danny and Florence, Pete and Sadigh, Dorsa and Guibas, Leonidas and Xia, Fei},
journal = {arXiv preprint arXiv:2401.12168},
year = {2024},
url = {https://arxiv.org/abs/2401.12168},
}
@misc{qwen2.5-VL,
title = {Qwen2.5-VL},
url = {https://qwenlm.github.io/blog/qwen2.5-vl/},
author = {Qwen Team},
month = {January},
year = {2025}
}
@misc{vl-thinking2025,
title={SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models },
author={Hardy Chen and Haoqin Tu and Fali Wang and Hui Liu and Xianfeng Tang and Xinya Du and Yuyin Zhou and Cihang Xie},
year = {2025},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/UCSC-VLAA/VLAA-Thinking}},
}
@article{wu2025spatialscore,
author = {Wu, Haoning and Huang, Xiao and Chen, Yaohui and Zhang, Ya and Wang, Yanfeng and Xie, Weidi},
title = {SpatialScore: Towards Unified Evaluation for Multimodal Spatial Understanding},
journal = {arXiv preprint arXiv:2505.17012},
year = {2025},
}
@article{gong2025space10,
title = {SpaCE-10: A Comprehensive Benchmark for Multimodal Large Language Models in Compositional Spatial Intelligence},
author = {Ziyang Gong and Wenhao Li and Oliver Ma and Songyuan Li and Jiayi Ji and Xue Yang and Gen Luo and Junchi Yan and Rongrong Ji},
journal = {arXiv preprint arXiv:2506.07966},
year = {2025},
url = {https://arxiv.org/abs/2506.07966}
}