SpatialBot-3B Open-Source Vision-Language Model - Accurately Parse Depth Maps and Efficiently Execute High-Level Tasks

Spatialbot 3B

Developed by RussRobin

SpatialBot is a vision-language model with spatial understanding and reasoning capabilities, capable of accurately parsing depth maps and performing advanced tasks.

Text-to-Image

Transformers

English#Depth Map Analysis #Spatial Reasoning #Multimodal VLM

Downloads 301

Release Time : 7/17/2024

Model Overview

A hybrid vision-language model developed based on Phi-2 and SigLIP architectures, excelling in conventional vision-language tasks and spatial understanding benchmarks.

Model Features

Spatial Understanding

Capable of accurately parsing depth maps and performing spatial reasoning.

Multimodal Processing

Processes both visual and language inputs simultaneously for cross-modal understanding.

Efficient Architecture

Designed with an efficient architecture based on Phi-2 and SigLIP.

Model Capabilities

Depth Map Analysis

Spatial Reasoning

Visual Question Answering

Multimodal Understanding

Use Cases

Spatial Understanding

Depth Value Query

Read depth values from specified coordinates in a depth map.

Returns precise depth values.

Spatial Relationship Reasoning

Analyze the spatial relationships between objects in a scene.

Generates accurate spatial descriptions.

🚀 SpatialBot-3B

SpatialBot is a Visual Language Model (VLM) equipped with spatial understanding and reasoning capabilities. It precisely interprets depth maps to perform high - level tasks.

🚀 Quick Start

✨ Features

SpatialBot is a VLM with spatial understanding and reasoning abilities. In this Hugging Face repo, we offer the merged SpatialBot - 3B, which is built on Phi - 2 and SigLIP. It performs well on general VLM tasks and spatial understanding benchmarks like SpatialBench.

📦 Installation

Install dependencies first:

pip install torch transformers accelerate pillow numpy

💻 Usage Examples

Basic Usage

# NOTE: We update the repo and quick start codes in 28 August, 2024. Please update your model and codes if you downloaded them before this date.
import torch
import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
from PIL import Image
import warnings
import numpy as np

# disable some warnings
transformers.logging.set_verbosity_error()
transformers.logging.disable_progress_bar()
warnings.filterwarnings('ignore')

# set device
device = 'cuda'  # or cpu

model_name = 'RussRobin/SpatialBot-3B'
offset_bos = 0

# create model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16, # float32 for cpu
    device_map='auto',
    trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    trust_remote_code=True)

# text prompt
prompt = 'What is the depth value of point <0.5,0.2>? Answer directly from depth map.'
text = f"A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: <image 1>\n<image 2>\n{prompt} ASSISTANT:"
text_chunks = [tokenizer(chunk).input_ids for chunk in text.split('<image 1>\n<image 2>\n')]
input_ids = torch.tensor(text_chunks[0] + [-201] + [-202] + text_chunks[1][offset_bos:], dtype=torch.long).unsqueeze(0).to(device)

image1 = Image.open('rgb.jpg')
image2 = Image.open('depth.png')

channels = len(image2.getbands())
if channels == 1:
    img = np.array(image2)
    height, width = img.shape
    three_channel_array = np.zeros((height, width, 3), dtype=np.uint8)
    three_channel_array[:, :, 0] = (img // 1024) * 4
    three_channel_array[:, :, 1] = (img // 32) * 8
    three_channel_array[:, :, 2] = (img % 32) * 8
    image2 = Image.fromarray(three_channel_array, 'RGB')

image_tensor = model.process_images([image1,image2], model.config).to(dtype=model.dtype, device=device)

# generate
output_ids = model.generate(
    input_ids,
    images=image_tensor,
    max_new_tokens=100,
    use_cache=True,
    repetition_penalty=1.0 # increase this to avoid chattering
)[0]

print(tokenizer.decode(output_ids[input_ids.shape[1]:], skip_special_tokens=True).strip())

📚 Documentation

Paper: https://arxiv.org/abs/2406.13642
GitHub repo: https://github.com/BAAI - DCAI/SpatialBot
SpatialBench, the benchmark: https://huggingface.co/datasets/RussRobin/SpatialBench
CKPTs for SpatialBot - 3B with LoRA: https://huggingface.co/RussRobin/SpatialBot - 3B - LoRA

📄 License

This project is licensed under the CC - BY - 4.0 license.

Property	Details
Model Type	VLM with spatial understanding and reasoning abilities
Training Data	RussRobin/SpatialQA
Tags	Embodied AI, MLLM, VLM, Spatial Understanding, Phi - 2
Pipeline Tag	visual - question - answering

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご