đ SpatialBot-3B
SpatialBot is a Visual Language Model (VLM) equipped with spatial understanding and reasoning capabilities. It precisely interprets depth maps to perform high - level tasks.
đ Quick Start
⨠Features
SpatialBot is a VLM with spatial understanding and reasoning abilities. In this Hugging Face repo, we offer the merged SpatialBot - 3B, which is built on Phi - 2 and SigLIP. It performs well on general VLM tasks and spatial understanding benchmarks like SpatialBench.
đĻ Installation
- Install dependencies first:
pip install torch transformers accelerate pillow numpy
đģ Usage Examples
Basic Usage
import torch
import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
from PIL import Image
import warnings
import numpy as np
transformers.logging.set_verbosity_error()
transformers.logging.disable_progress_bar()
warnings.filterwarnings('ignore')
device = 'cuda'
model_name = 'RussRobin/SpatialBot-3B'
offset_bos = 0
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map='auto',
trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(
model_name,
trust_remote_code=True)
prompt = 'What is the depth value of point <0.5,0.2>? Answer directly from depth map.'
text = f"A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: <image 1>\n<image 2>\n{prompt} ASSISTANT:"
text_chunks = [tokenizer(chunk).input_ids for chunk in text.split('<image 1>\n<image 2>\n')]
input_ids = torch.tensor(text_chunks[0] + [-201] + [-202] + text_chunks[1][offset_bos:], dtype=torch.long).unsqueeze(0).to(device)
image1 = Image.open('rgb.jpg')
image2 = Image.open('depth.png')
channels = len(image2.getbands())
if channels == 1:
img = np.array(image2)
height, width = img.shape
three_channel_array = np.zeros((height, width, 3), dtype=np.uint8)
three_channel_array[:, :, 0] = (img // 1024) * 4
three_channel_array[:, :, 1] = (img // 32) * 8
three_channel_array[:, :, 2] = (img % 32) * 8
image2 = Image.fromarray(three_channel_array, 'RGB')
image_tensor = model.process_images([image1,image2], model.config).to(dtype=model.dtype, device=device)
output_ids = model.generate(
input_ids,
images=image_tensor,
max_new_tokens=100,
use_cache=True,
repetition_penalty=1.0
)[0]
print(tokenizer.decode(output_ids[input_ids.shape[1]:], skip_special_tokens=True).strip())
đ Documentation
đ License
This project is licensed under the CC - BY - 4.0 license.
Property |
Details |
Model Type |
VLM with spatial understanding and reasoning abilities |
Training Data |
RussRobin/SpatialQA |
Tags |
Embodied AI, MLLM, VLM, Spatial Understanding, Phi - 2 |
Pipeline Tag |
visual - question - answering |