Sapnous VR 6B

Developed by Sapnous-AI

Sapnous-6B is an advanced vision-language model that enhances perception and understanding of the world through powerful multimodal capabilities.

Image-to-Text

Transformers

EnglishOpen Source License:Apache-2.0 #Multimodal Understanding #High-precision OCR #Long Sequence Processing

Downloads 261

Release Time : 3/24/2025

Model Overview

Building on the success of previous vision-language architectures, this model further improves performance and efficiency, featuring enhanced visual perception and efficient long-sequence processing capabilities.

Model Features

Powerful Multimodal Capabilities

Combines visual and language processing to achieve comprehensive perception and understanding of the world

Efficient Long Sequence Processing

Supports window sizes up to 32768, capable of handling long texts and complex visual inputs

Advanced Visual Encoder

32-layer deep visual encoder with 112 window size and 14x14 image patch processing capability

High-performance Benchmarking

Outperforms peer models in multiple vision-language benchmarks

Model Capabilities

Multimodal understanding and generation

Image content analysis

Text generation

Document understanding

Chart parsing

Mathematical problem solving

Visual question answering

Use Cases

Document Processing

Document QA

Extract information from scanned documents and answer questions

Achieves 95.6% accuracy on DocVQA test set

Visual Question Answering

Image Content Understanding

Answer complex questions about image content

Achieves 74.1% accuracy on VQAv2 validation set

Education

Math Problem Solving

Parse charts and math problems to provide solutions

Achieves 57.5% accuracy on MathVista test set

license_name: apache-2.0 language:

en pipeline_tag: image-text-to-text tags:
multimodal library_name: transformers base_model:
Sapnous/Sapnous-6B license: apache-2.0

Sapnous-6B: A Vision-Language Model for Enhanced World Perception

Sapnous-6B is a state-of-the-art vision-language model designed to enhance perception and understanding of the world through advanced multimodal capabilities. This model builds upon the success of previous vision-language architectures while introducing novel improvements in performance and efficiency.

Model Architecture

Base Architecture: 6B parameters
Hidden Size: 4096
Attention Heads: 32
Key/Value Heads: 8
Hidden Layers: 28
Window Size: 32768
Vision Encoder:
- Depth: 32 layers
- Hidden Size: 1280
- Attention Heads: 16
- Patch Size: 14x14
- Window Size: 112

Scores

📊 Benchmark Results

Multimodal Benchmarks

Benchmark	InternVL2.5-8B	MiniCPM-o 2.6	GPT-4o-mini	Qwen2-VL-7B	Qwen2.5-VL-7B	Sapnous-MoE (Updated)	Sapnous-6B
MMMU_val	56	50.4	60	54.1	58.6	64.4	60.2
MMMU-Pro_val	34.3	-	37.6	30.5	41.0	44.9	40.7
DocVQA_test	93	93	-	94.5	95.7	97.8	95.6
InfoVQA_test	77.6	-	-	76.5	82.6	88.7	81.9
ChartQA_test	84.8	-	-	83.0	87.3	94.2	87.2
TextVQA_val	79.1	80.1	-	84.3	84.9	91.2	84.6
OCRBench	822	852	785	845	864	929.0	861
CC_OCR	57.7	-	-	61.6	77.8	83.7	77.3
MMStar	62.8	-	-	60.7	63.9	69.3	63.6
MMBench-V1.1-En_test	79.4	78.0	76.0	80.7	82.6	89.6	82.4
MMT-Bench_test	-	-	-	63.7	63.6	69.0	63.3
MMStar	61.5	57.5	54.8	60.7	63.9	69.2	63.6
MMVet_GPT-4-Turbo	54.2	60.0	66.9	62.0	67.1	73.3	67.2
HallBench_avg	45.2	48.1	46.1	50.6	52.9	58.0	52.5
MathVista_testmini	58.3	60.6	52.4	58.2	68.2	74.0	67.9
MathVision	-	-	-	16.3	25.07	27.7	24.8

Reasoning & Visual Understanding Benchmarks

Benchmark	Metric	Llama 3.2 11B	Llama 3.2 90B	Sapnous-MoE (Updated)	Sapnous-6B
VQAv2 (val)	Accuracy	66.8	73.6	80.3	74.1
Text VQA (val)	Relaxed accuracy	73.1	73.5	81.1	74.7
DocVQA (val, unseen)	ANLS	62.3	70.7	77.2	71.0
MMMU (val, 0-shot)	Micro average accuracy	41.7	49.3	55.4	49.2
ChartQA (test)	Accuracy	39.4	54.2	61.0	54.1
InfographicsQA (val, unseen)	ANLS	43.2	56.8	63.7	57.1
AI2 Diagram (test)	Accuracy	62.4	75.3	82.3	75.6
MMMU (val, CoT)	Micro average accuracy	50.7	60.3	66.5	60.6
MMMU-Pro, Standard (10 opts, test)	Accuracy	33.0	45.2	50.0	45.5
MMMU-Pro, Vision (test)	Accuracy	23.7	33.8	39.6	33.9
MathVista (testmini)	Accuracy	51.5	57.3	63.0	57.5
ChartQA (test, CoT)	Relaxed accuracy	83.4	85.5	93.3	86.0
AI2 Diagram (test)	Accuracy	91.1	92.3	100.9	93.5
DocVQA (test)	ANLS	88.4	90.1	98.9	91.3
VQAv2 (test)	Accuracy	75.2	78.1	86.0	79.0
MMLU (CoT)	Macro_avg/acc	73.0	86.0	94.3	87.0
MATH (CoT)	Final_em	51.9	68.0	75.2	68.5
GPQA	Accuracy	32.8	46.7	52.2	46.7
MGSM (CoT)	em	68.9	86.9	95.0	87.4

The model is distributed across 5 safetensors files for efficient loading and memory management. Each file contains specific layers and weights as documented in the model.safetensors.index.json.

Usage

from transformers import pipeline
import requests
from PIL import Image
from io import BytesIO

def process_image_from_url(image_url, text_prompt):
    """Processes an image from a URL using a Transformers pipeline."""
    try:
        # Fetch the image from the URL
        response = requests.get(image_url, stream=True)
        response.raise_for_status()  # Raise an exception for bad status codes (4xx or 5xx)

        # Open the image using PIL
        image = Image.open(BytesIO(response.content))

        # Create the input for the pipeline
        inputs = {"image": image, "text": text_prompt}

        # Initialize the pipeline
        pipe = pipeline("image-text-to-text", model="Sapnous-AI/Sapnous-VR-6B", trust_remote_code=True)

        # Process the image and text
        result = pipe(inputs)
        return result

    except requests.exceptions.RequestException as e:
        print(f"Error fetching image: {e}")
        return None
    except Exception as e:
        print(f"An error occurred: {e}")
        return None

# Example usage
image_url = "example.com" #replace with your image url.
text_prompt = "What is in this image?"

result = process_image_from_url(image_url, text_prompt)

if result:
    print(result)

Model Capabilities

Multi-modal understanding and generation
Enhanced visual perception with advanced vision encoder
Efficient processing of long sequences
Robust performance across various vision-language tasks

Citations

@misc{sapnous-6b,
    title = {Sapnous-6B},
    author = {Sapnous AI Team},
    year = {2025}
}

@article{Sapnous6B,
    title={Sapnous-6B: Enhancing Vision-Language Model's Perception of the World at Any Resolution},
    author={Sapnous AI Team},
    year={2025}
}

@article{Sapnous-VR,
    title={Sapnous-VR: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond},
    author={Sapnous AI Team},
    year={2025}
}

License

Please refer to the LICENSE file for terms of use and distribution.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご