MLCD-ViT-bigG Open-source Model - Free Deployment to Support Document Understanding and Visual Question Answering Tasks

Mlcd Vit Bigg Patch14 448

Developed by DeepGlint-AI

MLCD-ViT-bigG is an advanced Vision Transformer model enhanced with 2D Rotary Position Encoding (RoPE2D), excelling in document understanding and visual question answering tasks.

Text Recognition

Safetensors

Open Source License:MIT #Document Visual Question Answering #2D Rotary Position Encoding #High-precision Visual Understanding

Downloads 1,517

Release Time : 2/12/2025

Model Overview

Developed by DeepGlint AI, this model employs a Vision Transformer architecture enhanced with 2D Rotary Position Encoding (RoPE2D), specifically designed for complex vision-language interaction tasks, demonstrating outstanding performance in document understanding and visual question answering.

Model Features

2D Rotary Position Encoding (RoPE2D)

Incorporates innovative 2D rotary position encoding technology, enhancing the model's ability to understand spatial position information

Exceptional Document Understanding

Outperforms peer models in document understanding and visual question answering tasks

High-Resolution Processing

Supports 448px high-resolution image input, capturing finer visual features

Model Capabilities

Image Feature Extraction

Document Understanding

Visual Question Answering

Chart Analysis

OCR Enhancement

Use Cases

Document Processing

Document Question Answering

Extract information from complex documents and answer questions

Achieves 83.34% accuracy on the DocVQA dataset

Table Understanding

Parse and understand tabular data in documents

Visual Question Answering

Chart Analysis

Understand and answer questions about charts

Achieves 73.80% accuracy on the ChartQA dataset

Information Extraction

Extract structured information from images

Achieves 46.59% accuracy on the InfoVQA dataset

🚀 MLCD-ViT-bigG Model Card

MLCD-ViT-bigG is a cutting - edge vision transformer model that leverages 2D Rotary Position Embedding (RoPE2D) to excel in document understanding and visual question - answering tasks.

⚠️ Important Note

LLaVA-NeXT and transformers now supports MLCD-ViT-bigG-14-448px.

💡 Usage Tip

We adopted the official LLaVA-NeXT and the official training dataset LLaVA-NeXT-Data for evaluating the foundational visual models. The language model is Qwen2.5-7B.

MLCD-ViT-bigG is a state-of-the-art vision transformer model enhanced with 2D Rotary Position Embedding (RoPE2D), achieving superior performance on document understanding and visual question answering tasks. Developed by DeepGlint AI, this model demonstrates exceptional capabilities in processing complex visual-language interactions.

Property	Details
Pipeline Tag	image-feature-extraction
License	MIT

Performance Comparison

Vision Tower	RoPE2D	ChartQA	DocVQA	InfoVQA	OCRBench	MMMU
CLIP (ViT-L-14-336px)	×	66.52	75.21	38.88	525.00	44.20
SigLIP (ViT-SO400M-384px)	×	69.28	76.71	41.38	554.00	46.78
DFN5B (ViT-H-14-378px)	×	64.36	70.87	38.59	473.00	48.00
MLCD (ViT-L-14-336px)	×	67.84	76.46	43.48	531.00	44.30
MLCD (ViT-bigG-14-336px)	√	71.07	79.63	44.38	572.00	46.78
MLCD (ViT-bigG-14-448px)	√	73.80	83.34	46.59	582.00	46.00

📦 Installation

pip install torch transformers
git clone https://github.com/deepglint/unicom
cd unicom/mlcd

💻 Usage Examples

Basic Usage

from vit_rope2d_hf import MLCDVisionModel
from transformers import CLIPImageProcessor
from PIL import Image
import requests
import torch

# Load model and processor
model = MLCDVisionModel.from_pretrained("DeepGlint-AI/mlcd-vit-bigG-patch14-448")
processor = CLIPImageProcessor.from_pretrained("DeepGlint-AI/mlcd-vit-bigG-patch14-448")

# Process single image
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(images=image, return_tensors="pt")

# Get visual features
with torch.no_grad():
    outputs = model(**inputs)
features = outputs.last_hidden_state

print(f"Extracted features shape: {features.shape}")

📄 License

This project is licensed under the MIT License.

📚 Citation

@inproceedings{anxiang_2024_mlcd,
  title={Multi-label Cluster Discrimination for Visual Representation Learning},
  author={An, Xiang and Yang, Kaicheng and Dai, Xiangzi and Feng, Ziyong and Deng, Jiankang},
  booktitle={ECCV},
  year={2024}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご