coreOCR-7B-050325-preview Open-Source Vision-Language Model - Efficient Document OCR and Precise Image-Text Conversion

Coreocr 7B 050325 Preview

Developed by prithivMLmods

coreOCR-7B-050325-preview is a vision-language model fine-tuned based on Qwen/Qwen2-VL-7B, focusing on document-level OCR, long-context vision-language understanding, and accurate image-to-text conversion (supporting mathematical LaTeX format).

Image-to-Text

Transformers

EnglishOpen Source License:Apache-2.0 #Document-level OCR #Long-context visual understanding #Mathematical LaTeX conversion

Downloads 1,532

Release Time : 5/3/2025

Model Overview

This model is optimized for document parsing, structured data extraction, and complex visual reasoning, supporting high-fidelity visual text understanding. It is suitable for tasks such as document analysis, mathematical problem solving, and multilingual OCR.

Model Features

Advanced document-level OCR

Capable of accurately processing and extracting structured text from complex multi-page documents such as invoices, tables, and research papers.

Enhanced long-context vision-language understanding

Supports long-text retrieval and reasoning from documents and multimedia inputs, including dense text blocks, charts, and mathematical content.

Optimal understanding across image resolutions

Achieved state-of-the-art results in visual benchmarks such as MathVista, DocVQA, RealWorldQA, and MTVQA.

Video understanding of over 20 minutes

Capable of high-quality video-based question answering, dialogue generation, and content summarization of long video sequences.

Device control via visual commands

Has complex reasoning and perception capabilities, can be integrated with devices such as mobile phones or robots to achieve vision-based automated operations.

Model Capabilities

Document parsing

Structured data extraction

Complex visual reasoning

Mathematical LaTeX text generation

Multilingual OCR

Long-video content understanding

Visual device control

Use Cases

Document analysis

Invoice processing

Extract structured data from scanned invoice images

High-precision text extraction and field recognition

Research paper parsing

Extract key information and references from multi-page research papers

Supports recognition of complex layouts and mathematical formulas

Education

Mathematical problem solving

Generate LaTeX text from handwritten or printed mathematical content

Accurate recognition and conversion of mathematical symbols

Chart understanding

Interpret charts and data visualizations in educational materials

Comprehensive understanding combining visual and text information

Business automation

Multilingual document digitization

Perform multilingual OCR on global business documents

Supports multiple languages and writing scripts

Visual robot control

Achieve automated device interaction through visual context

Complex visual reasoning and instruction execution

🚀 coreOCR-7B-050325-preview

The coreOCR-7B-050325-preview model is a fine - tuned version of Qwen/Qwen2 - VL - 7B, optimized for Document - Level Optical Character Recognition (OCR), long - context vision - language understanding, and accurate image - to - text conversion with mathematical LaTeX formatting.

✨ Features

Advanced Document - Level OCR: Accurately processes and extracts structured text from complex, multi - page documents including invoices, forms, and research papers.
Enhanced Long - Context Vision - Language Understanding: Supports long - text retrieval and reasoning from documents and multimedia inputs, including dense text blocks, diagrams, and math content.
SoTA Understanding Across Image Resolutions: Achieves state - of - the - art results on visual benchmarks including MathVista, DocVQA, RealWorldQA, and MTVQA.
Video Comprehension up to 20+ minutes: Capable of high - quality video - based question answering, dialogue generation, and content summarization from long video sequences.
Device Control via Visual Commands: With complex reasoning and perception capabilities, it can be integrated with devices like mobile phones or robots for visually grounded automation.

🚀 Quick Start

Quick Start with Transformers

from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

model = Qwen2VLForConditionalGeneration.from_pretrained(
    "prithivMLmods/coreOCR-7B-050325-preview", torch_dtype="auto", device_map="auto"
)

processor = AutoProcessor.from_pretrained("prithivMLmods/coreOCR-7B-050325-preview")

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

🔧 Technical Details

Training Details

Property	Details
Dataset Size	274,209 samples (Modular Combination of Datasets)
Model Architecture	`Qwen2VLForConditionalGeneration`
Hardware	2 × NVIDIA A100 SXM (with 32 vCPUs)
Total Disk	160,000 MB
Training Time	10,390 seconds (~2.88 hours)
Learning Rate	1e - 5
Scheduler	Linear Decay
Warmup Steps	700
Precision	bfloat16

⚠️ Important Note

The open dataset image - text response will be updated soon.

📚 Documentation

Intended Use

This model is intended for:

Document analysis and OCR from scanned images, PDFs, and camera input.
Image - based question answering (e.g., educational content, diagrams, receipts).
Math problem solving and LaTeX text generation from handwritten or printed math content.
Long - context vision - text applications such as multi - slide document retrieval and dense information extraction.
Multilingual OCR workflows for cross - lingual business documents and global data digitization.
AI agents for mobile/robotic interaction through visual context.

Limitations

Performance may degrade on extremely noisy or low - resolution images.
Not suitable for real - time inference on edge devices due to model size and memory demands.
While multilingual, performance on low - resource or rare scripts may vary.
Not optimized for high - speed processing of video streams in constrained environments.
Contextual understanding depends on visual tokenization parameters; improper configuration may affect output quality.
Outputs may occasionally include hallucinations or incomplete answers in long - context queries.

References

DocVLM: Make Your VLM an Efficient Reader https://arxiv.org/pdf/2412.08746v1
YaRN: Efficient Context Window Extension of Large Language Models
https://arxiv.org/pdf/2309.00071
Qwen2 - VL: Enhancing Vision - Language Model’s Perception of the World at Any Resolution
https://arxiv.org/pdf/2409.12191
Qwen - VL: A Versatile Vision - Language Model for Understanding, Localization, Text Reading, and Beyond
https://arxiv.org/pdf/2308.12966
A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy https://arxiv.org/pdf/2412.02210

📄 License

This project is licensed under the Apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご