olmOCR-7B-0225-preview Open-Source Document OCR Model - Supports Multi-Language Document Recognition and Metadata Extraction

Olmocr 7B 0225 Preview

Developed by FriendliAI

A document OCR model fine-tuned based on Qwen2-VL-7B-Instruct, supporting multilingual document recognition and metadata extraction

Text Recognition

Transformers

EnglishOpen Source License:Apache-2.0 #Document Image Understanding #Multimodal OCR #Scientific Literature Processing

Downloads 322

Release Time : 2/28/2025

Model Overview

This model is a multimodal model optimized for document OCR tasks, capable of processing single-page document images and extracting text content along with document structure information.

Model Features

Multimodal Document Understanding

Combines visual and language model capabilities to process both image and text information simultaneously

Metadata Extraction

Capable of identifying document language, rotation correction, table/chart detection, and other structured information

Efficient Inference Support

Supports batch processing of large volumes of documents through the sglang framework

Model Capabilities

Document Image Recognition

Multilingual Text Extraction

Document Structure Analysis

Metadata Generation

Table Detection

Chart Detection

Use Cases

Academic Research

Paper Digitization

Convert academic paper PDFs into structured digital content

Extract text content and paper metadata

Enterprise Document Processing

Contract Parsing

Automatically identify key clauses and structures in contract documents

Generate structured contract data

🚀 olmOCR-7B-0225-preview

This is a preview release of the olmOCR model, fine - tuned from Qwen2 - VL - 7B - Instruct using the olmOCR - mix - 0225 dataset, aiming to provide efficient document processing capabilities.

Quick links:

📃 Paper
🤗 Dataset
🛠️ Code
🎮 Demo

The best way to use this model is via the olmOCR toolkit. The toolkit comes with an efficient inference setup via sglang that can handle millions of documents at scale.

🚀 Quick Start

This model expects as input a single document image, rendered such that the longest dimension is 1024 pixels. The prompt must then contain the additional metadata from the document, and the easiest way to generate this is to use the methods provided by the olmOCR toolkit.

✨ Features

Fine - tuned Model: Fine - tuned from Qwen2 - VL - 7B - Instruct using the olmOCR - mix - 0225 dataset.
Efficient Toolkit: The olmOCR toolkit provides an efficient inference setup via sglang for large - scale document processing.

📦 Installation

If you want to use the olmOCR toolkit for manual prompting, you need to install it first:

pip install olmocr

💻 Usage Examples

Basic Usage

If you want to prompt this model manually instead of using the olmOCR toolkit, you can use the following code:

import torch
import base64
import urllib.request

from io import BytesIO
from PIL import Image
from transformers import AutoProcessor, Qwen2VLForConditionalGeneration

from olmocr.data.renderpdf import render_pdf_to_base64png
from olmocr.prompts import build_finetuning_prompt
from olmocr.prompts.anchor import get_anchor_text

# Initialize the model
model = Qwen2VLForConditionalGeneration.from_pretrained("allenai/olmOCR-7B-0225-preview", torch_dtype=torch.bfloat16).eval()
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Grab a sample PDF
urllib.request.urlretrieve("https://molmo.allenai.org/paper.pdf", "./paper.pdf")

# Render page 1 to an image
image_base64 = render_pdf_to_base64png("./paper.pdf", 1, target_longest_image_dim=1024)

# Build the prompt, using document metadata
anchor_text = get_anchor_text("./paper.pdf", 1, pdf_engine="pdfreport", target_length=4000)
prompt = build_finetuning_prompt(anchor_text)

# Build the full prompt
messages = [
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": prompt},
                    {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_base64}"}},
                ],
            }
        ]

# Apply the chat template and processor
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
main_image = Image.open(BytesIO(base64.b64decode(image_base64)))

inputs = processor(
    text=[text],
    images=[main_image],
    padding=True,
    return_tensors="pt",
)
inputs = {key: value.to(device) for (key, value) in inputs.items()}


# Generate the output
output = model.generate(
            **inputs,
            temperature=0.8,
            max_new_tokens=50,
            num_return_sequences=1,
            do_sample=True,
        )

# Decode the output
prompt_length = inputs["input_ids"].shape[1]
new_tokens = output[:, prompt_length:]
text_output = processor.tokenizer.batch_decode(
    new_tokens, skip_special_tokens=True
)

print(text_output)
# ['{"primary_language":"en","is_rotation_valid":true,"rotation_correction":0,"is_table":false,"is_diagram":false,"natural_text":"Molmo and PixMo:\\nOpen Weights and Open Data\\nfor State-of-the']

📄 License

olmOCR is licensed under the Apache 2.0 license. olmOCR is intended for research and educational use. For more information, please see our Responsible Use Guidelines.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご