đ olmOCR-7B-0225-preview
This is a preview release of the olmOCR model, fine - tuned from Qwen2 - VL - 7B - Instruct using the olmOCR - mix - 0225 dataset, aiming to provide efficient document processing capabilities.
Quick links:
The best way to use this model is via the olmOCR toolkit. The toolkit comes with an efficient inference setup via sglang that can handle millions of documents at scale.
đ Quick Start
This model expects as input a single document image, rendered such that the longest dimension is 1024 pixels. The prompt must then contain the additional metadata from the document, and the easiest way to generate this is to use the methods provided by the olmOCR toolkit.
⨠Features
- Fine - tuned Model: Fine - tuned from Qwen2 - VL - 7B - Instruct using the olmOCR - mix - 0225 dataset.
- Efficient Toolkit: The olmOCR toolkit provides an efficient inference setup via sglang for large - scale document processing.
đĻ Installation
If you want to use the olmOCR toolkit for manual prompting, you need to install it first:
pip install olmocr
đģ Usage Examples
Basic Usage
If you want to prompt this model manually instead of using the olmOCR toolkit, you can use the following code:
import torch
import base64
import urllib.request
from io import BytesIO
from PIL import Image
from transformers import AutoProcessor, Qwen2VLForConditionalGeneration
from olmocr.data.renderpdf import render_pdf_to_base64png
from olmocr.prompts import build_finetuning_prompt
from olmocr.prompts.anchor import get_anchor_text
model = Qwen2VLForConditionalGeneration.from_pretrained("allenai/olmOCR-7B-0225-preview", torch_dtype=torch.bfloat16).eval()
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
urllib.request.urlretrieve("https://molmo.allenai.org/paper.pdf", "./paper.pdf")
image_base64 = render_pdf_to_base64png("./paper.pdf", 1, target_longest_image_dim=1024)
anchor_text = get_anchor_text("./paper.pdf", 1, pdf_engine="pdfreport", target_length=4000)
prompt = build_finetuning_prompt(anchor_text)
messages = [
{
"role": "user",
"content": [
{"type": "text", "text": prompt},
{"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_base64}"}},
],
}
]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
main_image = Image.open(BytesIO(base64.b64decode(image_base64)))
inputs = processor(
text=[text],
images=[main_image],
padding=True,
return_tensors="pt",
)
inputs = {key: value.to(device) for (key, value) in inputs.items()}
output = model.generate(
**inputs,
temperature=0.8,
max_new_tokens=50,
num_return_sequences=1,
do_sample=True,
)
prompt_length = inputs["input_ids"].shape[1]
new_tokens = output[:, prompt_length:]
text_output = processor.tokenizer.batch_decode(
new_tokens, skip_special_tokens=True
)
print(text_output)
đ License
olmOCR is licensed under the Apache 2.0 license. olmOCR is intended for research and educational use. For more information, please see our Responsible Use Guidelines.