Dolphin Open-Source Multimodal Document Image Parsing Model - Analyze First, Then Parse to Handle Complex Document Elements

Dolphin

Developed by ByteDance

Dolphin is an innovative multimodal document image parsing model that adopts an 'analyze first, parse later' paradigm to handle complex document elements.

Image-to-Text

Transformers

Supports Multiple LanguagesOpen Source License:MIT #Two-stage document parsing #Heterogeneous anchor prompts #Multi-element parallel processing

Downloads 1,620

Release Time : 5/19/2025

Model Overview

Dolphin is a multimodal model for document image parsing, capable of processing complex interwoven document elements such as text paragraphs, charts, formulas, and tables. It achieves comprehensive page-level layout analysis and efficient element-level parsing through a two-stage approach.

Model Features

Two-stage parsing method

Performs page-level layout analysis first, followed by element-level parsing, effectively handling complex document structures

Heterogeneous anchor prompts

Uses natural language prompts to control parsing tasks, improving parsing efficiency and accuracy

Parallel parsing mechanism

Lightweight architecture supports parallel parsing of multiple document elements, enhancing processing efficiency

Multimodal capability

Simultaneously processes visual and textual information, suitable for complex document understanding tasks

Model Capabilities

Document image parsing

Layout analysis

Table extraction

Optical character recognition

Formula recognition

Chart understanding

Multimodal processing

Use Cases

Document digitization

Scanned document parsing

Convert scanned PDFs or images into structured digital documents

Preserves the original document's layout and content structure

Information extraction

Table data extraction

Extract table data from document images and convert it into structured format

High-precision table structure recognition and data extraction

Formula recognition

Identify mathematical formulas in documents and convert them into editable format

Supports recognition of complex mathematical symbols and structures

🚀 Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting

Dolphin is a novel multimodal document image parsing model. It follows an analyze - then - parse paradigm, effectively addressing the challenges in complex document understanding.

🚀 Quick Start

Our demo will be released in these days. Please keep tuned!

For detailed usage, please refer to our GitHub repository:

Page-wise parsing: for an entire document image
Element-wise parsing: for an element (paragraph, table, formula) image

✨ Features

Two - stage approach: Dolphin addresses the challenges of complex document understanding through a two - stage approach. In the first stage, it conducts comprehensive page - level layout analysis by generating an element sequence in natural reading order. In the second stage, it performs efficient parallel parsing of document elements using heterogeneous anchors and task - specific prompts.
Promising performance and efficiency: It achieves promising performance across diverse page - level and element - level parsing tasks. Its lightweight architecture and parallel parsing mechanism ensure superior efficiency.

📚 Documentation

Model Description

Dolphin (Document Image Parsing via Heterogeneous Anchor Prompting) is a novel multimodal document image parsing model that follows an analyze - then - parse paradigm. It addresses the challenges of complex document understanding through a two - stage approach designed to handle intertwined elements such as text paragraphs, figures, formulas, and tables.

Overview

Document image parsing is challenging due to its complexly intertwined elements such as text paragraphs, figures, formulas, and tables. Dolphin addresses these challenges through a two - stage approach:

Stage 1: Comprehensive page - level layout analysis by generating element sequence in natural reading order
Stage 2: Efficient parallel parsing of document elements using heterogeneous anchors and task - specific prompts

Dolphin achieves promising performance across diverse page - level and element - level parsing tasks while ensuring superior efficiency through its lightweight architecture and parallel parsing mechanism.

Model Architecture

Dolphin is built on a vision - encoder - decoder architecture using transformers:

Vision Encoder: Based on Swin Transformer for extracting visual features from document images
Text Decoder: Based on MBart for decoding text from visual features
Prompt - based interface: Uses natural language prompts to control parsing tasks

The model is implemented as a Hugging Face VisionEncoderDecoderModel for easy integration with the Transformers ecosystem.

📄 License

This model is released under the MIT License.

📚 Citation

@inproceedings{dolphin2025,
  title={Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting},
  author={Feng, Hao and Wei, Shu and Fei, Xiang and Shi, Wei and Han, Yingdong and Liao, Lei and Lu, Jinghui and Wu, Binghong and Liu, Qi and Lin, Chunhui and Tang, Jingqun and Liu, Hao and Huang, Can},
  year={2025},
  booktitle={Proceedings of the 65rd Annual Meeting of the Association for Computational Linguistics (ACL)}
}

🔗 Acknowledgements

This model builds on several open - source projects including:

Hugging Face Transformers
Donut
Nougat
[Swin Transformer](https://github.com/microsoft/Swin - Transformer)

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご