Kosmos-2.5 Open-Source Multimodal Reading and Writing Model - Free for Image and Text Recognition and Structured Output

Kosmos 2.5

Developed by microsoft

Kosmos-2.5 is a multimodal reading and writing model designed for machine reading of text-dense images, capable of text recognition and structured output from images.

Image-to-Text

Transformers

EnglishOpen Source License:MIT #Multimodal Reading and Writing #Text-Dense Image Parsing #Markdown Generation

Downloads 5,531

Release Time : 5/13/2024

Model Overview

Kosmos-2.5 is a multimodal reading and writing model focused on machine reading tasks for text-dense images. It can generate spatially aware text blocks and output structured text, suitable for tasks such as document-level text recognition and image-to-Markdown text generation.

Model Features

Multimodal Reading and Writing Capability

Combines visual and language processing capabilities to achieve text recognition and structured output from images.

Spatial-Aware Text Blocks

Can annotate the coordinate positions of each text block in the image, providing spatial information.

Structured Output

Converts styles and structures into Markdown format for easy subsequent processing and use.

Task Adaptability

Through supervised fine-tuning with different prompts, it can quickly adapt to various text-dense image understanding tasks.

Model Capabilities

Text recognition

Image-to-Markdown

Document understanding

Spatial text annotation

Use Cases

Document Processing

End-to-End Document-Level Text Recognition

Extracts text content from complex document images while preserving structural information

High-precision text recognition and structure retention

Image-to-Markdown

Converts text-containing images into structured Markdown format

Markdown output that preserves original styles and structures

Rich Text Image Processing

Real-World Rich Text Image Understanding

Processes real-world images with complex text layouts

Generalized text-dense image understanding capability

🚀 Kosmos-2.5

Kosmos-2.5 is a multimodal literate model designed for machine reading of text-intensive images. It offers unified capabilities in generating spatially - aware text blocks and structured markdown - formatted text, making it a general - purpose tool for real - world applications involving text - rich images.

Microsoft Document AI | GitHub

🚀 Quick Start

The model is pre - trained on large - scale text - intensive images. It can be applied to end - to - end document - level text recognition and image - to - markdown text generation tasks. Also, it can be adapted for other text - intensive image understanding tasks via supervised fine - tuning.

✨ Features

Multimodal Literacy: Kosmos - 2.5 excels in two key transcription tasks: generating spatially - aware text blocks with their spatial coordinates in the image and producing structured markdown - formatted text.
Unified Architecture: Achieves unified multimodal literate capabilities through a shared decoder - only auto - regressive Transformer architecture, task - specific prompts, and flexible text representations.
General - Purpose Applicability: Can be easily adapted for various text - intensive image understanding tasks with different prompts through supervised fine - tuning.

Kosmos-2.5: A Multimodal Literate Model

💡 Usage Tip

⚠️ Important Note

Since this is a generative model, there is a risk of hallucination during the generation process, and it CAN NOT guarantee the accuracy of all OCR/Markdown results in the images.

💻 Usage Examples

Markdown Task

For usage instructions, please refer to md.py.

OCR Task

For usage instructions, please refer to ocr.py.

📚 Documentation

Kosmos - 2.5 is evaluated on end - to - end document - level text recognition and image - to - markdown text generation. This work also paves the way for the future scaling of multimodal large language models.

📄 License

The content of this project itself is licensed under the MIT

Microsoft Open Source Code of Conduct

📚 Citation

If you find Kosmos - 2.5 useful in your research, please cite the following paper:

@article{lv2023kosmos,
  title={Kosmos-2.5: A multimodal literate model},
  author={Lv, Tengchao and Huang, Yupan and Chen, Jingye and Cui, Lei and Ma, Shuming and Chang, Yaoyao and Huang, Shaohan and Wang, Wenhui and Dong, Li and Luo, Weiyao and others},
  journal={arXiv preprint arXiv:2309.11419},
  year={2023}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご