Nanonets-OCR-s-GGUF開源OCR模型 - 免費將文檔智能轉換為Markdown格式

首頁

Nanonets OCR S GGUF

由unsloth開發

Nanonets-OCR-s是一款先進的圖像轉Markdown光學字符識別（OCR）模型，能夠將文檔轉換為結構化的Markdown格式，具備智能內容識別和語義標記功能。

圖像生成文本

Transformers

英語#結構化Markdown轉換 #LaTeX公式識別 #智能文檔解析

下載量 2,300

發布時間 : 6/16/2025

模型概述

Nanonets-OCR-s是一款強大的OCR模型，專為將文檔轉換為結構化的Markdown格式而設計。它不僅能夠提取文本，還能識別和標記LaTeX公式、圖像、簽名、水印等複雜內容，非常適合供大語言模型（LLM）進行下游處理。

模型特點

LaTeX公式識別

自動將數學方程和公式轉換為格式正確的LaTeX語法，可區分行內公式和顯示公式。

智能圖像描述

使用結構化的<img>標籤描述文檔內的圖像，便於大語言模型處理。

簽名檢測與分離

識別並分離文檔中的簽名，將其輸出到<signature>標籤內。

水印提取

檢測並提取文檔中的水印文本，將其放置在<watermark>標籤內。

智能複選框處理

將表單中的複選框和單選按鈕轉換為標準化的Unicode符號。

複雜表格提取

準確提取文檔中的複雜表格，並將其轉換為Markdown和HTML表格格式。

模型能力

文檔OCR

LaTeX公式識別

圖像內容描述

簽名檢測

水印提取

表格提取

複選框處理

使用案例

文檔處理

學術論文處理

將包含數學公式和表格的學術論文轉換為結構化Markdown格式。

保留原始文檔的結構和語義，便於後續分析和處理。

商業合同處理

提取合同中的文本、簽名和水印信息。

自動化處理法律文檔，提高效率。

表單處理

識別和轉換表單中的複選框和單選按鈕。

標準化表單數據，便於後續分析。

🚀 Nanonets-OCR-s圖像文字轉文本模型

Nanonets-OCR-s是一款強大的、先進的圖像轉Markdown光學字符識別（OCR）模型，它遠遠超越了傳統的文本提取功能。該模型能夠將文檔轉換為結構化的Markdown格式，具備智能內容識別和語義標記功能，非常適合供大語言模型（LLM）進行下游處理。

Unsloth Dynamic 2.0 實現了卓越的準確性，性能優於其他領先的量化模型。

✨ 主要特性

Nanonets-OCR-s具備一系列精心設計的功能，能夠輕鬆處理複雜文檔：

LaTeX公式識別：自動將數學方程和公式轉換為格式正確的LaTeX語法，可區分行內公式（ $...$ ）和顯示公式（$$...$$）。
智能圖像描述：使用結構化的 <img> 標籤描述文檔內的圖像，便於大語言模型處理。能夠描述各種類型的圖像，包括標誌、圖表等，並詳細說明其內容、樣式和上下文。
簽名檢測與分離：識別並分離文檔中的簽名，將其輸出到 <signature> 標籤內，這對於處理法律和商業文檔至關重要。
水印提取：檢測並提取文檔中的水印文本，將其放置在 <watermark> 標籤內。
智能複選框處理：將表單中的複選框和單選按鈕轉換為標準化的Unicode符號（‚òê, ‚òë, ‚òí），以實現一致且可靠的處理。
複雜表格提取：準確提取文檔中的複雜表格，並將其轉換為Markdown和HTML表格格式。

閱讀完整公告 | Hugging Face空間演示

🚀 快速開始

使用transformers庫

from PIL import Image
from transformers import AutoTokenizer, AutoProcessor, AutoModelForImageTextToText

model_path = "nanonets/Nanonets-OCR-s"

model = AutoModelForImageTextToText.from_pretrained(
    model_path, 
    torch_dtype="auto", 
    device_map="auto", 
    attn_implementation="flash_attention_2"
)
model.eval()

tokenizer = AutoTokenizer.from_pretrained(model_path)
processor = AutoProcessor.from_pretrained(model_path)


def ocr_page_with_nanonets_s(image_path, model, processor, max_new_tokens=4096):
    prompt = """Extract the text from the above document as if you were reading it naturally. Return the tables in html format. Return the equations in LaTeX representation. If there is an image in the document and image caption is not present, add a small description of the image inside the <img></img> tag; otherwise, add the image caption inside <img></img>. Watermarks should be wrapped in brackets. Ex: <watermark>OFFICIAL COPY</watermark>. Page numbers should be wrapped in brackets. Ex: <page_number>14</page_number> or <page_number>9/22</page_number>. Prefer using ‚òê and ‚òë for check boxes."""
    image = Image.open(image_path)
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": [
            {"type": "image", "image": f"file://{image_path}"},
            {"type": "text", "text": prompt},
        ]},
    ]
    text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = processor(text=[text], images=[image], padding=True, return_tensors="pt")
    inputs = inputs.to(model.device)
    
    output_ids = model.generate(**inputs, max_new_tokens=max_new_tokens, do_sample=False)
    generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, output_ids)]
    
    output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)
    return output_text[0]

image_path = "/path/to/your/document.jpg"
result = ocr_page_with_nanonets_s(image_path, model, processor, max_new_tokens=15000)
print(result)

使用vLLM

啟動vLLM服務器。

vllm serve nanonets/Nanonets-OCR-s

使用模型進行預測

from openai import OpenAI
import base64

client = OpenAI(api_key="123", base_url="http://localhost:8000/v1")

model = "nanonets/Nanonets-OCR-s"

def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")

def ocr_page_with_nanonets_s(img_base64):
    response = client.chat.completions.create(
        model=model,
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "image_url",
                        "image_url": {"url": f"data:image/png;base64,{img_base64}"},
                    },
                    {
                        "type": "text",
                        "text": "Extract the text from the above document as if you were reading it naturally. Return the tables in html format. Return the equations in LaTeX representation. If there is an image in the document and image caption is not present, add a small description of the image inside the <img></img> tag; otherwise, add the image caption inside <img></img>. Watermarks should be wrapped in brackets. Ex: <watermark>OFFICIAL COPY</watermark>. Page numbers should be wrapped in brackets. Ex: <page_number>14</page_number> or <page_number>9/22</page_number>. Prefer using ‚òê and ‚òë for check boxes.",
                    },
                ],
            }
        ],
        temperature=0.0,
        max_tokens=15000
    )
    return response.choices[0].message.content

test_img_path = "/path/to/your/document.jpg"
img_base64 = encode_image(test_img_path)
print(ocr_page_with_nanonets_s(img_base64))

使用docext

pip install docext
python -m docext.app.app --model_name hosted_vllm/nanonets/Nanonets-OCR-s

更多詳細信息請查看 GitHub。

📚 詳細文檔

BibTex引用

@misc{Nanonets-OCR-S,
  title={Nanonets-OCR-S: A model for transforming documents into structured markdown with intelligent content recognition and semantic tagging},
  author={Souvik Mandal and Ashish Talewar and Paras Ahuja and Prathamesh Juvatkar},
  year={2025},
}