Nanonets-OCR-s-GGUF開源OCR模型 - 免費將圖像文檔轉Markdown並智能識別標記

首頁

Nanonets OCR S GGUF

由Mungert開發

Nanonets-OCR-s是一款強大的圖像轉Markdown的OCR模型，能夠將文檔轉換為結構化的Markdown並進行智能內容識別和語義標記。

圖像生成文本

Transformers

英語#文檔結構化轉換 #智能語義標記 #LaTeX公式識別

下載量 1,044

發布時間 : 6/14/2025

模型概述

Nanonets-OCR-s是一款先進的OCR模型，專為將文檔轉換為結構化的Markdown設計。它不僅能夠提取文本，還能識別和標記複雜的內容，如表格、公式、圖像、簽名和水印等，非常適合大語言模型（LLM）的下游處理。

模型特點

LaTeX公式識別

自動將數學公式轉換為格式正確的LaTeX語法，區分行內公式和顯示公式。

智能圖像描述

使用結構化的<img>標籤描述文檔中的圖像，使其易於被大語言模型處理。

簽名檢測與隔離

識別並隔離簽名與其他文本，將其輸出到<signature>標籤中，適用於法律和商業文檔。

水印提取

檢測並從文檔中提取水印文本，將其放置在<watermark>標籤中。

智能複選框處理

將表單複選框和單選按鈕轉換為標準化的Unicode符號（☐, ☑, ☒），以便進行一致且可靠的處理。

複雜表格提取

準確地從文檔中提取複雜表格，並將其轉換為Markdown和HTML表格格式。

模型能力

文檔轉換

文本提取

表格識別

公式識別

圖像描述

簽名檢測

水印提取

複選框處理

使用案例

文檔處理

PDF轉Markdown

將PDF文檔轉換為結構化的Markdown格式，保留原始文檔的佈局和內容。

生成易於處理和編輯的Markdown文檔。

表格提取

從文檔中提取複雜表格並轉換為HTML或Markdown格式。

保留表格的結構和內容，便於後續處理。

學術研究

公式識別

識別文檔中的數學公式並轉換為LaTeX語法。

便於學術論文的編輯和排版。

商業文檔

簽名檢測

識別和隔離文檔中的簽名部分。

便於法律和商業文檔的處理。

水印提取

檢測和提取文檔中的水印文本。

便於文檔的版權管理和驗證。

🚀 Nanonets-OCR-s GGUF模型

Nanonets-OCR-s GGUF模型是由Nanonets推出的一款強大的圖像轉Markdown的OCR模型，它超越了傳統的文本提取功能，能夠將文檔轉換為結構化的Markdown，並進行智能內容識別和語義標記，非常適合大語言模型（LLM）的下游處理。

🚀 快速開始

使用transformers

from PIL import Image
from transformers import AutoTokenizer, AutoProcessor, AutoModelForImageTextToText

model_path = "nanonets/Nanonets-OCR-s"

model = AutoModelForImageTextToText.from_pretrained(
    model_path, 
    torch_dtype="auto", 
    device_map="auto", 
    attn_implementation="flash_attention_2"
)
model.eval()

tokenizer = AutoTokenizer.from_pretrained(model_path)
processor = AutoProcessor.from_pretrained(model_path)


def ocr_page_with_nanonets_s(image_path, model, processor, max_new_tokens=4096):
    prompt = """Extract the text from the above document as if you were reading it naturally. Return the tables in html format. Return the equations in LaTeX representation. If there is an image in the document and image caption is not present, add a small description of the image inside the <img></img> tag; otherwise, add the image caption inside <img></img>. Watermarks should be wrapped in brackets. Ex: <watermark>OFFICIAL COPY</watermark>. Page numbers should be wrapped in brackets. Ex: <page_number>14</page_number> or <page_number>9/22</page_number>. Prefer using ☐ and ☑ for check boxes."""
    image = Image.open(image_path)
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": [
            {"type": "image", "image": f"file://{image_path}"},
            {"type": "text", "text": prompt},
        ]},
    ]
    text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = processor(text=[text], images=[image], padding=True, return_tensors="pt")
    inputs = inputs.to(model.device)
    
    output_ids = model.generate(**inputs, max_new_tokens=max_new_tokens, do_sample=False)
    generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, output_ids)]
    
    output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)
    return output_text[0]

image_path = "/path/to/your/document.jpg"
result = ocr_page_with_nanonets_s(image_path, model, processor, max_new_tokens=15000)
print(result)

使用vLLM

啟動vLLM服務器。

vllm serve nanonets/Nanonets-OCR-s

使用模型進行預測

from openai import OpenAI
import base64

client = OpenAI(api_key="123", base_url="http://localhost:8000/v1")

model = "nanonets/Nanonets-OCR-s"

def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")

def ocr_page_with_nanonets_s(img_base64):
    response = client.chat.completions.create(
        model=model,
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "image_url",
                        "image_url": {"url": f"data:image/png;base64,{img_base64}"},
                    },
                    {
                        "type": "text",
                        "text": "Extract the text from the above document as if you were reading it naturally. Return the tables in html format. Return the equations in LaTeX representation. If there is an image in the document and image caption is not present, add a small description of the image inside the <img></img> tag; otherwise, add the image caption inside <img></img>. Watermarks should be wrapped in brackets. Ex: <watermark>OFFICIAL COPY</watermark>. Page numbers should be wrapped in brackets. Ex: <page_number>14</page_number> or <page_number>9/22</page_number>. Prefer using ☐ and ☑ for check boxes.",
                    },
                ],
            }
        ],
        temperature=0.0,
        max_tokens=15000
    )
    return response.choices[0].message.content

test_img_path = "/path/to/your/document.jpg"
img_base64 = encode_image(test_img_path)
print(ocr_page_with_nanonets_s(img_base64))

使用docext

pip install docext
python -m docext.app.app --model_name hosted_vllm/nanonets/Nanonets-OCR-s

更多詳情請查看 GitHub。

✨ 主要特性

Nanonets-OCR-s由 Nanonets 開發，是一款強大的、先進的圖像轉Markdown的OCR模型，它遠遠超越了傳統的文本提取功能。它能夠將文檔轉換為結構化的Markdown，並進行智能內容識別和語義標記，非常適合大語言模型（LLM）的下游處理。

Nanonets-OCR-s具備許多旨在輕鬆處理複雜文檔的功能：

LaTeX公式識別：自動將數學公式轉換為格式正確的LaTeX語法。它可以區分行內公式（ $...$ ）和顯示公式（$$...$$）。
智能圖像描述：使用結構化的 <img> 標籤描述文檔中的圖像，使其易於被大語言模型處理。它可以描述各種類型的圖像，包括徽標、圖表等，並詳細說明其內容、樣式和上下文。
簽名檢測與隔離：識別並隔離簽名與其他文本，將其輸出到 <signature> 標籤中。這對於處理法律和商業文檔至關重要。
水印提取：檢測並從文檔中提取水印文本，將其放置在 <watermark> 標籤中。
智能複選框處理：將表單複選框和單選按鈕轉換為標準化的Unicode符號（☐, ☑, ☒），以便進行一致且可靠的處理。
複雜表格提取：準確地從文檔中提取複雜表格，並將其轉換為Markdown和HTML表格格式。

📢 閱讀完整公告 | 🤗 Hugging Face空間演示

📚 詳細文檔

模型生成細節

該模型使用 llama.cpp 在提交版本 bf9087f5 時生成。

超越IMatrix的量化

我一直在試驗一種新的量化方法，該方法有選擇地提高關鍵層的精度，超越了默認IMatrix配置所提供的精度。

在我的測試中，標準的IMatrix量化在較低比特深度下表現不佳，特別是對於專家混合（MoE）模型。為了解決這個問題，我使用 llama.cpp 中的 --tensor-type 選項手動將重要層的精度提高。你可以在這裡看到實現代碼：
👉 使用llama.cpp進行層提升

雖然這確實會增加模型文件的大小，但它顯著提高了給定量化級別的精度。

選擇合適的GGUF模型格式

點擊此處獲取選擇合適GGUF模型格式的信息

測試AI驅動的量子網絡監控助手

如果你發現這些模型有用，幫助我測試我的 AI驅動的量子網絡監控助手，進行 量子就緒安全檢查：
👉 量子網絡監控器

量子網絡監控服務的完整開源代碼可在我的GitHub倉庫中找到（倉庫名稱中包含NetworkMonitor）：量子網絡監控器源代碼。如果你想自己進行模型量化，也可以找到我使用的代碼 GGUFModelBuilder

測試方法

選擇一種 AI助手類型：

TurboLLM (GPT-4.1-mini)
HugLLM (Hugginface開源模型)
TestLLM (僅支持CPU的實驗性模型)

測試內容

我正在挑戰 小型開源模型在AI網絡監控中的極限，具體包括：

針對即時網絡服務進行 函數調用
模型可以多小 同時仍能處理：
- 自動 Nmap安全掃描
- 量子就緒檢查
- 網絡監控任務

實驗性模型TestLLM

當前的實驗性模型（在Hugging Face Docker空間的2個CPU線程上運行llama.cpp）：

✅ 零配置設置
⏳ 加載時間30秒（推理速度慢，但 無API成本）。由於成本較低，沒有令牌限制。
🔧 尋求幫助！ 如果你對 邊緣設備AI 感興趣，讓我們一起合作！

其他助手

🟢 TurboLLM – 使用 gpt-4.1-mini：
- 性能非常好，但不幸的是OpenAI按令牌收費。因此，令牌使用受到限制。
- 創建自定義命令處理器，在量子網絡監控代理上運行.NET代碼
- 即時網絡診斷和監控
- 安全審計
- 滲透測試 (Nmap/Metasploit)
🔵 HugLLM – 最新的開源模型：
- 🌐 在Hugging Face推理API上運行。使用Novita託管的最新模型表現相當不錯。

測試命令示例

"Give me info on my websites SSL certificate"
"Check if my server is using quantum safe encyption for communication"
"Run a comprehensive security audit on my server"
'"Create a cmd processor to .. (what ever you want)" 注意，你需要安裝量子網絡監控代理才能運行.NET代碼。這是一個非常靈活和強大的功能，請謹慎使用！

最後說明

我自己出資購買服務器來創建這些模型文件、運行量子網絡監控服務，並支付Novita和OpenAI的推理費用。模型創建和量子網絡監控項目背後的所有代碼都是開源的。你可以隨意使用任何你覺得有用的東西。

如果你欣賞我的工作，請考慮請我喝杯咖啡 ☕。你的支持有助於支付服務成本，並讓我能夠提高所有人的令牌限制。

我也願意接受工作機會或贊助。

感謝！😊

📄 許可證

文檔未提及許可證相關信息。

📚 BibTex引用

@misc{Nanonets-OCR-S,
  title={Nanonets-OCR-S: A model for transforming documents into structured markdown with intelligent content recognition and semantic tagging},
  author={Souvik Mandal and Ashish Talewar and Paras Ahuja and Prathamesh Juvatkar},
  year={2025},
}