Nanonets-OCR-s-GGUF开源OCR模型 - 免费将图像文档转Markdown并智能识别标记

首页

Nanonets OCR S GGUF

由 Mungert 开发

Nanonets-OCR-s是一款强大的图像转Markdown的OCR模型，能够将文档转换为结构化的Markdown并进行智能内容识别和语义标记。

图像生成文本

Transformers

英语#文档结构化转换 #智能语义标记 #LaTeX公式识别

下载量 1,044

发布时间 : 6/14/2025

模型简介

Nanonets-OCR-s是一款先进的OCR模型，专为将文档转换为结构化的Markdown设计。它不仅能够提取文本，还能识别和标记复杂的内容，如表格、公式、图像、签名和水印等，非常适合大语言模型（LLM）的下游处理。

模型特点

LaTeX公式识别

自动将数学公式转换为格式正确的LaTeX语法，区分行内公式和显示公式。

智能图像描述

使用结构化的<img>标签描述文档中的图像，使其易于被大语言模型处理。

签名检测与隔离

识别并隔离签名与其他文本，将其输出到<signature>标签中，适用于法律和商业文档。

水印提取

检测并从文档中提取水印文本，将其放置在<watermark>标签中。

智能复选框处理

将表单复选框和单选按钮转换为标准化的Unicode符号（☐, ☑, ☒），以便进行一致且可靠的处理。

复杂表格提取

准确地从文档中提取复杂表格，并将其转换为Markdown和HTML表格格式。

模型能力

文档转换

文本提取

表格识别

公式识别

图像描述

签名检测

水印提取

复选框处理

使用案例

文档处理

PDF转Markdown

将PDF文档转换为结构化的Markdown格式，保留原始文档的布局和内容。

生成易于处理和编辑的Markdown文档。

表格提取

从文档中提取复杂表格并转换为HTML或Markdown格式。

保留表格的结构和内容，便于后续处理。

学术研究

公式识别

识别文档中的数学公式并转换为LaTeX语法。

便于学术论文的编辑和排版。

商业文档

签名检测

识别和隔离文档中的签名部分。

便于法律和商业文档的处理。

水印提取

检测和提取文档中的水印文本。

便于文档的版权管理和验证。

🚀 Nanonets-OCR-s GGUF模型

Nanonets-OCR-s GGUF模型是由Nanonets推出的一款强大的图像转Markdown的OCR模型，它超越了传统的文本提取功能，能够将文档转换为结构化的Markdown，并进行智能内容识别和语义标记，非常适合大语言模型（LLM）的下游处理。

🚀 快速开始

使用transformers

from PIL import Image
from transformers import AutoTokenizer, AutoProcessor, AutoModelForImageTextToText

model_path = "nanonets/Nanonets-OCR-s"

model = AutoModelForImageTextToText.from_pretrained(
    model_path, 
    torch_dtype="auto", 
    device_map="auto", 
    attn_implementation="flash_attention_2"
)
model.eval()

tokenizer = AutoTokenizer.from_pretrained(model_path)
processor = AutoProcessor.from_pretrained(model_path)


def ocr_page_with_nanonets_s(image_path, model, processor, max_new_tokens=4096):
    prompt = """Extract the text from the above document as if you were reading it naturally. Return the tables in html format. Return the equations in LaTeX representation. If there is an image in the document and image caption is not present, add a small description of the image inside the <img></img> tag; otherwise, add the image caption inside <img></img>. Watermarks should be wrapped in brackets. Ex: <watermark>OFFICIAL COPY</watermark>. Page numbers should be wrapped in brackets. Ex: <page_number>14</page_number> or <page_number>9/22</page_number>. Prefer using ☐ and ☑ for check boxes."""
    image = Image.open(image_path)
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": [
            {"type": "image", "image": f"file://{image_path}"},
            {"type": "text", "text": prompt},
        ]},
    ]
    text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = processor(text=[text], images=[image], padding=True, return_tensors="pt")
    inputs = inputs.to(model.device)
    
    output_ids = model.generate(**inputs, max_new_tokens=max_new_tokens, do_sample=False)
    generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, output_ids)]
    
    output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)
    return output_text[0]

image_path = "/path/to/your/document.jpg"
result = ocr_page_with_nanonets_s(image_path, model, processor, max_new_tokens=15000)
print(result)

使用vLLM

启动vLLM服务器。

vllm serve nanonets/Nanonets-OCR-s

使用模型进行预测

from openai import OpenAI
import base64

client = OpenAI(api_key="123", base_url="http://localhost:8000/v1")

model = "nanonets/Nanonets-OCR-s"

def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")

def ocr_page_with_nanonets_s(img_base64):
    response = client.chat.completions.create(
        model=model,
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "image_url",
                        "image_url": {"url": f"data:image/png;base64,{img_base64}"},
                    },
                    {
                        "type": "text",
                        "text": "Extract the text from the above document as if you were reading it naturally. Return the tables in html format. Return the equations in LaTeX representation. If there is an image in the document and image caption is not present, add a small description of the image inside the <img></img> tag; otherwise, add the image caption inside <img></img>. Watermarks should be wrapped in brackets. Ex: <watermark>OFFICIAL COPY</watermark>. Page numbers should be wrapped in brackets. Ex: <page_number>14</page_number> or <page_number>9/22</page_number>. Prefer using ☐ and ☑ for check boxes.",
                    },
                ],
            }
        ],
        temperature=0.0,
        max_tokens=15000
    )
    return response.choices[0].message.content

test_img_path = "/path/to/your/document.jpg"
img_base64 = encode_image(test_img_path)
print(ocr_page_with_nanonets_s(img_base64))

使用docext

pip install docext
python -m docext.app.app --model_name hosted_vllm/nanonets/Nanonets-OCR-s

更多详情请查看 GitHub。

✨ 主要特性

Nanonets-OCR-s由 Nanonets 开发，是一款强大的、先进的图像转Markdown的OCR模型，它远远超越了传统的文本提取功能。它能够将文档转换为结构化的Markdown，并进行智能内容识别和语义标记，非常适合大语言模型（LLM）的下游处理。

Nanonets-OCR-s具备许多旨在轻松处理复杂文档的功能：

LaTeX公式识别：自动将数学公式转换为格式正确的LaTeX语法。它可以区分行内公式（ $...$ ）和显示公式（$$...$$）。
智能图像描述：使用结构化的 <img> 标签描述文档中的图像，使其易于被大语言模型处理。它可以描述各种类型的图像，包括徽标、图表等，并详细说明其内容、样式和上下文。
签名检测与隔离：识别并隔离签名与其他文本，将其输出到 <signature> 标签中。这对于处理法律和商业文档至关重要。
水印提取：检测并从文档中提取水印文本，将其放置在 <watermark> 标签中。
智能复选框处理：将表单复选框和单选按钮转换为标准化的Unicode符号（☐, ☑, ☒），以便进行一致且可靠的处理。
复杂表格提取：准确地从文档中提取复杂表格，并将其转换为Markdown和HTML表格格式。

📢 阅读完整公告 | 🤗 Hugging Face空间演示

📚 详细文档

模型生成细节

该模型使用 llama.cpp 在提交版本 bf9087f5 时生成。

超越IMatrix的量化

我一直在试验一种新的量化方法，该方法有选择地提高关键层的精度，超越了默认IMatrix配置所提供的精度。

在我的测试中，标准的IMatrix量化在较低比特深度下表现不佳，特别是对于专家混合（MoE）模型。为了解决这个问题，我使用 llama.cpp 中的 --tensor-type 选项手动将重要层的精度提高。你可以在这里看到实现代码：
👉 使用llama.cpp进行层提升

虽然这确实会增加模型文件的大小，但它显著提高了给定量化级别的精度。

选择合适的GGUF模型格式

点击此处获取选择合适GGUF模型格式的信息

测试AI驱动的量子网络监控助手

如果你发现这些模型有用，帮助我测试我的 AI驱动的量子网络监控助手，进行 量子就绪安全检查：
👉 量子网络监控器

量子网络监控服务的完整开源代码可在我的GitHub仓库中找到（仓库名称中包含NetworkMonitor）：量子网络监控器源代码。如果你想自己进行模型量化，也可以找到我使用的代码 GGUFModelBuilder

测试方法

选择一种 AI助手类型：

TurboLLM (GPT-4.1-mini)
HugLLM (Hugginface开源模型)
TestLLM (仅支持CPU的实验性模型)

测试内容

我正在挑战 小型开源模型在AI网络监控中的极限，具体包括：

针对实时网络服务进行 函数调用
模型可以多小 同时仍能处理：
- 自动 Nmap安全扫描
- 量子就绪检查
- 网络监控任务

实验性模型TestLLM

当前的实验性模型（在Hugging Face Docker空间的2个CPU线程上运行llama.cpp）：

✅ 零配置设置
⏳ 加载时间30秒（推理速度慢，但 无API成本）。由于成本较低，没有令牌限制。
🔧 寻求帮助！ 如果你对 边缘设备AI 感兴趣，让我们一起合作！

其他助手

🟢 TurboLLM – 使用 gpt-4.1-mini：
- 性能非常好，但不幸的是OpenAI按令牌收费。因此，令牌使用受到限制。
- 创建自定义命令处理器，在量子网络监控代理上运行.NET代码
- 实时网络诊断和监控
- 安全审计
- 渗透测试 (Nmap/Metasploit)
🔵 HugLLM – 最新的开源模型：
- 🌐 在Hugging Face推理API上运行。使用Novita托管的最新模型表现相当不错。

测试命令示例

"Give me info on my websites SSL certificate"
"Check if my server is using quantum safe encyption for communication"
"Run a comprehensive security audit on my server"
'"Create a cmd processor to .. (what ever you want)" 注意，你需要安装量子网络监控代理才能运行.NET代码。这是一个非常灵活和强大的功能，请谨慎使用！

最后说明

我自己出资购买服务器来创建这些模型文件、运行量子网络监控服务，并支付Novita和OpenAI的推理费用。模型创建和量子网络监控项目背后的所有代码都是开源的。你可以随意使用任何你觉得有用的东西。

如果你欣赏我的工作，请考虑请我喝杯咖啡 ☕。你的支持有助于支付服务成本，并让我能够提高所有人的令牌限制。

我也愿意接受工作机会或赞助。

感谢！😊

📄 许可证

文档未提及许可证相关信息。

📚 BibTex引用

@misc{Nanonets-OCR-S,
  title={Nanonets-OCR-S: A model for transforming documents into structured markdown with intelligent content recognition and semantic tagging},
  author={Souvik Mandal and Ashish Talewar and Paras Ahuja and Prathamesh Juvatkar},
  year={2025},
}