Nanonets-OCR-s-GGUFオープンソースOCRモデル - ドキュメントを無料でMarkdown形式にスマート変換

ホーム

Nanonets OCR S GGUF

unslothによって開発

Nanonets-OCR-sは、画像をMarkdown形式に変換する高度な光学文字認識（OCR）モデルで、文書を構造化されたMarkdown形式に変換し、スマートな内容認識とセマンティックなタグ付け機能を備えています。

画像生成テキスト

Transformers

英語#構造化Markdown変換 #LaTeX数式認識 #スマート文書解析

ダウンロード数 2,300

リリース時間 : 6/16/2025

モデル概要

Nanonets-OCR-sは強力なOCRモデルで、文書を構造化されたMarkdown形式に変換するために設計されています。テキストを抽出するだけでなく、LaTeX数式、画像、署名、透かしなどの複雑な内容を認識してタグ付けすることができ、大規模言語モデル（LLM）による下流処理に非常に適しています。

モデル特徴

LaTeX数式認識

数学方程式や数式を自動的に正しい形式のLaTeX構文に変換し、行内数式と表示数式を区別できます。

スマート画像記述

構造化された<img>タグを使用して文書内の画像を記述し、大規模言語モデルによる処理を容易にします。

署名検出と分離

文書内の署名を認識して分離し、<signature>タグ内に出力します。

透かし抽出

文書内の透かしテキストを検出して抽出し、<watermark>タグ内に配置します。

スマートチェックボックス処理

フォーム内のチェックボックスとラジオボタンを標準化されたUnicode記号に変換します。

複雑な表の抽出

文書内の複雑な表を正確に抽出し、MarkdownとHTMLの表形式に変換します。

モデル能力

文書OCR

LaTeX数式認識

画像内容記述

署名検出

透かし抽出

表抽出

チェックボックス処理

使用事例

文書処理

学術論文の処理

数学公式と表を含む学術論文を構造化Markdown形式に変換します。

元の文書の構造とセマンティクスを保持し、後続の分析と処理を容易にします。

商業契約書の処理

契約書内のテキスト、署名、透かし情報を抽出します。

法律文書を自動化処理し、効率を向上させます。

フォームの処理

フォーム内のチェックボックスとラジオボタンを認識して変換します。

フォームデータを標準化し、後続の分析を容易にします。

🚀 Nanonets-OCR-s

Nanonets-OCR-sは、従来のテキスト抽出を遥かに超える、強力な最先端の画像からマークダウンへのOCRモデルです。このモデルは、文書を構造化されたマークダウンに変換し、インテリジェントなコンテンツ認識とセマンティックタグ付けを行うため、大規模言語モデル（LLM）による下流処理に最適です。

Unsloth Dynamic 2.0 は、卓越した精度を達成し、他の主要な量子化手法を上回ります。

属性	详情
モデルタイプ	image-text-to-text
ベースモデル	nanonets/Nanonets-OCR-s
タグ	OCR、unsloth、pdf2markdown
ライブラリ名	transformers

✨ 主な機能

Nanonets-OCR-sには、複雑な文書を簡単に処理できるように設計された機能が満載されています。

LaTeX数式認識：数学の方程式や公式を適切にフォーマットされたLaTeX構文に自動変換します。インライン ( $...$ ) とディスプレイ ($$...$$) の方程式を区別します。
インテリジェントな画像説明：文書内の画像を構造化された <img> タグを使用して説明し、LLM処理に適した形式にします。ロゴ、チャート、グラフなど、さまざまな画像タイプを説明し、その内容、スタイル、コンテキストを詳細に記述します。
署名検出と分離：署名を他のテキストから識別して分離し、<signature> タグ内に出力します。これは、法的およびビジネス文書の処理に重要です。
透かし抽出：文書から透かしテキストを検出して抽出し、<watermark> タグ内に配置します。
スマートチェックボックス処理：フォームのチェックボックスやラジオボタンを標準化されたUnicode記号 (‚òê, ‚òë, ‚òí) に変換し、一貫した信頼性の高い処理を行います。
複雑な表抽出：文書から複雑な表を正確に抽出し、マークダウンとHTMLの両方の表形式に変換します。

完全なアナウンスを読む | Hugging Face Spaceデモ

💻 使用例

基本的な使用法

from PIL import Image
from transformers import AutoTokenizer, AutoProcessor, AutoModelForImageTextToText

model_path = "nanonets/Nanonets-OCR-s"

model = AutoModelForImageTextToText.from_pretrained(
    model_path, 
    torch_dtype="auto", 
    device_map="auto", 
    attn_implementation="flash_attention_2"
)
model.eval()

tokenizer = AutoTokenizer.from_pretrained(model_path)
processor = AutoProcessor.from_pretrained(model_path)


def ocr_page_with_nanonets_s(image_path, model, processor, max_new_tokens=4096):
    prompt = """Extract the text from the above document as if you were reading it naturally. Return the tables in html format. Return the equations in LaTeX representation. If there is an image in the document and image caption is not present, add a small description of the image inside the <img></img> tag; otherwise, add the image caption inside <img></img>. Watermarks should be wrapped in brackets. Ex: <watermark>OFFICIAL COPY</watermark>. Page numbers should be wrapped in brackets. Ex: <page_number>14</page_number> or <page_number>9/22</page_number>. Prefer using ‚òê and ‚òë for check boxes."""
    image = Image.open(image_path)
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": [
            {"type": "image", "image": f"file://{image_path}"},
            {"type": "text", "text": prompt},
        ]},
    ]
    text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = processor(text=[text], images=[image], padding=True, return_tensors="pt")
    inputs = inputs.to(model.device)
    
    output_ids = model.generate(**inputs, max_new_tokens=max_new_tokens, do_sample=False)
    generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, output_ids)]
    
    output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)
    return output_text[0]

image_path = "/path/to/your/document.jpg"
result = ocr_page_with_nanonets_s(image_path, model, processor, max_new_tokens=15000)
print(result)

高度な使用法

vLLMを使用する場合

vLLMサーバーを起動します。

vllm serve nanonets/Nanonets-OCR-s

モデルで予測を行います。

from openai import OpenAI
import base64

client = OpenAI(api_key="123", base_url="http://localhost:8000/v1")

model = "nanonets/Nanonets-OCR-s"

def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")

def ocr_page_with_nanonets_s(img_base64):
    response = client.chat.completions.create(
        model=model,
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "image_url",
                        "image_url": {"url": f"data:image/png;base64,{img_base64}"},
                    },
                    {
                        "type": "text",
                        "text": "Extract the text from the above document as if you were reading it naturally. Return the tables in html format. Return the equations in LaTeX representation. If there is an image in the document and image caption is not present, add a small description of the image inside the <img></img> tag; otherwise, add the image caption inside <img></img>. Watermarks should be wrapped in brackets. Ex: <watermark>OFFICIAL COPY</watermark>. Page numbers should be wrapped in brackets. Ex: <page_number>14</page_number> or <page_number>9/22</page_number>. Prefer using ‚òê and ‚òë for check boxes.",
                    },
                ],
            }
        ],
        temperature=0.0,
        max_tokens=15000
    )
    return response.choices[0].message.content

test_img_path = "/path/to/your/document.jpg"
img_base64 = encode_image(test_img_path)
print(ocr_page_with_nanonets_s(img_base64))

docextを使用する場合

pip install docext
python -m docext.app.app --model_name hosted_vllm/nanonets/Nanonets-OCR-s

詳細は GitHub を確認してください。

📚 ドキュメント

BibTex

@misc{Nanonets-OCR-S,
  title={Nanonets-OCR-S: A model for transforming documents into structured markdown with intelligent content recognition and semantic tagging},
  author={Souvik Mandal and Ashish Talewar and Paras Ahuja and Prathamesh Juvatkar},
  year={2025},
}