🚀 Nanonets-OCR-s圖像文字轉文本模型
Nanonets-OCR-s是一款強大的、先進的圖像轉Markdown光學字符識別(OCR)模型,它遠遠超越了傳統的文本提取功能。該模型能夠將文檔轉換為結構化的Markdown格式,具備智能內容識別和語義標記功能,非常適合供大語言模型(LLM)進行下游處理。
✨ 主要特性
Nanonets-OCR-s具備一系列精心設計的功能,能夠輕鬆處理複雜文檔:
- LaTeX公式識別:自動將數學方程和公式轉換為格式正確的LaTeX語法,可區分行內公式(
$...$
)和顯示公式($$...$$
)。
- 智能圖像描述:使用結構化的
<img>
標籤描述文檔內的圖像,便於大語言模型處理。能夠描述各種類型的圖像,包括標誌、圖表等,並詳細說明其內容、樣式和上下文。
- 簽名檢測與分離:識別並分離文檔中的簽名,將其輸出到
<signature>
標籤內,這對於處理法律和商業文檔至關重要。
- 水印提取:檢測並提取文檔中的水印文本,將其放置在
<watermark>
標籤內。
- 智能複選框處理:將表單中的複選框和單選按鈕轉換為標準化的Unicode符號(
‚òê
, ‚òë
, ‚òí
),以實現一致且可靠的處理。
- 複雜表格提取:準確提取文檔中的複雜表格,並將其轉換為Markdown和HTML表格格式。
閱讀完整公告 | Hugging Face空間演示
🚀 快速開始
使用transformers庫
from PIL import Image
from transformers import AutoTokenizer, AutoProcessor, AutoModelForImageTextToText
model_path = "nanonets/Nanonets-OCR-s"
model = AutoModelForImageTextToText.from_pretrained(
model_path,
torch_dtype="auto",
device_map="auto",
attn_implementation="flash_attention_2"
)
model.eval()
tokenizer = AutoTokenizer.from_pretrained(model_path)
processor = AutoProcessor.from_pretrained(model_path)
def ocr_page_with_nanonets_s(image_path, model, processor, max_new_tokens=4096):
prompt = """Extract the text from the above document as if you were reading it naturally. Return the tables in html format. Return the equations in LaTeX representation. If there is an image in the document and image caption is not present, add a small description of the image inside the <img></img> tag; otherwise, add the image caption inside <img></img>. Watermarks should be wrapped in brackets. Ex: <watermark>OFFICIAL COPY</watermark>. Page numbers should be wrapped in brackets. Ex: <page_number>14</page_number> or <page_number>9/22</page_number>. Prefer using ‚òê and ‚òë for check boxes."""
image = Image.open(image_path)
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": [
{"type": "image", "image": f"file://{image_path}"},
{"type": "text", "text": prompt},
]},
]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], padding=True, return_tensors="pt")
inputs = inputs.to(model.device)
output_ids = model.generate(**inputs, max_new_tokens=max_new_tokens, do_sample=False)
generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, output_ids)]
output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)
return output_text[0]
image_path = "/path/to/your/document.jpg"
result = ocr_page_with_nanonets_s(image_path, model, processor, max_new_tokens=15000)
print(result)
使用vLLM
- 啟動vLLM服務器。
vllm serve nanonets/Nanonets-OCR-s
- 使用模型進行預測
from openai import OpenAI
import base64
client = OpenAI(api_key="123", base_url="http://localhost:8000/v1")
model = "nanonets/Nanonets-OCR-s"
def encode_image(image_path):
with open(image_path, "rb") as image_file:
return base64.b64encode(image_file.read()).decode("utf-8")
def ocr_page_with_nanonets_s(img_base64):
response = client.chat.completions.create(
model=model,
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {"url": f"data:image/png;base64,{img_base64}"},
},
{
"type": "text",
"text": "Extract the text from the above document as if you were reading it naturally. Return the tables in html format. Return the equations in LaTeX representation. If there is an image in the document and image caption is not present, add a small description of the image inside the <img></img> tag; otherwise, add the image caption inside <img></img>. Watermarks should be wrapped in brackets. Ex: <watermark>OFFICIAL COPY</watermark>. Page numbers should be wrapped in brackets. Ex: <page_number>14</page_number> or <page_number>9/22</page_number>. Prefer using ‚òê and ‚òë for check boxes.",
},
],
}
],
temperature=0.0,
max_tokens=15000
)
return response.choices[0].message.content
test_img_path = "/path/to/your/document.jpg"
img_base64 = encode_image(test_img_path)
print(ocr_page_with_nanonets_s(img_base64))
使用docext
pip install docext
python -m docext.app.app --model_name hosted_vllm/nanonets/Nanonets-OCR-s
更多詳細信息請查看 GitHub。
📚 詳細文檔
BibTex引用
@misc{Nanonets-OCR-S,
title={Nanonets-OCR-S: A model for transforming documents into structured markdown with intelligent content recognition and semantic tagging},
author={Souvik Mandal and Ashish Talewar and Paras Ahuja and Prathamesh Juvatkar},
year={2025},
}