開源Chonky模型 - 免費部署，智能分割文本成語義塊，助力RAG系統

首頁

Chonky Modernbert Large 1

由mirth開發

Chonky是一款能夠智能地將文本分割成有意義的語義塊的Transformer模型，適用於RAG系統。

序列標註

Transformers

英語開源協議:MIT #語義分塊 #RAG優化 #長文本處理

下載量 54

發布時間 : 4/26/2025

模型概述

該模型處理文本並將其劃分為語義連貫的片段，這些分塊可以作為RAG流程的一部分，輸入到基於嵌入的檢索系統或語言模型中。

模型特點

智能語義分塊

能夠將文本分割成有意義的語義塊，保持內容的連貫性。

RAG系統優化

專為檢索增強生成(RAG)系統設計，優化了分塊質量。

長序列支持

在長度為1024的序列上進行了微調（基礎模型支持最長8192的序列）。

模型能力

文本語義分塊

段落分割

RAG系統預處理

使用案例

信息檢索

RAG系統預處理

為檢索增強生成系統準備語義連貫的文本塊

提高檢索系統的準確性和相關性

文本處理

文檔分割

將長文檔分割成有意義的段落

便於後續處理和分析

🚀 超大型現代BERT模型v1

超大型現代BERT模型（Chonky） 是一個能夠智能地將文本分割成有意義語義塊的Transformer模型。該模型可用於檢索增強生成（RAG）系統。

🚀 快速開始

超大型現代BERT模型（Chonky）能夠處理文本並將其分割成語義連貫的片段。這些片段隨後可以作為RAG流程的一部分，被輸入到基於嵌入的檢索系統或語言模型中。

⚠️ 重要提示

該模型在長度為1024的序列上進行了微調（默認情況下，現代BERT支持的序列長度最大為8192）。

✨ 主要特性

智能地將文本分割成有意義的語義塊。
可用於檢索增強生成（RAG）系統。

📦 安裝指南

你可以使用作者開發的小型Python庫來使用這個模型：chonky

💻 使用示例

基礎用法

from chonky import ParagraphSplitter

# 首次運行時，它將下載Transformer模型
splitter = ParagraphSplitter(
  model_id="mirth/chonky_modernbert_large_1",
  device="cpu"
)

text = """Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep. The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights."""

for chunk in splitter(text):
  print(chunk)
  print("--")

基礎用法示例輸出

Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories.
--
 My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep. The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing."
--
 This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it.
--
 It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights.
--

高級用法

你也可以使用標準的命名實體識別（NER）流程來使用這個模型：

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

model_name = "mirth/chonky_modernbert_large_1"

tokenizer = AutoTokenizer.from_pretrained(model_name, model_max_length=1024)

id2label = {
    0: "O",
    1: "separator",
}
label2id = {
    "O": 0,
    "separator": 1,
}

model = AutoModelForTokenClassification.from_pretrained(
    model_name,
    num_labels=2,
    id2label=id2label,
    label2id=label2id,
)


pipe = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

text = """Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep. The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights."""

pipe(text)

高級用法示例輸出

[
  {'entity_group': 'separator', 'score': np.float32(0.91590524), 'word': ' stories.', 'start': 209, 'end': 218},
  {'entity_group': 'separator', 'score': np.float32(0.6210419), 'word': ' processing."', 'start': 455, 'end': 468},
  {'entity_group': 'separator', 'score': np.float32(0.7071036), 'word': '.', 'start': 652, 'end': 653}
]