Chonky_modernbert_base_1開源模型 - 免費實現文本智能分割，適用RAG系統

Home

Chonky Modernbert Base 1

Developed by mirth

Chonky是一款能智能將文本分割成有意義語義塊的Transformer模型，可用於RAG系統。

序列標註

Transformers

EnglishOpen Source License:MIT #語義分塊 #RAG優化 #長文本處理

Downloads 221

Release Time : 4/14/2025

Model Overview

該模型處理文本並將其劃分為語義連貫的片段，這些分塊可作為RAG流程的一部分輸入到基於嵌入的檢索系統或語言模型中。

Model Features

語義分塊

能夠智能地將文本分割成有意義的語義塊，保持語義連貫性

長序列支持

基於ModernBERT架構，原生支持最長8192的序列長度

RAG優化

專為RAG(檢索增強生成)系統設計，優化了分塊質量

Model Capabilities

文本分割

語義分析

段落劃分

Use Cases

信息檢索

RAG系統預處理

為檢索增強生成系統準備語義連貫的文本塊

提高檢索效率和相關性

文本處理

文檔分塊

將長文檔分割成有意義的段落

便於後續處理和分析

🚀 現代BERT基礎分塊模型v1

Chonky 是一個Transformer模型，它能夠智能地將文本分割成有意義的語義塊。該模型可用於檢索增強生成（RAG）系統。

🚀 快速開始

“Chonky現代BERT基礎版v1”模型可用於智能地將文本分割成有意義的語義塊，可應用於RAG系統中。以下為你介紹使用方法。

✨ 主要特性

該模型能夠處理文本並將其劃分為語義連貫的片段。這些片段隨後可以作為RAG管道的一部分，被輸入到基於嵌入的檢索系統或語言模型中。
此模型在長度為1024的序列上進行了微調（默認情況下，ModernBERT支持的序列長度可達8192）。

📦 安裝指南

你可以使用作者開發的小型Python庫 chonky 來使用該模型。

💻 使用示例

基礎用法

from chonky import ParagraphSplitter

# 首次運行時，它將下載Transformer模型
splitter = ParagraphSplitter(
  model_id="mirth/chonky_modernbert_base_1",
  device="cpu"
)

text = """Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep. The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights."""

for chunk in splitter(text):
  print(chunk)
  print("--")

# 輸出

Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories.
--
 My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep. The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing."
--
 This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it.
--
 It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights.
--

高級用法

你也可以使用標準的命名實體識別（NER）管道來使用該模型：

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

model_name = "mirth/chonky_modernbert_base_1"

tokenizer = AutoTokenizer.from_pretrained(model_name, model_max_length=1024)

id2label = {
    0: "O",
    1: "separator",
}
label2id = {
    "O": 0,
    "separator": 1,
}

model = AutoModelForTokenClassification.from_pretrained(
    model_name,
    num_labels=2,
    id2label=id2label,
    label2id=label2id,
)


pipe = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

text = """Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep. The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights."""

pipe(text)

# 輸出

[
  {'entity_group': 'separator', 'score': np.float32(0.91590524), 'word': ' stories.', 'start': 209, 'end': 218},
  {'entity_group': 'separator', 'score': np.float32(0.6210419), 'word': ' processing."', 'start': 455, 'end': 468},
  {'entity_group': 'separator', 'score': np.float32(0.7071036), 'word': '.', 'start': 652, 'end': 653}
]