🚀 現代BERT基礎分塊模型v1
Chonky 是一個Transformer模型,它能夠智能地將文本分割成有意義的語義塊。該模型可用於檢索增強生成(RAG)系統。
🚀 快速開始
“Chonky現代BERT基礎版v1”模型可用於智能地將文本分割成有意義的語義塊,可應用於RAG系統中。以下為你介紹使用方法。
✨ 主要特性
- 該模型能夠處理文本並將其劃分為語義連貫的片段。這些片段隨後可以作為RAG管道的一部分,被輸入到基於嵌入的檢索系統或語言模型中。
- 此模型在長度為1024的序列上進行了微調(默認情況下,ModernBERT支持的序列長度可達8192)。
📦 安裝指南
你可以使用作者開發的小型Python庫 chonky 來使用該模型。
💻 使用示例
基礎用法
from chonky import ParagraphSplitter
splitter = ParagraphSplitter(
model_id="mirth/chonky_modernbert_base_1",
device="cpu"
)
text = """Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep. The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights."""
for chunk in splitter(text):
print(chunk)
print("--")
Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories.
--
My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep. The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing."
--
This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it.
--
It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights.
--
高級用法
你也可以使用標準的命名實體識別(NER)管道來使用該模型:
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
model_name = "mirth/chonky_modernbert_base_1"
tokenizer = AutoTokenizer.from_pretrained(model_name, model_max_length=1024)
id2label = {
0: "O",
1: "separator",
}
label2id = {
"O": 0,
"separator": 1,
}
model = AutoModelForTokenClassification.from_pretrained(
model_name,
num_labels=2,
id2label=id2label,
label2id=label2id,
)
pipe = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
text = """Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep. The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights."""
pipe(text)
[
{'entity_group': 'separator', 'score': np.float32(0.91590524), 'word': ' stories.', 'start': 209, 'end': 218},
{'entity_group': 'separator', 'score': np.float32(0.6210419), 'word': ' processing."', 'start': 455, 'end': 468},
{'entity_group': 'separator', 'score': np.float32(0.7071036), 'word': '.', 'start': 652, 'end': 653}
]
📚 詳細文檔
訓練數據
該模型是基於BookCorpus數據集中的段落分割任務進行訓練的。
評估指標
基於標記的評估指標如下:
指標 |
值 |
F1值 |
0.79 |
精確率 |
0.83 |
召回率 |
0.75 |
準確率 |
0.99 |
硬件環境
該模型在單個H100上進行了數小時的微調。
📄 許可證
本項目採用MIT許可證。
⚠️ 重要提示
此模型在長度為1024的序列上進行了微調(默認情況下,ModernBERT支持的序列長度可達8192)。