🚀 分塊式蒸餾伯特基礎模型(無大小寫區分)v1
分塊式蒸餾伯特基礎模型 是一個能夠智能地將文本分割成有意義語義塊的Transformer模型。該模型可用於檢索增強生成(RAG)系統。
🚀 快速開始
本模型可用於將文本處理並分割成語義連貫的片段。這些片段隨後可作為RAG流程的一部分,輸入到基於嵌入的檢索系統或語言模型中。
✨ 主要特性
- 智能地將文本分割成有意義的語義塊。
- 可應用於RAG系統。
📦 安裝指南
你可以使用作者開發的小型Python庫 chonky 來使用此模型。
💻 使用示例
基礎用法
from chonky import ParagraphSplitter
splitter = ParagraphSplitter(device="cpu")
text = """Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep. The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights."""
for chunk in splitter(text):
print(chunk)
print("--")
高級用法
你也可以使用標準的命名實體識別(NER)流程來使用此模型:
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
model_name = "mirth/chonky_distilbert_uncased_1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
id2label = {
0: "O",
1: "separator",
}
label2id = {
"O": 0,
"separator": 1,
}
model = AutoModelForTokenClassification.from_pretrained(
model_name,
num_labels=2,
id2label=id2label,
label2id=label2id,
)
pipe = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
text = """Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep. The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights."""
pipe(text)
[
{'entity_group': 'separator', 'score': 0.89515704, 'word': 'deep.', 'start': 333, 'end': 338},
{'entity_group': 'separator', 'score': 0.61160326, 'word': '.', 'start': 652, 'end': 653}
]
📚 詳細文檔
訓練數據
該模型使用來自書籍語料庫(bookcorpus)數據集的段落進行訓練。
評估指標
指標 |
值 |
F1值 |
0.7 |
精確率 |
0.79 |
召回率 |
0.63 |
準確率 |
0.99 |
硬件環境
該模型在2塊1080ti顯卡上進行了微調。
📄 許可證
本項目採用MIT許可證。
信息表格
屬性 |
詳情 |
模型類型 |
分塊式蒸餾伯特基礎模型(無大小寫區分)v1 |
訓練數據 |
書籍語料庫(bookcorpus)數據集 |