🚀 分块式蒸馏伯特基础模型(无大小写区分)v1
分块式蒸馏伯特基础模型 是一个能够智能地将文本分割成有意义语义块的Transformer模型。该模型可用于检索增强生成(RAG)系统。
🚀 快速开始
本模型可用于将文本处理并分割成语义连贯的片段。这些片段随后可作为RAG流程的一部分,输入到基于嵌入的检索系统或语言模型中。
✨ 主要特性
- 智能地将文本分割成有意义的语义块。
- 可应用于RAG系统。
📦 安装指南
你可以使用作者开发的小型Python库 chonky 来使用此模型。
💻 使用示例
基础用法
from chonky import ParagraphSplitter
splitter = ParagraphSplitter(device="cpu")
text = """Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep. The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights."""
for chunk in splitter(text):
print(chunk)
print("--")
高级用法
你也可以使用标准的命名实体识别(NER)流程来使用此模型:
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
model_name = "mirth/chonky_distilbert_uncased_1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
id2label = {
0: "O",
1: "separator",
}
label2id = {
"O": 0,
"separator": 1,
}
model = AutoModelForTokenClassification.from_pretrained(
model_name,
num_labels=2,
id2label=id2label,
label2id=label2id,
)
pipe = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
text = """Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep. The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights."""
pipe(text)
[
{'entity_group': 'separator', 'score': 0.89515704, 'word': 'deep.', 'start': 333, 'end': 338},
{'entity_group': 'separator', 'score': 0.61160326, 'word': '.', 'start': 652, 'end': 653}
]
📚 详细文档
训练数据
该模型使用来自书籍语料库(bookcorpus)数据集的段落进行训练。
评估指标
指标 |
值 |
F1值 |
0.7 |
精确率 |
0.79 |
召回率 |
0.63 |
准确率 |
0.99 |
硬件环境
该模型在2块1080ti显卡上进行了微调。
📄 许可证
本项目采用MIT许可证。
信息表格
属性 |
详情 |
模型类型 |
分块式蒸馏伯特基础模型(无大小写区分)v1 |
训练数据 |
书籍语料库(bookcorpus)数据集 |