đ Chonky modernbert large v1
Chonky is a transformer model that can intelligently segment text into meaningful semantic chunks. It can be applied in Retrieval Augmented Generation (RAG) systems, enhancing the efficiency and accuracy of information retrieval and generation.
đ Quick Start
Chonky is a transformer model designed to process text and divide it into semantically coherent segments. These segments can be utilized in embedding - based retrieval systems or language models as part of a RAG pipeline.
â ī¸ Important Note
This model was fine - tuned on a sequence of length 1024 (by default, ModernBERT supports a sequence length of up to 8192).
⨠Features
- Intelligently segments text into semantically meaningful chunks.
- Suitable for use in RAG systems.
đĻ Installation
A small Python library for this model is available: chonky.
đģ Usage Examples
Basic Usage
from chonky import ParagraphSplitter
splitter = ParagraphSplitter(
model_id="mirth/chonky_modernbert_large_1",
device="cpu"
)
text = """Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep. The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien - looking machines â CPU, disk drives, printer, card reader â sitting up on a raised floor under bright fluorescent lights."""
for chunk in splitter(text):
print(chunk)
print("--")
Sample Output
Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories.
--
My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep. The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing."
--
This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it.
--
It was like a mini Bond villain's lair down there, with all these alien - looking machines â CPU, disk drives, printer, card reader â sitting up on a raised floor under bright fluorescent lights.
--
Advanced Usage
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
model_name = "mirth/chonky_modernbert_large_1"
tokenizer = AutoTokenizer.from_pretrained(model_name, model_max_length=1024)
id2label = {
0: "O",
1: "separator",
}
label2id = {
"O": 0,
"separator": 1,
}
model = AutoModelForTokenClassification.from_pretrained(
model_name,
num_labels=2,
id2label=id2label,
label2id=label2id,
)
pipe = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
text = """Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep. The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien - looking machines â CPU, disk drives, printer, card reader â sitting up on a raised floor under bright fluorescent lights."""
pipe(text)
Sample Output
[
{'entity_group': 'separator', 'score': np.float32(0.91590524), 'word': ' stories.', 'start': 209, 'end': 218},
{'entity_group': 'separator', 'score': np.float32(0.6210419), 'word': ' processing."', 'start': 455, 'end': 468},
{'entity_group': 'separator', 'score': np.float32(0.7071036), 'word': '.', 'start': 652, 'end': 653}
]
đ Documentation
Training Data
The model was trained to split paragraphs from the minipile and bookcorpus datasets.
Metrics
Minipile
Property |
Details |
F1 |
0.85 |
Precision |
0.87 |
Recall |
0.82 |
Accuracy |
0.99 |
Bookcorpus
Property |
Details |
F1 |
0.79 |
Precision |
0.85 |
Recall |
0.74 |
Accuracy |
0.99 |
Hardware
The model was fine - tuned on a single H100 for several hours.
đ License
This project is licensed under the MIT license.