Open-source Chonky Model - Free Deployment, Intelligent Text Segmentation into Semantic Blocks to Support RAG Systems

Chonky Modernbert Large 1

Developed by mirth

Chonky is a Transformer model capable of intelligently splitting text into meaningful semantic chunks, suitable for RAG systems.

Sequence Labeling

Transformers

EnglishOpen Source License:MIT #Semantic Chunking #RAG Optimization #Long Text Processing

Downloads 54

Release Time : 4/26/2025

Model Overview

This model processes text and divides it into semantically coherent segments, which can be used as part of RAG workflows, input into embedding-based retrieval systems or language models.

Model Features

Intelligent Semantic Chunking

Capable of splitting text into meaningful semantic chunks while maintaining content coherence.

RAG System Optimization

Designed specifically for Retrieval-Augmented Generation (RAG) systems, optimizing chunk quality.

Long Sequence Support

Fine-tuned on sequences of length 1024 (base model supports sequences up to 8192 in length).

Model Capabilities

Text Semantic Chunking

Paragraph Segmentation

RAG System Preprocessing

Use Cases

Information Retrieval

RAG System Preprocessing

Preparing semantically coherent text chunks for retrieval-augmented generation systems

Improves retrieval system accuracy and relevance

Text Processing

Document Segmentation

Splitting long documents into meaningful paragraphs

Facilitates subsequent processing and analysis

🚀 Chonky modernbert large v1

Chonky is a transformer model that can intelligently segment text into meaningful semantic chunks. It can be applied in Retrieval Augmented Generation (RAG) systems, enhancing the efficiency and accuracy of information retrieval and generation.

🚀 Quick Start

Chonky is a transformer model designed to process text and divide it into semantically coherent segments. These segments can be utilized in embedding - based retrieval systems or language models as part of a RAG pipeline.

⚠️ Important Note

This model was fine - tuned on a sequence of length 1024 (by default, ModernBERT supports a sequence length of up to 8192).

✨ Features

Intelligently segments text into semantically meaningful chunks.
Suitable for use in RAG systems.

📦 Installation

A small Python library for this model is available: chonky.

💻 Usage Examples

Basic Usage

from chonky import ParagraphSplitter

# on the first run it will download the transformer model
splitter = ParagraphSplitter(
  model_id="mirth/chonky_modernbert_large_1",
  device="cpu"
)

text = """Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep. The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien - looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights."""

for chunk in splitter(text):
  print(chunk)
  print("--")

Sample Output

Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories.
--
 My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep. The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing."
--
 This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it.
--
 It was like a mini Bond villain's lair down there, with all these alien - looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights.
--

Advanced Usage

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

model_name = "mirth/chonky_modernbert_large_1"

tokenizer = AutoTokenizer.from_pretrained(model_name, model_max_length=1024)

id2label = {
    0: "O",
    1: "separator",
}
label2id = {
    "O": 0,
    "separator": 1,
}

model = AutoModelForTokenClassification.from_pretrained(
    model_name,
    num_labels=2,
    id2label=id2label,
    label2id=label2id,
)


pipe = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

text = """Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep. The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien - looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights."""

pipe(text)

Sample Output

[
  {'entity_group': 'separator', 'score': np.float32(0.91590524), 'word': ' stories.', 'start': 209, 'end': 218},
  {'entity_group': 'separator', 'score': np.float32(0.6210419), 'word': ' processing."', 'start': 455, 'end': 468},
  {'entity_group': 'separator', 'score': np.float32(0.7071036), 'word': '.', 'start': 652, 'end': 653}
]

📚 Documentation

Training Data

The model was trained to split paragraphs from the minipile and bookcorpus datasets.

Metrics

Minipile

Property	Details
F1	0.85
Precision	0.87
Recall	0.82
Accuracy	0.99

Bookcorpus

Property	Details
F1	0.79
Precision	0.85
Recall	0.74
Accuracy	0.99

Hardware

The model was fine - tuned on a single H100 for several hours.

📄 License

This project is licensed under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご