Chonky Open-Source Transformer Model - Freely Deploy to Achieve Intelligent Segmentation of Text Semantic Blocks, Suitable for RAG Systems

Chonky Distilbert Base Uncased 1

Developed by mirth

Chonky is a Transformer model that intelligently segments text into meaningful semantic chunks, suitable for RAG systems.

Sequence Labeling

Transformers

EnglishOpen Source License:MIT #Semantic Chunking #RAG Optimization #Text Segmentation

Downloads 1,486

Release Time : 4/10/2025

Model Overview

This model processes text and divides it into semantically coherent segments, which can be input into embedding-based retrieval systems or language models as part of the RAG pipeline.

Model Features

Intelligent Semantic Chunking

Capable of intelligently segmenting text into meaningful semantic chunks, improving the efficiency of RAG systems.

Based on DistilBERT

Uses the lightweight DistilBERT-base-uncased model, balancing performance and efficiency.

Easy Integration

Provides both a dedicated Python library and standard NER pipeline for usage.

Model Capabilities

Text Segmentation

Semantic Analysis

RAG System Support

Use Cases

Information Retrieval

RAG System Preprocessing

Prepares semantically coherent text chunks for embedding-based retrieval systems

Improves retrieval relevance and efficiency

Text Processing

Document Segmentation

Splits long documents into meaningful paragraphs

Facilitates subsequent analysis and processing

🚀 Chonky distilbert base (uncased) v1

Chonky is a transformer model that can intelligently segment text into meaningful semantic chunks. It can be used in RAG systems, offering a practical solution for text processing and retrieval.

✨ Features

The model processes text and divides it into semantically coherent segments. These chunks can be fed into embedding - based retrieval systems or language models as part of a RAG pipeline.

📦 Installation

The README doesn't provide specific installation steps. However, you can use the model through the provided Python libraries. You may need to install dependencies like transformers and chonky.

💻 Usage Examples

Basic Usage

You can use the chonky library for easy text splitting:

from chonky import ParagraphSplitter

# on the first run it will download the transformer model
splitter = ParagraphSplitter(device="cpu")

text = """Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep. The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien - looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights."""

for chunk in splitter(text):
  print(chunk)
  print("--")

Advanced Usage

You can also use the model with the standard NER pipeline:

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

model_name = "mirth/chonky_distilbert_uncased_1"

tokenizer = AutoTokenizer.from_pretrained(model_name)

id2label = {
    0: "O",
    1: "separator",
}
label2id = {
    "O": 0,
    "separator": 1,
}

model = AutoModelForTokenClassification.from_pretrained(
    model_name,
    num_labels=2,
    id2label=id2label,
    label2id=label2id,
)

pipe = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

text = """Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep. The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien - looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights."""

pipe(text)

# Output

[
  {'entity_group': 'separator', 'score': 0.89515704, 'word': 'deep.', 'start': 333, 'end': 338},
  {'entity_group': 'separator', 'score': 0.61160326, 'word': '.', 'start': 652, 'end': 653}
]

📚 Documentation

Training Data

The model was trained to split paragraphs from the bookcorpus dataset.

Metrics

Property	Details
F1	0.7
Precision	0.79
Recall	0.63
Accuracy	0.99

Hardware

The model was fine - tuned on 2x1080ti.

📄 License

This project is licensed under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご