ChonkyオープンソースTransformerモデル - 無料でデプロイしてテキストの意味ブロックを自動分割可能、RAGシステムに適しています

ホーム

Chonky Distilbert Base Uncased 1

mirthによって開発

Chonkyは、テキストを意味のある意味分塊に賢く分割できるTransformerモデルで、RAGシステムに適しています。

シーケンスラベリング

Transformers

英語オープンソースライセンス:MIT #意味分塊 #RAG最適化 #テキスト分割

ダウンロード数 1,486

リリース時間 : 4/10/2025

モデル概要

このモデルはテキストを処理し、意味的に一貫した断片に分割します。これらの分塊は、RAGプロセスの一部として、埋め込みベースの検索システムまたは言語モデルに入力できます。

モデル特徴

賢い意味分塊

テキストを意味のある意味分塊に賢く分割し、RAGシステムの効率を向上させます。

DistilBERTベース

軽量のDistilBERT-base-uncasedモデルを使用し、性能と効率をバランスさせます。

統合が容易

専用のPythonライブラリと標準のNERプロセスの2つの使用方法を提供します。

モデル能力

テキスト分割

意味分析

RAGシステムサポート

使用事例

情報検索

RAGシステムの前処理

埋め込みベースの検索システム用に意味的に一貫したテキストブロックを準備する

検索の関連性と効率を向上させる

テキスト処理

ドキュメント分割

長いドキュメントを意味のある段落に分割する

後続の分析と処理を容易にする

🚀 チョンキーディスティルベルトベース（アンケースド）v1

Chonky は、テキストを意味のあるセマンティックなチャンクに賢く分割するトランスフォーマーモデルです。このモデルはRAGシステムで使用できます。

🚀 クイックスタート

このモデルは、テキストを処理し、意味的にまとまったセグメントに分割します。これらのチャンクは、RAGパイプラインの一部として、埋め込みベースの検索システムや言語モデルに入力することができます。

✨ 主な機能

テキストを意味のあるセマンティックなチャンクに分割する。
RAGシステムでの使用が可能。

📦 インストール

このモデルには、小さなPythonライブラリ chonky が用意されています。

💻 使用例

基本的な使用法

from chonky import ParagraphSplitter

# 初回実行時にトランスフォーマーモデルをダウンロードします
splitter = ParagraphSplitter(device="cpu")

text = """Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep. The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights."""

for chunk in splitter(text):
  print(chunk)
  print("--")

高度な使用法

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

model_name = "mirth/chonky_distilbert_uncased_1"

tokenizer = AutoTokenizer.from_pretrained(model_name)

id2label = {
    0: "O",
    1: "separator",
}
label2id = {
    "O": 0,
    "separator": 1,
}

model = AutoModelForTokenClassification.from_pretrained(
    model_name,
    num_labels=2,
    id2label=id2label,
    label2id=label2id,
)

pipe = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

text = """Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep. The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights."""

pipe(text)

# 出力

[
  {'entity_group': 'separator', 'score': 0.89515704, 'word': 'deep.', 'start': 333, 'end': 338},
  {'entity_group': 'separator', 'score': 0.61160326, 'word': '.', 'start': 652, 'end': 653}
]