🚀 Ruri: Japanese General Text Embeddings
Ruri is a model designed for generating Japanese general text embeddings. It offers multiple versions with different parameter sizes, suitable for various tasks such as sentence similarity and feature extraction. This README provides details on its usage, performance benchmarks, model architecture, training information, and licensing.
🚀 Quick Start
✨ Features
- Multiple Model Versions: There are v3 models available, including
cl-nagoya/ruri-v3-30m
, cl-nagoya/ruri-v3-70m
, cl-nagoya/ruri-v3-130m
, and cl-nagoya/ruri-v3-310m
, with different parameter counts and performance on the JMTEB benchmark.
- Sentence Similarity: Capable of calculating sentence similarity, useful for tasks like information retrieval and text classification.
- Japanese Language Support: Specifically optimized for the Japanese language.
📦 Installation
First, install the Sentence Transformers library:
pip install -U sentence-transformers fugashi sentencepiece unidic-lite
💻 Usage Examples
Basic Usage
import torch.nn.functional as F
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("cl-nagoya/ruri-small", trust_remote_code=True)
sentences = [
"クエリ: 瑠璃色はどんな色?",
"文章: 瑠璃色(るりいろ)は、紫みを帯びた濃い青。名は、半貴石の瑠璃(ラピスラズリ、英: lapis lazuli)による。JIS慣用色名では「こい紫みの青」(略号 dp-pB)と定義している[1][2]。",
"クエリ: ワシやタカのように、鋭いくちばしと爪を持った大型の鳥類を総称して「何類」というでしょう?",
"文章: ワシ、タカ、ハゲワシ、ハヤブサ、コンドル、フクロウが代表的である。これらの猛禽類はリンネ前後の時代(17~18世紀)には鷲類・鷹類・隼類及び梟類に分類された。ちなみにリンネは狩りをする鳥を単一の目(もく)にまとめ、vultur(コンドル、ハゲワシ)、falco(ワシ、タカ、ハヤブサなど)、strix(フクロウ)、lanius(モズ)の4属を含めている。",
]
embeddings = model.encode(sentences, convert_to_tensor=True)
print(embeddings.size())
similarities = F.cosine_similarity(embeddings.unsqueeze(0), embeddings.unsqueeze(1), dim=2)
print(similarities)
📚 Documentation
Benchmarks
JMTEB
Evaluated with JMTEB.
Model Details
Model Description
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: DistilBertModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)
Training Details
Framework |
Version |
Python |
3.10.13 |
Sentence Transformers |
3.0.0 |
Transformers |
4.41.2 |
PyTorch |
2.3.1+cu118 |
Accelerate |
0.30.1 |
Datasets |
2.19.1 |
Tokenizers |
0.19.1 |
🔧 Technical Details
The model is based on the Sentence Transformer architecture, which is trained to generate high-quality text embeddings for Japanese text. The base model cl-nagoya/ruri-pt-small
provides a good starting point, and further fine-tuning is done on the cl-nagoya/ruri-dataset-ft
dataset.
📄 License
This model is published under the Apache License, Version 2.0.
Citation
@misc{
Ruri,
title={{Ruri: Japanese General Text Embeddings}},
author={Hayato Tsukagoshi and Ryohei Sasano},
year={2024},
eprint={2409.07737},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2409.07737},
}