🚀 Ruri: Japanese General Text Embeddings
Ruri v3 is a general-purpose Japanese text embedding model built on top of ModernBERT-Ja. It offers state-of-the-art performance for Japanese text embedding tasks, supports sequence lengths up to 8192 tokens, and has an expanded vocabulary of 100K tokens.
🚀 Quick Start
You can use our models directly with the transformers library v4.48.0 or higher. First, install the necessary libraries:
pip install -U "transformers>=4.48.0" sentence-transformers
If your GPUs support Flash Attention 2, we recommend using our models with it:
pip install flash-attn --no-build-isolation
Then you can load this model and run inference:
import torch
import torch.nn.functional as F
from sentence_transformers import SentenceTransformer
device = "cuda" if torch.cuda.is_available() else "cpu"
model = SentenceTransformer("cl-nagoya/ruri-v3-310m", device=device)
sentences = [
"川べりでサーフボードを持った人たちがいます",
"サーファーたちが川べりに立っています",
"トピック: 瑠璃色のサーファー",
"検索クエリ: 瑠璃色はどんな色?",
"検索文書: 瑠璃色(るりいろ)は、紫みを帯びた濃い青。名は、半貴石の瑠璃(ラピスラズリ、英: lapis lazuli)による。JIS慣用色名では「こい紫みの青」(略号 dp-pB)と定義している[1][2]。",
]
embeddings = model.encode(sentences, convert_to_tensor=True)
print(embeddings.size())
similarities = F.cosine_similarity(embeddings.unsqueeze(0), embeddings.unsqueeze(1), dim=2)
print(similarities)
✨ Features
- State-of-the-art performance for Japanese text embedding tasks.
- Supports sequence lengths up to 8192 tokens
- Previous versions of Ruri (v1, v2) were limited to 512.
- Expanded vocabulary of 100K tokens, compared to 32K in v1 and v2
- The larger vocabulary make input sequences shorter, improving efficiency.
- Integrated FlashAttention, following ModernBERT's architecture
- Enables faster inference and fine-tuning.
- Tokenizer based solely on SentencePiece
- Unlike previous versions, which relied on Japanese-specific BERT tokenizers and required pre-tokenized input, Ruri v3 performs tokenization with SentencePiece only—no external word segmentation tool is required.
📦 Installation
pip install -U "transformers>=4.48.0" sentence-transformers
If your GPUs support Flash Attention 2:
pip install flash-attn --no-build-isolation
💻 Usage Examples
Basic Usage
import torch
import torch.nn.functional as F
from sentence_transformers import SentenceTransformer
device = "cuda" if torch.cuda.is_available() else "cpu"
model = SentenceTransformer("cl-nagoya/ruri-v3-310m", device=device)
sentences = [
"川べりでサーフボードを持った人たちがいます",
"サーファーたちが川べりに立っています",
"トピック: 瑠璃色のサーファー",
"検索クエリ: 瑠璃色はどんな色?",
"検索文書: 瑠璃色(るりいろ)は、紫みを帯びた濃い青。名は、半貴石の瑠璃(ラピスラズリ、英: lapis lazuli)による。JIS慣用色名では「こい紫みの青」(略号 dp-pB)と定義している[1][2]。",
]
embeddings = model.encode(sentences, convert_to_tensor=True)
print(embeddings.size())
similarities = F.cosine_similarity(embeddings.unsqueeze(0), embeddings.unsqueeze(1), dim=2)
print(similarities)
📚 Documentation
Model Series
We provide Ruri-v3 in several model sizes. Below is a summary of each model.
Benchmarks
JMTEB
Evaluated with JMTEB.
Model Details
Model Description
Property |
Details |
Model Type |
Sentence Transformer |
Base model |
cl-nagoya/ruri-v3-pt-310m |
Maximum Sequence Length |
8192 tokens |
Output Dimensionality |
768 |
Similarity Function |
Cosine Similarity |
Language |
Japanese |
License |
Apache 2.0 |
Paper |
https://arxiv.org/abs/2409.07737 |
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: ModernBertModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)
📄 License
This model is published under the Apache License, Version 2.0.
🔧 Technical Details
Ruri v3 offers several key technical advantages:
- State-of-the-art performance for Japanese text embedding tasks.
- Supports sequence lengths up to 8192 tokens
- Previous versions of Ruri (v1, v2) were limited to 512.
- Expanded vocabulary of 100K tokens, compared to 32K in v1 and v2
- The larger vocabulary make input sequences shorter, improving efficiency.
- Integrated FlashAttention, following ModernBERT's architecture
- Enables faster inference and fine-tuning.
- Tokenizer based solely on SentencePiece
- Unlike previous versions, which relied on Japanese-specific BERT tokenizers and required pre-tokenized input, Ruri v3 performs tokenization with SentencePiece only—no external word segmentation tool is required.
📖 Citation
@misc{
Ruri,
title={{Ruri: Japanese General Text Embeddings}},
author={Hayato Tsukagoshi and Ryohei Sasano},
year={2024},
eprint={2409.07737},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2409.07737},
}