Ruri-v3-70m (Ruri v3) Open-Source Japanese Text Embedding Model - Support for Long Sequences and Advanced Performance Experience

Ruri V3 70m

Developed by cl-nagoya

Ruri v3 is a Japanese general-purpose text embedding model based on ModernBERT-Ja, supporting sequences up to 8192 tokens long and achieving state-of-the-art performance in Japanese text embedding tasks.

Text Embedding

Safetensors

JapaneseOpen Source License:Apache-2.0 #Japanese Text Embedding #Long Text Support #High-Precision Retrieval

Downloads 865

Release Time : 4/9/2025

Model Overview

Ruri v3 is a high-performance Japanese text embedding model designed for tasks such as Japanese text similarity, retrieval, and classification. It employs a pure SentencePiece tokenizer, supports long sequence processing, and integrates FlashAttention technology for improved efficiency.

Model Features

Long Sequence Support

Supports processing sequences up to 8192 tokens, far exceeding the previous 512-token limit.

Expanded Vocabulary

Vocabulary expanded to 100,000 tokens (previously 32,000), improving processing efficiency.

FlashAttention Integration

Incorporates FlashAttention technology from the ModernBERT architecture for faster inference and fine-tuning.

Pure SentencePiece Tokenization

Eliminates the need for external tokenization tools, simplifying preprocessing.

Multi-Task Prefix Scheme

Uses a 1+3 prefix scheme to distinguish different text input types (semantic, topic, query, document).

Model Capabilities

Japanese Text Embedding

Sentence Similarity Calculation

Text Retrieval

Text Classification

Clustering Analysis

Re-ranking Tasks

Use Cases

Information Retrieval

Document Retrieval

Build efficient retrieval systems using 'search query' and 'search document' prefixes.

Achieved 79.96 points in JMTEB retrieval tasks.

Text Analysis

Topic Classification

Use 'topic' prefix for text topic encoding and classification.

Achieved 76.97 points in JMTEB classification tasks.

Semantic Analysis

Sentence Similarity Calculation

Calculate semantic similarity between two Japanese sentences.

Achieved 79.82 points in JMTEB STS tasks.

🚀 Ruri: Japanese General Text Embeddings

Ruri v3 is a general - purpose Japanese text embedding model built on top of ModernBERT - Ja. It offers a solution for Japanese text embedding tasks, providing state - of - the - art performance and several technical advantages.

🚀 Quick Start

You can use our models directly with the transformers library v4.48.0 or higher:

pip install -U "transformers>=4.48.0" sentence-transformers

Additionally, if your GPUs support Flash Attention 2, we recommend using our models with Flash Attention 2.

pip install flash-attn --no-build-isolation

Then you can load this model and run inference:

import torch
import torch.nn.functional as F
from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
device = "cuda" if torch.cuda.is_available() else "cpu"
model = SentenceTransformer("cl-nagoya/ruri-v3-70m", device=device)

# Ruri v3 employs a 1+3 prefix scheme to distinguish between different types of text inputs:
# "" (empty string) is used for encoding semantic meaning.
# "トピック: " is used for classification, clustering, and encoding topical information.
# "検索クエリ: " is used for queries in retrieval tasks.
# "検索文書: " is used for documents to be retrieved.
sentences = [
    "川べりでサーフボードを持った人たちがいます",
    "サーファーたちが川べりに立っています",
    "トピック: 瑠璃色のサーファー",
    "検索クエリ: 瑠璃色はどんな色？",
    "検索文書: 瑠璃色（るりいろ）は、紫みを帯びた濃い青。名は、半貴石の瑠璃（ラピスラズリ、英: lapis lazuli）による。JIS慣用色名では「こい紫みの青」（略号 dp-pB）と定義している[1][2]。",
]

embeddings = model.encode(sentences, convert_to_tensor=True)
print(embeddings.size())
# [5, 384]

similarities = F.cosine_similarity(embeddings.unsqueeze(0), embeddings.unsqueeze(1), dim=2)
print(similarities)
# [[1.0000, 0.9555, 0.8430, 0.6772, 0.7077],
#  [0.9555, 1.0000, 0.8333, 0.6636, 0.6971],
#  [0.8430, 0.8333, 1.0000, 0.8554, 0.8639],
#  [0.6772, 0.6636, 0.8554, 1.0000, 0.9500],
#  [0.7077, 0.6971, 0.8639, 0.9500, 1.0000]]

✨ Features

State - of - the - art performance: Offers top - notch performance for Japanese text embedding tasks.
Longer sequence support: Supports sequence lengths up to 8192 tokens, compared to the previous versions (v1, v2) which were limited to 512.
Expanded vocabulary: Has an expanded vocabulary of 100K tokens, as opposed to 32K in v1 and v2, making input sequences shorter and improving efficiency.
Integrated FlashAttention: Following ModernBERT's architecture, it enables faster inference and fine - tuning.
Simplified tokenizer: Uses a tokenizer based solely on SentencePiece. Unlike previous versions that relied on Japanese - specific BERT tokenizers and required pre - tokenized input, Ruri v3 performs tokenization with SentencePiece only, eliminating the need for an external word segmentation tool.

📦 Installation

pip install -U "transformers>=4.48.0" sentence-transformers

If your GPUs support Flash Attention 2:

pip install flash-attn --no-build-isolation

📚 Documentation

Model Series

We provide Ruri - v3 in several model sizes. Below is a summary of each model:

ID	#Param.	#Param. w/o Emb.	Dim.	#Layers	Avg. JMTEB
cl-nagoya/ruri-v3-30m	37M	10M	256	10	74.51
cl-nagoya/ruri-v3-70m	70M	31M	384	13	75.48
cl-nagoya/ruri-v3-130m	132M	80M	512	19	76.55
cl-nagoya/ruri-v3-310m	315M	236M	768	25	77.24

Benchmarks

JMTEB

Evaluated with JMTEB.

Model	#Param.	Avg.	Retrieval	STS	Classfification	Reranking	Clustering	PairClassification

Ruri-v3-30m	37M	74.51	78.08	82.48	74.80	93.00	52.12	62.40
Ruri-v3-70m (this model)	70M	75.48	79.96	79.82	76.97	93.27	52.70	61.75
Ruri-v3-130m	132M	76.55	81.89	79.25	77.16	93.31	55.36	62.26
Ruri-v3-310m	315M	77.24	81.89	81.22	78.66	93.43	55.69	62.60

sbintuitions/sarashina-embedding-v1-1b	1.22B	75.50	77.61	82.71	78.37	93.74	53.86	62.00
PLaMo-Embedding-1B	1.05B	76.10	79.94	83.14	77.20	93.57	53.47	62.37

OpenAI/text-embedding-ada-002	-	69.48	64.38	79.02	69.75	93.04	48.30	62.40
OpenAI/text-embedding-3-small	-	70.86	66.39	79.46	73.06	92.92	51.06	62.27
OpenAI/text-embedding-3-large	-	73.97	74.48	82.52	77.58	93.58	53.32	62.35

pkshatech/GLuCoSE-base-ja	133M	70.44	59.02	78.71	76.82	91.90	49.78	66.39
pkshatech/GLuCoSE-base-ja-v2	133M	72.23	73.36	82.96	74.21	93.01	48.65	62.37
retrieva-jp/amber-base	130M	72.12	73.40	77.81	76.14	93.27	48.05	64.03
retrieva-jp/amber-large	315M	73.22	75.40	79.32	77.14	93.54	48.73	60.97

sentence-transformers/LaBSE	472M	64.70	40.12	76.56	72.66	91.63	44.88	62.33
intfloat/multilingual-e5-small	118M	69.52	67.27	80.07	67.62	93.03	46.91	62.19
intfloat/multilingual-e5-base	278M	70.12	68.21	79.84	69.30	92.85	48.26	62.26
intfloat/multilingual-e5-large	560M	71.65	70.98	79.70	72.89	92.96	51.24	62.15

Ruri-Small	68M	71.53	69.41	82.79	76.22	93.00	51.19	62.11
Ruri-Small v2	68M	73.30	73.94	82.91	76.17	93.20	51.58	62.32
Ruri-Base	111M	71.91	69.82	82.87	75.58	92.91	54.16	62.38
Ruri-Base v2	111M	72.48	72.33	83.03	75.34	93.17	51.38	62.35
Ruri-Large	337M	73.31	73.02	83.13	77.43	92.99	51.82	62.29
Ruri-Large v2	337M	74.55	76.34	83.17	77.18	93.21	52.14	62.27

Model Details

Model Description

Property	Details
Model Type	Sentence Transformer
Base model	cl-nagoya/ruri-v3-pt-70m
Maximum Sequence Length	8192 tokens
Output Dimensionality	384
Similarity Function	Cosine Similarity
Language	Japanese
License	Apache 2.0
Paper	https://arxiv.org/abs/2409.07737

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: ModernBertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

🔧 Technical Details

Ruri v3 is built on the foundation of ModernBERT - Ja. It has made several improvements in terms of sequence length support, vocabulary size, and tokenization method. The integration of FlashAttention also significantly enhances the efficiency of inference and fine - tuning.

📄 License

This model is published under the Apache License, Version 2.0.

📚 Citation

@misc{
  Ruri,
  title={{Ruri: Japanese General Text Embeddings}}, 
  author={Hayato Tsukagoshi and Ryohei Sasano},
  year={2024},
  eprint={2409.07737},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2409.07737}, 
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご