Ruri-v3-pt-30m Open-source Japanese Text Embedding Model - Supports Multi-parameter Versions to Handle Diverse Text Tasks

Ruri V3 Pt 30m

Developed by cl-nagoya

Ruri is a Japanese universal text embedding model based on ModernBERT-Ja, offering versions with different parameter scales suitable for various text processing tasks.

Text Embedding

Safetensors

JapaneseOpen Source License:Apache-2.0 #Japanese Text Embedding #Multi-prefix Encoding #Lightweight BERT

Downloads 250

Release Time : 3/20/2025

Model Overview

Ruri is a Japanese universal text embedding model primarily used for sentence similarity calculation and feature extraction. It is based on the ModernBERT-Ja architecture and supports prefix differentiation for various text types.

Model Features

Multiple Parameter Scale Versions

Offers model versions ranging from 30M to 310M parameters to meet different computational resource needs.

1+3 Prefix Scheme

Uses special prefixes to differentiate text types: empty string for semantic encoding, 'トピック:' for classification/clustering, '検索クエリ:' for search queries, and '検索文書:' for documents to be retrieved.

High Performance

Achieves an average score of 74.51 to 77.24 on the JMTEB benchmark (varies by parameter scale version).

Model Capabilities

Sentence Similarity Calculation

Text Feature Extraction

Semantic Encoding

Classification/Clustering Encoding

Search Query Encoding

Document Retrieval Encoding

Use Cases

Information Retrieval

Document Retrieval

Use '検索クエリ:' and '検索文書:' prefixes to encode queries and documents for efficient retrieval.

Text Analysis

Topic Classification

Use the 'トピック:' prefix to encode text for topic classification.

Semantic Similarity Calculation

Compare embedding vectors of different texts to calculate semantic similarity.

🚀 Ruri: Japanese General Text Embeddings

Ruri is a general - purpose Japanese text embedding model, offering various model sizes for different needs.

⚠️ Important Note

This model is a pretrained version and has not been fine - tuned. For the fine - tuned version, please use [cl - nagoya/ruri - v3 - 30m](https://huggingface.co/cl - nagoya/ruri - v3 - 30m)!

✨ Features

Fine - tuned Model Series

Ruri v3 is a general - purpose Japanese text embedding model built on top of [ModernBERT - Ja](https://huggingface.co/collections/sbintuitions/modernbert - ja - 67b68fe891132877cf67aa0a). We provide Ruri - v3 in several model sizes. Below is a summary of each model.

ID	#Param.	#Param. w/o Emb.	Dim.	#Layers	Avg. JMTEB
[cl - nagoya/ruri - v3 - 30m](https://huggingface.co/cl - nagoya/ruri - v3 - 30m)	37M	10M	256	10	74.51
[cl - nagoya/ruri - v3 - 70m](https://huggingface.co/cl - nagoya/ruri - v3 - 70m)	70M	31M	384	13	75.48
[cl - nagoya/ruri - v3 - 130m](https://huggingface.co/cl - nagoya/ruri - v3 - 130m)	132M	80M	512	19	76.55
[cl - nagoya/ruri - v3 - 310m](https://huggingface.co/cl - nagoya/ruri - v3 - 310m)	315M	236M	768	25	77.24

📦 Installation

You can use our models directly with the transformers library v4.48.0 or higher:

pip install -U "transformers>=4.48.0" sentence-transformers

Additionally, if your GPUs support Flash Attention 2, we recommend using our models with Flash Attention 2.

pip install flash-attn --no-build-isolation

💻 Usage Examples

Basic Usage

import torch.nn.functional as F
from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("cl-nagoya/ruri-v3-pt-30m")

# Ruri v3 employs a 1+3 prefix scheme to distinguish between different types of text inputs:
# "" (empty string) is used for encoding semantic meaning.
# "トピック: " is used for classification, clustering, and encoding topical information.
# "検索クエリ: " is used for queries in retrieval tasks.
# "検索文書: " is used for documents to be retrieved.
sentences = [
    "川べりでサーフボードを持った人たちがいます",
    "サーファーたちが川べりに立っています",
    "トピック: 瑠璃色のサーファー",
    "検索クエリ: 瑠璃色はどんな色？",
    "検索文書: 瑠璃色（るりいろ）は、紫みを帯びた濃い青。名は、半貴石の瑠璃（ラピスラズリ、英: lapis lazuli）による。JIS慣用色名では「こい紫みの青」（略号 dp-pB）と定義している[1][2]。",
]

embeddings = model.encode(sentences, convert_to_tensor=True)
print(embeddings.size())
# [5, 256]

similarities = F.cosine_similarity(embeddings.unsqueeze(0), embeddings.unsqueeze(1), dim=2)
print(similarities)

📚 Documentation

Citation

@misc{
  Ruri,
  title={{Ruri: Japanese General Text Embeddings}}, 
  author={Hayato Tsukagoshi and Ryohei Sasano},
  year={2024},
  eprint={2409.07737},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2409.07737}, 
}

📄 License

This model is published under the Apache License, Version 2.0.

📋 Information Table

Property	Details
Language	Japanese
Tags	sentence - similarity, feature - extraction
Base Model	sbintuitions/modernbert - ja - 30m
Pipeline Tag	sentence - similarity
License	apache - 2.0
Datasets	cl - nagoya/ruri - v3 - dataset - pt

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご