ruri-base Open-source Japanese Text Embedding Model - Freely Achieve Sentence Similarity and Feature Extraction

Home

Ruri Base

Developed by cl-nagoya

Ruri is a universal text embedding model for Japanese, focusing on sentence similarity and feature extraction tasks.

Text Embedding

Safetensors

JapaneseOpen Source License:Apache-2.0 #Japanese Text Embedding #Long Text Support #High JMTEB Score

Downloads 523.56k

Release Time : 8/28/2024

Model Overview

Ruri is a Japanese text embedding model based on the BERT architecture, primarily used for calculating sentence similarity and extracting text features. The model supports adding specific prefixes to query and passage texts for better performance.

Model Features

Japanese Optimization

Specially optimized for Japanese text, excelling in Japanese language tasks

Long Text Support

Supports sequences up to 512 tokens, capable of handling longer texts

High Performance

Outperforms other Japanese models in the JMTEB benchmark

Prefix Enhancement

Improves similarity calculation by adding query/passage prefixes

Model Capabilities

Sentence Similarity Calculation

Text Feature Extraction

Semantic Search

Text Clustering

Information Retrieval

Use Cases

Information Retrieval

Q&A System

Implements question-answering functionality by calculating similarity between queries and candidate answers

Achieved a score of 69.82 on JMTEB retrieval tasks

Text Analysis

Text Clustering

Automatically groups similar texts together

Achieved a score of 54.16 on JMTEB clustering tasks

🚀 Ruri: Japanese General Text Embeddings

Ruri provides general text embeddings for Japanese, useful for tasks like sentence similarity and feature extraction.

🚀 Quick Start

Notes: v3 models are out!
We recommend using the following v3 models going forward.

ID	#Param.	Max Len.	Avg. JMTEB
cl-nagoya/ruri-v3-30m	37M	8192	74.51
cl-nagoya/ruri-v3-70m	70M	8192	75.48
cl-nagoya/ruri-v3-130m	132M	8192	76.55
cl-nagoya/ruri-v3-310m	315M	8192	77.24

✨ Features

Sentence Similarity: Can be used to calculate the similarity between Japanese sentences.
Feature Extraction: Extracts features from Japanese texts.

📦 Installation

First install the Sentence Transformers library:

pip install -U sentence-transformers fugashi sentencepiece unidic-lite

💻 Usage Examples

Basic Usage

import torch.nn.functional as F
from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("cl-nagoya/ruri-base")

# Don't forget to add the prefix "クエリ: " for query-side or "文章: " for passage-side texts.
sentences = [
    "クエリ: 瑠璃色はどんな色？",
    "文章: 瑠璃色（るりいろ）は、紫みを帯びた濃い青。名は、半貴石の瑠璃（ラピスラズリ、英: lapis lazuli）による。JIS慣用色名では「こい紫みの青」（略号 dp-pB）と定義している[1][2]。",
    "クエリ: ワシやタカのように、鋭いくちばしと爪を持った大型の鳥類を総称して「何類」というでしょう?",
    "文章: ワシ、タカ、ハゲワシ、ハヤブサ、コンドル、フクロウが代表的である。これらの猛禽類はリンネ前後の時代(17~18世紀)には鷲類・鷹類・隼類及び梟類に分類された。ちなみにリンネは狩りをする鳥を単一の目(もく)にまとめ、vultur(コンドル、ハゲワシ)、falco(ワシ、タカ、ハヤブサなど)、strix(フクロウ)、lanius(モズ)の4属を含めている。",
]

embeddings = model.encode(sentences, convert_to_tensor=True)
print(embeddings.size())
# [4, 768]

similarities = F.cosine_similarity(embeddings.unsqueeze(0), embeddings.unsqueeze(1), dim=2)
print(similarities)
# [[1.0000, 0.9421, 0.6844, 0.7167],
#  [0.9421, 1.0000, 0.6626, 0.6863],
#  [0.6844, 0.6626, 1.0000, 0.8785],
#  [0.7167, 0.6863, 0.8785, 1.0000]]

📚 Documentation

Benchmarks

JMTEB

Evaluated with JMTEB.

Model	#Param.	Avg.	Retrieval	STS	Classfification	Reranking	Clustering	PairClassification
cl-nagoya/sup-simcse-ja-base	111M	68.56	49.64	82.05	73.47	91.83	51.79	62.57
cl-nagoya/sup-simcse-ja-large	337M	66.51	37.62	83.18	73.73	91.48	50.56	62.51
cl-nagoya/unsup-simcse-ja-base	111M	65.07	40.23	78.72	73.07	91.16	44.77	62.44
cl-nagoya/unsup-simcse-ja-large	337M	66.27	40.53	80.56	74.66	90.95	48.41	62.49
pkshatech/GLuCoSE-base-ja	133M	70.44	59.02	78.71	76.82	91.90	49.78	66.39

sentence-transformers/LaBSE	472M	64.70	40.12	76.56	72.66	91.63	44.88	62.33
intfloat/multilingual-e5-small	118M	69.52	67.27	80.07	67.62	93.03	46.91	62.19
intfloat/multilingual-e5-base	278M	70.12	68.21	79.84	69.30	92.85	48.26	62.26
intfloat/multilingual-e5-large	560M	71.65	70.98	79.70	72.89	92.96	51.24	62.15

OpenAI/text-embedding-ada-002	-	69.48	64.38	79.02	69.75	93.04	48.30	62.40
OpenAI/text-embedding-3-small	-	70.86	66.39	79.46	73.06	92.92	51.06	62.27
OpenAI/text-embedding-3-large	-	73.97	74.48	82.52	77.58	93.58	53.32	62.35

Ruri-Small	68M	71.53	69.41	82.79	76.22	93.00	51.19	62.11
Ruri-Base (this model)	111M	71.91	69.82	82.87	75.58	92.91	54.16	62.38
Ruri-Large	337M	73.31	73.02	83.13	77.43	92.99	51.82	62.29

Model Details

Model Description

Property	Details
Model Type	Sentence Transformer
Base model	cl-nagoya/ruri-pt-base
Maximum Sequence Length	512 tokens
Output Dimensionality	768
Similarity Function	Cosine Similarity
Language	Japanese
License	Apache 2.0
Paper	https://arxiv.org/abs/2409.07737

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Framework Versions

Python: 3.10.13
Sentence Transformers: 3.0.0
Transformers: 4.41.2
PyTorch: 2.3.1+cu118
Accelerate: 0.30.1
Datasets: 2.19.1
Tokenizers: 0.19.1

🔧 Technical Details

The model is based on the Sentence Transformer architecture, fine - tuned on the cl-nagoya/ruri-dataset-ft dataset. It uses cosine similarity to measure the similarity between sentence embeddings.

📄 License

This model is published under the Apache License, Version 2.0.

Citation

@misc{
  Ruri,
  title={{Ruri: Japanese General Text Embeddings}}, 
  author={Hayato Tsukagoshi and Ryohei Sasano},
  year={2024},
  eprint={2409.07737},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2409.07737}, 
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご