ruri-small Open-source Japanese Text Embedding Model - Calculate sentence similarity and extract text features for free

Ruri Small

Developed by cl-nagoya

Ruri is a model specialized in Japanese text embedding, capable of efficiently calculating sentence similarity and extracting text features.

Text Embedding

Safetensors

JapaneseOpen Source License:Apache-2.0 #Japanese Text Embedding #High-precision Similarity Calculation #Long Text Support

Downloads 11.75k

Release Time : 8/28/2024

Model Overview

This model is a general-purpose Japanese text embedding model, primarily used for sentence similarity calculation and feature extraction. Based on the DistilBert architecture, it supports a maximum sequence length of 512 tokens with an output dimension of 768.

Model Features

Efficient Japanese Processing

Optimized specifically for Japanese text, accurately understanding Japanese semantic features

High Performance

Outperforms similar models in JMTEB evaluations

Lightweight

A small model with only 68M parameters, suitable for resource-limited environments

Long Text Support

Supports a maximum sequence length of 512 tokens

Model Capabilities

Japanese Text Feature Extraction

Sentence Similarity Calculation

Semantic Search

Text Clustering

Use Cases

Information Retrieval

Semantic Search

Find relevant documents based on query semantics

Achieved a score of 69.41 in the JMTEB retrieval task

Text Analysis

Text Clustering

Group semantically similar texts together

Achieved a score of 51.19 in the JMTEB clustering task

🚀 Ruri: Japanese General Text Embeddings

Ruri is a model designed for generating Japanese general text embeddings. It offers multiple versions with different parameter sizes, suitable for various tasks such as sentence similarity and feature extraction. This README provides details on its usage, performance benchmarks, model architecture, training information, and licensing.

🚀 Quick Start

✨ Features

Multiple Model Versions: There are v3 models available, including cl-nagoya/ruri-v3-30m, cl-nagoya/ruri-v3-70m, cl-nagoya/ruri-v3-130m, and cl-nagoya/ruri-v3-310m, with different parameter counts and performance on the JMTEB benchmark.
Sentence Similarity: Capable of calculating sentence similarity, useful for tasks like information retrieval and text classification.
Japanese Language Support: Specifically optimized for the Japanese language.

📦 Installation

First, install the Sentence Transformers library:

pip install -U sentence-transformers fugashi sentencepiece unidic-lite

💻 Usage Examples

Basic Usage

import torch.nn.functional as F
from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("cl-nagoya/ruri-small", trust_remote_code=True)

# Don't forget to add the prefix "クエリ: " for query-side or "文章: " for passage-side texts.
sentences = [
    "クエリ: 瑠璃色はどんな色？",
    "文章: 瑠璃色（るりいろ）は、紫みを帯びた濃い青。名は、半貴石の瑠璃（ラピスラズリ、英: lapis lazuli）による。JIS慣用色名では「こい紫みの青」（略号 dp-pB）と定義している[1][2]。",
    "クエリ: ワシやタカのように、鋭いくちばしと爪を持った大型の鳥類を総称して「何類」というでしょう?",
    "文章: ワシ、タカ、ハゲワシ、ハヤブサ、コンドル、フクロウが代表的である。これらの猛禽類はリンネ前後の時代(17~18世紀)には鷲類・鷹類・隼類及び梟類に分類された。ちなみにリンネは狩りをする鳥を単一の目(もく)にまとめ、vultur(コンドル、ハゲワシ)、falco(ワシ、タカ、ハヤブサなど)、strix(フクロウ)、lanius(モズ)の4属を含めている。",
]

embeddings = model.encode(sentences, convert_to_tensor=True)
print(embeddings.size())
# [4, 768]

similarities = F.cosine_similarity(embeddings.unsqueeze(0), embeddings.unsqueeze(1), dim=2)
print(similarities)
# [[1.0000, 0.9453, 0.6860, 0.7225],
#  [0.9453, 1.0000, 0.6852, 0.7005],
#  [0.6860, 0.6852, 1.0000, 0.8567],
#  [0.7225, 0.7005, 0.8567, 1.0000]]

📚 Documentation

Benchmarks

JMTEB

Evaluated with JMTEB.

Model	#Param.	Avg.	Retrieval	STS	Classification	Reranking	Clustering	Pair Classification
cl-nagoya/sup-simcse-ja-base	111M	68.56	49.64	82.05	73.47	91.83	51.79	62.57
cl-nagoya/sup-simcse-ja-large	337M	66.51	37.62	83.18	73.73	91.48	50.56	62.51
cl-nagoya/unsup-simcse-ja-base	111M	65.07	40.23	78.72	73.07	91.16	44.77	62.44
cl-nagoya/unsup-simcse-ja-large	337M	66.27	40.53	80.56	74.66	90.95	48.41	62.49
pkshatech/GLuCoSE-base-ja	133M	70.44	59.02	78.71	76.82	91.90	49.78	66.39
sentence-transformers/LaBSE	472M	64.70	40.12	76.56	72.66	91.63	44.88	62.33
intfloat/multilingual-e5-small	118M	69.52	67.27	80.07	67.62	93.03	46.91	62.19
intfloat/multilingual-e5-base	278M	70.12	68.21	79.84	69.30	92.85	48.26	62.26
intfloat/multilingual-e5-large	560M	71.65	70.98	79.70	72.89	92.96	51.24	62.15
OpenAI/text-embedding-ada-002	-	69.48	64.38	79.02	69.75	93.04	48.30	62.40
OpenAI/text-embedding-3-small	-	70.86	66.39	79.46	73.06	92.92	51.06	62.27
OpenAI/text-embedding-3-large	-	73.97	74.48	82.52	77.58	93.58	53.32	62.35
Ruri-Small (this model)	68M	71.53	69.41	82.79	76.22	93.00	51.19	62.11
Ruri-Base	111M	71.91	69.82	82.87	75.58	92.91	54.16	62.38
Ruri-Large	337M	73.31	73.02	83.13	77.43	92.99	51.82	62.29

Model Details

Model Description

Property	Details
Model Type	Sentence Transformer
Base model	cl-nagoya/ruri-pt-small
Maximum Sequence Length	512 tokens
Output Dimensionality	768
Similarity Function	Cosine Similarity
Language	Japanese
License	Apache 2.0
Paper	https://arxiv.org/abs/2409.07737

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: DistilBertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Training Details

Framework	Version
Python	3.10.13
Sentence Transformers	3.0.0
Transformers	4.41.2
PyTorch	2.3.1+cu118
Accelerate	0.30.1
Datasets	2.19.1
Tokenizers	0.19.1

🔧 Technical Details

The model is based on the Sentence Transformer architecture, which is trained to generate high-quality text embeddings for Japanese text. The base model cl-nagoya/ruri-pt-small provides a good starting point, and further fine-tuning is done on the cl-nagoya/ruri-dataset-ft dataset.

📄 License

This model is published under the Apache License, Version 2.0.

Citation

@misc{
  Ruri,
  title={{Ruri: Japanese General Text Embeddings}}, 
  author={Hayato Tsukagoshi and Ryohei Sasano},
  year={2024},
  eprint={2409.07737},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2409.07737}, 
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご