RoSEtta-base-ja Open-source Japanese Text Embedding Model - Free for Sentence Similarity and Paragraph Retrieval

Rosetta Base Ja

Developed by pkshatech

RoSEtta is a general-purpose Japanese text embedding model, excelling in retrieval tasks, supporting sequence lengths of up to 1024 tokens, and suitable for sentence similarity calculation and paragraph retrieval.

Text Embedding

Safetensors

JapaneseOpen Source License:Apache-2.0 #Japanese semantic retrieval #Long text embedding #Distillation enhancement

Downloads 1,760

Release Time : 8/22/2024

Model Overview

A Japanese text embedding model based on the RoFormer architecture, optimized through distillation and multi-stage contrastive learning, specifically designed for retrieval tasks, supporting long sentence input and CPU operation.

Model Features

Long text processing capability

Supports sequence lengths of up to 1024 tokens, effectively handling long sentence input

Retrieval-optimized design

Performance for retrieval tasks is specifically optimized through multi-stage contrastive learning and distillation training

Efficient inference

Moderate model size (0.2B parameters) allows efficient operation on CPUs

Rotary position encoding

Utilizes RoPE (Rotary Position Encoding) technology to enhance position information processing capability

Model Capabilities

Calculate sentence semantic similarity

Text feature extraction

Query-based paragraph retrieval

Long text semantic understanding

Use Cases

Information retrieval

QA system retrieval

Quickly retrieves the most relevant answer paragraphs to questions in a QA system

Achieves a recall@5 of 79.3 on the MIRACL-ja dataset

Document similarity analysis

Calculates semantic similarity between documents or sentences

Scores 81.39 on the STS task in the JMTEB evaluation

Content management

Duplicate content detection

Identifies duplicate or highly similar content in websites or document collections

🚀 RoSEtta

RoSEtta (RoFormer-based Sentence Encoder through Distillation) is a general Japanese text embedding model that excels in retrieval tasks. It can handle long sentences with a maximum sequence length of 1024 tokens, runs on a CPU, and is designed to measure semantic similarity between sentences and serve as a retrieval system for passage search based on queries.

✨ Features

Utilizes RoPE (Rotary Position Embedding).
Supports a maximum sequence length of 1024 tokens.
Distilled from large sentence embedding models.
Specialized for retrieval tasks.

During inference, the prefix "query: " or "passage: " is required. Refer to the Usage section for details.

📚 Documentation

Model Description

This model is based on the RoFormer architecture. After pre-training with MLM loss, weakly supervised learning was performed. Additionally, further training was conducted through distillation using several large embedding models and multi-stage contrastive learning (like GLuCoSE v2).

Maximum Sequence Length: 1024 tokens
Output Dimensionality: 768 tokens
Similarity Function: Cosine Similarity

Usage

Direct Usage (Sentence Transformers)

You can perform inference using SentenceTransformer with the following code:

from sentence_transformers import SentenceTransformer
import torch.nn.functional as F

# Download from the 🤗 Hub
# The argument "trust_remote_code=True" is required to load the model
model = SentenceTransformer("pkshatech/RoSEtta-base-ja",trust_remote_code=True)

# Each input text should start with "query: " or "passage: ".
# For tasks other than retrieval, you can simply use the "query: " prefix.
sentences = [
    'query: PKSHAはどんな会社ですか？',
    'passage: 研究開発したアルゴリズムを、多くの企業のソフトウエア・オペレーションに導入しています。',
    'query: 日本で一番高い山は？',
    'passage: 富士山（ふじさん）は、標高3776.12 m、日本最高峰（剣ヶ峰）の独立峰で、その優美な風貌は日本国外でも日本の象徴として広く知られている。',
]
embeddings = model.encode(sentences,convert_to_tensor=True)
print(embeddings.shape)
# [4, 768]

# Get the similarity scores for the embeddings
similarities = F.cosine_similarity(embeddings.unsqueeze(0), embeddings.unsqueeze(1), dim=2)
print(similarities)
# [[1.0000, 0.5910, 0.4332, 0.5421],
# [0.5910, 1.0000, 0.4977, 0.6969],
# [0.4332, 0.4977, 1.0000, 0.7475],
# [0.5421, 0.6969, 0.7475, 1.0000]]

Direct Usage (Transformers)

You can perform inference using Transformers with the following code:

import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel

def mean_pooling(last_hidden_states: Tensor,attention_mask: Tensor) -> Tensor:
    emb = last_hidden_states * attention_mask.unsqueeze(-1)
    emb = emb.sum(dim=1) / attention_mask.sum(dim=1).unsqueeze(-1)
    return emb

# Download from the 🤗 Hub
tokenizer = AutoTokenizer.from_pretrained("pkshatech/RoSEtta-base-ja")
# The argument "trust_remote_code=True" is required to load the model
model = AutoModel.from_pretrained("pkshatech/RoSEtta-base-ja",trust_remote_code=True)

# Each input text should start with "query: " or "passage: ".
# For tasks other than retrieval, you can simply use the "query: " prefix.
sentences = [
    'query: PKSHAはどんな会社ですか？',
    'passage: 研究開発したアルゴリズムを、多くの企業のソフトウエア・オペレーションに導入しています。',
    'query: 日本で一番高い山は？',
    'passage: 富士山（ふじさん）は、標高3776.12 m、日本最高峰（剣ヶ峰）の独立峰で、その優美な風貌は日本国外でも日本の象徴として広く知られている。',
]

# Tokenize the input texts
batch_dict = tokenizer(sentences, max_length=1024, padding=True, truncation=True, return_tensors='pt')

outputs = model(**batch_dict)
embeddings = mean_pooling(outputs.last_hidden_state, batch_dict['attention_mask'])
print(embeddings.shape)
# [4, 768]

# Get the similarity scores for the embeddings
similarities = F.cosine_similarity(embeddings.unsqueeze(0), embeddings.unsqueeze(1), dim=2)
print(similarities)
# [[1.0000, 0.5910, 0.4332, 0.5421],
# [0.5910, 1.0000, 0.4977, 0.6969],
# [0.4332, 0.4977, 1.0000, 0.7475],
# [0.5421, 0.6969, 0.7475, 1.0000]]

Training Details

The fine-tuning of RoSEtta is carried out through the following steps:

Pre-training:
- The model is pre-trained based on the RoFormer architecture.
- Training data: Japanese Wikipedia and cc100.
Weakly supervised learning:
- Training data: MQA and mc4.
Ensemble distillation:
- The embedded representation was distilled using E5-mistral, gte-Qwen2, and mE5-large as teacher models.
Contrastive learning:
- Triplets were created from JSNLI, MNLI, PAWS-X, JSeM and Mr.TyDi and used for training.
- This training aimed to improve the overall performance as a sentence embedding model.
Search-specific contrastive learning:
- To make the model more robust for retrieval tasks, additional two-stage training with QA and retrieval tasks was conducted.
- In the first stage, the synthetic dataset auto-wiki-qa was used for training.
- In the second stage, JQaRA, MQA, Japanese Wikipedia Human Retrieval, Mr.TyDi,MIRACL, Quiz Works and Quiz No Mor were used.

Benchmarks

Retrieval

Evaluated with MIRACL-ja, JQARA, JaCWIR and MLDR-ja.

Model	Size	MIRACL Recall@5	JQaRA nDCG@10	JaCWIR MAP@10	MLDR nDCG@10
intfloat/multilingual-e5-large	0.6B	89.2	55.4	87.6	29.8
cl-nagoya/ruri-large	0.3B	78.7	62.4	85.0	37.5
intfloat/multilingual-e5-base	0.3B	84.2	47.2	85.3	25.4
cl-nagoya/ruri-base	0.1B	74.3	58.1	84.6	35.3
pkshatech/GLuCoSE-base-ja	0.1B	53.3	30.8	68.6	25.2
RoSEtta	0.2B	79.3	57.7	83.8	32.3

Note: Results for OpenAI small embeddings in JQARA and JaCWIR are quoted from JQARA and JaCWIR.

JMTEB

Evaluated with JMTEB. The average score is macro-average.

Model	Size	Avg.	Retrieval	STS	Classification	Reranking	Clustering	PairClassification
OpenAI/text-embedding-3-small	-	69.18	66.39	79.46	73.06	92.92	51.06	62.27
OpenAI/text-embedding-3-large	-	74.05	74.48	82.52	77.58	93.58	53.32	62.35
intfloat/multilingual-e5-large	0.6B	70.90	70.98	79.70	72.89	92.96	51.24	62.15
cl-nagoya/ruri-large	0.3B	73.31	73.02	83.13	77.43	92.99	51.82	62.29
intfloat/multilingual-e5-base	0.3B	68.61	68.21	79.84	69.30	92.85	48.26	62.26
cl-nagoya/ruri-base	0.1B	71.91	69.82	82.87	75.58	92.91	54.16	62.38
pkshatech/GLuCoSE-base-ja	0.1B	67.29	59.02	78.71	76.82	91.90	49.78	66.39
RoSEtta	0.2B	72.45	73.21	81.39	72.41	92.69	53.23	61.74

👨‍💻 Authors

Chihiro Yano, Mocho Go, Hideyuki Tachibana, Hiroto Takegawa, Yotaro Watanabe

📄 License

This model is published under the Apache License, Version 2.0.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご