🚀 RoSEtta
RoSEtta (RoFormer-based Sentence Encoder through Distillation) is a general Japanese text embedding model that excels in retrieval tasks. It can handle long sentences with a maximum sequence length of 1024 tokens, runs on a CPU, and is designed to measure semantic similarity between sentences and serve as a retrieval system for passage search based on queries.
✨ Features
- Utilizes RoPE (Rotary Position Embedding).
- Supports a maximum sequence length of 1024 tokens.
- Distilled from large sentence embedding models.
- Specialized for retrieval tasks.
During inference, the prefix "query: " or "passage: " is required. Refer to the Usage section for details.
📚 Documentation
Model Description
This model is based on the RoFormer architecture. After pre-training with MLM loss, weakly supervised learning was performed. Additionally, further training was conducted through distillation using several large embedding models and multi-stage contrastive learning (like GLuCoSE v2).
- Maximum Sequence Length: 1024 tokens
- Output Dimensionality: 768 tokens
- Similarity Function: Cosine Similarity
Usage
Direct Usage (Sentence Transformers)
You can perform inference using SentenceTransformer with the following code:
from sentence_transformers import SentenceTransformer
import torch.nn.functional as F
model = SentenceTransformer("pkshatech/RoSEtta-base-ja",trust_remote_code=True)
sentences = [
'query: PKSHAはどんな会社ですか?',
'passage: 研究開発したアルゴリズムを、多くの企業のソフトウエア・オペレーションに導入しています。',
'query: 日本で一番高い山は?',
'passage: 富士山(ふじさん)は、標高3776.12 m、日本最高峰(剣ヶ峰)の独立峰で、その優美な風貌は日本国外でも日本の象徴として広く知られている。',
]
embeddings = model.encode(sentences,convert_to_tensor=True)
print(embeddings.shape)
similarities = F.cosine_similarity(embeddings.unsqueeze(0), embeddings.unsqueeze(1), dim=2)
print(similarities)
Direct Usage (Transformers)
You can perform inference using Transformers with the following code:
import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel
def mean_pooling(last_hidden_states: Tensor,attention_mask: Tensor) -> Tensor:
emb = last_hidden_states * attention_mask.unsqueeze(-1)
emb = emb.sum(dim=1) / attention_mask.sum(dim=1).unsqueeze(-1)
return emb
tokenizer = AutoTokenizer.from_pretrained("pkshatech/RoSEtta-base-ja")
model = AutoModel.from_pretrained("pkshatech/RoSEtta-base-ja",trust_remote_code=True)
sentences = [
'query: PKSHAはどんな会社ですか?',
'passage: 研究開発したアルゴリズムを、多くの企業のソフトウエア・オペレーションに導入しています。',
'query: 日本で一番高い山は?',
'passage: 富士山(ふじさん)は、標高3776.12 m、日本最高峰(剣ヶ峰)の独立峰で、その優美な風貌は日本国外でも日本の象徴として広く知られている。',
]
batch_dict = tokenizer(sentences, max_length=1024, padding=True, truncation=True, return_tensors='pt')
outputs = model(**batch_dict)
embeddings = mean_pooling(outputs.last_hidden_state, batch_dict['attention_mask'])
print(embeddings.shape)
similarities = F.cosine_similarity(embeddings.unsqueeze(0), embeddings.unsqueeze(1), dim=2)
print(similarities)
Training Details
The fine-tuning of RoSEtta is carried out through the following steps:
- Pre-training:
- Weakly supervised learning:
- Ensemble distillation:
- Contrastive learning:
- Triplets were created from JSNLI, MNLI, PAWS-X, JSeM and Mr.TyDi and used for training.
- This training aimed to improve the overall performance as a sentence embedding model.
- Search-specific contrastive learning:
Benchmarks
Retrieval
Evaluated with MIRACL-ja, JQARA, JaCWIR and MLDR-ja.
Note: Results for OpenAI small embeddings in JQARA and JaCWIR are quoted from JQARA and JaCWIR.
JMTEB
Evaluated with JMTEB. The average score is macro-average.
Model |
Size |
Avg. |
Retrieval |
STS |
Classification |
Reranking |
Clustering |
PairClassification |
OpenAI/text-embedding-3-small |
- |
69.18 |
66.39 |
79.46 |
73.06 |
92.92 |
51.06 |
62.27 |
OpenAI/text-embedding-3-large |
- |
74.05 |
74.48 |
82.52 |
77.58 |
93.58 |
53.32 |
62.35 |
intfloat/multilingual-e5-large |
0.6B |
70.90 |
70.98 |
79.70 |
72.89 |
92.96 |
51.24 |
62.15 |
cl-nagoya/ruri-large |
0.3B |
73.31 |
73.02 |
83.13 |
77.43 |
92.99 |
51.82 |
62.29 |
intfloat/multilingual-e5-base |
0.3B |
68.61 |
68.21 |
79.84 |
69.30 |
92.85 |
48.26 |
62.26 |
cl-nagoya/ruri-base |
0.1B |
71.91 |
69.82 |
82.87 |
75.58 |
92.91 |
54.16 |
62.38 |
pkshatech/GLuCoSE-base-ja |
0.1B |
67.29 |
59.02 |
78.71 |
76.82 |
91.90 |
49.78 |
66.39 |
RoSEtta |
0.2B |
72.45 |
73.21 |
81.39 |
72.41 |
92.69 |
53.23 |
61.74 |
👨💻 Authors
Chihiro Yano, Mocho Go, Hideyuki Tachibana, Hiroto Takegawa, Yotaro Watanabe
📄 License
This model is published under the Apache License, Version 2.0.