🚀 Sarashina-Embedding-v1-1B
"Sarashina-Embedding-v1-1B" is a Japanese text embedding model. It is based on the 1.2B-parameter Japanese LLM "Sarashina2.1-1B". This model maps sentences and paragraphs to a 1792 - dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and other applications.
Japanese README
🚀 Quick Start
First, install the Sentence Transformers library:
pip install -U sentence-transformers
Then, you can load this model and run inference.
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("sbintuitions/sarashina-embedding-v1-1b")
sentences = [
'更級日記は、平安時代中期に菅原孝標女によって書かれた回想録です。',
'Sarashinaは、SB Intuitionsが開発した日本語大規模言語モデルです。これまでに7B, 13B, 70B, 8x70Bのモデルが公開されています。',
'サラシナエンベディングは日本語言語モデルをベースにした日本語埋め込みモデルです。'
]
embeddings = model.encode(sentences)
print(embeddings.shape)
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
⚠️ Important Note
- You do not need to add prefixes such as "Query: " and "Document: " to the beginning of the input sentence.
- This model is licensed under the Sarashina Model NonCommercial License Agreement, which has restrictions on commercial use. If you are interested in utilizing this model for your business, please feel free to contact us through our contact page.
✨ Features
This model is based on the 1.2B - parameter Japanese LLM "Sarashina2.1-1B". It uses multi - stage contrastive learning and has achieved the state - of - the - art average score across 16 datasets in JMTEB (Japanese Massive Text Embedding Benchmark).
📚 Documentation
Model Details
Model Description
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: LlamaModel
(1): Pooling({'word_embedding_dimension': 1792, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': True, 'include_prompt': False})
)
Training
"Sarashina-Embedding-v1-1B" is created through the following two - stage learning process:
Stage 1: Weakly - supervised Learning
To achieve generic text embedding performance across a wide range of domains, we performed contrastive training on weakly - supervised data consisting of our own web - crawled data and open data.
Datasets
Dataset |
Counts |
Auto Wiki QA/NLI |
50,521,135 |
web - crawled data (ours) |
47,370,649 |
MQA |
12,941,472 |
[llm - japanese - dataset](https://huggingface.co/datasets/izumi - lab/llm - japanese - dataset) |
9,074,340 |
Wikipedia |
5,555,212 |
Quiz dataset (ours) |
988,478 |
[Natural Questions](https://huggingface.co/datasets/sentence - transformers/NQ - retrieval) |
132,796 |
JSQuAD |
62,859 |
[SNOW(T15+T23)](https://aclanthology.org/L18 - 1185) |
62,758 |
JaQuAD |
31,746 |
[MKQA](https://aclanthology.org/2021.tacl - 1.82) |
3,318 |
total |
126,744,763 |
Step 2: Supervised Fine - tuning
To enable the model to learn a more accurate query - document similarity, we performed supervised fine - tuning using the following datasets.
Datasets
Dataset |
Counts |
[JSNLI](https://nlp.ist.i.kyoto - u.ac.jp/?%E6%97%A5%E6%9C%AC%E8%AA%9ESNLI%28JSNLI%29%E3%83%87%E3%83%BC%E3%82%BF%E3%82%BB%E3%83%83%E3%83%88) |
141,388 |
[NU - MNLI](https://huggingface.co/datasets/cl - nagoya/nu - mnli) |
67,987 |
[Mr. TyDi](https://huggingface.co/datasets/castorini/mr - tydi) (only Japanese subset) |
3,697 |
[Natural Questions](https://huggingface.co/datasets/sentence - transformers/NQ - retrieval) (sampled) |
20,000 |
total |
233,072 |
Evaluation Results with JMTEB
Model |
Max Tokens |
Avg. |
Retrieval |
STS |
Classification |
Reranking |
Clustering |
PairClassification |
[OpenAI/text - embedding - 3 - large](https://openai.com/index/new - embedding - models - and - api - updates/)[^oai] |
8191 |
74.05 |
74.48 |
82.52 |
77.58 |
93.58 |
53.32 |
62.35 |
cl - nagoya/ruri - large |
512 |
73.31 |
73.02 |
83.13 |
77.43 |
92.99 |
51.82 |
62.29 |
[pkshatech/GLuCoSE - base - ja - v2](https://huggingface.co/pkshatech/GLuCoSE - base - ja - v2) |
512 |
72.23 |
73.36 |
82.96 |
74.21 |
93.01 |
48.65 |
62.37 |
[pkshatech/RoSEtta - base - ja](https://huggingface.co/pkshatech/RoSEtta - base - ja) |
1024 |
72.04 |
73.21 |
81.39 |
72.41 |
92.69 |
53.23 |
61.74 |
[intfloat/multilingual - e5 - large](https://huggingface.co/intfloat/multilingual - e5 - large) |
512 |
70.90 |
70.98 |
79.70 |
72.89 |
92.96 |
51.24 |
62.15 |
[Sarashina - Embedding - v1 - 1B](https://huggingface.co/sbintuitions/sarashina - embedding - v1 - 1b)(This model) |
8192 |
75.50 |
77.61 |
82.71 |
78.37 |
93.74 |
53.86 |
62.00 |
📄 License
This model is licensed under Sarashina Model NonCommercial License Agreement.
If you are interested in using this model for commercial purposes, please feel free to contact us through our contact page.
[^oai]: Benchmarked on April 23, 2024.