Open-source and free sarashina-embedding-v1-1b text embedding model shows excellent performance in Japanese processing

Sarashina Embedding V1 1b

Developed by sbintuitions

A text embedding model developed based on a 1.2 billion parameter Japanese large language model, excelling in JMTEB benchmark tests

Text Embedding

Transformers

Supports Multiple Languages#Japanese text embedding #8192 long text support #1792-dimensional dense vector

Downloads 23.85k

Release Time : 11/22/2024

Model Overview

The Sarashina Embedding Model v1-1B is a text embedding model based on a Japanese large language model, capable of mapping sentences and paragraphs into a 1792-dimensional dense vector space, suitable for various scenarios such as semantic text similarity calculation and semantic search

Model Features

High-dimensional dense vector

Outputs 1792-dimensional dense vectors, capable of capturing semantic information more finely

Long text support

Supports processing of long texts up to 8192 tokens

Multi-stage training

Enhances model performance through two-stage training with weakly supervised learning and supervised fine-tuning

Japanese optimization

Specifically optimized for Japanese text, demonstrating excellent performance in JMTEB benchmark tests

Model Capabilities

Semantic text similarity calculation

Semantic search

Paraphrase mining

Text classification

Clustering analysis

Use Cases

Information retrieval

Document retrieval

Quickly retrieves relevant documents based on query semantics

Scored 77.61 in JMTEB retrieval tasks

Text analysis

Text similarity calculation

Calculates the semantic similarity between two texts

Scored 82.71 in JMTEB semantic similarity tasks

Text clustering

Automatically groups semantically similar texts

Scored 53.86 in JMTEB clustering tasks

🚀 Sarashina-Embedding-v1-1B

"Sarashina-Embedding-v1-1B" is a Japanese text embedding model. It is based on the 1.2B-parameter Japanese LLM "Sarashina2.1-1B". This model maps sentences and paragraphs to a 1792 - dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and other applications.

Japanese README

🚀 Quick Start

First, install the Sentence Transformers library:

pip install -U sentence-transformers

Then, you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("sbintuitions/sarashina-embedding-v1-1b")
# Run inference
sentences = [
    '更級日記は、平安時代中期に菅原孝標女によって書かれた回想録です。',
    'Sarashinaは、SB Intuitionsが開発した日本語大規模言語モデルです。これまでに7B, 13B, 70B, 8x70Bのモデルが公開されています。',
    'サラシナエンベディングは日本語言語モデルをベースにした日本語埋め込みモデルです。'
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1792]

# Get the similarity scores between the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

⚠️ Important Note

You do not need to add prefixes such as "Query: " and "Document: " to the beginning of the input sentence.

This model is licensed under the Sarashina Model NonCommercial License Agreement, which has restrictions on commercial use. If you are interested in utilizing this model for your business, please feel free to contact us through our contact page.

✨ Features

This model is based on the 1.2B - parameter Japanese LLM "Sarashina2.1-1B". It uses multi - stage contrastive learning and has achieved the state - of - the - art average score across 16 datasets in JMTEB (Japanese Massive Text Embedding Benchmark).

📚 Documentation

Model Details

Model Description

Property	Details
Model Type	Sentence Transformer
Base model	Sarashina2.1-1B
Maximum Sequence Length	8,192 tokens
Output Dimensionality	1,792 dimensions
Similarity Function	Cosine Similarity
Language	Japanese
License	Sarashina Model NonCommercial License Agreement

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: LlamaModel 
  (1): Pooling({'word_embedding_dimension': 1792, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': True, 'include_prompt': False})
)

Training

"Sarashina-Embedding-v1-1B" is created through the following two - stage learning process:

Stage 1: Weakly - supervised Learning

To achieve generic text embedding performance across a wide range of domains, we performed contrastive training on weakly - supervised data consisting of our own web - crawled data and open data.

Datasets

Dataset	Counts
Auto Wiki QA/NLI	50,521,135
web - crawled data (ours)	47,370,649
MQA	12,941,472
[llm - japanese - dataset](https://huggingface.co/datasets/izumi - lab/llm - japanese - dataset)	9,074,340
Wikipedia	5,555,212
Quiz dataset (ours)	988,478
[Natural Questions](https://huggingface.co/datasets/sentence - transformers/NQ - retrieval)	132,796
JSQuAD	62,859
[SNOW(T15+T23)](https://aclanthology.org/L18 - 1185)	62,758
JaQuAD	31,746
[MKQA](https://aclanthology.org/2021.tacl - 1.82)	3,318
total	126,744,763

Step 2: Supervised Fine - tuning

To enable the model to learn a more accurate query - document similarity, we performed supervised fine - tuning using the following datasets.

Datasets

Dataset	Counts
[JSNLI](https://nlp.ist.i.kyoto - u.ac.jp/?%E6%97%A5%E6%9C%AC%E8%AA%9ESNLI%28JSNLI%29%E3%83%87%E3%83%BC%E3%82%BF%E3%82%BB%E3%83%83%E3%83%88)	141,388
[NU - MNLI](https://huggingface.co/datasets/cl - nagoya/nu - mnli)	67,987
[Mr. TyDi](https://huggingface.co/datasets/castorini/mr - tydi) (only Japanese subset)	3,697
[Natural Questions](https://huggingface.co/datasets/sentence - transformers/NQ - retrieval) (sampled)	20,000
total	233,072

Evaluation Results with JMTEB

Model	Max Tokens	Avg.	Retrieval	STS	Classification	Reranking	Clustering	PairClassification
[OpenAI/text - embedding - 3 - large](https://openai.com/index/new - embedding - models - and - api - updates/)[^oai]	8191	74.05	74.48	82.52	77.58	93.58	53.32	62.35
cl - nagoya/ruri - large	512	73.31	73.02	83.13	77.43	92.99	51.82	62.29
[pkshatech/GLuCoSE - base - ja - v2](https://huggingface.co/pkshatech/GLuCoSE - base - ja - v2)	512	72.23	73.36	82.96	74.21	93.01	48.65	62.37
[pkshatech/RoSEtta - base - ja](https://huggingface.co/pkshatech/RoSEtta - base - ja)	1024	72.04	73.21	81.39	72.41	92.69	53.23	61.74
[intfloat/multilingual - e5 - large](https://huggingface.co/intfloat/multilingual - e5 - large)	512	70.90	70.98	79.70	72.89	92.96	51.24	62.15
[Sarashina - Embedding - v1 - 1B](https://huggingface.co/sbintuitions/sarashina - embedding - v1 - 1b)(This model)	8192	75.50	77.61	82.71	78.37	93.74	53.86	62.00

📄 License

This model is licensed under Sarashina Model NonCommercial License Agreement.

If you are interested in using this model for commercial purposes, please feel free to contact us through our contact page.

[^oai]: Benchmarked on April 23, 2024.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご