Fio-base-japanese-v0.1 Open-source Japanese Embedding Model - Freely Achieve Japanese Text Similarity and Feature Extraction

Fio Base Japanese V0.1

Developed by bclavie

The first version of the Fio series Japanese embedding model, based on BERT architecture, focusing on Japanese text similarity and feature extraction tasks

Text Embedding

Transformers

Japanese#Japanese Embedding Model #Sentence Similarity #Multi-task Training

Downloads 79

Release Time : 12/18/2023

Model Overview

This is a Japanese version of the sentence transformer model, primarily used for sentence similarity calculation and feature extraction tasks. As a proof-of-concept version, it was trained on a limited dataset.

Model Features

Multi-task Training

Trained on various Japanese datasets, including similarity/entailment and retrieval tasks

Performance Advantage

Outperforms similar models in Japanese text similarity tasks

Proof of Concept

As the first Japanese version of the Fio series, it demonstrates the potential of this architecture for Japanese tasks

Model Capabilities

Japanese Text Embedding

Sentence Similarity Calculation

Text Feature Extraction

Cross-language Retrieval

Use Cases

Text Similarity

Japanese Text Matching

Calculate the similarity between two Japanese sentences

Achieved an excellent score of 0.863 on the JSTS dataset

Information Retrieval

Cross-language Document Retrieval

Retrieve relevant Japanese documents in a multilingual environment

Performed well on the MIRACL dataset

🚀 fio-base-japanese-v0.1

fio-base-japanese-v0.1 is a proof - of - concept and the first release of the Fio family of Japanese embeddings. It addresses the need for high - quality Japanese sentence embeddings and offers a solution based on pre - trained models with limited data training.

🚀 Quick Start

This model requires both fugashi and unidic - lite. Install them with the following command:

pip install -U fugashi unidic-lite

If using for a retrieval task, you must prefix your query with "関連記事を取得するために使用できるこの文の表現を生成します: ".

✨ Features

Based on [cl - tohoku/bert - base - japanese - v3](https://huggingface.co/cl - tohoku/bert - base - japanese - v3).
Trained on limited volumes of data on a single GPU.
Applicable for sentence - similarity and retrieval tasks.

📦 Installation

To use this model, you need to install the necessary libraries.

Install `fugashi` and `unidic - lite`

pip install -U fugashi unidic-lite

Install `sentence - transformers` (Optional but recommended)

pip install -U sentence-transformers

💻 Usage Examples

Basic Usage (Sentence - Transformers)

from sentence_transformers import SentenceTransformer
sentences = ["こんにちは、世界！", "文埋め込み最高！文埋め込み最高と叫びなさい", "極度乾燥しなさい"]

model = SentenceTransformer('bclavie/fio-base-japanese-v0.1')
embeddings = model.encode(sentences)
print(embeddings)

Advanced Usage (HuggingFace Transformers)

from transformers import AutoTokenizer, AutoModel
import torch


def cls_pooling(model_output, attention_mask):
    return model_output[0][:,0]


# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('{MODEL_NAME}')
model = AutoModel.from_pretrained('{MODEL_NAME}')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, cls pooling.
sentence_embeddings = cls_pooling(model_output, encoded_input['attention_mask'])

print("Sentence embeddings:")
print(sentence_embeddings)

📚 Documentation

Datasets

Similarity/Entailment

JSTS (train)
JSNLI (train)
JNLI (train)
JSICK (train)

Retrieval

MMARCO (Multilingual Marco) (train, 124k sentence pairs, <1% of the full data)
Mr.TyDI (train)
MIRACL (train, 50% sample)
~~JSQuAD (train, 50% sample, no LLM enhancement)~~ JSQuAD is not used in the released version, to serve as an unseen test set.

Results

⚠️ Important Note

fio - base - japanese - v0.1 has seen textual entailment tasks during its training, which is not the case of the other other japanese - only models in this table. This gives Fio an unfair advantage over the previous best results, cl - nagoya/sup - simcse - ja - [base|large]. During mid - training evaluations, this didn't seem to greatly affect performance, however, JSICK (NLI set) was included in the training data, and therefore it's impossible to fully remove this contamination at the moment. I intend to fix this in future release, but please keep this in mind as you view the results (see JSQuAD results on the associated blog post for a fully unseen comparison, although focused on retrieval).

This is adapted and truncated (to keep only the most popular models) from oshizo's benchmarking github repo, please check it out for more information and give it a star as it was very useful!

Italic denotes best model for its size when a smaller model outperforms a bigger one (base/large | 768/1024), bold denotes best overall.

Property	Details
Model Type	fio - base - japanese - v0.1
Training Data	JSTS (train), JSNLI (train), JNLI (train), JSICK (train), MMARCO (train, 124k sentence pairs, <1% of the full data), Mr.TyDI (train), MIRACL (train, 50% sample)

Model	JSTS valid - v1.1	JSICK test	MIRACL dev	Average
bclavie/fio - base - japanese - v0.1	*0.863*	*0.894*	0.718	0.825
cl - nagoya/sup - simcse - ja - base	0.809	0.827	0.527	0.721
cl - nagoya/sup - simcse - ja - large	0.831	0.831	0.507	0.723
colorfulscoop/sbert - base - ja	0.742	0.657	0.254	0.551
intfloat/multilingual - e5 - base	0.796	0.806	0.845	0.816
intfloat/multilingual - e5 - large	0.819	0.794	0.883	*0.832*
pkshatech/GLuCoSE - base - ja	0.818	0.757	0.692	0.755
text - embedding - ada - 002	0.790	0.789	0.7232	0.768

📄 License

Citing & Authors

  bclavie-fio-embeddings,
  author = {Benjamin Clavié},
  title = {Fio Japanese Embeddings},
  year = {2023},
  howpublished = {\url{https://ben.clavie.eu/fio}}
}```

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご