Free and open source! The sentence-bert-base-ja-mean-tokens-v2 Japanese model with significantly improved accuracy!

Sentence Bert Base Ja Mean Tokens V2

Developed by sonoisa

This is a Japanese-specific Sentence-BERT model, which uses an improved loss function for training optimization compared to Version 1, achieving a 1.5 to 2 percentage point increase in accuracy.

Text Embedding

Safetensors

Japanese#Japanese semantic similarity #Improved loss function #BERT mean pooling

Downloads 108.15k

Release Time : 3/2/2022

Model Overview

Japanese Sentence-BERT model for generating sentence embeddings, suitable for tasks such as sentence similarity calculation and feature extraction.

Model Features

Optimized loss function

Trained using MultipleNegativesRankingLoss, achieving a 1.5-2% accuracy improvement over Version 1

Japanese-specific

Sentence-BERT model specifically optimized for Japanese text

Based on high-quality pre-trained model

Built on cl-tohoku/bert-base-japanese-whole-word-masking

Model Capabilities

Japanese sentence embedding

Sentence similarity calculation

Feature extraction

Use Cases

Text similarity

Semantic search

Implement semantic search by calculating sentence embedding similarity

Duplicate content detection

Identify sentences with similar semantics but different expressions

Information retrieval

Document clustering

Cluster documents based on sentence embeddings

🚀 Japanese Sentence-BERT Model (Version 2)

This is a Japanese Sentence-BERT model (Version 2). It is an improved version trained with a better loss function, MultipleNegativesRankingLoss, compared to Version 1. In our private dataset, it achieved about 1.5 - 2 percentage points higher accuracy than Version 1.

We used the pre-trained model cl-tohoku/bert-base-japanese-whole-word-masking. Therefore, fugashi and ipadic are required for inference (pip install fugashi ipadic).

🚀 Quick Start

Installation

To use this model, you need to install the necessary dependencies. You can install fugashi and ipadic using the following command:

pip install fugashi ipadic

Usage

Here is an example of how to use the model:

from transformers import BertJapaneseTokenizer, BertModel
import torch


class SentenceBertJapanese:
    def __init__(self, model_name_or_path, device=None):
        self.tokenizer = BertJapaneseTokenizer.from_pretrained(model_name_or_path)
        self.model = BertModel.from_pretrained(model_name_or_path)
        self.model.eval()

        if device is None:
            device = "cuda" if torch.cuda.is_available() else "cpu"
        self.device = torch.device(device)
        self.model.to(device)

    def _mean_pooling(self, model_output, attention_mask):
        token_embeddings = model_output[0] #First element of model_output contains all token embeddings
        input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
        return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

    @torch.no_grad()
    def encode(self, sentences, batch_size=8):
        all_embeddings = []
        iterator = range(0, len(sentences), batch_size)
        for batch_idx in iterator:
            batch = sentences[batch_idx:batch_idx + batch_size]

            encoded_input = self.tokenizer.batch_encode_plus(batch, padding="longest", 
                                           truncation=True, return_tensors="pt").to(self.device)
            model_output = self.model(**encoded_input)
            sentence_embeddings = self._mean_pooling(model_output, encoded_input["attention_mask"]).to('cpu')

            all_embeddings.extend(sentence_embeddings)

        # return torch.stack(all_embeddings).numpy()
        return torch.stack(all_embeddings)


MODEL_NAME = "sonoisa/sentence-bert-base-ja-mean-tokens-v2"  # <- v2です。
model = SentenceBertJapanese(MODEL_NAME)

sentences = ["暴走したAI", "暴走した人工知能"]
sentence_embeddings = model.encode(sentences, batch_size=8)

print("Sentence embeddings:", sentence_embeddings)

📚 Documentation

Explanation of the Old Version

You can find the explanation of the old version here. If you change the model name to "sonoisa/sentence-bert-base-ja-mean-tokens-v2", it will use this model.

📄 License

This project is licensed under the CC BY-SA 4.0 license.

📋 Information Table

Property	Details
Model Type	Sentence-BERT
Training Data	Not specified
License	CC BY-SA 4.0
Tags	sentence-transformers, sentence-bert, feature-extraction, sentence-similarity

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご