Sentence-BERT Base Japanese Mean Tokens Open-Source Model - Generate Vectors for Japanese Sentences and Calculate Similarity for Free

Sentence Bert Base Ja Mean Tokens

Developed by sonoisa

This is a Japanese-specific sentence embedding model based on the BERT architecture, designed to generate semantic vector representations of sentences and calculate sentence similarity.

Text Embedding

Safetensors

Japanese#Japanese Semantic Encoding #Sentence Similarity Calculation #BERT Feature Extraction

Downloads 51.01k

Release Time : 3/2/2022

Model Overview

This model is the Japanese version of Sentence-BERT, specifically designed for processing Japanese sentences. It can convert Japanese sentences into high-dimensional vector representations, facilitating the calculation of semantic similarity between sentences.

Model Features

Japanese-Specific

A BERT model optimized specifically for Japanese sentences, better handling Japanese grammar and semantic features.

Mean Pooling

Uses mean pooling to generate sentence embedding vectors, effectively capturing the overall semantics of sentences.

Improved Version

Offers Version 2 with approximately 1.5 percentage points improvement in accuracy.

Model Capabilities

Japanese sentence embedding

Sentence similarity calculation

Semantic feature extraction

Use Cases

Information Retrieval

🚀 Japanese Sentence-BERT Model

This is a Japanese Sentence-BERT model designed for feature extraction and calculating sentence similarity. It offers an effective way to process Japanese text, enabling users to obtain sentence embeddings for various NLP tasks.

🚀 Quick Start

This is a Japanese Sentence-BERT model (Version 1).

⚠️ Important Note There is also Version 2 of the model, which has approximately 1.5 points higher accuracy.

📚 Documentation

For more detailed explanations, please refer to this article.

💻 Usage Examples

Basic Usage

from transformers import BertJapaneseTokenizer, BertModel
import torch


class SentenceBertJapanese:
    def __init__(self, model_name_or_path, device=None):
        self.tokenizer = BertJapaneseTokenizer.from_pretrained(model_name_or_path)
        self.model = BertModel.from_pretrained(model_name_or_path)
        self.model.eval()

        if device is None:
            device = "cuda" if torch.cuda.is_available() else "cpu"
        self.device = torch.device(device)
        self.model.to(device)

    def _mean_pooling(self, model_output, attention_mask):
        token_embeddings = model_output[0] #First element of model_output contains all token embeddings
        input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
        return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

    @torch.no_grad()
    def encode(self, sentences, batch_size=8):
        all_embeddings = []
        iterator = range(0, len(sentences), batch_size)
        for batch_idx in iterator:
            batch = sentences[batch_idx:batch_idx + batch_size]

            encoded_input = self.tokenizer.batch_encode_plus(batch, padding="longest", 
                                           truncation=True, return_tensors="pt").to(self.device)
            model_output = self.model(**encoded_input)
            sentence_embeddings = self._mean_pooling(model_output, encoded_input["attention_mask"]).to('cpu')

            all_embeddings.extend(sentence_embeddings)

        # return torch.stack(all_embeddings).numpy()
        return torch.stack(all_embeddings)


MODEL_NAME = "sonoisa/sentence-bert-base-ja-mean-tokens"
model = SentenceBertJapanese(MODEL_NAME)

sentences = ["暴走したAI", "暴走した人工知能"]
sentence_embeddings = model.encode(sentences, batch_size=8)

print("Sentence embeddings:", sentence_embeddings)

📄 License

This project is licensed under the CC BY-SA 4.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご