sbert-base-ja Open-source Japanese Model - Free to Use for Sentence Similarity Calculation

Home

Sbert Base Ja

Developed by colorfulscoop

Basic Sentence BERT model for Japanese, fine-tuned based on BERT model, used for sentence similarity calculation

Text Embedding

PyTorch

Japanese#Japanese semantic similarity #Sentence vectorization #JSNLI fine-tuning

Downloads 537

Release Time : 3/2/2022

Model Overview

This model is a basic Sentence BERT model for Japanese, primarily used for sentence similarity calculation and feature extraction tasks. It is based on the colorfulscoop/bert-base-ja pre-trained model and fine-tuned using the Japanese SNLI dataset.

Model Features

Optimized for Japanese

Specially optimized for Japanese text processing based on Japanese BERT model and Japanese SNLI dataset

Efficient Sentence Embedding

Capable of converting sentences into 768-dimensional embedding vectors for subsequent similarity calculations

Lightweight Deployment

Relatively lightweight model, suitable for deployment in practical applications

Model Capabilities

Sentence embedding generation

Sentence similarity calculation

Text feature extraction

Use Cases

Text Matching

Q&A Systems

Used to match user questions with similar questions in the knowledge base

Semantic Search

Enhances the semantic understanding of query statements and documents in search systems

Content Recommendation

🚀 Sentence BERT base Japanese model

This repository houses a Sentence BERT base model tailored for the Japanese language, facilitating tasks such as sentence similarity and feature extraction.

🚀 Quick Start

First, install the necessary dependencies:

$ pip install sentence-transformers==2.0.0

Then initialize the SentenceTransformer model and use the encode method to convert sentences to vectors:

>>> from sentence_transformers import SentenceTransformer
>>> model = SentenceTransformer("colorfulscoop/sbert-base-ja")
>>> sentences = ["外をランニングするのが好きです", "海外旅行に行くのが趣味です"]
>>> model.encode(sentences)

✨ Features

Sentence Similarity: Ideal for calculating the similarity between Japanese sentences.
Feature Extraction: Capable of extracting features from Japanese text.

📦 Installation

To use this model, you need to install the sentence-transformers library:

$ pip install sentence-transformers==2.0.0

💻 Usage Examples

Basic Usage

>>> from sentence_transformers import SentenceTransformer
>>> model = SentenceTransformer("colorfulscoop/sbert-base-ja")
>>> sentences = ["外をランニングするのが好きです", "海外旅行に行くのが趣味です"]
>>> model.encode(sentences)

📚 Documentation

Pretrained model

This model uses a Japanese BERT model colorfulscoop/bert-base-ja v1.0, which is released under Creative Commons Attribution-ShareAlike 3.0, as a pretrained model.

Training data

The Japanese SNLI dataset released under Creative Commons Attribution-ShareAlike 4.0 is used for training. The original training dataset is split into train/valid datasets. Finally, the following data is prepared:

Train data: 523,005 samples
Valid data: 10,000 samples
Test data: 3,916 samples

Model description

This model utilizes the SentenceTransformer model from the sentence-transformers. The model details are as follows:

>>> from sentence_transformers import SentenceTransformer
>>> SentenceTransformer("colorfulscoop/sbert-base-ja")
SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

Training

This model fine-tunes colorfulscoop/bert-base-ja with a Softmax classifier of 3 labels from SNLI. The AdamW optimizer with a learning rate of 2e - 05, linearly warmed up in 10% of the training data, is used. The model is trained for 1 epoch with a batch size of 8.

Note: In the original paper of Sentence BERT, the batch size of the model trained on SNLI and Multi - Genle NLI was 16. In this model, the dataset is about half the size of the original one, so the batch size is set to half of the original batch size of 16.

Training is conducted on Ubuntu 18.04.5 LTS with one RTX 2080 Ti. After training, the test set accuracy reaches 0.8529. The training code is available in a GitHub repository.

🔧 Technical Details

Pretrained Model: Based on colorfulscoop/bert-base-ja v1.0.
Training Data: Japanese SNLI dataset.
Model Structure: Utilizes SentenceTransformer with specific pooling and transformer components.
Training Process: Fine - tuning with Softmax classifier, AdamW optimizer, and specific training parameters.

📄 License

Disclaimer: Use of this model is at your sole risk. Colorful Scoop makes no warranty or guarantee of any outputs from the model. Colorful Scoop is not liable for any trouble, loss, or damage arising from the model output.

This model utilizes the following pretrained model:

Property	Details
Model Name	bert-base-ja
Credit	(c) 2021 Colorful Scoop
License	Creative Commons Attribution-ShareAlike 3.0
Disclaimer	The model potentially has the possibility that it generates similar texts in the training data, texts not to be true, or biased texts. Use of the model is at your sole risk. Colorful Scoop makes no warranty or guarantee of any outputs from the model. Colorful Scoop is not liable for any trouble, loss, or damage arising from the model output.
Link	https://huggingface.co/colorfulscoop/bert-base-ja

This model utilizes the following data for fine - tuning:

Property	Details
Data Name	日本語SNLI(JSNLI)データセット
Credit	https://nlp.ist.i.kyoto-u.ac.jp/index.php?日本語SNLI(JSNLI)データセット
License	CC BY - SA 4.0
Link	https://nlp.ist.i.kyoto-u.ac.jp/index.php?日本語SNLI(JSNLI)データセット

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご