🚀 Sentence BERT base Japanese model
This repository houses a Sentence BERT base model tailored for the Japanese language, facilitating tasks such as sentence similarity and feature extraction.
🚀 Quick Start
First, install the necessary dependencies:
$ pip install sentence-transformers==2.0.0
Then initialize the SentenceTransformer
model and use the encode
method to convert sentences to vectors:
>>> from sentence_transformers import SentenceTransformer
>>> model = SentenceTransformer("colorfulscoop/sbert-base-ja")
>>> sentences = ["外をランニングするのが好きです", "海外旅行に行くのが趣味です"]
>>> model.encode(sentences)
✨ Features
- Sentence Similarity: Ideal for calculating the similarity between Japanese sentences.
- Feature Extraction: Capable of extracting features from Japanese text.
📦 Installation
To use this model, you need to install the sentence-transformers
library:
$ pip install sentence-transformers==2.0.0
💻 Usage Examples
Basic Usage
>>> from sentence_transformers import SentenceTransformer
>>> model = SentenceTransformer("colorfulscoop/sbert-base-ja")
>>> sentences = ["外をランニングするのが好きです", "海外旅行に行くのが趣味です"]
>>> model.encode(sentences)
📚 Documentation
Pretrained model
This model uses a Japanese BERT model colorfulscoop/bert-base-ja v1.0, which is released under Creative Commons Attribution-ShareAlike 3.0, as a pretrained model.
Training data
The Japanese SNLI dataset released under Creative Commons Attribution-ShareAlike 4.0 is used for training. The original training dataset is split into train/valid datasets. Finally, the following data is prepared:
- Train data: 523,005 samples
- Valid data: 10,000 samples
- Test data: 3,916 samples
Model description
This model utilizes the SentenceTransformer
model from the sentence-transformers. The model details are as follows:
>>> from sentence_transformers import SentenceTransformer
>>> SentenceTransformer("colorfulscoop/sbert-base-ja")
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)
Training
This model fine-tunes colorfulscoop/bert-base-ja with a Softmax classifier of 3 labels from SNLI. The AdamW optimizer with a learning rate of 2e - 05, linearly warmed up in 10% of the training data, is used. The model is trained for 1 epoch with a batch size of 8.
Note: In the original paper of Sentence BERT, the batch size of the model trained on SNLI and Multi - Genle NLI was 16. In this model, the dataset is about half the size of the original one, so the batch size is set to half of the original batch size of 16.
Training is conducted on Ubuntu 18.04.5 LTS with one RTX 2080 Ti. After training, the test set accuracy reaches 0.8529. The training code is available in a GitHub repository.
🔧 Technical Details
- Pretrained Model: Based on colorfulscoop/bert-base-ja v1.0.
- Training Data: Japanese SNLI dataset.
- Model Structure: Utilizes
SentenceTransformer
with specific pooling and transformer components.
- Training Process: Fine - tuning with Softmax classifier, AdamW optimizer, and specific training parameters.
📄 License
Copyright (c) 2021 Colorful Scoop. All the models included in this repository are licensed under Creative Commons Attribution-ShareAlike 4.0.
Disclaimer: Use of this model is at your sole risk. Colorful Scoop makes no warranty or guarantee of any outputs from the model. Colorful Scoop is not liable for any trouble, loss, or damage arising from the model output.
This model utilizes the following pretrained model:
Property |
Details |
Model Name |
bert-base-ja |
Credit |
(c) 2021 Colorful Scoop |
License |
Creative Commons Attribution-ShareAlike 3.0 |
Disclaimer |
The model potentially has the possibility that it generates similar texts in the training data, texts not to be true, or biased texts. Use of the model is at your sole risk. Colorful Scoop makes no warranty or guarantee of any outputs from the model. Colorful Scoop is not liable for any trouble, loss, or damage arising from the model output. |
Link |
https://huggingface.co/colorfulscoop/bert-base-ja |
This model utilizes the following data for fine - tuning: