Open-source simcse-ja-bert-base-clcmlp model - Freely extract high-quality embedding representations of Japanese sentences

Simcse Ja Bert Base Clcmlp

Developed by pkshatech

This is a BERT-based Japanese SimCSE model, specifically designed for extracting high-quality sentence embeddings from Japanese sentences.

Text Embedding

Transformers

Japanese#Japanese Sentence Embedding #Cosine Similarity Optimization #JSNLI Training

Downloads 803

Release Time : 12/26/2022

Model Overview

This model is based on the BERT architecture, optimized for Japanese text, capable of generating high-quality sentence embeddings suitable for tasks such as sentence similarity calculation.

Model Features

Japanese Optimization

Specifically trained for Japanese text, optimized using the JSNLI dataset

Efficient Embedding

Capable of quickly generating high-quality sentence embeddings

Cosine Similarity Optimization

Uses cosine similarity as the loss function during training, making it particularly suitable for similarity calculation tasks

Model Capabilities

Sentence Embedding Extraction

Sentence Similarity Calculation

Japanese Text Feature Extraction

Use Cases

Text Analysis

Semantic Search

Used for building Japanese semantic search engines

Improves the relevance of search results

Text Clustering

Automatic classification and clustering of Japanese text

Enables unsupervised text organization

Natural Language Processing

Question Answering Systems

Used for building semantic matching components in Japanese QA systems

Improves the accuracy of question-answer matching

🚀 Japanese SimCSE (BERT-base)

A Japanese SimCSE model for easily extracting sentence embedding representations from Japanese sentences.

🚀 Quick Start

This is a Japanese SimCSE model, named pkshatech/simcse-ja-bert-base-clcmlp. It allows you to effortlessly extract sentence embedding representations from Japanese sentences. The model is based on cl-tohoku/bert-base-japanese-v2 and trained on the JSNLI dataset, a Japanese natural language inference dataset.

Japanese README

✨ Features

Extract sentence embedding representations from Japanese sentences.
Based on a well - known Japanese BERT model and trained on a Japanese NLI dataset.

📦 Installation

You can use this model easily with sentence-transformers. For tokenization, you need fugashi and unidic-lite.

Install sentence-transformers, fugashi, and unidic-lite with pip as follows:

pip install -U fugashi[unidic-lite] sentence-transformers

💻 Usage Examples

Basic Usage

You can load the model and convert sentences to dense vectors as follows:

from sentence_transformers import SentenceTransformer
sentences = [
    "PKSHA Technologyは機械学習/深層学習技術に関わるアルゴリズムソリューションを展開している。",
    "この深層学習モデルはPKSHA Technologyによって学習され、公開された。",
    "広目天は、仏教における四天王の一尊であり、サンスクリット語の「種々の眼をした者」を名前の由来とする。",
]

model = SentenceTransformer('pkshatech/simcse-ja-bert-base-clcmlp')
embeddings = model.encode(sentences)
print(embeddings)

Since the loss function used during training is cosine similarity, we recommend using cosine similarity for downstream tasks.

🔧 Technical Details

Tokenization

We use the same tokenizer as tohoku/bert-base-japanese-v2. For details, please see the README of tohoku/bert-base-japanese-v2.

Training

We set tohoku/bert-base-japanese-v2 as the initial value and trained it on the train set of JSNLI. We trained for 20 epochs and published the checkpoint of the model with the highest Spearman's correlation coefficient on the validation set [^1] of the train set of JSTS.

Training Parameters

Property	Details
pooling_strategy	[CLS] -> single fully-connected layer
max_seq_length	128
with hard negative	true
temperature of contrastive loss	0.05
Batch size	200
Learning rate	1e - 5
Weight decay	0.01
Max gradient norm	1.0
Warmup steps	2012
Scheduler	WarmupLinear
Epochs	20
Evaluation steps	250

📄 License

This model is distributed under the terms of the Creative Commons Attribution - ShareAlike 4.0.

[^1]: When we trained this model, the test data of JGLUE was not released, so we used the dev set of JGLUE as a private evaluation data. Therefore, we selected the checkpoint on the train set of JGLUE instead of its dev set.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご