sup-simcse-ja-base Open-source Japanese Embedding Model: Free for Sentence Similarity Calculation and Feature Extraction

Sup Simcse Ja Base

Developed by cl-nagoya

A Japanese sentence embedding model fine-tuned using supervised SimCSE method, suitable for sentence similarity calculation and feature extraction tasks.

Text Embedding

Transformers

Japanese#Japanese Sentence Embedding #Supervised SimCSE #JSNLI Fine-tuning

Downloads 3,027

Release Time : 10/2/2023

Model Overview

This model is a Japanese sentence embedding model based on BERT architecture, fine-tuned on the JSNLI dataset using supervised SimCSE method. It can generate high-quality sentence embeddings and is suitable for natural language processing tasks such as sentence similarity calculation and information retrieval.

Model Features

Supervised SimCSE Fine-tuning

Fine-tuned using supervised SimCSE method, improving the quality and discriminability of sentence embeddings.

Japanese Optimization

Built upon the Japanese BERT model (cl-tohoku/bert-base-japanese-v3), specifically optimized for Japanese text.

Efficient Pooling Strategy

Utilizes CLS token pooling strategy with additional MLP layers during training to enhance sentence representation capability.

Model Capabilities

Sentence embedding generation

Sentence similarity calculation

Japanese text feature extraction

Information retrieval

Use Cases

Natural Language Processing

Semantic Search

Used to build Japanese semantic search engines that retrieve relevant documents based on semantic similarity to query sentences.

Text Clustering

Performs clustering analysis on Japanese texts to discover similar content or topics.

Question Answering Systems

Serves as a component in question answering systems to match questions with relevant knowledge segments.

🚀 sup-simcse-ja-base

This model is designed for sentence feature extraction and similarity calculation, leveraging the power of sentence-transformers and transformers.

🚀 Quick Start

📦 Installation

Using this model becomes easy when you have sentence-transformers installed:

pip install -U fugashi[unidic-lite] sentence-transformers

💻 Usage Examples

Basic Usage

from sentence_transformers import SentenceTransformer
sentences = ["こんにちは、世界！", "文埋め込み最高！文埋め込み最高と叫びなさい", "極度乾燥しなさい"]

model = SentenceTransformer("cl-nagoya/sup-simcse-ja-base")
embeddings = model.encode(sentences)
print(embeddings)

Advanced Usage

Without sentence-transformers, you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.

from transformers import AutoTokenizer, AutoModel
import torch


def cls_pooling(model_output, attention_mask):
    return model_output[0][:,0]


# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained("cl-nagoya/sup-simcse-ja-base")
model = AutoModel.from_pretrained("cl-nagoya/sup-simcse-ja-base")

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, cls pooling.
sentence_embeddings = cls_pooling(model_output, encoded_input['attention_mask'])

print("Sentence embeddings:")
print(sentence_embeddings)

📚 Documentation

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

Model Summary

Property	Details
Fine-tuning method	Supervised SimCSE
Base model	cl-tohoku/bert-base-japanese-v3
Training dataset	JSNLI
Pooling strategy	cls (with an extra MLP layer only during training)
Hidden size	768
Learning rate	5e-5
Batch size	512
Temperature	0.05
Max sequence length	64
Number of training examples	2^20
Validation interval (steps)	2^6
Warmup ratio	0.1
Dtype	BFloat16

See the GitHub repository for a detailed experimental setup.

📄 License

This model is released under the cc-by-sa-4.0 license.

Citing & Authors

@misc{
  hayato-tsukagoshi-2023-simple-simcse-ja,
  author = {Hayato Tsukagoshi},
  title = {Japanese Simple-SimCSE},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/hppRC/simple-simcse-ja}}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご