unsup-simcse-ja-base Open Source Model - Free Generation of High-Quality Japanese Sentence Embeddings

Unsup Simcse Ja Base

Developed by cl-nagoya

This is an unsupervised SimCSE-based Japanese sentence embedding model, specifically designed for generating high-quality Japanese sentence embeddings.

Text Embedding

Transformers

Japanese#Japanese sentence embedding #Unsupervised learning #Sentence similarity calculation

Downloads 190

Release Time : 10/2/2023

Model Overview

This model is trained using the unsupervised SimCSE method and can convert Japanese sentences into high-dimensional vector representations, suitable for tasks such as sentence similarity calculation.

Model Features

Unsupervised learning

Trained using the unsupervised SimCSE method, requiring no labeled data

Japanese-specific

Sentence embedding model specifically optimized for Japanese text

High-quality embeddings

Generated sentence embeddings effectively capture semantic information

Model Capabilities

Sentence embedding generation

Sentence similarity calculation

Japanese text feature extraction

Use Cases

Information retrieval

Semantic search

Enable search based on semantics rather than keywords through sentence embeddings

Text similarity

Duplicate content detection

Identify texts with different expressions but similar semantics

🚀 unsup-simcse-ja-base

This model is designed for feature extraction and sentence similarity tasks, leveraging the power of sentence-transformers and transformers.

🚀 Quick Start

📦 Installation

Using this model becomes easy when you have sentence-transformers installed:

pip install -U fugashi[unidic-lite] sentence-transformers

💻 Usage Examples

Basic Usage

from sentence_transformers import SentenceTransformer
sentences = ["こんにちは、世界！", "文埋め込み最高！文埋め込み最高と叫びなさい", "極度乾燥しなさい"]

model = SentenceTransformer("cl-nagoya/unsup-simcse-ja-base")
embeddings = model.encode(sentences)
print(embeddings)

Advanced Usage

Without sentence-transformers, you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.

from transformers import AutoTokenizer, AutoModel
import torch


def cls_pooling(model_output, attention_mask):
    return model_output[0][:,0]


# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained("cl-nagoya/unsup-simcse-ja-base")
model = AutoModel.from_pretrained("cl-nagoya/unsup-simcse-ja-base")

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, cls pooling.
sentence_embeddings = cls_pooling(model_output, encoded_input['attention_mask'])

print("Sentence embeddings:")
print(sentence_embeddings)

📚 Documentation

🔧 Technical Details

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

Model Summary

Property	Details
Fine-tuning method	Unsupervised SimCSE
Base model	cl-tohoku/bert-base-japanese-v3
Training dataset	Wiki40B
Pooling strategy	cls (with an extra MLP layer only during training)
Hidden size	768
Learning rate	5e-5
Batch size	64
Temperature	0.05
Max sequence length	64
Number of training examples	2^20
Validation interval (steps)	2^6
Warmup ratio	0.1
Dtype	BFloat16

See the GitHub repository for a detailed experimental setup.

📄 License

This model is released under the cc-by-sa-4.0 license.

📚 Citing & Authors

@misc{
  hayato-tsukagoshi-2023-simple-simcse-ja,
  author = {Hayato Tsukagoshi},
  title = {Japanese Simple-SimCSE},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/hppRC/simple-simcse-ja}}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご