Unsup-simcse-ja-large Open-source Japanese Model - Free Generation of High-quality Japanese Sentence Embedding Representations

Unsup Simcse Ja Large

Developed by cl-nagoya

This is an unsupervised learning-based Japanese sentence embedding model, specifically designed to generate high-quality Japanese sentence embeddings.

Text Embedding

Transformers

Japanese#Japanese sentence embedding #Unsupervised similarity calculation #Large-scale pre-training

Downloads 59

Release Time : 10/2/2023

Model Overview

This model is trained using the unsupervised SimCSE method and can convert Japanese sentences into high-dimensional vector representations, suitable for tasks such as sentence similarity calculation.

Model Features

Unsupervised Learning

Trained using the unsupervised SimCSE method, capable of learning effective sentence representations without labeled data.

Japanese Optimization

Specifically optimized for Japanese text, better capturing Japanese language features.

High-Quality Embeddings

Generated sentence embeddings can be used for various downstream tasks, such as similarity calculation, clustering, etc.

Model Capabilities

Sentence embedding generation

Sentence similarity calculation

Text feature extraction

Use Cases

Information Retrieval

🚀 unsup-simcse-ja-large

This is a model for feature extraction and sentence similarity tasks, leveraging sentence-transformers and based on the Transformer architecture.

🚀 Quick Start

Prerequisites

Before using the model, you need to install the necessary libraries.

📦 Installation

If you have sentence-transformers installed, using this model becomes straightforward:

pip install -U fugashi[unidic-lite] sentence-transformers

💻 Usage Examples

Basic Usage

from sentence_transformers import SentenceTransformer
sentences = ["こんにちは、世界！", "文埋め込み最高！文埋め込み最高と叫びなさい", "極度乾燥しなさい"]

model = SentenceTransformer("cl-nagoya/unsup-simcse-ja-large")
embeddings = model.encode(sentences)
print(embeddings)

Advanced Usage

Without sentence-transformers, you can use the model as follows: First, pass your input through the transformer model, then apply the appropriate pooling operation on top of the contextualized word embeddings.

from transformers import AutoTokenizer, AutoModel
import torch


def cls_pooling(model_output, attention_mask):
    return model_output[0][:,0]


# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained("cl-nagoya/unsup-simcse-ja-large")
model = AutoModel.from_pretrained("cl-nagoya/unsup-simcse-ja-large")

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, cls pooling.
sentence_embeddings = cls_pooling(model_output, encoded_input['attention_mask'])

print("Sentence embeddings:")
print(sentence_embeddings)

🔧 Technical Details

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

Model Summary

Property	Details
Fine-tuning Method	Unsupervised SimCSE
Base Model	cl-tohoku/bert-large-japanese-v2
Training Dataset	Wiki40B
Pooling Strategy	cls (with an extra MLP layer only during training)
Hidden Size	1024
Learning Rate	3e-5
Batch Size	64
Temperature	0.05
Max Sequence Length	64
Number of Training Examples	2^20
Validation Interval (steps)	2^6
Warmup Ratio	0.1
Dtype	BFloat16

See the GitHub repository for a detailed experimental setup.

📄 License

This model is released under the cc-by-sa-4.0 license.

Citing & Authors

@misc{
  hayato-tsukagoshi-2023-simple-simcse-ja,
  author = {Hayato Tsukagoshi},
  title = {Japanese Simple-SimCSE},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/hppRC/simple-simcse-ja}}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご