Erlangshen-SimCSE-110M-Chinese Open Source Model - Achieving Precise Chinese Sentence Vector Representations

Erlangshen SimCSE 110M Chinese

Developed by IDEA-CCNL

A Chinese sentence vector representation model based on the unsupervised version of SimCSE, trained with supervised contrastive learning using Chinese NLI data

Text Embedding

Transformers

ChineseOpen Source License:Apache-2.0 #Chinese Sentence Vector #Unsupervised Contrastive Learning #NLI Optimization

Downloads 186

Release Time : 11/7/2022

Model Overview

This model is trained through contrastive learning and can directly extract sentence vectors for similarity calculation, suitable for Chinese sentence pair matching tasks without fine-tuning

Model Features

Chinese Optimization

Specially optimized for Chinese language characteristics

Direct Sentence Vector Extraction

No fine-tuning required, similarity judgment can be made directly through [CLS] token output

Contrastive Learning Training

Combines unsupervised and supervised contrastive learning methods

Model Capabilities

Chinese sentence vector representation

Sentence similarity calculation

Text matching

Use Cases

Text Matching

Q&A System

Used to match user questions with candidate answers in the knowledge base

Improves Q&A accuracy

Semantic Search

Enhances search engine's understanding of query statements

Improves search result relevance

Natural Language Understanding

Text Classification

Used as a feature extractor for text classification tasks

🚀 Erlangshen-SimCSE-110M-Chinese

Based on the unsupervised version of SimCSE, this model is trained on supervised tasks using collected and sorted Chinese NLI data, achieving good results on Chinese sentence pair tasks.

Main Page: Fengshenbang
Github: Fengshenbang-LM

🚀 Quick Start

This section provides a quick overview of the Erlangshen-SimCSE-110M-Chinese model, including its features, installation, and basic usage.

✨ Features

Supervised Training: Based on the unsupervised version of SimCSE, it is trained on supervised tasks using Chinese NLI data.
Good Performance: Achieves good results on Chinese sentence pair tasks.
General Sentence Embedding: Capable of extracting general sentence vectors without fine-tuning.

📦 Installation

To use the Erlangshen-SimCSE-110M-Chinese model, you need to install the transformers library. You can install it using pip:

pip install transformers

💻 Usage Examples

Basic Usage

from transformers import AutoTokenizer, AutoModelForMaskedLM

# Load the model and tokenizer
model = AutoModelForMaskedLM.from_pretrained('IDEA-CCNL/Erlangshen-SimCSE-110M-Chinese')
tokenizer = AutoTokenizer.from_pretrained('IDEA-CCNL/Erlangshen-SimCSE-110M-Chinese')

import torch
from sklearn.metrics.pairwise import cosine_similarity

# Define two sentences
texta = '今天天气真不错，我们去散步吧！'
textb = '今天天气真糟糕，还是在宅家里写bug吧！'

# Tokenize the sentences
inputs_a = tokenizer(texta, return_tensors="pt")
inputs_b = tokenizer(textb, return_tensors="pt")

# Get the model outputs
outputs_a = model(**inputs_a, output_hidden_states=True)
texta_embedding = outputs_a.hidden_states[-1][:, 0, :].squeeze()

outputs_b = model(**inputs_b, output_hidden_states=True)
textb_embedding = outputs_b.hidden_states[-1][:, 0, :].squeeze()

# Calculate the cosine similarity
with torch.no_grad():
    similarity_score = cosine_similarity(texta_embedding.reshape(1, -1), textb_embedding.reshape(1, -1))[0][0]

print(similarity_score)

📚 Documentation

Model Taxonomy

Property	Details
Demand	General
Task	Natural Language Understanding (NLU)
Series	Erlangshen
Model	Bert
Parameter	110M
Extra	Chinese

Model Information

In order to obtain a general sentence-embedding model, we used a large number of unsupervised and supervised data for contrastive learning based on the Bert-base model. Finally, we obtained a model that can use the [CLS] output from the model to judge the similarity without fine-tuning. Different from the sentence similarity task after fine-tuning the BERT model, our model has the ability to extract sentence vectors directly after pre-training. The evaluation results on some tasks are as follows:

Model	LCQMC	BQ	PAWSX	ATEC	STS-B
Bert	62	38.62	17.38	28.98	68.27
Bert-large	63.78	37.51	18.63	30.24	68.87
RoBerta	67.3	39.89	16.79	30.57	69.36
RoBerta large	67.25	38.39	19.09	30.85	69.36
RoFormer	63.58	39.9	17.52	29.37	67.32
SimBERT	73.43	40.98	15.87	31.24	72
Erlangshen-SimCSE-110M-Chinese	74.94	56.97	21.84	34.12	70.5

Note: Our model uses [cls] directly without whitening; other models use last avg + whitening.

📄 License

This project is licensed under the Apache-2.0 license.

📖 Citation

If you use our model in your work, please cite our paper:

@article{fengshenbang,
  author    = {Jiaxing Zhang and Ruyi Gan and Junjie Wang and Yuxiang Zhang and Lin Zhang and Ping Yang and Xinyu Gao and Ziwei Wu and Xiaoqun Dong and Junqing He and Jianheng Zhuo and Qi Yang and Yongfeng Huang and Xiayu Li and Yanghan Wu and Junyu Lu and Xinyu Zhu and Weifeng Chen and Ting Han and Kunhao Pan and Rui Wang and Hao Wang and Xiaojun Wu and Zhongshen Zeng and Chongpei Chen},
  title     = {Fengshenbang 1.0: Being the Foundation of Chinese Cognitive Intelligence},
  journal   = {CoRR},
  volume    = {abs/2209.02970},
  year      = {2022}
}

You can also cite our website:

@misc{Fengshenbang-LM,
  title={Fengshenbang-LM},
  author={IDEA-CCNL},
  year={2021},
  howpublished={\url{https://github.com/IDEA-CCNL/Fengshenbang-LM}},
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご