sbert-base-chinese-nli Open Source Model - Free Deployment, Accurately Calculate Chinese Sentence Similarity

Sbert Base Chinese Nli

Developed by uer

Chinese sentence embedding model based on UER-py pretraining, used for calculating sentence similarity

ChineseOpen Source License:Apache-2.0 #Sentence similarity calculation #Chinese semantic understanding #Siamese network architecture

Downloads 8,054

Release Time : 3/2/2022

Model Overview

This model generates sentence embeddings through a Siamese BERT network, primarily used for Chinese text similarity calculation and natural language inference tasks.

Model Features

Chinese Optimization

Specially optimized and trained for Chinese text

Efficient Similarity Calculation

Quickly calculates sentence embedding similarity through cosine distance

Pretrained Model Fine-tuning

Fine-tuned based on the chinese_roberta_L-12_H-768 pretrained model

Model Capabilities

Chinese sentence embedding extraction

Sentence similarity calculation

Natural language inference

Use Cases

Text Matching

Semantic Similarity Judgment

Determines whether two Chinese sentences express the same meaning

Can accurately identify semantically similar but differently phrased sentences

Information Retrieval

Query-Document Matching

Calculates the semantic relevance between a query and a document

🚀 Chinese Sentence BERT

A sentence embedding model pre - trained for sentence similarity tasks, leveraging pre - training frameworks like UER - py and TencentPretrain.

🚀 Quick Start

You can use this model to extract sentence embeddings for sentence similarity task. Here is an example of using cosine distance to calculate the embedding similarity:

>>> from sentence_transformers import SentenceTransformer
>>> model = SentenceTransformer('uer/sbert-base-chinese-nli')
>>> sentences = ['那个人很开心', '那个人非常开心']
>>> sentence_embeddings = model.encode(sentences)
>>> from sklearn.metrics.pairwise import paired_cosine_distances
>>> cosine_score = 1 - paired_cosine_distances([sentence_embeddings[0]],[sentence_embeddings[1]])

✨ Features

This is the sentence embedding model pre - trained by [UER - py](https://github.com/dbiir/UER - py/), which is introduced in this paper.
The model could also be pre - trained by TencentPretrain introduced in this paper, which inherits UER - py to support models with parameters above one billion, and extends it to a multimodal pre - training framework.

📦 Installation

No specific installation steps are provided in the original README.

💻 Usage Examples

Basic Usage

>>> from sentence_transformers import SentenceTransformer
>>> model = SentenceTransformer('uer/sbert-base-chinese-nli')
>>> sentences = ['那个人很开心', '那个人非常开心']
>>> sentence_embeddings = model.encode(sentences)
>>> from sklearn.metrics.pairwise import paired_cosine_distances
>>> cosine_score = 1 - paired_cosine_distances([sentence_embeddings[0]],[sentence_embeddings[1]])

📚 Documentation

Training data

ChineseTextualInference is used as training data.

Training procedure

The model is fine - tuned by [UER - py](https://github.com/dbiir/UER - py/) on Tencent Cloud. We fine - tune five epochs with a sequence length of 128 on the basis of the pre - trained model [chinese_roberta_L - 12_H - 768](https://huggingface.co/uer/chinese_roberta_L - 12_H - 768). At the end of each epoch, the model is saved when the best performance on development set is achieved.

python3 finetune/run_classifier_siamese.py --pretrained_model_path models/cluecorpussmall_roberta_base_seq512_model.bin-250000 \
                                           --vocab_path models/google_zh_vocab.txt \
                                           --config_path models/sbert/base_config.json \
                                           --train_path datasets/ChineseTextualInference/train.tsv \
                                           --dev_path datasets/ChineseTextualInference/dev.tsv \
                                           --learning_rate 5e-5 --epochs_num 5 --batch_size 64

Finally, we convert the pre - trained model into Huggingface's format:

python3 scripts/convert_sbert_from_uer_to_huggingface.py --input_model_path models/finetuned_model.bin \                                                                
                                                         --output_model_path pytorch_model.bin \                                                                                            
                                                         --layers_num 12

BibTeX entry and citation info

@article{reimers2019sentence,
  title={Sentence-bert: Sentence embeddings using siamese bert-networks},
  author={Reimers, Nils and Gurevych, Iryna},
  journal={arXiv preprint arXiv:1908.10084},
  year={2019}
}

@article{zhao2019uer,
  title={UER: An Open-Source Toolkit for Pre-training Models},
  author={Zhao, Zhe and Chen, Hui and Zhang, Jinbin and Zhao, Xin and Liu, Tao and Lu, Wei and Chen, Xi and Deng, Haotang and Ju, Qi and Du, Xiaoyong},
  journal={EMNLP-IJCNLP 2019},
  pages={241},
  year={2019}
}

@article{zhao2023tencentpretrain,
  title={TencentPretrain: A Scalable and Flexible Toolkit for Pre-training Models of Different Modalities},
  author={Zhao, Zhe and Li, Yudong and Hou, Cheng and Zhao, Jing and others},
  journal={ACL 2023},
  pages={217},
  year={2023}

📄 License

This project is licensed under the apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご