text2vec-base-chinese-paraphrase Open Source Model - Supports tasks such as text embedding, matching, and semantic search

Text2vec Base Chinese Paraphrase

Developed by shibing624

A Chinese text vectorization model trained based on the CoSENT method, supporting tasks such as sentence embedding, text matching, and semantic search.

Text Embedding

Transformers

ChineseOpen Source License:Apache-2.0 #Chinese semantic matching #Sentence vectorization #Paragraph similarity

Downloads 45.88k

Release Time : 6/19/2023

Model Overview

This model maps Chinese sentences to a 768-dimensional dense vector space and can be used for tasks such as sentence embedding, text matching, or semantic search. Based on the nghuyong/ernie-3.0-base-zh model, it is trained using an enhanced Chinese STS dataset and achieves SOTA on various Chinese NLI test sets.

Model Features

Trained on an enhanced Chinese STS dataset

Trained using an enhanced Chinese STS dataset containing s2p (sentence-to-paragraph) data, which strengthens the long-text representation ability.

SOTA performance

Achieves the current optimal performance on various Chinese NLI test sets, with an average Spearman's correlation coefficient of 63.08.

Efficient inference

Supports an inference speed of 3066 QPS, suitable for deployment in production environments.

Model Capabilities

Text vectorization

Sentence similarity calculation

Semantic search

Text matching

Feature extraction

Use Cases

Information retrieval

Semantic search

Convert queries and documents into vectors and then calculate the similarity to achieve search based on semantics rather than keywords.

Improve the relevance of search results.

Intelligent customer service

Question matching

Calculate the similarity between user questions and knowledge base questions to achieve automatic question answering.

Improve the accuracy of the customer service system.

Text clustering

Document categorization

Cluster similar documents through vector distances.

Achieve unsupervised document classification.

🚀 shibing624/text2vec-base-chinese-paraphrase

This is a CoSENT (Cosine Sentence) model that maps sentences to a 768-dimensional dense vector space, useful for sentence embeddings, text matching, and semantic search.

🚀 Quick Start

This CoSENT model, shibing624/text2vec-base-chinese-paraphrase, maps sentences to a 768-dimensional dense vector space. It can be applied to tasks such as sentence embeddings, text matching, or semantic search.

Training dataset: https://huggingface.co/datasets/shibing624/nli-zh-all/tree/main/text2vec-base-chinese-paraphrase-dataset
Base model: nghuyong/ernie-3.0-base-zh
Max sequence length: 256
Best epoch: 5
Sentence embedding dimension: 768

✨ Features

Evaluation

For an automated evaluation of this model, see the Evaluation Benchmark: text2vec

Release Models

Chinese matching evaluation results of the released models in this project:

Arch	BaseModel	Model	ATEC	BQ	LCQMC	PAWSX	STS-B	SOHU-dd	SOHU-dc	Avg	QPS
Word2Vec	word2vec	w2v-light-tencent-chinese	20.00	31.49	59.46	2.57	55.78	55.04	20.70	35.03	23769
SBERT	xlm-roberta-base	sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2	18.42	38.52	63.96	10.14	78.90	63.01	52.28	46.46	3138
Instructor	hfl/chinese-roberta-wwm-ext	moka-ai/m3e-base	41.27	63.81	74.87	12.20	76.96	75.83	60.55	57.93	2980
CoSENT	hfl/chinese-macbert-base	shibing624/text2vec-base-chinese	31.93	42.67	70.16	17.21	79.30	70.27	50.42	51.61	3008
CoSENT	hfl/chinese-lert-large	GanymedeNil/text2vec-large-chinese	32.61	44.59	69.30	14.51	79.44	73.01	59.04	53.12	2092
CoSENT	nghuyong/ernie-3.0-base-zh	shibing624/text2vec-base-chinese-sentence	43.37	61.43	73.48	38.90	78.25	70.60	53.08	59.87	3089
CoSENT	nghuyong/ernie-3.0-base-zh	shibing624/text2vec-base-chinese-paraphrase	44.89	63.58	74.24	40.90	78.93	76.70	63.30	63.08	3066
CoSENT	sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2	shibing624/text2vec-base-multilingual	32.39	50.33	65.64	32.56	74.45	68.88	51.17	53.67	4004

Notes:

Evaluation metric: Spearman coefficient
The shibing624/text2vec-base-chinese model is trained using the CoSENT method based on hfl/chinese-macbert-base on Chinese STS-B data. It achieves good results on the Chinese STS-B test set. You can train the model by running the code examples/training_sup_text_matching_model.py. The model files have been uploaded to the HF model hub. It is recommended for general Chinese semantic matching tasks.
The shibing624/text2vec-base-chinese-sentence model is trained using the CoSENT method based on nghuyong/ernie-3.0-base-zh with the manually selected Chinese STS dataset shibing624/nli-zh-all/text2vec-base-chinese-sentence-dataset. It achieves good results on various Chinese NLI test sets. You can train the model by running the code examples/training_sup_text_matching_model_jsonl_data.py. The model files have been uploaded to the HF model hub. It is recommended for Chinese s2s (sentence vs sentence) semantic matching tasks.
The shibing624/text2vec-base-chinese-paraphrase model is trained using the CoSENT method based on nghuyong/ernie-3.0-base-zh with the manually selected Chinese STS dataset shibing624/nli-zh-all/text2vec-base-chinese-paraphrase-dataset. Compared with shibing624/nli-zh-all/text2vec-base-chinese-sentence-dataset, the dataset adds s2p (sentence to paraphrase) data, which enhances its long-text representation ability. It achieves SOTA results on various Chinese NLI test sets. You can train the model by running the code examples/training_sup_text_matching_model_jsonl_data.py. The model files have been uploaded to the HF model hub. It is recommended for Chinese s2p (sentence vs paragraph) semantic matching tasks.
The sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 model is trained using SBERT and is a multilingual version of the paraphrase-MiniLM-L12-v2 model, supporting Chinese, English, etc.
The w2v-light-tencent-chinese is a Word2Vec model of Tencent word vectors, which can be loaded and used on the CPU. It is suitable for Chinese literal matching tasks and cold-start situations with limited data.

📦 Installation

Install with text2vec

pip install -U text2vec

Install transformers

pip install transformers

Install sentence-transformers

pip install -U sentence-transformers

💻 Usage Examples

Usage (text2vec)

from text2vec import SentenceModel
sentences = ['如何更换花呗绑定银行卡', '花呗更改绑定银行卡']
model = SentenceModel('shibing624/text2vec-base-chinese-paraphrase')
embeddings = model.encode(sentences)
print(embeddings)

Usage (HuggingFace Transformers)

from transformers import BertTokenizer, BertModel
import torch
# Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]  # First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
# Load model from HuggingFace Hub
tokenizer = BertTokenizer.from_pretrained('shibing624/text2vec-base-chinese-paraphrase')
model = BertModel.from_pretrained('shibing624/text2vec-base-chinese-paraphrase')
sentences = ['如何更换花呗绑定银行卡', '花呗更改绑定银行卡']
# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)
# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
print("Sentence embeddings:")
print(sentence_embeddings)

Usage (sentence-transformers)

from sentence_transformers import SentenceTransformer
m = SentenceTransformer("shibing624/text2vec-base-chinese-paraphrase")
sentences = ['如何更换花呗绑定银行卡', '花呗更改绑定银行卡']
sentence_embeddings = m.encode(sentences)
print("Sentence embeddings:")
print(sentence_embeddings)

📚 Documentation

Full Model Architecture

CoSENT(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: ErnieModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_mean_tokens': True})
)

Intended uses

Our model is intended to be used as a sentence and short paragraph encoder. Given an input text, it outputs a vector that captures the semantic information. The sentence vector can be used for information retrieval, clustering, or sentence similarity tasks.

By default, input text longer than 256 word pieces is truncated.

Training procedure

Pre-training

We use the pretrained nghuyong/ernie-3.0-base-zh model. Please refer to the model card for more detailed information about the pre-training procedure.

Fine-tuning

We fine-tune the model using a contrastive objective. Formally, we compute the cosine similarity from each possible sentence pairs from the batch. We then apply the rank loss by comparing with true pairs and false pairs.

🔧 Technical Details

The model maps sentences to a 768-dimensional dense vector space. During training, it uses a contrastive objective and applies rank loss. The input text is truncated if it exceeds 256 word pieces.

📄 License

This model is under the Apache-2.0 license.

Citing & Authors

This model was trained by text2vec.

If you find this model helpful, feel free to cite:

@software{text2vec,
  author = {Ming Xu},
  title = {text2vec: A Tool for Text to Vector},
  year = {2023},
  url = {https://github.com/shibing624/text2vec},
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご