🚀 shibing624/text2vec-base-chinese-paraphrase
This is a CoSENT (Cosine Sentence) model that maps sentences to a 768-dimensional dense vector space, useful for sentence embeddings, text matching, and semantic search.
🚀 Quick Start
This CoSENT model, shibing624/text2vec-base-chinese-paraphrase
, maps sentences to a 768-dimensional dense vector space. It can be applied to tasks such as sentence embeddings, text matching, or semantic search.
✨ Features
Evaluation
For an automated evaluation of this model, see the Evaluation Benchmark: text2vec
Release Models
- Chinese matching evaluation results of the released models in this project:
Notes:
- Evaluation metric: Spearman coefficient
- The
shibing624/text2vec-base-chinese
model is trained using the CoSENT method based on hfl/chinese-macbert-base
on Chinese STS-B data. It achieves good results on the Chinese STS-B test set. You can train the model by running the code examples/training_sup_text_matching_model.py. The model files have been uploaded to the HF model hub. It is recommended for general Chinese semantic matching tasks.
- The
shibing624/text2vec-base-chinese-sentence
model is trained using the CoSENT method based on nghuyong/ernie-3.0-base-zh
with the manually selected Chinese STS dataset shibing624/nli-zh-all/text2vec-base-chinese-sentence-dataset. It achieves good results on various Chinese NLI test sets. You can train the model by running the code examples/training_sup_text_matching_model_jsonl_data.py. The model files have been uploaded to the HF model hub. It is recommended for Chinese s2s (sentence vs sentence) semantic matching tasks.
- The
shibing624/text2vec-base-chinese-paraphrase
model is trained using the CoSENT method based on nghuyong/ernie-3.0-base-zh
with the manually selected Chinese STS dataset shibing624/nli-zh-all/text2vec-base-chinese-paraphrase-dataset. Compared with shibing624/nli-zh-all/text2vec-base-chinese-sentence-dataset, the dataset adds s2p (sentence to paraphrase) data, which enhances its long-text representation ability. It achieves SOTA results on various Chinese NLI test sets. You can train the model by running the code examples/training_sup_text_matching_model_jsonl_data.py. The model files have been uploaded to the HF model hub. It is recommended for Chinese s2p (sentence vs paragraph) semantic matching tasks.
- The
sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
model is trained using SBERT and is a multilingual version of the paraphrase-MiniLM-L12-v2
model, supporting Chinese, English, etc.
- The
w2v-light-tencent-chinese
is a Word2Vec model of Tencent word vectors, which can be loaded and used on the CPU. It is suitable for Chinese literal matching tasks and cold-start situations with limited data.
📦 Installation
Install with text2vec
pip install -U text2vec
Install transformers
pip install transformers
Install sentence-transformers
pip install -U sentence-transformers
💻 Usage Examples
Usage (text2vec)
from text2vec import SentenceModel
sentences = ['如何更换花呗绑定银行卡', '花呗更改绑定银行卡']
model = SentenceModel('shibing624/text2vec-base-chinese-paraphrase')
embeddings = model.encode(sentences)
print(embeddings)
Usage (HuggingFace Transformers)
from transformers import BertTokenizer, BertModel
import torch
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0]
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
tokenizer = BertTokenizer.from_pretrained('shibing624/text2vec-base-chinese-paraphrase')
model = BertModel.from_pretrained('shibing624/text2vec-base-chinese-paraphrase')
sentences = ['如何更换花呗绑定银行卡', '花呗更改绑定银行卡']
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
model_output = model(**encoded_input)
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
print("Sentence embeddings:")
print(sentence_embeddings)
Usage (sentence-transformers)
from sentence_transformers import SentenceTransformer
m = SentenceTransformer("shibing624/text2vec-base-chinese-paraphrase")
sentences = ['如何更换花呗绑定银行卡', '花呗更改绑定银行卡']
sentence_embeddings = m.encode(sentences)
print("Sentence embeddings:")
print(sentence_embeddings)
📚 Documentation
Full Model Architecture
CoSENT(
(0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: ErnieModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_mean_tokens': True})
)
Intended uses
Our model is intended to be used as a sentence and short paragraph encoder. Given an input text, it outputs a vector that captures the semantic information. The sentence vector can be used for information retrieval, clustering, or sentence similarity tasks.
By default, input text longer than 256 word pieces is truncated.
Training procedure
Pre-training
We use the pretrained nghuyong/ernie-3.0-base-zh
model. Please refer to the model card for more detailed information about the pre-training procedure.
Fine-tuning
We fine-tune the model using a contrastive objective. Formally, we compute the cosine similarity from each possible sentence pairs from the batch. We then apply the rank loss by comparing with true pairs and false pairs.
🔧 Technical Details
The model maps sentences to a 768-dimensional dense vector space. During training, it uses a contrastive objective and applies rank loss. The input text is truncated if it exceeds 256 word pieces.
📄 License
This model is under the Apache-2.0 license.
Citing & Authors
This model was trained by text2vec.
If you find this model helpful, feel free to cite:
@software{text2vec,
author = {Ming Xu},
title = {text2vec: A Tool for Text to Vector},
year = {2023},
url = {https://github.com/shibing624/text2vec},
}