Model Overview
Model Features
Model Capabilities
Use Cases
🚀 shibing624/text2vec-bge-large-chinese
This is a CoSENT (Cosine Sentence) model that maps sentences to a 1024-dimensional dense vector space, suitable for tasks such as sentence embeddings, text matching, or semantic search.
🚀 Quick Start
This model can be used for tasks like sentence embeddings, text matching, or semantic search. It maps sentences to a 1024-dimensional dense vector space.
- Training dataset: https://huggingface.co/datasets/shibing624/nli-zh-all/tree/main/text2vec-base-chinese-paraphrase-dataset
- Base model: https://huggingface.co/BAAI/bge-large-zh-noinstruct
- Max sequence length: 256
- Best epoch: 4
- Sentence embedding dimension: 1024
✨ Features
- Sentence Embedding: Maps sentences to a 1024-dimensional dense vector space.
- Multiple Application Scenarios: Suitable for tasks such as sentence embeddings, text matching, or semantic search.
📦 Installation
You can install the necessary libraries according to different usage methods:
Using text2vec
pip install -U text2vec
Using HuggingFace Transformers
pip install transformers
Using sentence-transformers
pip install -U sentence-transformers
💻 Usage Examples
Basic Usage (text2vec)
from text2vec import SentenceModel
sentences = ['如何更换花呗绑定银行卡', '花呗更改绑定银行卡']
model = SentenceModel('shibing624/text2vec-bge-large-chinese')
embeddings = model.encode(sentences)
print(embeddings)
Basic Usage (HuggingFace Transformers)
from transformers import BertTokenizer, BertModel
import torch
# Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0] # First element of model_output contains all token embeddings
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
# Load model from HuggingFace Hub
tokenizer = BertTokenizer.from_pretrained('shibing624/text2vec-bge-large-chinese')
model = BertModel.from_pretrained('shibing624/text2vec-bge-large-chinese')
sentences = ['如何更换花呗绑定银行卡', '花呗更改绑定银行卡']
# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
# Compute token embeddings
with torch.no_grad():
model_output = model(**encoded_input)
# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
print("Sentence embeddings:")
print(sentence_embeddings)
Basic Usage (sentence-transformers)
from sentence_transformers import SentenceTransformer
m = SentenceTransformer("shibing624/text2vec-bge-large-chinese")
sentences = ['如何更换花呗绑定银行卡', '花呗更改绑定银行卡']
sentence_embeddings = m.encode(sentences)
print("Sentence embeddings:")
print(sentence_embeddings)
📚 Documentation
Evaluation
For an automated evaluation of this model, see the Evaluation Benchmark: text2vec
Release Models
Arch | BaseModel | Model | ATEC | BQ | LCQMC | PAWSX | STS-B | SOHU-dd | SOHU-dc | Avg | QPS |
---|---|---|---|---|---|---|---|---|---|---|---|
Word2Vec | word2vec | w2v-light-tencent-chinese | 20.00 | 31.49 | 59.46 | 2.57 | 55.78 | 55.04 | 20.70 | 35.03 | 23769 |
SBERT | xlm-roberta-base | sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 | 18.42 | 38.52 | 63.96 | 10.14 | 78.90 | 63.01 | 52.28 | 46.46 | 3138 |
CoSENT | hfl/chinese-macbert-base | shibing624/text2vec-base-chinese | 31.93 | 42.67 | 70.16 | 17.21 | 79.30 | 70.27 | 50.42 | 51.61 | 3008 |
CoSENT | hfl/chinese-lert-large | GanymedeNil/text2vec-large-chinese | 32.61 | 44.59 | 69.30 | 14.51 | 79.44 | 73.01 | 59.04 | 53.12 | 2092 |
CoSENT | nghuyong/ernie-3.0-base-zh | shibing624/text2vec-base-chinese-sentence | 43.37 | 61.43 | 73.48 | 38.90 | 78.25 | 70.60 | 53.08 | 59.87 | 3089 |
CoSENT | nghuyong/ernie-3.0-base-zh | shibing624/text2vec-base-chinese-paraphrase | 44.89 | 63.58 | 74.24 | 40.90 | 78.93 | 76.70 | 63.30 | 63.08 | 3066 |
CoSENT | sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 | shibing624/text2vec-base-multilingual | 32.39 | 50.33 | 65.64 | 32.56 | 74.45 | 68.88 | 51.17 | 53.67 | 3138 |
CoSENT | BAAI/bge-large-zh-noinstruct | shibing624/text2vec-bge-large-chinese | 38.41 | 61.34 | 71.72 | 35.15 | 76.44 | 71.81 | 63.15 | 59.72 | 844 |
Explanation:
- Evaluation metric: Spearman coefficient
- The
shibing624/text2vec-base-chinese
model is trained using the CoSENT method based onhfl/chinese-macbert-base
on Chinese STS-B data. It achieves good results on the Chinese STS-B test set. You can run the code examples/training_sup_text_matching_model.py to train the model. The model files have been uploaded to the HF model hub. It is recommended for Chinese general semantic matching tasks. - The
shibing624/text2vec-base-chinese-sentence
model is trained using the CoSENT method based onnghuyong/ernie-3.0-base-zh
with the manually selected Chinese STS dataset shibing624/nli-zh-all/text2vec-base-chinese-sentence-dataset. It achieves good results on various Chinese NLI test sets. You can run the code examples/training_sup_text_matching_model_jsonl_data.py to train the model. The model files have been uploaded to the HF model hub. It is recommended for Chinese s2s (sentence vs sentence) semantic matching tasks. - The
shibing624/text2vec-base-chinese-paraphrase
model is trained using the CoSENT method based onnghuyong/ernie-3.0-base-zh
with the manually selected Chinese STS dataset shibing624/nli-zh-all/text2vec-base-chinese-paraphrase-dataset. Compared with shibing624/nli-zh-all/text2vec-base-chinese-sentence-dataset, the dataset adds s2p (sentence to paraphrase) data, which enhances its long-text representation ability. It achieves SOTA results on various Chinese NLI test sets. You can run the code examples/training_sup_text_matching_model_jsonl_data.py to train the model. The model files have been uploaded to the HF model hub. It is recommended for Chinese s2p (sentence vs paragraph) semantic matching tasks. - The
shibing624/text2vec-base-multilingual
model is trained using the CoSENT method based onsentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
with the manually selected multilingual STS dataset shibing624/nli-zh-all/text2vec-base-multilingual-dataset. It shows improved performance compared to the original model on Chinese and English test sets. You can run the code examples/training_sup_text_matching_model_jsonl_data.py to train the model. The model files have been uploaded to the HF model hub. It is recommended for multilingual semantic matching tasks. - The
shibing624/text2vec-bge-large-chinese
model is trained using the CoSENT method based onBAAI/bge-large-zh-noinstruct
with the manually selected Chinese STS dataset shibing624/nli-zh-all/text2vec-base-chinese-paraphrase-dataset. It shows improved performance compared to the original model on Chinese test sets, especially in short-text discrimination. You can run the code examples/training_sup_text_matching_model_jsonl_data.py to train the model. The model files have been uploaded to the HF model hub. It is recommended for Chinese s2s (sentence vs sentence) semantic matching tasks. w2v-light-tencent-chinese
is a Word2Vec model of Tencent's word vectors, which can be loaded and used on the CPU. It is suitable for Chinese literal matching tasks and cold start situations with limited data.- All pre-trained models can be called through transformers. For example, for the MacBERT model:
--model_name hfl/chinese-macbert-base
, or for the roberta model:--model_name uer/roberta-medium-wwm-chinese-cluecorpussmall
. - To evaluate the robustness of the model, an untrained SOHU test set is added to test the model's generalization ability. To achieve out-of-the-box practical results, various collected Chinese matching datasets are used. The datasets have also been uploaded to HF datasets see the link below.
- Experiments on Chinese matching tasks show that the optimal pooling methods are
EncoderType.FIRST_LAST_AVG
andEncoderType.MEAN
, and the prediction performance difference between the two is very small. - To reproduce the Chinese matching evaluation results, you can download the Chinese matching dataset to
examples/data
and run the code tests/model_spearman.py to reproduce the evaluation results. - The GPU test environment for QPS is Tesla V100 with 32GB of video memory.
Model training experiment report: Experiment Report
🔧 Technical Details
Full Model Architecture
CoSENT(
(0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: ErnieModel
(1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_mean_tokens': True})
)
Intended uses
Our model is intended to be used as a sentence and short paragraph encoder. Given an input text, it outputs a vector that captures the semantic information. The sentence vector can be used for information retrieval, clustering, or sentence similarity tasks.
By default, input text longer than 256 word pieces is truncated.
Training procedure
Pre-training
We use the pre-trained model https://huggingface.co/BAAI/bge-large-zh-noinstruct. Please refer to the model card for more detailed information about the pre-training procedure.
Fine-tuning
We fine-tune the model using a contrastive objective. Formally, we compute the cosine similarity from each possible sentence pair in the batch. Then we apply the rank loss by comparing true pairs and false pairs.
📄 License
This model is licensed under the Apache 2.0 license.
Citing & Authors
This model was trained by text2vec.
If you find this model helpful, feel free to cite:
@software{text2vec,
author = {Ming Xu},
title = {text2vec: A Tool for Text to Vector},
year = {2023},
url = {https://github.com/shibing624/text2vec},
}





