text2vec-bge-large-chinese Open-source Model - A Chinese Semantic Processing Tool for Text Matching and Search

Text2vec Bge Large Chinese

Developed by shibing624

A Chinese semantic matching model based on the CoSENT algorithm, capable of mapping sentences into a 1024-dimensional dense vector space, suitable for tasks such as sentence embedding, text matching, or semantic search.

Text Embedding

Transformers

ChineseOpen Source License:Apache-2.0 #Chinese Semantic Matching #1024-dimensional Vector #Long Text Optimization

Downloads 1,791

Release Time : 9/4/2023

Model Overview

This model is trained using the CoSENT method, fine-tuned on the BAAI/bge-large-zh-noinstruct model, and optimized for Chinese sentence-level semantic matching tasks.

Model Features

Efficient Semantic Matching

Trained with the CoSENT method, optimizing the performance of Chinese sentence similarity calculations

Large Model Foundation

Fine-tuned on the BAAI/bge-large-zh-noinstruct model, equipped with robust semantic understanding capabilities

Long Text Processing

Supports sequences up to 256 tokens in length, suitable for processing sentences and short paragraphs

Model Capabilities

Sentence Embedding

Text Matching

Semantic Search

Information Retrieval

Text Clustering

Use Cases

Intelligent Customer Service

Question Similarity Matching

Matching user questions with similar questions in the knowledge base

Improves response speed and accuracy of customer service

Search Engine

Semantic Search

Understanding user query intent and returning semantically relevant results

Enhances search relevance

🚀 shibing624/text2vec-bge-large-chinese

This is a CoSENT (Cosine Sentence) model that maps sentences to a 1024-dimensional dense vector space, suitable for tasks such as sentence embeddings, text matching, or semantic search.

🚀 Quick Start

This model can be used for tasks like sentence embeddings, text matching, or semantic search. It maps sentences to a 1024-dimensional dense vector space.

Training dataset: https://huggingface.co/datasets/shibing624/nli-zh-all/tree/main/text2vec-base-chinese-paraphrase-dataset
Base model: https://huggingface.co/BAAI/bge-large-zh-noinstruct
Max sequence length: 256
Best epoch: 4
Sentence embedding dimension: 1024

✨ Features

Sentence Embedding: Maps sentences to a 1024-dimensional dense vector space.
Multiple Application Scenarios: Suitable for tasks such as sentence embeddings, text matching, or semantic search.

📦 Installation

You can install the necessary libraries according to different usage methods:

Using text2vec

pip install -U text2vec

Using HuggingFace Transformers

pip install transformers

Using sentence-transformers

pip install -U sentence-transformers

💻 Usage Examples

Basic Usage (text2vec)

from text2vec import SentenceModel
sentences = ['如何更换花呗绑定银行卡', '花呗更改绑定银行卡']

model = SentenceModel('shibing624/text2vec-bge-large-chinese')
embeddings = model.encode(sentences)
print(embeddings)

Basic Usage (HuggingFace Transformers)

from transformers import BertTokenizer, BertModel
import torch

# Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]  # First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

# Load model from HuggingFace Hub
tokenizer = BertTokenizer.from_pretrained('shibing624/text2vec-bge-large-chinese')
model = BertModel.from_pretrained('shibing624/text2vec-bge-large-chinese')
sentences = ['如何更换花呗绑定银行卡', '花呗更改绑定银行卡']
# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)
# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
print("Sentence embeddings:")
print(sentence_embeddings)

Basic Usage (sentence-transformers)

from sentence_transformers import SentenceTransformer

m = SentenceTransformer("shibing624/text2vec-bge-large-chinese")
sentences = ['如何更换花呗绑定银行卡', '花呗更改绑定银行卡']

sentence_embeddings = m.encode(sentences)
print("Sentence embeddings:")
print(sentence_embeddings)

📚 Documentation

Evaluation

For an automated evaluation of this model, see the Evaluation Benchmark: text2vec

Release Models

Arch	BaseModel	Model	ATEC	BQ	LCQMC	PAWSX	STS-B	SOHU-dd	SOHU-dc	Avg	QPS
Word2Vec	word2vec	w2v-light-tencent-chinese	20.00	31.49	59.46	2.57	55.78	55.04	20.70	35.03	23769
SBERT	xlm-roberta-base	sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2	18.42	38.52	63.96	10.14	78.90	63.01	52.28	46.46	3138
CoSENT	hfl/chinese-macbert-base	shibing624/text2vec-base-chinese	31.93	42.67	70.16	17.21	79.30	70.27	50.42	51.61	3008
CoSENT	hfl/chinese-lert-large	GanymedeNil/text2vec-large-chinese	32.61	44.59	69.30	14.51	79.44	73.01	59.04	53.12	2092
CoSENT	nghuyong/ernie-3.0-base-zh	shibing624/text2vec-base-chinese-sentence	43.37	61.43	73.48	38.90	78.25	70.60	53.08	59.87	3089
CoSENT	nghuyong/ernie-3.0-base-zh	shibing624/text2vec-base-chinese-paraphrase	44.89	63.58	74.24	40.90	78.93	76.70	63.30	63.08	3066
CoSENT	sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2	shibing624/text2vec-base-multilingual	32.39	50.33	65.64	32.56	74.45	68.88	51.17	53.67	3138
CoSENT	BAAI/bge-large-zh-noinstruct	shibing624/text2vec-bge-large-chinese	38.41	61.34	71.72	35.15	76.44	71.81	63.15	59.72	844

Explanation:

Evaluation metric: Spearman coefficient
The shibing624/text2vec-base-chinese model is trained using the CoSENT method based on hfl/chinese-macbert-base on Chinese STS-B data. It achieves good results on the Chinese STS-B test set. You can run the code examples/training_sup_text_matching_model.py to train the model. The model files have been uploaded to the HF model hub. It is recommended for Chinese general semantic matching tasks.
The shibing624/text2vec-base-chinese-sentence model is trained using the CoSENT method based on nghuyong/ernie-3.0-base-zh with the manually selected Chinese STS dataset shibing624/nli-zh-all/text2vec-base-chinese-sentence-dataset. It achieves good results on various Chinese NLI test sets. You can run the code examples/training_sup_text_matching_model_jsonl_data.py to train the model. The model files have been uploaded to the HF model hub. It is recommended for Chinese s2s (sentence vs sentence) semantic matching tasks.
The shibing624/text2vec-base-chinese-paraphrase model is trained using the CoSENT method based on nghuyong/ernie-3.0-base-zh with the manually selected Chinese STS dataset shibing624/nli-zh-all/text2vec-base-chinese-paraphrase-dataset. Compared with shibing624/nli-zh-all/text2vec-base-chinese-sentence-dataset, the dataset adds s2p (sentence to paraphrase) data, which enhances its long-text representation ability. It achieves SOTA results on various Chinese NLI test sets. You can run the code examples/training_sup_text_matching_model_jsonl_data.py to train the model. The model files have been uploaded to the HF model hub. It is recommended for Chinese s2p (sentence vs paragraph) semantic matching tasks.
The shibing624/text2vec-base-multilingual model is trained using the CoSENT method based on sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 with the manually selected multilingual STS dataset shibing624/nli-zh-all/text2vec-base-multilingual-dataset. It shows improved performance compared to the original model on Chinese and English test sets. You can run the code examples/training_sup_text_matching_model_jsonl_data.py to train the model. The model files have been uploaded to the HF model hub. It is recommended for multilingual semantic matching tasks.
The shibing624/text2vec-bge-large-chinese model is trained using the CoSENT method based on BAAI/bge-large-zh-noinstruct with the manually selected Chinese STS dataset shibing624/nli-zh-all/text2vec-base-chinese-paraphrase-dataset. It shows improved performance compared to the original model on Chinese test sets, especially in short-text discrimination. You can run the code examples/training_sup_text_matching_model_jsonl_data.py to train the model. The model files have been uploaded to the HF model hub. It is recommended for Chinese s2s (sentence vs sentence) semantic matching tasks.
w2v-light-tencent-chinese is a Word2Vec model of Tencent's word vectors, which can be loaded and used on the CPU. It is suitable for Chinese literal matching tasks and cold start situations with limited data.
All pre-trained models can be called through transformers. For example, for the MacBERT model: --model_name hfl/chinese-macbert-base, or for the roberta model: --model_name uer/roberta-medium-wwm-chinese-cluecorpussmall.
To evaluate the robustness of the model, an untrained SOHU test set is added to test the model's generalization ability. To achieve out-of-the-box practical results, various collected Chinese matching datasets are used. The datasets have also been uploaded to HF datasets see the link below.
Experiments on Chinese matching tasks show that the optimal pooling methods are EncoderType.FIRST_LAST_AVG and EncoderType.MEAN, and the prediction performance difference between the two is very small.
To reproduce the Chinese matching evaluation results, you can download the Chinese matching dataset to examples/data and run the code tests/model_spearman.py to reproduce the evaluation results.
The GPU test environment for QPS is Tesla V100 with 32GB of video memory.

Model training experiment report: Experiment Report

🔧 Technical Details

Full Model Architecture

CoSENT(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: ErnieModel 
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_mean_tokens': True})
)

Intended uses

Our model is intended to be used as a sentence and short paragraph encoder. Given an input text, it outputs a vector that captures the semantic information. The sentence vector can be used for information retrieval, clustering, or sentence similarity tasks.

By default, input text longer than 256 word pieces is truncated.

Training procedure

Pre-training

We use the pre-trained model https://huggingface.co/BAAI/bge-large-zh-noinstruct. Please refer to the model card for more detailed information about the pre-training procedure.

Fine-tuning

We fine-tune the model using a contrastive objective. Formally, we compute the cosine similarity from each possible sentence pair in the batch. Then we apply the rank loss by comparing true pairs and false pairs.

📄 License

This model is licensed under the Apache 2.0 license.

Citing & Authors

This model was trained by text2vec.

If you find this model helpful, feel free to cite:

@software{text2vec,
  author = {Ming Xu},
  title = {text2vec: A Tool for Text to Vector},
  year = {2023},
  url = {https://github.com/shibing624/text2vec},
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご