text2vec-base-chinese Open Source Model - Freely Implement Chinese Text Embedding, Matching, and Semantic Search

Text2vec Base Chinese

Developed by shibing624

A Chinese text embedding model based on the CoSENT (Cosine Sentence) model, which can map sentences to a 768-dimensional dense vector space and is suitable for tasks such as sentence embedding, text matching, or semantic search.

Text Embedding ChineseOpen Source License:Apache-2.0 #Chinese semantic matching #CoSENT optimization #768-dimensional vector

Downloads 605.98k

Release Time : 3/2/2022

Model Overview

This model is trained using the CoSENT method and is obtained by training on Chinese STS-B data based on hfl/chinese-macbert-base. It performs excellently in the evaluation of the Chinese STS-B test set.

Model Features

Efficient Chinese semantic matching

Performs excellently in Chinese text matching tasks and is suitable for general semantic matching scenarios.

Based on the CoSENT method

Trained using the Cosine Sentence (CoSENT) method to optimize the similarity calculation of sentence embeddings.

768-dimensional dense vector

Maps sentences to a 768-dimensional dense vector space, suitable for downstream task processing.

Model Capabilities

Sentence embedding

Text matching

Semantic search

Use Cases

Text similarity calculation

Question-answering system

Used to calculate the semantic similarity between questions and candidate answers

Improve the accuracy of question-answering matching

Information retrieval

Enhance the semantic understanding ability of search engines

Improve the relevance of search results

Natural language processing

Text clustering

Used for the automatic clustering of similar texts

Text classification

Used as input features for text classification tasks

🚀 shibing624/text2vec-base-chinese

This is a CoSENT (Cosine Sentence) model that maps sentences to a 768-dimensional dense vector space. It can be used for tasks such as sentence embeddings, text matching, or semantic search.

🚀 Quick Start

This is a CoSENT (Cosine Sentence) model: shibing624/text2vec-base-chinese. It maps sentences to a 768-dimensional dense vector space and can be used for tasks like sentence embeddings, text matching, or semantic search.

✨ Features

Sentence Embeddings: Maps sentences to a 768-dimensional dense vector space.
Text Matching: Suitable for tasks such as text matching or semantic search.
Multilingual Support: Supports multiple languages, including Chinese and English.

📦 Installation

Install text2vec

pip install -U text2vec

Install transformers

pip install transformers

Install sentence-transformers

pip install -U sentence-transformers

💻 Usage Examples

Basic Usage with text2vec

from text2vec import SentenceModel
sentences = ['如何更换花呗绑定银行卡', '花呗更改绑定银行卡']

model = SentenceModel('shibing624/text2vec-base-chinese')
embeddings = model.encode(sentences)
print(embeddings)

Usage with HuggingFace Transformers

from transformers import BertTokenizer, BertModel
import torch

# Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]  # First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

# Load model from HuggingFace Hub
tokenizer = BertTokenizer.from_pretrained('shibing624/text2vec-base-chinese')
model = BertModel.from_pretrained('shibing624/text2vec-base-chinese')
sentences = ['如何更换花呗绑定银行卡', '花呗更改绑定银行卡']
# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)
# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
print("Sentence embeddings:")
print(sentence_embeddings)

Usage with sentence-transformers

from sentence_transformers import SentenceTransformer

m = SentenceTransformer("shibing624/text2vec-base-chinese")
sentences = ['如何更换花呗绑定银行卡', '花呗更改绑定银行卡']

sentence_embeddings = m.encode(sentences)
print("Sentence embeddings:")
print(sentence_embeddings)

Advanced Usage - Model Speed up

ONNX Optimized (onnx-O4) for GPU

from sentence_transformers import SentenceTransformer

model = SentenceTransformer(
    "shibing624/text2vec-base-chinese",
    backend="onnx",
    model_kwargs={"file_name": "model_O4.onnx"},
)
embeddings = model.encode(["如何更换花呗绑定银行卡", "花呗更改绑定银行卡", "你是谁"])
print(embeddings.shape)
similarities = model.similarity(embeddings, embeddings)
print(similarities)

OpenVINO (ov) for CPU

# pip install 'optimum[openvino]'

from sentence_transformers import SentenceTransformer

model = SentenceTransformer(
    "shibing624/text2vec-base-chinese",
    backend="openvino",
)

embeddings = model.encode(["如何更换花呗绑定银行卡", "花呗更改绑定银行卡", "你是谁"])
print(embeddings.shape)
similarities = model.similarity(embeddings, embeddings)
print(similarities)

int8 Quantization with OV (ov-qint8) for CPU

# pip install optimum
from sentence_transformers import SentenceTransformer

model = SentenceTransformer(
    "shibing624/text2vec-base-chinese",
    backend="onnx",
    model_kwargs={"file_name": "model_qint8_avx512_vnni.onnx"},
)
embeddings = model.encode(["如何更换花呗绑定银行卡", "花呗更改绑定银行卡", "你是谁"])
print(embeddings.shape)
similarities = model.similarity(embeddings, embeddings)
print(similarities)

📚 Documentation

Evaluation

For an automated evaluation of this model, see the Evaluation Benchmark: text2vec

Chinese text matching task:

Arch	BaseModel	Model	ATEC	BQ	LCQMC	PAWSX	STS-B	SOHU-dd	SOHU-dc	Avg	QPS
Word2Vec	word2vec	w2v-light-tencent-chinese	20.00	31.49	59.46	2.57	55.78	55.04	20.70	35.03	23769
SBERT	xlm-roberta-base	sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2	18.42	38.52	63.96	10.14	78.90	63.01	52.28	46.46	3138
Instructor	hfl/chinese-roberta-wwm-ext	moka-ai/m3e-base	41.27	63.81	74.87	12.20	76.96	75.83	60.55	57.93	2980
CoSENT	hfl/chinese-macbert-base	shibing624/text2vec-base-chinese	31.93	42.67	70.16	17.21	79.30	70.27	50.42	51.61	3008
CoSENT	hfl/chinese-lert-large	GanymedeNil/text2vec-large-chinese	32.61	44.59	69.30	14.51	79.44	73.01	59.04	53.12	2092
CoSENT	nghuyong/ernie-3.0-base-zh	shibing624/text2vec-base-chinese-sentence	43.37	61.43	73.48	38.90	78.25	70.60	53.08	59.87	3089
CoSENT	nghuyong/ernie-3.0-base-zh	shibing624/text2vec-base-chinese-paraphrase	44.89	63.58	74.24	40.90	78.93	76.70	63.30	63.08	3066
CoSENT	sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2	shibing624/text2vec-base-multilingual	32.39	50.33	65.64	32.56	74.45	68.88	51.17	53.67	4004

Model Speed up

Model	ATEC	BQ	LCQMC	PAWSX	STSB
shibing624/text2vec-base-chinese (fp32, baseline)	0.31928	0.42672	0.70157	0.17214	0.79296
shibing624/text2vec-base-chinese (onnx-O4, #29)	0.31928	0.42672	0.70157	0.17214	0.79296
shibing624/text2vec-base-chinese (ov, #27)	0.31928	0.42672	0.70157	0.17214	0.79296
shibing624/text2vec-base-chinese (ov-qint8, #30)	0.30778 (-3.60%)	0.43474 (+1.88%)	0.69620 (-0.77%)	0.16662 (-3.20%)	0.79396 (+0.13%)

In short:

✅ shibing624/text2vec-base-chinese (onnx-O4), ONNX Optimized to O4 does not reduce performance, but gives a ~2x speedup on GPU.
✅ shibing624/text2vec-base-chinese (ov), OpenVINO does not reduce performance, but gives a 1.12x speedup on CPU.
🟡 shibing624/text2vec-base-chinese (ov-qint8), int8 quantization with OV incurs a small performance hit on some tasks, and a tiny performance gain on others, when quantizing with Chinese STSB. Additionally, it results in a 4.78x speedup on CPU.

Intended Uses

Our model is intended to be used as a sentence and short paragraph encoder. Given an input text, it outputs a vector that captures the semantic information. The sentence vector may be used for information retrieval, clustering, or sentence similarity tasks.

By default, input text longer than 256 word pieces is truncated.

Training Procedure

Pre-training

We use the pretrained hfl/chinese-macbert-base model. Please refer to the model card for more detailed information about the pre-training procedure.

Fine-tuning

We fine-tune the model using a contrastive objective. Formally, we compute the cosine similarity from each possible sentence pairs from the batch. We then apply the rank loss by comparing with true pairs and false pairs.

Hyperparameters

Training dataset: https://huggingface.co/datasets/shibing624/nli_zh
Max_seq_length: 128
Best epoch: 5
Sentence embedding dim: 768

🔧 Technical Details

Full Model Architecture

CoSENT(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_mean_tokens': True})
)

📄 License

This model is licensed under the Apache-2.0 license.

Citing & Authors

This model was trained by text2vec.

If you find this model helpful, feel free to cite:

@software{text2vec,
  author = {Xu Ming},
  title = {text2vec: A Tool for Text to Vector},
  year = {2022},
  url = {https://github.com/shibing624/text2vec},
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご