🚀 shibing624/text2vec-base-chinese
This is a CoSENT (Cosine Sentence) model that maps sentences to a 768-dimensional dense vector space. It can be used for tasks such as sentence embeddings, text matching, or semantic search.
🚀 Quick Start
This is a CoSENT (Cosine Sentence) model: shibing624/text2vec-base-chinese. It maps sentences to a 768-dimensional dense vector space and can be used for tasks like sentence embeddings, text matching, or semantic search.
✨ Features
- Sentence Embeddings: Maps sentences to a 768-dimensional dense vector space.
- Text Matching: Suitable for tasks such as text matching or semantic search.
- Multilingual Support: Supports multiple languages, including Chinese and English.
📦 Installation
Install text2vec
pip install -U text2vec
Install transformers
pip install transformers
Install sentence-transformers
pip install -U sentence-transformers
💻 Usage Examples
Basic Usage with text2vec
from text2vec import SentenceModel
sentences = ['如何更换花呗绑定银行卡', '花呗更改绑定银行卡']
model = SentenceModel('shibing624/text2vec-base-chinese')
embeddings = model.encode(sentences)
print(embeddings)
Usage with HuggingFace Transformers
from transformers import BertTokenizer, BertModel
import torch
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0]
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
tokenizer = BertTokenizer.from_pretrained('shibing624/text2vec-base-chinese')
model = BertModel.from_pretrained('shibing624/text2vec-base-chinese')
sentences = ['如何更换花呗绑定银行卡', '花呗更改绑定银行卡']
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
model_output = model(**encoded_input)
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
print("Sentence embeddings:")
print(sentence_embeddings)
Usage with sentence-transformers
from sentence_transformers import SentenceTransformer
m = SentenceTransformer("shibing624/text2vec-base-chinese")
sentences = ['如何更换花呗绑定银行卡', '花呗更改绑定银行卡']
sentence_embeddings = m.encode(sentences)
print("Sentence embeddings:")
print(sentence_embeddings)
Advanced Usage - Model Speed up
ONNX Optimized (onnx-O4) for GPU
from sentence_transformers import SentenceTransformer
model = SentenceTransformer(
"shibing624/text2vec-base-chinese",
backend="onnx",
model_kwargs={"file_name": "model_O4.onnx"},
)
embeddings = model.encode(["如何更换花呗绑定银行卡", "花呗更改绑定银行卡", "你是谁"])
print(embeddings.shape)
similarities = model.similarity(embeddings, embeddings)
print(similarities)
OpenVINO (ov) for CPU
from sentence_transformers import SentenceTransformer
model = SentenceTransformer(
"shibing624/text2vec-base-chinese",
backend="openvino",
)
embeddings = model.encode(["如何更换花呗绑定银行卡", "花呗更改绑定银行卡", "你是谁"])
print(embeddings.shape)
similarities = model.similarity(embeddings, embeddings)
print(similarities)
int8 Quantization with OV (ov-qint8) for CPU
from sentence_transformers import SentenceTransformer
model = SentenceTransformer(
"shibing624/text2vec-base-chinese",
backend="onnx",
model_kwargs={"file_name": "model_qint8_avx512_vnni.onnx"},
)
embeddings = model.encode(["如何更换花呗绑定银行卡", "花呗更改绑定银行卡", "你是谁"])
print(embeddings.shape)
similarities = model.similarity(embeddings, embeddings)
print(similarities)
📚 Documentation
Evaluation
For an automated evaluation of this model, see the Evaluation Benchmark: text2vec
- Chinese text matching task:
Model Speed up
Model |
ATEC |
BQ |
LCQMC |
PAWSX |
STSB |
shibing624/text2vec-base-chinese (fp32, baseline) |
0.31928 |
0.42672 |
0.70157 |
0.17214 |
0.79296 |
shibing624/text2vec-base-chinese (onnx-O4, #29) |
0.31928 |
0.42672 |
0.70157 |
0.17214 |
0.79296 |
shibing624/text2vec-base-chinese (ov, #27) |
0.31928 |
0.42672 |
0.70157 |
0.17214 |
0.79296 |
shibing624/text2vec-base-chinese (ov-qint8, #30) |
0.30778 (-3.60%) |
0.43474 (+1.88%) |
0.69620 (-0.77%) |
0.16662 (-3.20%) |
0.79396 (+0.13%) |
In short:
- ✅ shibing624/text2vec-base-chinese (onnx-O4), ONNX Optimized to O4 does not reduce performance, but gives a ~2x speedup on GPU.
- ✅ shibing624/text2vec-base-chinese (ov), OpenVINO does not reduce performance, but gives a 1.12x speedup on CPU.
- 🟡 shibing624/text2vec-base-chinese (ov-qint8), int8 quantization with OV incurs a small performance hit on some tasks, and a tiny performance gain on others, when quantizing with Chinese STSB. Additionally, it results in a 4.78x speedup on CPU.
Intended Uses
Our model is intended to be used as a sentence and short paragraph encoder. Given an input text, it outputs a vector that captures the semantic information. The sentence vector may be used for information retrieval, clustering, or sentence similarity tasks.
By default, input text longer than 256 word pieces is truncated.
Training Procedure
Pre-training
We use the pretrained hfl/chinese-macbert-base
model. Please refer to the model card for more detailed information about the pre-training procedure.
Fine-tuning
We fine-tune the model using a contrastive objective. Formally, we compute the cosine similarity from each possible sentence pairs from the batch. We then apply the rank loss by comparing with true pairs and false pairs.
Hyperparameters
- Training dataset: https://huggingface.co/datasets/shibing624/nli_zh
- Max_seq_length: 128
- Best epoch: 5
- Sentence embedding dim: 768
🔧 Technical Details
Full Model Architecture
CoSENT(
(0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: BertModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_mean_tokens': True})
)
📄 License
This model is licensed under the Apache-2.0 license.
Citing & Authors
This model was trained by text2vec.
If you find this model helpful, feel free to cite:
@software{text2vec,
author = {Xu Ming},
title = {text2vec: A Tool for Text to Vector},
year = {2022},
url = {https://github.com/shibing624/text2vec},
}