Dj

Developed by TaoH

A BERT model based on bert-base-chinese, trained on the million-level semantic similarity dataset SimCLUE, designed for general semantic matching scenarios, demonstrating excellent generalization ability.

Text Embedding

Transformers

#Chinese Semantic Matching #Strong Generalization Ability #Million-level Training

Downloads 14

Release Time : 10/26/2022

Model Overview

This model is a Chinese sentence embedding model, primarily used for calculating semantic similarity between sentences, suitable for tasks such as semantic search and text matching.

Model Features

Excellent Generalization Ability

Performs well on multiple public semantic matching datasets, showing stronger generalization ability compared to previous models in most tasks.

General Semantic Matching

Designed for general semantic matching scenarios, suitable for various text similarity calculation tasks.

Trained on Large-scale Data

Trained on the million-level semantic similarity dataset SimCLUE.

Model Capabilities

Sentence Embedding Vector Extraction

Semantic Similarity Calculation

Text Feature Extraction

Semantic Search

Use Cases

Text Matching

Q&A System

Used to calculate semantic similarity between questions and candidate answers

Information Retrieval

Used to improve relevance ranking in search engines

Text Clustering

Document Classification

Automatically classify documents based on semantic similarity

🚀 DMetaSoul/sbert-chinese-general-v2

This model is based on the BERT model of the bert-base-chinese version and is trained on the million-level semantic similarity dataset SimCLUE. It is suitable for general semantic matching scenarios. In terms of performance, this model has better generalization ability on various tasks.

⚠️ Important Note

A lightweight version of this model has also been open-sourced!

🚀 Quick Start

✨ Features

Based on the bert-base-chinese model.
Trained on the SimCLUE dataset.
Suitable for general semantic matching scenarios.
Better generalization ability on various tasks.

📦 Installation

You can install the necessary library via the following command:

pip install -U sentence-transformers

💻 Usage Examples

Basic Usage

Use the sentence-transformers framework to load the model and extract text representation vectors:

from sentence_transformers import SentenceTransformer
sentences = ["我的儿子！他猛然间喊道，我的儿子在哪儿？", "我的儿子呢！他突然喊道，我的儿子在哪里？"]

model = SentenceTransformer('DMetaSoul/sbert-chinese-general-v2')
embeddings = model.encode(sentences)
print(embeddings)

Advanced Usage

If you don't want to use sentence-transformers, you can load the model and extract text vectors through HuggingFace Transformers:

from transformers import AutoTokenizer, AutoModel
import torch


#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# Sentences we want sentence embeddings for
sentences = ["我的儿子！他猛然间喊道，我的儿子在哪儿？", "我的儿子呢！他突然喊道，我的儿子在哪里？"]

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('DMetaSoul/sbert-chinese-general-v2')
model = AutoModel.from_pretrained('DMetaSoul/sbert-chinese-general-v2')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

print("Sentence embeddings:")
print(sentence_embeddings)

📚 Documentation

Evaluation

The model was evaluated on several public semantic matching datasets, and the correlation coefficient between vector similarity and real labels was calculated:

	csts_dev	csts_test	afqmc	lcqmc	bqcorpus	pawsx	xiaobu
sbert-chinese-general-v1	84.54%	82.17%	23.80%	65.94%	45.52%	11.52%	48.51%
sbert-chinese-general-v2	77.20%	72.60%	36.80%	76.92%	49.63%	16.24%	63.16%

Here, the differences between this model and the previously released sbert-chinese-general-v1 are compared. It can be seen that this model has better generalization ability on multiple tasks.

📄 License

Citing & Authors

E-mail: xiaowenbin@dmetasoul.com

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご