Xiaobu-Embedding: An Open-Source Multi-Task Embedding Model - Supporting Various Chinese Natural Language Processing Applications

Xiaobu Embedding

Developed by lier007

xiaobu-embedding is a multi-task embedding model that supports various Chinese natural language processing tasks, including text similarity calculation, classification, clustering, and retrieval.

Text Embedding

Transformers

#Chinese Semantic Similarity #Medical QA Re-ranking #Multi-task Text Embedding

Downloads 147

Release Time : 1/9/2024

Model Overview

This model focuses on learning embedding representations for Chinese texts, capable of generating high-quality text vectors suitable for various downstream tasks such as semantic similarity calculation, text classification, and information retrieval.

Model Features

Multi-task Support

Supports various natural language processing tasks, including text similarity calculation, classification, clustering, and retrieval.

Chinese Optimization

Specifically optimized for Chinese texts, better capturing Chinese semantic features.

High Performance

Outstanding performance on multiple Chinese benchmark tests, particularly excelling in medical-related tasks.

Model Capabilities

Text Embedding

Semantic Similarity Calculation

Text Classification

Text Clustering

Information Retrieval

Re-ranking

Use Cases

Medical Field

Medical QA Retrieval

Used for retrieval and answering of medical-related questions

Achieved MAP@10 of 37.604 on CMedQA retrieval task

Medical Document Re-ranking

Re-ranking medical-related documents by relevance

Achieved MAP of 87.57 on CMedQAv2 re-ranking task

E-commerce

Product Review Classification

Sentiment and topic classification of product reviews

Achieved accuracy of 86.74% on JD.com review classification task

Product Retrieval

Product search functionality in e-commerce platforms

Achieved MAP@10 of 63.14 on EcomRetrieval task

General NLP

Text Similarity Calculation

Calculating semantic similarity between two texts

Achieved Pearson correlation of 79.75 on STSB task

Text Classification

Multi-category classification of texts

Achieved accuracy of 49.74% on IFlyTek classification task

🚀 xiaobu-embedding

xiaobu-embedding is a model that participates in multiple tasks of the MTEB benchmark, demonstrating its performance in various natural language processing tasks such as semantic text similarity, classification, clustering, reranking, and retrieval.

📚 Documentation

Model Information

Property	Details
Model Name	xiaobu-embedding
Tags	mteb

Performance Metrics

1. STS (Semantic Textual Similarity) Tasks

C-MTEB/AFQMC (Validation Split)

Metric Type	Value
cos_sim_pearson	49.37874132528482
cos_sim_spearman	54.84722470052176
euclidean_pearson	53.0495882931575
euclidean_spearman	54.847727301700665
manhattan_pearson	53.0632140838278
manhattan_spearman	54.8744258024692

C-MTEB/ATEC (Test Split)

Metric Type	Value
cos_sim_pearson	48.15992903013723
cos_sim_spearman	55.13198035464577
euclidean_pearson	55.435876753245715
euclidean_spearman	55.13215936702871
manhattan_pearson	55.41429518223402
manhattan_spearman	55.13363087679285

C-MTEB/BQ (Test Split)

Metric Type	Value
cos_sim_pearson	63.517830355554224
cos_sim_spearman	65.57007801018649
euclidean_pearson	64.05153340906585
euclidean_spearman	65.5696865661119
manhattan_pearson	63.95710619755406
manhattan_spearman	65.48565785379489

C-MTEB/LCQMC (Test Split)

Metric Type	Value
cos_sim_pearson	69.96711977441642
cos_sim_spearman	75.54747609685569
euclidean_pearson	74.62663478056035
euclidean_spearman	75.54761576699639
manhattan_pearson	74.60983904582241
manhattan_spearman	75.52758938061503

2. Classification Tasks

mteb/amazon_reviews_multi (Test Split, zh Config)

Metric Type Value

accuracy 46.722

f1 45.039340641893205
C-MTEB/IFlyTek-classification (Validation Split)

Metric Type Value

accuracy 49.74220854174683

f1 38.01399980618159
C-MTEB/JDReview-classification (Test Split)

Metric Type Value

accuracy 86.73545966228893

ap 55.72394235169542

f1 81.58550390953492

Metric Type	Value
accuracy	46.722
f1	45.039340641893205

Metric Type	Value
accuracy	49.74220854174683
f1	38.01399980618159

Metric Type	Value
accuracy	86.73545966228893
ap	55.72394235169542
f1	81.58550390953492

3. Clustering Tasks

C-MTEB/CLSClusteringP2P (Test Split)

Metric Type Value

v_measure 43.24046498507819
C-MTEB/CLSClusteringS2S (Test Split)

Metric Type Value

v_measure 41.22618199372116

Metric Type	Value
v_measure	43.24046498507819

Metric Type	Value
v_measure	41.22618199372116

4. Reranking Tasks

C-MTEB/CMedQAv1-reranking (Test Split)

Metric Type Value

map 87.12213224673621

mrr 89.57150793650794
C-MTEB/CMedQAv2-reranking (Test Split)

Metric Type Value

map 87.57290061886421

mrr 90.19202380952382
C-MTEB/Mmarco-reranking (Dev Split)

Metric Type Value

map 28.076927649720986

mrr 26.98015873015873

Metric Type	Value
map	87.12213224673621
mrr	89.57150793650794

Metric Type	Value
map	87.57290061886421
mrr	90.19202380952382

Metric Type	Value
map	28.076927649720986
mrr	26.98015873015873

5. Retrieval Tasks

C-MTEB/CmedqaRetrieval (Dev Split)

Metric Type	Value
map_at_1	25.22
map_at_10	37.604
map_at_100	39.501
map_at_1000	39.614
map_at_3	33.378
map_at_5	35.774
mrr_at_1	38.385000000000005
mrr_at_10	46.487
mrr_at_100	47.504999999999995
mrr_at_1000	47.548
mrr_at_3	43.885999999999996
mrr_at_5	45.373000000000005
ndcg_at_1	38.385000000000005
ndcg_at_10	44.224999999999994
ndcg_at_100	51.637
ndcg_at_1000	53.55799999999999
ndcg_at_3	38.845
ndcg_at_5	41.163
precision_at_1	38.385000000000005
precision_at_10	9.812
precision_at_100	1.58
precision_at_1000	0.183
precision_at_3	21.88
precision_at_5	15.974
recall_at_1	25.22
recall_at_10	54.897
recall_at_100	85.469
recall_at_1000	98.18599999999999
recall_at_3	38.815
recall_at_5	45.885

C-MTEB/CovidRetrieval (Dev Split)

Metric Type	Value
map_at_1	76.87
map_at_10	84.502
map_at_100	84.615
map_at_1000	84.617
map_at_3	83.127
map_at_5	83.99600000000001
mrr_at_1	77.02799999999999
mrr_at_10	84.487
mrr_at_100	84.59299999999999
mrr_at_1000	84.59400000000001
mrr_at_3	83.193
mrr_at_5	83.994
ndcg_at_1	77.134
ndcg_at_10	87.68599999999999
ndcg_at_100	88.17099999999999
ndcg_at_1000	88.21
ndcg_at_3	84.993
ndcg_at_5	86.519
precision_at_1	77.134
precision_at_10	9.841999999999999
precision_at_100	1.006
precision_at_1000	0.101
precision_at_3	30.313000000000002
precision_at_5	18.945999999999998
recall_at_1	76.87
recall_at_10	97.418
recall_at_100	99.579
recall_at_1000	99.895
recall_at_3	90.227
recall_at_5	93.888

C-MTEB/DuRetrieval (Dev Split)

Metric Type	Value
map_at_1	25.941
map_at_10	78.793
map_at_100	81.57799999999999
map_at_1000	81.626
map_at_3	54.749
map_at_5	69.16
mrr_at_1	90.45
mrr_at_10	93.406
mrr_at_100	93.453
mrr_at_1000	93.45700000000001
mrr_at_3	93.10000000000001
mrr_at_5	93.27499999999999
ndcg_at_1	90.45
ndcg_at_10	86.44500000000001
ndcg_at_100	89.28399999999999
ndcg_at_1000	89.739
ndcg_at_3	85.62100000000001
ndcg_at_5	84.441
precision_at_1	90.45
precision_at_10	41.19
precision_at_100	4.761
precision_at_1000	0.48700000000000004
precision_at_3	76.583
precision_at_5	64.68
recall_at_1	25.941
recall_at_10	87.443
recall_at_100	96.54
recall_at_1000	98.906
recall_at_3	56.947
recall_at_5	73.714

C-MTEB/EcomRetrieval (Dev Split)

Metric Type	Value
map_at_1	52.900000000000006
map_at_10	63.144
map_at_100	63.634
map_at_1000	63.644999999999996
map_at_3	60.817
map_at_5	62.202
mrr_at_1	52.900000000000006
mrr_at_10	63.144
mrr_at_100	63.634
mrr_at_1000	63.644999999999996
mrr_at_3	60.817
mrr_at_5	62.202
ndcg_at_1	52.900000000000006
ndcg_at_10	68.042
ndcg_at_100	70.417
ndcg_at_1000	70.722
ndcg_at_3	63.287000000000006
ndcg_at_5	65.77
precision_at_1	52.900000000000006
precision_at_10	8.34
precision_at_100	0.9450000000000001
precision_at_1000	0.097
precision_at_3	23.467
precision_at_5	15.28
recall_at_1	52.900000000000006
recall_at_10	83.39999999999999
recall_at_100	94.5
recall_at_1000	96.89999999999999
recall_at_3	70.39999999999999
recall_at_5	76.4

C-MTEB/MMarcoRetrieval (Dev Split)

Metric Type	Value
map_at_1	65.58
map_at_10	74.763
map_at_100	75.077
map_at_1000	75.091
map_at_3	72.982
map_at_5	74.155
mrr_at_1	67.822
mrr_at_10	75.437
mrr_at_100	75.702
mrr_at_1000	75.715
mrr_at_3	73.91

6. PairClassification Task

C-MTEB/CMNLI (Validation Split)

Metric Type	Value
cos_sim_accuracy	83.22309079975948
cos_sim_ap	89.94833400328307
cos_sim_f1	84.39319055464031
cos_sim_precision	79.5774647887324
cos_sim_recall	89.82931961655366
dot_accuracy	83.22309079975948
dot_ap	89.95618559578415
dot_f1	84.41173239591345
dot_precision	79.61044343141317
dot_recall	89.82931961655366
euclidean_accuracy	83.23511725796753
euclidean_ap	89.94836342787318
euclidean_f1	84.40550133096718
euclidean_precision	80.29120067524794
euclidean_recall	88.9642272620996
manhattan_accuracy	83.23511725796753
manhattan_ap	89.9450103956978
manhattan_f1	84.44444444444444
manhattan_precision	80.09647651006712
manhattan_recall	89.29155950432546
max_accuracy	83.23511725796753
max_ap	89.95618559578415
max_f1	84.44444444444444

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご