Acge Text Embedding
A
Acge Text Embedding
aspireによって開発
ACGE テキスト埋め込みモデルは、テキスト埋め込みベクトルを生成するためのモデルで、さまざまな自然言語処理タスクをサポートします。
ダウンロード数 27.12k
リリース時間 : 3/9/2024
モデル概要
このモデルは主に高品質なテキスト埋め込みベクトルを生成するために使用され、意味類似度計算、テキスト分類、クラスタリング、検索、再ランキングなどのタスクに適しています。
モデル特徴
マルチタスクサポート
意味類似度計算、テキスト分類、クラスタリング、検索、再ランキングなど、さまざまな自然言語処理タスクをサポートします。
高性能
特に中国語テキスト処理タスクにおいて、複数のベンチマークデータセットで優れたパフォーマンスを発揮します。
モデル能力
テキスト埋め込み生成
意味類似度計算
テキスト分類
テキストクラスタリング
テキスト検索
テキスト再ランキング
使用事例
意味類似度計算
文類似度計算
2つの文の意味類似度を計算し、QAシステムや推薦システムなどのシナリオに適用できます。
AFQMC データセットでは、コサイン類似度ピアソン値は54.03です。
テキスト分類
Amazonレビュー分類
Amazon商品レビューを感情分類します。
精度は48.54%、F1スコアは46.60%です。
テキストクラスタリング
文クラスタリング
意味的に類似した文をクラスタリングします。
CLSClusteringP2P データセットでは、V測定値は47.08%です。
テキスト検索
医療QA検索
医療QAデータセットから関連する回答を検索します。
Cmedqa 検索データセットでは、MAP@10は40.00%です。
テキスト再ランキング
医療QA再ランキング
検索結果を再ランキングして関連性を向上させます。
CMedQAv1 データセットでは、MAPは88.66%です。
pipeline_tag: sentence-similarity tags:
- mteb
- sentence-transformers
- feature-extraction
- sentence-similarity model-index:
- name: acge_text_embedding
results:
- task:
type: STS
dataset:
type: C-MTEB/AFQMC
name: MTEB AFQMC
config: default
split: validation
revision: b44c3b011063adb25877c13823db83bb193913c4
metrics:
- type: cos_sim_pearson value: 54.03434872650919
- type: cos_sim_spearman value: 58.80730796688325
- type: euclidean_pearson value: 57.47231387497989
- type: euclidean_spearman value: 58.80775026351807
- type: manhattan_pearson value: 57.46332720141574
- type: manhattan_spearman value: 58.80196022940078
- task:
type: STS
dataset:
type: C-MTEB/ATEC
name: MTEB ATEC
config: default
split: test
revision: 0f319b1142f28d00e055a6770f3f726ae9b7d865
metrics:
- type: cos_sim_pearson value: 53.52621290548175
- type: cos_sim_spearman value: 57.945227768312144
- type: euclidean_pearson value: 61.17041394151802
- type: euclidean_spearman value: 57.94553287835657
- type: manhattan_pearson value: 61.168327500057885
- type: manhattan_spearman value: 57.94477516925043
- task:
type: Classification
dataset:
type: mteb/amazon_reviews_multi
name: MTEB AmazonReviewsClassification (zh)
config: zh
split: test
revision: 1399c76144fd37290681b995c656ef9b2e06e26d
metrics:
- type: accuracy value: 48.538000000000004
- type: f1 value: 46.59920995594044
- task:
type: STS
dataset:
type: C-MTEB/BQ
name: MTEB BQ
config: default
split: test
revision: e3dda5e115e487b39ec7e618c0c6a29137052a55
metrics:
- type: cos_sim_pearson value: 68.27529991817154
- type: cos_sim_spearman value: 70.37095914176643
- type: euclidean_pearson value: 69.42690712802727
- type: euclidean_spearman value: 70.37017971889912
- type: manhattan_pearson value: 69.40264877917839
- type: manhattan_spearman value: 70.34786744049524
- task:
type: Clustering
dataset:
type: C-MTEB/CLSClusteringP2P
name: MTEB CLSClusteringP2P
config: default
split: test
revision: 4b6227591c6c1a73bc76b1055f3b7f3588e72476
metrics:
- type: v_measure value: 47.08027536192709
- task:
type: Clustering
dataset:
type: C-MTEB/CLSClusteringS2S
name: MTEB CLSClusteringS2S
config: default
split: test
revision: e458b3f5414b62b7f9f83499ac1f5497ae2e869f
metrics:
- type: v_measure value: 44.0526024940363
- task:
type: Reranking
dataset:
type: C-MTEB/CMedQAv1-reranking
name: MTEB CMedQAv1
config: default
split: test
revision: 8d7f1e942507dac42dc58017c1a001c3717da7df
metrics:
- type: map value: 88.65974993133156
- type: mrr value: 90.64761904761905
- task:
type: Reranking
dataset:
type: C-MTEB/CMedQAv2-reranking
name: MTEB CMedQAv2
config: default
split: test
revision: 23d186750531a14a0357ca22cd92d712fd512ea0
metrics:
- type: map value: 88.90396838907245
- type: mrr value: 90.90932539682541
- task:
type: Retrieval
dataset:
type: C-MTEB/CmedqaRetrieval
name: MTEB CmedqaRetrieval
config: default
split: dev
revision: cd540c506dae1cf9e9a59c3e06f42030d54e7301
metrics:
- type: map_at_1 value: 26.875
- type: map_at_10 value: 39.995999999999995
- type: map_at_100 value: 41.899
- type: map_at_1000 value: 42.0
- type: map_at_3 value: 35.414
- type: map_at_5 value: 38.019
- type: mrr_at_1 value: 40.635
- type: mrr_at_10 value: 48.827
- type: mrr_at_100 value: 49.805
- type: mrr_at_1000 value: 49.845
- type: mrr_at_3 value: 46.145
- type: mrr_at_5 value: 47.693999999999996
- type: ndcg_at_1 value: 40.635
- type: ndcg_at_10 value: 46.78
- type: ndcg_at_100 value: 53.986999999999995
- type: ndcg_at_1000 value: 55.684
- type: ndcg_at_3 value: 41.018
- type: ndcg_at_5 value: 43.559
- type: precision_at_1 value: 40.635
- type: precision_at_10 value: 10.427999999999999
- type: precision_at_100 value: 1.625
- type: precision_at_1000 value: 0.184
- type: precision_at_3 value: 23.139000000000003
- type: precision_at_5 value: 17.004
- type: recall_at_1 value: 26.875
- type: recall_at_10 value: 57.887
- type: recall_at_100 value: 87.408
- type: recall_at_1000 value: 98.721
- type: recall_at_3 value: 40.812
- type: recall_at_5 value: 48.397
- task:
type: PairClassification
dataset:
type: C-MTEB/CMNLI
name: MTEB Cmnli
config: default
split: validation
revision: 41bc36f332156f7adc9e38f53777c959b2ae9766
metrics:
- type: cos_sim_accuracy value: 83.43956704750451
- type: cos_sim_ap value: 90.49172854352659
- type: cos_sim_f1 value: 84.28475486903963
- type: cos_sim_precision value: 80.84603822203135
- type: cos_sim_recall value: 88.02899228431144
- type: dot_accuracy value: 83.43956704750451
- type: dot_ap value: 90.46317132695233
- type: dot_f1 value: 84.28794294628929
- type: dot_precision value: 80.51948051948052
- type: dot_recall value: 88.4264671498714
- type: euclidean_accuracy value: 83.43956704750451
- type: euclidean_ap value: 90.49171785256486
- type: euclidean_f1 value: 84.28235820561584
- type: euclidean_precision value: 80.8022308022308
- type: euclidean_recall value: 88.07575403320084
- type: manhattan_accuracy value: 83.55983162958509
- type: manhattan_ap value: 90.48046779812815
- type: manhattan_f1 value: 84.45354259069714
- type: manhattan_precision value: 82.21877767936226
- type: manhattan_recall value: 86.81318681318682
- type: max_accuracy value: 83.55983162958509
- type: max_ap value: 90.49172854352659
- type: max_f1 value: 84.45354259069714
- task:
type: Retrieval
dataset:
type: C-MTEB/CovidRetrieval
name: MTEB CovidRetrieval
config: default
split: dev
revision: 1271c7809071a13532e05f25fb53511ffce77117
metrics:
- type: map_at_1 value: 68.54599999999999
- type: map_at_10 value: 77.62400000000001
- type: map_at_100 value: 77.886
- type: map_at_1000 value: 77.89
- type: map_at_3 value: 75.966
- type: map_at_5 value: 76.995
- type: mrr_at_1 value: 68.915
- type: mrr_at_10 value: 77.703
- type: mrr_at_100 value: 77.958
- type: mrr_at_1000 value: 77.962
- type: mrr_at_3 value: 76.08
- type: mrr_at_5 value: 77.118
- type: ndcg_at_1 value: 68.809
- type: ndcg_at_10 value: 81.563
- type: ndcg_at_100 value: 82.758
- type: ndcg_at_1000 value: 82.864
- type: ndcg_at_3 value: 78.29
- type: ndcg_at_5 value: 80.113
- type: precision_at_1 value: 68.809
- type: precision_at_10 value: 9.463000000000001
- type: precision_at_100 value: 1.001
- type: precision_at_1000 value: 0.101
- type: precision_at_3 value: 28.486
- type: precision_at_5 value: 18.019
- type: recall_at_1 value: 68.54599999999999
- type: recall_at_10 value: 93.625
- type: recall_at_100 value: 99.05199999999999
- type: recall_at_1000 value: 99.895
- type: recall_at_3 value: 84.879
- type: recall_at_5 value: 89.252
- task:
type: Retrieval
dataset:
type: C-MTEB/DuRetrieval
name: MTEB DuRetrieval
config: default
split: dev
revision: a1a333e290fe30b10f3f56498e3a0d911a693ced
metrics:
- type: map_at_1 value: 25.653
- type: map_at_10 value: 79.105
- type: map_at_100 value: 81.902
- type: map_at_1000 value: 81.947
- type: map_at_3 value: 54.54599999999999
- type: map_at_5 value: 69.226
- type: mrr_at_1 value: 89.35
- type: mrr_at_10 value: 92.69
- type: mrr_at_100 value: 92.77
- type: mrr_at_1000 value: 92.774
- type: mrr_at_3 value: 92.425
- type: mrr_at_5 value: 92.575
- type: ndcg_at_1 value: 89.35
- type: ndcg_at_10 value: 86.55199999999999
- type: ndcg_at_100 value: 89.35300000000001
- type: ndcg_at_1000 value: 89.782
- type: ndcg_at_3 value: 85.392
- type: ndcg_at_5 value: 84.5
- type: precision_at_1 value: 89.35
- type: precision_at_10 value: 41.589999999999996
- type: precision_at_100 value: 4.781
- type: precision_at_1000 value: 0.488
- type: precision_at_3 value: 76.683
- type: precision_at_5 value: 65.06
- type: recall_at_1 value: 25.653
- type: recall_at_10 value: 87.64999999999999
- type: recall_at_100 value: 96.858
- type: recall_at_1000 value: 99.13300000000001
- type: recall_at_3 value: 56.869
- type: recall_at_5 value: 74.024
- task:
type: Retrieval
dataset:
type: C-MTEB/EcomRetrieval
name: MTEB EcomRetrieval
config: default
split: dev
revision: 687de13dc7294d6fd9be10c6945f9e8fec8166b9
metrics:
- type: map_at_1 value: 52.1
- type: map_at_10 value: 62.629999999999995
- type: map_at_100 value: 63.117000000000004
- type: map_at_1000 value: 63.134
- type: map_at_3 value: 60.267
- type: map_at_5 value: 61.777
- type: mrr_at_1 value: 52.1
- type: mrr_at_10 value: 62.629999999999995
- type: mrr_at_100 value: 63.117000000000004
- type: mrr_at_1000 value: 63.134
- type: mrr_at_3 value: 60.267
- type: mrr_at_5 value: 61.777
- type: ndcg_at_1 value: 52.1
- type: ndcg_at_10 value: 67.596
- type: ndcg_at_100 value: 69.95
- type: ndcg_at_1000 value: 70.33500000000001
- type: ndcg_at_3 value: 62.82600000000001
- type: ndcg_at_5 value: 65.546
- type: precision_at_1 value: 52.1
- type: precision_at_10 value: 8.309999999999999
- type: precision_at_100 value: 0.941
- type: precision_at_1000 value: 0.097
- type: precision_at_3 value: 23.400000000000002
- type: precision_at_5 value: 15.36
- type: recall_at_1 value: 52.1
- type: recall_at_10 value: 83.1
- type: recall_at_100 value: 94.1
- type: recall_at_1000 value: 97.0
- type: recall_at_3 value: 70.19999999999999
- type: recall_at_5 value: 76.8
- task:
type: Classification
dataset:
type: C-MTEB/IFlyTek-classification
name: MTEB IFlyTek
config: default
split: validation
revision: 421605374b29664c5fc098418fe20ada9bd55f8a
metrics:
- type: accuracy value: 51.773759138130046
- type: f1 value: 40.341407912920054
- task:
type: Classification
dataset:
type: C-MTEB/JDReview-classification
name: MTEB JDReview
config: default
split: test
revision: b7c64bd89eb87f8ded463478346f76731f07bf8b
metrics:
- type: accuracy value: 86.69793621013133
- type: ap value: 55.46718958939327
- type: f1 value: 81.48228915952436
- task:
type: STS
dataset:
type: C-MTEB/LCQMC
name: MTEB LCQMC
config: default
split: test
revision: 17f9b096f80380fce5ed12a9be8be7784b337daf
metrics:
- type: cos_sim_pearson value: 71.1397780205448
- type: cos_sim_spearman value: 78.17368193033309
- type: euclidean_pearson value: 77.4849177602368
- type: euclidean_spearman value: 78.17369079663212
- type: manhattan_pearson value: 77.47344305182406
- type: manhattan_spearman value: 78.16454335155387
- task:
type: Reranking
dataset:
type: C-MTEB/Mmarco-reranking
name: MTEB MMarcoReranking
config: default
split: dev
revision: 8e0c766dbe9e16e1d221116a3f36795fbade07f6
metrics:
- type: map value: 27.76160559006673
- type: mrr value: 28.02420634920635
- task:
type: Retrieval
dataset:
type: C-MTEB/MMarcoRetrieval
name: MTEB MMarcoRetrieval
config: default
split: dev
revision: 539bbde593d947e2a124ba72651aafc09eb33fc2
metrics:
- type: map_at_1 value: 65.661
- type: map_at_10 value: 74.752
- type: map_at_100 value: 75.091
- type: map_at_1000 value: 75.104
- type: map_at_3 value: 72.997
- type: map_at_5 value: 74.119
- type: mrr_at_1 value: 67.923
- type: mrr_at_10 value: 75.376
- type: mrr_at_100 value: 75.673
- type: mrr_at_1000 value: 75.685
- type: mrr_at_3 value: 73.856
- type: mrr_at_5 value: 74.82799999999999
- type: ndcg_at_1 value: 67.923
- type: ndcg_at_10 value: 78.424
- type: ndcg_at_100 value: 79.95100000000001
- type: ndcg_at_1000 value: 80.265
- type: ndcg_at_3 value: 75.101
- type: ndcg_at_5 value: 76.992
- type: precision_at_1 value: 67.923
- type: precision_at_10 value: 9.474
- type: precision_at_100 value: 1.023
- type: precision_at_1000 value: 0.105
- type: precision_at_3 value: 28.319
- type: precision_at_5 value: 17.986
- type: recall_at_1 value: 65.661
- type: recall_at_10 value: 89.09899999999999
- type: recall_at_100 value: 96.023
- type: recall_at_1000 value: 98.455
- type: recall_at_3 value: 80.314
- type: recall_at_5 value: 84.81
- task:
type: Classification
dataset:
type: mteb/amazon_massive_intent
name: MTEB MassiveIntentClassification (zh-CN)
config: zh-CN
split: test
revision: 31efe3c427b0bae9c22cbb560b8f15491cc6bed7
metrics:
- type: accuracy value: 75.86751849361131
- type: f1 value: 73.04918450508
- task:
type: Classification
dataset:
type: mteb/amazon_massive_scenario
name: MTEB MassiveScenarioClassification (zh-CN)
config: zh-CN
split: test
revision: 7d571f92784cd94a019292a1f45445077d0ef634
metrics:
- type: accuracy value: 78.4364492266308
- type: f1 value: 78.120686034844
- task:
type: Retrieval
dataset:
type: C-MTEB/MedicalRetrieval
name: MTEB MedicalRetrieval
config: default
split: dev
revision: 2039188fb5800a9803ba5048df7b76e6fb151fc6
metrics:
- type: map_at_1 value: 55.00000000000001
- type: map_at_10 value: 61.06399999999999
- type: map_at_100 value: 61.622
- type: map_at_1000 value: 61.663000000000004
- type: map_at_3 value: 59.583
- type: map_at_5 value: 60.373
- type: mrr_at_1 value: 55.2
- type: mrr_at_10 value: 61.168
- type: mrr_at_100 value: 61.726000000000006
- type: mrr_at_1000 value: 61.767
- type: mrr_at_3 value: 59.683
- type: mrr_at_5 value: 60.492999999999995
- type: ndcg_at_1 value: 55.00000000000001
- type: ndcg_at_10 value: 64.098
- type: ndcg_at_100 value: 67.05
- type: ndcg_at_1000 value: 68.262
- type: ndcg_at_3 value: 61.00600000000001
- type: ndcg_at_5 value: 62.439
- type: precision_at_1 value: 55.00000000000001
- type: precision_at_10 value: 7.37
- type: precision_at_100 value: 0.881
- type: precision_at_1000 value: 0.098
- type: precision_at_3 value: 21.7
- type: precision_at_5 value: 13.719999999999999
- type: recall_at_1 value: 55.00000000000001
- type: recall_at_10 value: 73.7
- type: recall_at_100 value: 88.1
- type: recall_at_1000 value: 97.8
- type: recall_at_3 value: 65.10000000000001
- type: recall_at_5 value: 68.60000000000001
- task:
type: Classification
dataset:
type: C-MTEB/MultilingualSentiment-classification
name: MTEB MultilingualSentiment
config: default
split: validation
revision: 46958b007a63fdbf239b7672c25d0bea67b5ea1a
metrics:
- type: accuracy value: 77.52666666666667
- type: f1 value: 77.49784731367215
- task:
type: PairClassification
dataset:
type: C-MTEB/OCNLI
name: MTEB Ocnli
config: default
split: validation
revision: 66e76a618a34d6d565d5538088562851e6daa7ec
metrics:
- type: cos_sim_accuracy value: 81.10449377368705
- type: cos_sim_ap value: 85.17742765935606
- type: cos_sim_f1 value: 83.00094966761633
- type: cos_sim_precision value: 75.40983606557377
- type: cos_sim_recall value: 92.29144667370645
- type: dot_accuracy value: 81.10449377368705
- type: dot_ap value: 85.17143850809614
- type: dot_f1 value: 83.01707779886148
- type: dot_precision value: 75.36606373815677
- type: dot_recall value: 92.39704329461456
- type: euclidean_accuracy value: 81.10449377368705
- type: euclidean_ap value: 85.17856775343333
- type: euclidean_f1 value: 83.00094966761633
- type: euclidean_precision value: 75.40983606557377
- type: euclidean_recall value: 92.29144667370645
- type: manhattan_accuracy value: 81.05035192203573
- type: manhattan_ap value: 85.14464459395809
- type: manhattan_f1 value: 82.96155671570953
- type: manhattan_precision value: 75.3448275862069
- type: manhattan_recall value: 92.29144667370645
- type: max_accuracy value: 81.10449377368705
- type: max_ap value: 85.17856775343333
- type: max_f1 value: 83.01707779886148
- task:
type: Classification
dataset:
type: C-MTEB/OnlineShopping-classification
name: MTEB OnlineShopping
config: default
split: test
revision: e610f2ebd179a8fda30ae534c3878750a96db120
metrics:
- type: accuracy value: 93.71000000000001
- type: ap value: 91.83202232349356
- type: f1 value: 93.69900560334331
- task:
type: STS
dataset:
type: C-MTEB/PAWSX
name: MTEB PAWSX
config: default
split: test
revision: 9c6a90e430ac22b5779fb019a23e820b11a8b5e1
metrics:
- type: cos_sim_pearson value: 39.175047651512415
- type: cos_sim_spearman value: 45.51434675777896
- type: euclidean_pearson value: 44.864110004132286
- type: euclidean_spearman value: 45.516433048896076
- type: manhattan_pearson value: 44.87153627706517
- type: manhattan_spearman value: 45.52862617925012
- task:
type: STS
dataset:
type: C-MTEB/QBQTC
name: MTEB QBQTC
config: default
split: test
revision: 790b0510dc52b1553e8c49f3d2afb48c0e5c48b7
metrics:
- type: cos_sim_pearson value: 34.249579701429084
- type: cos_sim_spearman value: 37.30903127368978
- type: euclidean_pearson value: 35.129438425253355
- type: euclidean_spearman value: 37.308544018709085
- type: manhattan_pearson value: 35.08936153503652
- type: manhattan_spearman value: 37.25582901077839
- task:
type: STS
dataset:
type: mteb/sts22-crosslingual-sts
name: MTEB STS22 (zh)
config: zh
split: test
revision: eea2b4fe26a775864c896887d910b76a8098ad3f
metrics:
- type: cos_sim_pearson value: 61.29309637460004
- type: cos_sim_spearman value: 65.85136090376717
- type: euclidean_pearson value: 64.04783990953557
- type: euclidean_spearman value: 65.85036859610366
- type: manhattan_pearson value: 63.995852552712186
- type: manhattan_spearman value: 65.86508416749417
- task:
type: STS
dataset:
type: C-MTEB/STSB
name: MTEB STSB
config: default
split: test
revision: 0cde68302b3541bb8b3c340dc0644b0b745b3dc0
metrics:
- type: cos_sim_pearson value: 81.5595940455587
- type: cos_sim_spearman value: 82.72654634579749
- type: euclidean_pearson value: 82.4892721061365
- type: euclidean_spearman value: 82.72678504228253
- type: manhattan_pearson value: 82.4770861422454
- type: manhattan_spearman value: 82.71137469783162
- task:
type: Reranking
dataset:
type: C-MTEB/T2Reranking
name: MTEB T2Reranking
config: default
split: dev
revision: 76631901a18387f85eaa53e5450019b87ad58ef9
metrics:
- type: map value: 66.6159547610527
- type: mrr value: 76.35739406347057
- task:
type: Retrieval
dataset:
type: C-MTEB/T2Retrieval
name: MTEB T2Retrieval
config: default
split: dev
revision: 8731a845f1bf500a4f111cf1070785c793d10e64
metrics:
- type: map_at_1 value: 27.878999999999998
- type: map_at_10 value: 77.517
- type: map_at_100 value: 81.139
- type: map_at_1000 value: 81.204
- type: map_at_3 value: 54.728
- type: map_at_5 value: 67.128
- type: mrr_at_1 value: 90.509
- type: mrr_at_10 value: 92.964
- type: mrr_at_100 value: 93.045
- type: mrr_at_1000 value: 93.048
- type: mrr_at_3 value: 92.551
- type: mrr_at_5 value: 92.81099999999999
- type: ndcg_at_1 value: 90.509
- type: ndcg_at_10 value: 85.075
- type: ndcg_at_100 value: 88.656
- type: ndcg_at_1000 value: 89.25699999999999
- type: ndcg_at_3 value: 86.58200000000001
- type: ndcg_at_5 value: 85.138
- type: precision_at_1 value: 90.509
- type: precision_at_10 value: 42.05
- type: precision_at_100 value: 5.013999999999999
- type: precision_at_1000 value: 0.516
- type: precision_at_3 value: 75.551
- type: precision_at_5 value: 63.239999999999995
- type: recall_at_1 value: 27.878999999999998
- type: recall_at_10 value: 83.941
- type: recall_at_100 value: 95.568
- type: recall_at_1000 value: 98.55000000000001
- type: recall_at_3 value: 56.374
- type: recall_at_5 value: 70.435
- task:
type: Classification
dataset:
type: C-MTEB/TNews-classification
name: MTEB TNews
config: default
split: validation
revision: 317f262bf1e6126357bbe89e875451e4b0938fe4
metrics:
- type: accuracy value: 53.687
- type: f1 value: 51.86911933364655
- task:
type: Clustering
dataset:
type: C-MTEB/ThuNewsClusteringP2P
name: MTEB ThuNewsClusteringP2P
config: default
split: test
revision: 5798586b105c0434e4f0fe5e767abe619442cf93
metrics:
- type: v_measure value: 74.65887489872564
- task:
type: Clustering
dataset:
type: C-MTEB/ThuNewsClusteringS2S
name: MTEB ThuNewsClusteringS2S
config: default
split: test
revision: 8a8b2caeda43f39e13c4bc5bea0f8a667896e10d
metrics:
- type: v_measure value: 69.00410995984436
- task:
type: Retrieval
dataset:
type: C-MTEB/VideoRetrieval
name: MTEB VideoRetrieval
config: default
split: dev
revision: 58c2597a5943a2ba48f4668c3b90d796283c5639
metrics:
- type: map_at_1 value: 59.4
- type: map_at_10 value: 69.214
- type: map_at_100 value: 69.72699999999999
- type: map_at_1000 value: 69.743
- type: map_at_3 value: 67.717
- type: map_at_5 value: 68.782
- type: mrr_at_1 value: 59.4
- type: mrr_at_10 value: 69.214
- type: mrr_at_100 value: 69.72699999999999
- type: mrr_at_1000 value: 69.743
- type: mrr_at_3 value: 67.717
- type: mrr_at_5 value: 68.782
- type: ndcg_at_1 value: 59.4
- type: ndcg_at_10 value: 73.32300000000001
- type: ndcg_at_100 value: 75.591
- type: ndcg_at_1000 value: 75.98700000000001
- type: ndcg_at_3 value: 70.339
- type: ndcg_at_5 value: 72.246
- type: precision_at_1 value: 59.4
- type: precision_at_10 value: 8.59
- type: precision_at_100 value: 0.96
- type: precision_at_1000 value: 0.099
- type: precision_at_3 value: 25.967000000000002
- type: precision_at_5 value: 16.5
- type: recall_at_1 value: 59.4
- type: recall_at_10 value: 85.9
- type: recall_at_100 value: 96.0
- type: recall_at_1000 value: 99.1
- type: recall_at_3 value: 77.9
- type: recall_at_5 value: 82.5
- task:
type: Classification
dataset:
type: C-MTEB/waimai-classification
name: MTEB Waimai
config: default
split: test
revision: 339287def212450dcaa9df8c22bf93e9980c7023
metrics:
- type: accuracy value: 88.53
- type: ap value: 73.56216166534062
- type: f1 value: 87.06093694294485
- task:
type: STS
dataset:
type: C-MTEB/AFQMC
name: MTEB AFQMC
config: default
split: validation
revision: b44c3b011063adb25877c13823db83bb193913c4
metrics:

acge model
acge模型来自于合合信息技术团队,对外技术试用平台TextIn, github开源链接为github。合合信息是行业领先的人工智能及大数据科技企业,致力于通过智能文字识别及商业大数据领域的核心技术、C端和B端产品以及行业解决方案为全球企业和个人用户提供创新的数字化、智能化服务。
技术交流请联系yanhui_he@intsig.net,商务合作请联系simon_liu@intsig.net,可以点击图片,扫面二维码来加入我们的微信社群。想加入合合信息,做“文档解析”、“文档检索”、“文档预研”的同学可以投简历给min_du@intsig.net,也可直接添加HR微信详聊岗位内容。
acge是一个通用的文本编码模型,是一个可变长度的向量化模型,使用了Matryoshka Representation Learning,如图所示:
建议使用的维度为1024或者1792
Model Name | Model Size (GB) | Dimension | Sequence Length | Language | Need instruction for retrieval? |
---|---|---|---|---|---|
acge-text-embedding | 0.65 | [1024, 1792] | 1024 | Chinese | NO |
Metric
C-MTEB leaderboard (Chinese)
测试的时候因为数据的随机性、显卡、推理的数据类型导致每次推理的结果不一致,我总共测试了4次,不同的显卡(A10 A100),不同的数据类型,测试结果放在了result文件夹中,选取了一个精度最低的测试作为最终的精度测试。 根据infgrad的建议,选取不用的输入的长度作为测试,Sequence Length为512时测试最佳。
Model Name | GPU | tensor-type | Model Size (GB) | Dimension | Sequence Length | Average (35) | Classification (9) | Clustering (4) | Pair Classification (2) | Reranking (4) | Retrieval (8) | STS (8) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
acge_text_embedding | NVIDIA TESLA A10 | bfloat16 | 0.65 | 1792 | 1024 | 68.91 | 72.76 | 58.22 | 87.82 | 67.67 | 72.48 | 62.24 |
acge_text_embedding | NVIDIA TESLA A100 | bfloat16 | 0.65 | 1792 | 1024 | 68.91 | 72.77 | 58.35 | 87.82 | 67.53 | 72.48 | 62.24 |
acge_text_embedding | NVIDIA TESLA A100 | float16 | 0.65 | 1792 | 1024 | 68.99 | 72.76 | 58.68 | 87.84 | 67.89 | 72.49 | 62.24 |
acge_text_embedding | NVIDIA TESLA A100 | float32 | 0.65 | 1792 | 1024 | 68.98 | 72.76 | 58.58 | 87.83 | 67.91 | 72.49 | 62.24 |
acge_text_embedding | NVIDIA TESLA A100 | float16 | 0.65 | 1792 | 768 | 68.95 | 72.76 | 58.68 | 87.84 | 67.86 | 72.48 | 62.07 |
acge_text_embedding | NVIDIA TESLA A100 | float16 | 0.65 | 1792 | 512 | 69.07 | 72.75 | 58.7 | 87.84 | 67.99 | 72.93 | 62.09 |
Reproduce our results
C-MTEB:
import torch
import argparse
import functools
from C_MTEB.tasks import *
from typing import List, Dict
from sentence_transformers import SentenceTransformer
from mteb import MTEB, DRESModel
class RetrievalModel(DRESModel):
def __init__(self, encoder, **kwargs):
self.encoder = encoder
def encode_queries(self, queries: List[str], **kwargs) -> np.ndarray:
input_texts = ['{}'.format(q) for q in queries]
return self._do_encode(input_texts)
def encode_corpus(self, corpus: List[Dict[str, str]], **kwargs) -> np.ndarray:
input_texts = ['{} {}'.format(doc.get('title', ''), doc['text']).strip() for doc in corpus]
input_texts = ['{}'.format(t) for t in input_texts]
return self._do_encode(input_texts)
@torch.no_grad()
def _do_encode(self, input_texts: List[str]) -> np.ndarray:
return self.encoder.encode(
sentences=input_texts,
batch_size=512,
normalize_embeddings=True,
convert_to_numpy=True
)
def get_args():
parser = argparse.ArgumentParser()
parser.add_argument('--model_name_or_path', default="acge_text_embedding", type=str)
parser.add_argument('--task_type', default=None, type=str)
parser.add_argument('--pooling_method', default='cls', type=str)
parser.add_argument('--output_dir', default='zh_results',
type=str, help='output directory')
parser.add_argument('--max_len', default=1024, type=int, help='max length')
return parser.parse_args()
if __name__ == '__main__':
args = get_args()
encoder = SentenceTransformer(args.model_name_or_path).half()
encoder.encode = functools.partial(encoder.encode, normalize_embeddings=True)
encoder.max_seq_length = int(args.max_len)
task_names = [t.description["name"] for t in MTEB(task_types=args.task_type,
task_langs=['zh', 'zh-CN']).tasks]
TASKS_WITH_PROMPTS = ["T2Retrieval", "MMarcoRetrieval", "DuRetrieval", "CovidRetrieval", "CmedqaRetrieval",
"EcomRetrieval", "MedicalRetrieval", "VideoRetrieval"]
for task in task_names:
evaluation = MTEB(tasks=[task], task_langs=['zh', 'zh-CN'])
if task in TASKS_WITH_PROMPTS:
evaluation.run(RetrievalModel(encoder), output_folder=args.output_dir, overwrite_results=False)
else:
evaluation.run(encoder, output_folder=args.output_dir, overwrite_results=False)
Usage
acge 中文系列模型
在sentence-transformer库中的使用方法:
from sentence_transformers import SentenceTransformer
sentences = ["数据1", "数据2"]
model = SentenceTransformer('acge_text_embedding')
print(model.max_seq_length)
embeddings_1 = model.encode(sentences, normalize_embeddings=True)
embeddings_2 = model.encode(sentences, normalize_embeddings=True)
similarity = embeddings_1 @ embeddings_2.T
print(similarity)
在sentence-transformer库中的使用方法,选取不同的维度:
from sklearn.preprocessing import normalize
from sentence_transformers import SentenceTransformer
sentences = ["数据1", "数据2"]
model = SentenceTransformer('acge_text_embedding')
embeddings = model.encode(sentences, normalize_embeddings=False)
matryoshka_dim = 1024
embeddings = embeddings[..., :matryoshka_dim] # Shrink the embedding dimensions
embeddings = normalize(embeddings, norm="l2", axis=1)
print(embeddings.shape)
# => (2, 1024)
Jina Embeddings V3
Jina Embeddings V3 は100以上の言語をサポートする多言語文埋め込みモデルで、文の類似度と特徴抽出タスクに特化しています。
テキスト埋め込み
Transformers 複数言語対応

J
jinaai
3.7M
911
Ms Marco MiniLM L6 V2
Apache-2.0
MS Marcoパッセージランキングタスクで訓練されたクロスエンコーダモデル、情報検索におけるクエリ-パッセージ関連性スコアリング用
テキスト埋め込み 英語
M
cross-encoder
2.5M
86
Opensearch Neural Sparse Encoding Doc V2 Distill
Apache-2.0
蒸留技術に基づくスパース検索モデルで、OpenSearch向けに最適化されており、推論不要のドキュメントエンコーディングをサポートし、検索関連性と効率性においてV1版を上回ります
テキスト埋め込み
Transformers 英語

O
opensearch-project
1.8M
7
Sapbert From PubMedBERT Fulltext
Apache-2.0
PubMedBERTに基づく生物医学エンティティ表現モデルで、自己アライメント事前学習により意味関係の捕捉を最適化します。
テキスト埋め込み 英語
S
cambridgeltl
1.7M
49
Gte Large
MIT
GTE-Largeは強力なセンテンストランスフォーマーモデルで、文の類似度とテキスト埋め込みタスクに特化しており、複数のベンチマークテストで優れた性能を発揮します。
テキスト埋め込み 英語
G
thenlper
1.5M
278
Gte Base En V1.5
Apache-2.0
GTE-base-en-v1.5 は英語の文章変換モデルで、文章類似度タスクに特化しており、複数のテキスト埋め込みベンチマークで優れた性能を発揮します。
テキスト埋め込み
Transformers 複数言語対応

G
Alibaba-NLP
1.5M
63
Gte Multilingual Base
Apache-2.0
GTE Multilingual Base は50以上の言語をサポートする多言語文埋め込みモデルで、文類似度計算などのタスクに適しています。
テキスト埋め込み
Transformers 複数言語対応

G
Alibaba-NLP
1.2M
246
Polybert
polyBERTは、完全に機械駆動の超高速ポリマー情報学を実現するための化学言語モデルです。PSMILES文字列を600次元の密なフィンガープリントにマッピングし、ポリマー化学構造を数値形式で表現します。
テキスト埋め込み
Transformers

P
kuelumbus
1.0M
5
Bert Base Turkish Cased Mean Nli Stsb Tr
Apache-2.0
トルコ語BERTベースの文埋め込みモデルで、意味的類似性タスクに最適化
テキスト埋め込み
Transformers その他

B
emrecan
1.0M
40
GIST Small Embedding V0
MIT
BAAI/bge-small-en-v1.5モデルを微調整したテキスト埋め込みモデルで、MEDIデータセットとMTEB分類タスクデータセットで訓練され、検索タスクのクエリエンコーディング能力を最適化しました。
テキスト埋め込み
Safetensors 英語
G
avsolatorio
945.68k
29
おすすめAIモデル
Llama 3 Typhoon V1.5x 8b Instruct
タイ語専用に設計された80億パラメータの命令モデルで、GPT-3.5-turboに匹敵する性能を持ち、アプリケーションシナリオ、検索拡張生成、制限付き生成、推論タスクを最適化
大規模言語モデル
Transformers 複数言語対応

L
scb10x
3,269
16
Cadet Tiny
Openrail
Cadet-TinyはSODAデータセットでトレーニングされた超小型対話モデルで、エッジデバイス推論向けに設計されており、体積はCosmo-3Bモデルの約2%です。
対話システム
Transformers 英語

C
ToddGoldfarb
2,691
6
Roberta Base Chinese Extractive Qa
RoBERTaアーキテクチャに基づく中国語抽出型QAモデルで、与えられたテキストから回答を抽出するタスクに適しています。
質問応答システム 中国語
R
uer
2,694
98