Model Overview

このモデルは主に文間の類似度計算に使用され、フランス語と英語をサポートし、クラスタリング、再ランキング、検索、分類などのさまざまな自然言語処理タスクに適しています。

Model Features

多言語サポート

フランス語と英語の文類似度計算をサポートします。

多機能タスク

クラスタリング、再ランキング、検索、分類などのさまざまな自然言語処理タスクに適用可能です。

高性能

MTEB AlloProfClusteringP2P や MTEB AlloprofReranking などの複数のデータセットで優れた性能を発揮します。

Model Capabilities

文類似度計算

特徴抽出

テキストクラスタリング

テキスト再ランキング

テキスト検索

テキスト分類

Use Cases

教育

教育コンテンツクラスタリング

教育コンテンツの自動クラスタリングに使用され、教育リソースの整理と管理を支援します。

MTEB AlloProfClusteringP2P データセットで v_measure が 56.727 を達成しました。

法律

法律文書検索

法律文書の検索と再ランキングに使用され、法律研究の効率を向上させます。

MTEB BSARDRetrieval データセットでは限定的な性能で、map_at_100 は 0.011 でした。

ニュース

ニュース分類

ニュース記事の自動分類に使用され、ニュースプラットフォームのコンテンツ整理を支援します。

MTEB MasakhaNEWSClassification データセットで精度が 70.521 を達成しました。

library_name: sentence-transformers pipeline_tag: sentence-similarity tags:

sentence-transformers
feature-extraction
sentence-similarity
mteb model-index:
name: bge-fr-en results:
- task: type: Clustering dataset: type: lyon-nlp/alloprof name: MTEB AlloProfClusteringP2P config: default split: test revision: 392ba3f5bcc8c51f578786c1fc3dae648662cb9b metrics:
  - type: v_measure value: 56.727459716713
- task: type: Clustering dataset: type: lyon-nlp/alloprof name: MTEB AlloProfClusteringS2S config: default split: test revision: 392ba3f5bcc8c51f578786c1fc3dae648662cb9b metrics:
  - type: v_measure value: 38.19920006179227
- task: type: Reranking dataset: type: lyon-nlp/mteb-fr-reranking-alloprof-s2p name: MTEB AlloprofReranking config: default split: test revision: e40c8a63ce02da43200eccb5b0846fcaa888f562 metrics:
  - type: map value: 65.17465797499942
  - type: mrr value: 66.51400197384653
- task: type: Retrieval dataset: type: lyon-nlp/alloprof name: MTEB AlloprofRetrieval config: default split: test revision: 2df7bee4080bedf2e97de3da6bd5c7bc9fc9c4d2 metrics:
  - type: map_at_1 value: 29.836000000000002
  - type: map_at_10 value: 39.916000000000004
  - type: map_at_100 value: 40.816
  - type: map_at_1000 value: 40.877
  - type: map_at_3 value: 37.294
  - type: map_at_5 value: 38.838
  - type: mrr_at_1 value: 29.836000000000002
  - type: mrr_at_10 value: 39.916000000000004
  - type: mrr_at_100 value: 40.816
  - type: mrr_at_1000 value: 40.877
  - type: mrr_at_3 value: 37.294
  - type: mrr_at_5 value: 38.838
  - type: ndcg_at_1 value: 29.836000000000002
  - type: ndcg_at_10 value: 45.097
  - type: ndcg_at_100 value: 49.683
  - type: ndcg_at_1000 value: 51.429
  - type: ndcg_at_3 value: 39.717
  - type: ndcg_at_5 value: 42.501
  - type: precision_at_1 value: 29.836000000000002
  - type: precision_at_10 value: 6.149
  - type: precision_at_100 value: 0.8340000000000001
  - type: precision_at_1000 value: 0.097
  - type: precision_at_3 value: 15.576
  - type: precision_at_5 value: 10.698
  - type: recall_at_1 value: 29.836000000000002
  - type: recall_at_10 value: 61.485
  - type: recall_at_100 value: 83.428
  - type: recall_at_1000 value: 97.461
  - type: recall_at_3 value: 46.727000000000004
  - type: recall_at_5 value: 53.489
- task: type: Classification dataset: type: mteb/amazon_reviews_multi name: MTEB AmazonReviewsClassification (fr) config: fr split: test revision: 1399c76144fd37290681b995c656ef9b2e06e26d metrics:
  - type: accuracy value: 42.332
  - type: f1 value: 40.801800929404344
- task: type: Retrieval dataset: type: maastrichtlawtech/bsard name: MTEB BSARDRetrieval config: default split: test revision: 5effa1b9b5fa3b0f9e12523e6e43e5f86a6e6d59 metrics:
  - type: map_at_1 value: 0.0
  - type: map_at_10 value: 0.0
  - type: map_at_100 value: 0.011000000000000001
  - type: map_at_1000 value: 0.018000000000000002
  - type: map_at_3 value: 0.0
  - type: map_at_5 value: 0.0
  - type: mrr_at_1 value: 0.0
  - type: mrr_at_10 value: 0.0
  - type: mrr_at_100 value: 0.011000000000000001
  - type: mrr_at_1000 value: 0.018000000000000002
  - type: mrr_at_3 value: 0.0
  - type: mrr_at_5 value: 0.0
  - type: ndcg_at_1 value: 0.0
  - type: ndcg_at_10 value: 0.0
  - type: ndcg_at_100 value: 0.13999999999999999
  - type: ndcg_at_1000 value: 0.457
  - type: ndcg_at_3 value: 0.0
  - type: ndcg_at_5 value: 0.0
  - type: precision_at_1 value: 0.0
  - type: precision_at_10 value: 0.0
  - type: precision_at_100 value: 0.009000000000000001
  - type: precision_at_1000 value: 0.004
  - type: precision_at_3 value: 0.0
  - type: precision_at_5 value: 0.0
  - type: recall_at_1 value: 0.0
  - type: recall_at_10 value: 0.0
  - type: recall_at_100 value: 0.901
  - type: recall_at_1000 value: 3.604
  - type: recall_at_3 value: 0.0
  - type: recall_at_5 value: 0.0
- task: type: Clustering dataset: type: lyon-nlp/clustering-hal-s2s name: MTEB HALClusteringS2S config: default split: test revision: e06ebbbb123f8144bef1a5d18796f3dec9ae2915 metrics:
  - type: v_measure value: 24.1294565929144
- task: type: Clustering dataset: type: mlsum name: MTEB MLSUMClusteringP2P config: default split: test revision: b5d54f8f3b61ae17845046286940f03c6bc79bc7 metrics:
  - type: v_measure value: 42.12040762356958
- task: type: Clustering dataset: type: mlsum name: MTEB MLSUMClusteringS2S config: default split: test revision: b5d54f8f3b61ae17845046286940f03c6bc79bc7 metrics:
  - type: v_measure value: 36.69102548662494
- task: type: Classification dataset: type: mteb/mtop_domain name: MTEB MTOPDomainClassification (fr) config: fr split: test revision: d80d48c1eb48d3562165c59d59d0034df9fff0bf metrics:
  - type: accuracy value: 90.3946132164109
  - type: f1 value: 90.15608090764273
- task: type: Classification dataset: type: mteb/mtop_intent name: MTEB MTOPIntentClassification (fr) config: fr split: test revision: ae001d0e6b1228650b7bd1c2c65fb50ad11a8aba metrics:
  - type: accuracy value: 60.87691825869088
  - type: f1 value: 43.56160799721332
- task: type: Classification dataset: type: masakhane/masakhanews name: MTEB MasakhaNEWSClassification (fra) config: fra split: test revision: 8ccc72e69e65f40c70e117d8b3c08306bb788b60 metrics:
  - type: accuracy value: 70.52132701421802
  - type: f1 value: 66.7911493789742
- task: type: Clustering dataset: type: masakhane/masakhanews name: MTEB MasakhaNEWSClusteringP2P (fra) config: fra split: test revision: 8ccc72e69e65f40c70e117d8b3c08306bb788b60 metrics:
  - type: v_measure value: 34.60975901092521
- task: type: Clustering dataset: type: masakhane/masakhanews name: MTEB MasakhaNEWSClusteringS2S (fra) config: fra split: test revision: 8ccc72e69e65f40c70e117d8b3c08306bb788b60 metrics:
  - type: v_measure value: 32.8092912406207
- task: type: Classification dataset: type: mteb/amazon_massive_intent name: MTEB MassiveIntentClassification (fr) config: fr split: test revision: 31efe3c427b0bae9c22cbb560b8f15491cc6bed7 metrics:
  - type: accuracy value: 66.70477471418964
  - type: f1 value: 64.4848306188641
- task: type: Classification dataset: type: mteb/amazon_massive_scenario name: MTEB MassiveScenarioClassification (fr) config: fr split: test revision: 7d571f92784cd94a019292a1f45445077d0ef634 metrics:
  - type: accuracy value: 74.57969065232011
  - type: f1 value: 73.58251655418402
- task: type: Retrieval dataset: type: jinaai/mintakaqa name: MTEB MintakaRetrieval (fr) config: fr split: test revision: efa78cc2f74bbcd21eff2261f9e13aebe40b814e metrics:
  - type: map_at_1 value: 14.005
  - type: map_at_10 value: 21.279999999999998
  - type: map_at_100 value: 22.288
  - type: map_at_1000 value: 22.404
  - type: map_at_3 value: 19.151
  - type: map_at_5 value: 20.322000000000003
  - type: mrr_at_1 value: 14.005
  - type: mrr_at_10 value: 21.279999999999998
  - type: mrr_at_100 value: 22.288
  - type: mrr_at_1000 value: 22.404
  - type: mrr_at_3 value: 19.151
  - type: mrr_at_5 value: 20.322000000000003
  - type: ndcg_at_1 value: 14.005
  - type: ndcg_at_10 value: 25.173000000000002
  - type: ndcg_at_100 value: 30.452
  - type: ndcg_at_1000 value: 34.241
  - type: ndcg_at_3 value: 20.768
  - type: ndcg_at_5 value: 22.869
  - type: precision_at_1 value: 14.005
  - type: precision_at_10 value: 3.759
  - type: precision_at_100 value: 0.631
  - type: precision_at_1000 value: 0.095
  - type: precision_at_3 value: 8.477
  - type: precision_at_5 value: 6.101999999999999
  - type: recall_at_1 value: 14.005
  - type: recall_at_10 value: 37.592
  - type: recall_at_100 value: 63.144999999999996
  - type: recall_at_1000 value: 94.513
  - type: recall_at_3 value: 25.430000000000003
  - type: recall_at_5 value: 30.508000000000003
- task: type: PairClassification dataset: type: GEM/opusparcus name: MTEB OpusparcusPC (fr) config: fr split: test revision: 9e9b1f8ef51616073f47f306f7f47dd91663f86a metrics:
  - type: cos_sim_accuracy value: 81.60762942779292
  - type: cos_sim_ap value: 93.33850264444463
  - type: cos_sim_f1 value: 87.24705882352941
  - type: cos_sim_precision value: 82.91592128801432
  - type: cos_sim_recall value: 92.05561072492551
  - type: dot_accuracy value: 81.60762942779292
  - type: dot_ap value: 93.33850264444463
  - type: dot_f1 value: 87.24705882352941
  - type: dot_precision value: 82.91592128801432
  - type: dot_recall value: 92.05561072492551
  - type: euclidean_accuracy value: 81.60762942779292
  - type: euclidean_ap value: 93.3384939260791
  - type: euclidean_f1 value: 87.24705882352941
  - type: euclidean_precision value: 82.91592128801432
  - type: euclidean_recall value: 92.05561072492551
  - type: manhattan_accuracy value: 81.60762942779292
  - type: manhattan_ap value: 93.27064794794664
  - type: manhattan_f1 value: 87.27440999537251
  - type: manhattan_precision value: 81.7157712305026
  - type: manhattan_recall value: 93.64448857994041
  - type: max_accuracy value: 81.60762942779292
  - type: max_ap value: 93.33850264444463
  - type: max_f1 value: 87.27440999537251
- task: type: PairClassification dataset: type: paws-x name: MTEB PawsX (fr) config: fr split: test revision: 8a04d940a42cd40658986fdd8e3da561533a3646 metrics:
  - type: cos_sim_accuracy value: 61.95
  - type: cos_sim_ap value: 60.8497942066519
  - type: cos_sim_f1 value: 62.53032928942807
  - type: cos_sim_precision value: 45.50958627648839
  - type: cos_sim_recall value: 99.88925802879291
  - type: dot_accuracy value: 61.95
  - type: dot_ap value: 60.83772617132806
  - type: dot_f1 value: 62.53032928942807
  - type: dot_precision value: 45.50958627648839
  - type: dot_recall value: 99.88925802879291
  - type: euclidean_accuracy value: 61.95
  - type: euclidean_ap value: 60.8497942066519
  - type: euclidean_f1 value: 62.53032928942807
  - type: euclidean_precision value: 45.50958627648839
  - type: euclidean_recall value: 99.88925802879291
  - type: manhattan_accuracy value: 61.9
  - type: manhattan_ap value: 60.87914286416435
  - type: manhattan_f1 value: 62.491349480968864
  - type: manhattan_precision value: 45.44539506794162
  - type: manhattan_recall value: 100.0
  - type: max_accuracy value: 61.95
  - type: max_ap value: 60.87914286416435
  - type: max_f1 value: 62.53032928942807
- task: type: STS dataset: type: Lajavaness/SICK-fr name: MTEB SICKFr config: default split: test revision: e077ab4cf4774a1e36d86d593b150422fafd8e8a metrics:
  - type: cos_sim_pearson value: 81.24400370393097
  - type: cos_sim_spearman value: 75.50548831172674
  - type: euclidean_pearson value: 77.81039134726188
  - type: euclidean_spearman value: 75.50504199480463
  - type: manhattan_pearson value: 77.79383923445839
  - type: manhattan_spearman value: 75.472882776806
- task: type: STS dataset: type: mteb/sts22-crosslingual-sts name: MTEB STS22 (fr) config: fr split: test revision: eea2b4fe26a775864c896887d910b76a8098ad3f metrics:
  - type: cos_sim_pearson value: 80.48474973785514
  - type: cos_sim_spearman value: 81.69566405041475
  - type: euclidean_pearson value: 78.32784472269549
  - type: euclidean_spearman value: 81.69566405041475
  - type: manhattan_pearson value: 78.2856100079857
  - type: manhattan_spearman value: 81.84463256785325
- task: type: STS dataset: type: PhilipMay/stsb_multi_mt name: MTEB STSBenchmarkMultilingualSTS (fr) config: fr split: test revision: 93d57ef91790589e3ce9c365164337a8a78b7632 metrics:
  - type: cos_sim_pearson value: 80.68785966129913
  - type: cos_sim_spearman value: 81.29936344904975
  - type: euclidean_pearson value: 80.25462090186443
  - type: euclidean_spearman value: 81.29928746010391
  - type: manhattan_pearson value: 80.17083094559602
  - type: manhattan_spearman value: 81.18921827402406
- task: type: Summarization dataset: type: lyon-nlp/summarization-summeval-fr-p2p name: MTEB SummEvalFr config: default split: test revision: b385812de6a9577b6f4d0f88c6a6e35395a94054 metrics:
  - type: cos_sim_pearson value: 31.66113105701837
  - type: cos_sim_spearman value: 30.13316633681715
  - type: dot_pearson value: 31.66113064418324
  - type: dot_spearman value: 30.13316633681715
- task: type: Reranking dataset: type: lyon-nlp/mteb-fr-reranking-syntec-s2p name: MTEB SyntecReranking config: default split: test revision: b205c5084a0934ce8af14338bf03feb19499c84d metrics:
  - type: map value: 85.43333333333334
  - type: mrr value: 85.43333333333334
- task: type: Retrieval dataset: type: lyon-nlp/mteb-fr-retrieval-syntec-s2p name: MTEB SyntecRetrieval config: default split: test revision: aa460cd4d177e6a3c04fcd2affd95e8243289033 metrics:
  - type: map_at_1 value: 65.0
  - type: map_at_10 value: 75.19200000000001
  - type: map_at_100 value: 75.77000000000001
  - type: map_at_1000 value: 75.77000000000001
  - type: map_at_3 value: 73.667
  - type: map_at_5 value: 75.067
  - type: mrr_at_1 value: 65.0
  - type: mrr_at_10 value: 75.19200000000001
  - type: mrr_at_100 value: 75.77000000000001
  - type: mrr_at_1000 value: 75.77000000000001
  - type: mrr_at_3 value: 73.667
  - type: mrr_at_5 value: 75.067
  - type: ndcg_at_1 value: 65.0
  - type: ndcg_at_10 value: 79.145
  - type: ndcg_at_100 value: 81.34400000000001
  - type: ndcg_at_1000 value: 81.34400000000001
  - type: ndcg_at_3 value: 76.333
  - type: ndcg_at_5 value: 78.82900000000001
  - type: precision_at_1 value: 65.0
  - type: precision_at_10 value: 9.1
  - type: precision_at_100 value: 1.0
  - type: precision_at_1000 value: 0.1
  - type: precision_at_3 value: 28.000000000000004
  - type: precision_at_5 value: 18.0
  - type: recall_at_1 value: 65.0
  - type: recall_at_10 value: 91.0
  - type: recall_at_100 value: 100.0
  - type: recall_at_1000 value: 100.0
  - type: recall_at_3 value: 84.0
  - type: recall_at_5 value: 90.0
- task: type: Retrieval dataset: type: jinaai/xpqa name: MTEB XPQARetrieval (fr) config: fr split: test revision: c99d599f0a6ab9b85b065da6f9d94f9cf731679f metrics:
  - type: map_at_1 value: 40.225
  - type: map_at_10 value: 61.833000000000006
  - type: map_at_100 value: 63.20400000000001
  - type: map_at_1000 value: 63.27
  - type: map_at_3 value: 55.593
  - type: map_at_5 value: 59.65200000000001
  - type: mrr_at_1 value: 63.284
  - type: mrr_at_10 value: 71.351
  - type: mrr_at_100 value: 71.772
  - type: mrr_at_1000 value: 71.786
  - type: mrr_at_3 value: 69.381
  - type: mrr_at_5 value: 70.703
  - type: ndcg_at_1 value: 63.284
  - type: ndcg_at_10 value: 68.49199999999999
  - type: ndcg_at_100 value: 72.79299999999999
  - type: ndcg_at_1000 value: 73.735
  - type: ndcg_at_3 value: 63.278
  - type: ndcg_at_5 value: 65.19200000000001
  - type: precision_at_1 value: 63.284
  - type: precision_at_10 value: 15.661
  - type: precision_at_100 value: 1.9349999999999998
  - type: precision_at_1000 value: 0.207
  - type: precision_at_3 value: 38.273
  - type: precision_at_5 value: 27.397
  - type: recall_at_1 value: 40.225
  - type: recall_at_10 value: 77.66999999999999
  - type: recall_at_100 value: 93.887
  - type: recall_at_1000 value: 99.70599999999999
  - type: recall_at_3 value: 61.133
  - type: recall_at_5 value: 69.789

A custom french/english specialized embedding model: manu/bge-fr-en

This is a sentence-transformers model: It maps sentences & paragraphs to a 1024 dimensional dense vector space and can be used for tasks like clustering or semantic search.

This model is a finetuned version of BGE M3's dense model on custom french and english data.

Usage (Sentence-Transformers)

Using this model becomes easy when you have sentence-transformers installed:

pip install -U sentence-transformers

Then you can use the model like this:

from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('manu/bge-fr-en')
embeddings = model.encode(sentences)
print(embeddings)

Evaluation Results

This model was evaluated using the MTEB package.

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: XLMRobertaModel 
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)