模型概述
模型特點
模型能力
使用案例
🚀 SentenceTransformer
這是一個基於 sentence-transformers 的模型,在 train_set 數據集上進行了訓練。它可以將句子和段落映射到一個 1024 維的密集向量空間,可用於語義文本相似度計算、語義搜索、釋義挖掘、文本分類、聚類等任務。
🚀 快速開始
直接使用(Sentence Transformers)
首先安裝 Sentence Transformers 庫:
pip install -U sentence-transformers
然後你可以加載這個模型並進行推理。
from sentence_transformers import SentenceTransformer
# 從 🤗 Hub 下載
model = SentenceTransformer("dragonkue/bge-m3-ko")
# 進行推理
sentences = [
'수급권자 중 근로 능력이 없는 임산부는 몇 종에 해당하니?',
'내년부터 저소득층 1세 미만 아동의 \n의료비 부담이 더 낮아진다!\n의료급여제도 개요\n□ (목적) 생활유지 능력이 없거나 생활이 어려운 국민들에게 발생하는 질병, 부상, 출산 등에 대해 국가가 의료서비스 제공\n□ (지원대상) 국민기초생활보장 수급권자, 타 법에 의한 수급권자 등\n\n| 구분 | 국민기초생활보장법에 의한 수급권자 | 국민기초생활보장법 이외의 타 법에 의한 수급권자 |\n| --- | --- | --- |\n| 1종 | ○ 국민기초생활보장 수급권자 중 근로능력이 없는 자만으로 구성된 가구 - 18세 미만, 65세 이상 - 4급 이내 장애인 - 임산부, 병역의무이행자 등 | ○ 이재민(재해구호법) ○ 의상자 및 의사자의 유족○ 국내 입양된 18세 미만 아동○ 국가유공자 및 그 유족․가족○ 국가무형문화재 보유자 및 그 가족○ 새터민(북한이탈주민)과 그 가족○ 5․18 민주화운동 관련자 및 그 유가족○ 노숙인 ※ 행려환자 (의료급여법 시행령) |\n| 2종 | ○ 국민기초생활보장 수급권자 중 근로능력이 있는 가구 | - |\n',
'이어 이날 오후 1시30분부터 열릴 예정이던 스노보드 여자 슬로프스타일 예선 경기는 연기를 거듭하다 취소됐다. 조직위는 예선 없이 다음 날 결선에서 참가자 27명이 한번에 경기해 순위를 가리기로 했다.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]
# 獲取嵌入向量的相似度分數
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
✨ 主要特性
- 學習中文和英文以外的其他語言的能力不足,因此需要額外的學習來優化對其他語言的使用。
- 該模型在韓語數據集上進行了額外的訓練。
📚 詳細文檔
模型詳情
屬性 | 詳情 |
---|---|
模型類型 | Sentence Transformer Transformer Encoder |
最大序列長度 | 8192 個詞元 |
輸出維度 | 1024 個詞元 |
相似度函數 | 餘弦相似度 |
模型來源
- 文檔:Sentence Transformers 文檔
- 倉庫:GitHub 上的 Sentence Transformers
- Hugging Face:Hugging Face 上的 Sentence Transformers
完整模型架構
SentenceTransformer(
(0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: XLMRobertaModel
(1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Normalize()
)
🔧 技術細節
評估指標
- ndcg、mrr、map 指標是考慮排名的指標,而準確率、精確率和召回率是不考慮排名的指標。(例如:在考慮檢索前 10 名的排名時,當正確文檔排在第 1 位和排在第 10 位時會給出不同的分數。但是,如果它們在前 10 名中,準確率、精確率和召回率分數是相同的。)
信息檢索
- 韓語嵌入基準測試是一個字符串長度的 3/4 分位數相對較長,為 1024 的基準測試。
使用 AutoRAG 的韓語嵌入基準測試
這是韓語嵌入模型的基準測試。(https://github.com/Marker-Inc-Korea/AutoRAG-example-korean-embedding-benchmark)
-
Top-k 1 | 模型名稱 | F1 | 召回率 | 精確率 | mAP | mRR | NDCG | | --- | --- | --- | --- | --- | --- | --- | | paraphrase-multilingual-mpnet-base-v2 | 0.3596 | 0.3596 | 0.3596 | 0.3596 | 0.3596 | 0.3596 | | KoSimCSE-roberta | 0.4298 | 0.4298 | 0.4298 | 0.4298 | 0.4298 | 0.4298 | | Cohere embed-multilingual-v3.0 | 0.3596 | 0.3596 | 0.3596 | 0.3596 | 0.3596 | 0.3596 | | openai ada 002 | 0.4737 | 0.4737 | 0.4737 | 0.4737 | 0.4737 | 0.4737 | | multilingual-e5-large-instruct | 0.4649 | 0.4649 | 0.4649 | 0.4649 | 0.4649 | 0.4649 | | Upstage Embedding | 0.6579 | 0.6579 | 0.6579 | 0.6579 | 0.6579 | 0.6579 | | paraphrase-multilingual-MiniLM-L12-v2 | 0.2982 | 0.2982 | 0.2982 | 0.2982 | 0.2982 | 0.2982 | | openai_embed_3_small | 0.5439 | 0.5439 | 0.5439 | 0.5439 | 0.5439 | 0.5439 | | ko-sroberta-multitask | 0.4211 | 0.4211 | 0.4211 | 0.4211 | 0.4211 | 0.4211 | | openai_embed_3_large | 0.6053 | 0.6053 | 0.6053 | 0.6053 | 0.6053 | 0.6053 | | KU-HIAI-ONTHEIT-large-v1 | 0.7105 | 0.7105 | 0.7105 | 0.7105 | 0.7105 | 0.7105 | | KU-HIAI-ONTHEIT-large-v1.1 | 0.7193 | 0.7193 | 0.7193 | 0.7193 | 0.7193 | 0.7193 | | kf-deberta-multitask | 0.4561 | 0.4561 | 0.4561 | 0.4561 | 0.4561 | 0.4561 | | gte-multilingual-base | 0.5877 | 0.5877 | 0.5877 | 0.5877 | 0.5877 | 0.5877 | | KoE5 | 0.7018 | 0.7018 | 0.7018 | 0.7018 | 0.7018 | 0.7018 | | BGE-m3 | 0.6578 | 0.6578 | 0.6578 | 0.6578 | 0.6578 | 0.6578 | | bge-m3-korean | 0.5351 | 0.5351 | 0.5351 | 0.5351 | 0.5351 | 0.5351 | | BGE-m3-ko | 0.7456 | 0.7456 | 0.7456 | 0.7456 | 0.7456 | 0.7456 |
-
Top-k 3 | 模型名稱 | F1 | 召回率 | 精確率 | mAP | mRR | NDCG | | --- | --- | --- | --- | --- | --- | --- | | paraphrase-multilingual-mpnet-base-v2 | 0.2368 | 0.4737 | 0.1579 | 0.2032 | 0.2032 | 0.2712 | | KoSimCSE-roberta | 0.3026 | 0.6053 | 0.2018 | 0.2661 | 0.2661 | 0.3515 | | Cohere embed-multilingual-v3.0 | 0.2851 | 0.5702 | 0.1901 | 0.2515 | 0.2515 | 0.3321 | | openai ada 002 | 0.3553 | 0.7105 | 0.2368 | 0.3202 | 0.3202 | 0.4186 | | multilingual-e5-large-instruct | 0.3333 | 0.6667 | 0.2222 | 0.2909 | 0.2909 | 0.3856 | | Upstage Embedding | 0.4211 | 0.8421 | 0.2807 | 0.3509 | 0.3509 | 0.4743 | | paraphrase-multilingual-MiniLM-L12-v2 | 0.2061 | 0.4123 | 0.1374 | 0.1740 | 0.1740 | 0.2340 | | openai_embed_3_small | 0.3640 | 0.7281 | 0.2427 | 0.3026 | 0.3026 | 0.4097 | | ko-sroberta-multitask | 0.2939 | 0.5877 | 0.1959 | 0.2500 | 0.2500 | 0.3351 | | openai_embed_3_large | 0.3947 | 0.7895 | 0.2632 | 0.3348 | 0.3348 | 0.4491 | | KU-HIAI-ONTHEIT-large-v1 | 0.4386 | 0.8772 | 0.2924 | 0.3421 | 0.3421 | 0.4766 | | KU-HIAI-ONTHEIT-large-v1.1 | 0.4430 | 0.8860 | 0.2953 | 0.3406 | 0.3406 | 0.4778 | | kf-deberta-multitask | 0.3158 | 0.6316 | 0.2105 | 0.2792 | 0.2792 | 0.3679 | | gte-multilingual-base | 0.4035 | 0.8070 | 0.2690 | 0.3450 | 0.3450 | 0.4614 | | KoE5 | 0.4254 | 0.8509 | 0.2836 | 0.3173 | 0.3173 | 0.4514 | | BGE-m3 | 0.4254 | 0.8508 | 0.2836 | 0.3421 | 0.3421 | 0.4701 | | bge-m3-korean | 0.3684 | 0.7368 | 0.2456 | 0.3143 | 0.3143 | 0.4207 | | BGE-m3-ko | 0.4517 | 0.9035 | 0.3011 | 0.3494 | 0.3494 | 0.4886 |
-
Top-k 5 | 模型名稱 | F1 | 召回率 | 精確率 | mAP | mRR | NDCG | | --- | --- | --- | --- | --- | --- | --- | | paraphrase-multilingual-mpnet-base-v2 | 0.1813 | 0.5439 | 0.1088 | 0.1575 | 0.1575 | 0.2491 | | KoSimCSE-roberta | 0.2164 | 0.6491 | 0.1298 | 0.1751 | 0.1751 | 0.2873 | | Cohere embed-multilingual-v3.0 | 0.2076 | 0.6228 | 0.1246 | 0.1640 | 0.1640 | 0.2731 | | openai ada 002 | 0.2602 | 0.7807 | 0.1561 | 0.2139 | 0.2139 | 0.3486 | | multilingual-e5-large-instruct | 0.2544 | 0.7632 | 0.1526 | 0.2194 | 0.2194 | 0.3487 | | Upstage Embedding | 0.2982 | 0.8947 | 0.1789 | 0.2237 | 0.2237 | 0.3822 | | paraphrase-multilingual-MiniLM-L12-v2 | 0.1637 | 0.4912 | 0.0982 | 0.1437 | 0.1437 | 0.2264 | | openai_embed_3_small | 0.2690 | 0.8070 | 0.1614 | 0.2148 | 0.2148 | 0.3553 | | ko-sroberta-multitask | 0.2164 | 0.6491 | 0.1298 | 0.1697 | 0.1697 | 0.2835 | | openai_embed_3_large | 0.2807 | 0.8421 | 0.1684 | 0.2088 | 0.2088 | 0.3586 | | KU-HIAI-ONTHEIT-large-v1 | 0.3041 | 0.9123 | 0.1825 | 0.2137 | 0.2137 | 0.3783 | | KU-HIAI-ONTHEIT-large-v1.1 | 0.3099 | 0.9298 | 0.1860 | 0.2148 | 0.2148 | 0.3834 | | kf-deberta-multitask | 0.2281 | 0.6842 | 0.1368 | 0.1724 | 0.1724 | 0.2939 | | gte-multilingual-base | 0.2865 | 0.8596 | 0.1719 | 0.2096 | 0.2096 | 0.3637 | | KoE5 | 0.2982 | 0.8947 | 0.1789 | 0.2054 | 0.2054 | 0.3678 | | BGE-m3 | 0.3041 | 0.9123 | 0.1825 | 0.2193 | 0.2193 | 0.3832 | | bge-m3-korean | 0.2661 | 0.7982 | 0.1596 | 0.2116 | 0.2116 | 0.3504 | | BGE-m3-ko | 0.3099 | 0.9298 | 0.1860 | 0.2098 | 0.2098 | 0.3793 |
-
Top-k 10 | 模型名稱 | F1 | 召回率 | 精確率 | mAP | mRR | NDCG | | --- | --- | --- | --- | --- | --- | --- | | paraphrase-multilingual-mpnet-base-v2 | 0.1212 | 0.6667 | 0.0667 | 0.1197 | 0.1197 | 0.2382 | | KoSimCSE-roberta | 0.1324 | 0.7281 | 0.0728 | 0.1080 | 0.1080 | 0.2411 | | Cohere embed-multilingual-v3.0 | 0.1324 | 0.7281 | 0.0728 | 0.1150 | 0.1150 | 0.2473 | | openai ada 002 | 0.1563 | 0.8596 | 0.0860 | 0.1051 | 0.1051 | 0.2673 | | multilingual-e5-large-instruct | 0.1483 | 0.8158 | 0.0816 | 0.0980 | 0.0980 | 0.2520 | | Upstage Embedding | 0.1707 | 0.9386 | 0.0939 | 0.1078 | 0.1078 | 0.2848 | | paraphrase-multilingual-MiniLM-L12-v2 | 0.1053 | 0.5789 | 0.0579 | 0.0961 | 0.0961 | 0.2006 | | openai_embed_3_small | 0.1547 | 0.8509 | 0.0851 | 0.0984 | 0.0984 | 0.2593 | | ko-sroberta-multitask | 0.1276 | 0.7018 | 0.0702 | 0.0986 | 0.0986 | 0.2275 | | openai_embed_3_large | 0.1643 | 0.9035 | 0.0904 | 0.1180 | 0.1180 | 0.2855 | | KU-HIAI-ONTHEIT-large-v1 | 0.1707 | 0.9386 | 0.0939 | 0.1105 | 0.1105 | 0.2860 | | KU-HIAI-ONTHEIT-large-v1.1 | 0.1722 | 0.9474 | 0.0947 | 0.1033 | 0.1033 | 0.2822 | | kf-deberta-multitask | 0.1388 | 0.7632 | 0.0763 | 0.1 | 0.1 | 0.2422 | | gte-multilingual-base | 0.1675 | 0.9211 | 0.0921 | 0.1066 | 0.1066 | 0.2805 | | KoE5 | 0.1675 | 0.9211 | 0.0921 | 0.1011 | 0.1011 | 0.2750 | | BGE-m3 | 0.1707 | 0.9386 | 0.0939 | 0.1130 | 0.1130 | 0.2884 | | bge-m3-korean | 0.1579 | 0.8684 | 0.0868 | 0.1093 | 0.1093 | 0.2721 | | BGE-m3-ko | 0.1770 | 0.9736 | 0.0974 | 0.1097 | 0.1097 | 0.2932 |
信息檢索
- 數據集:
miracl-ko
(https://github.com/project-miracl/miracl) - miracl 基準測試是一個在韓語維基數據集上字符串長度的 3/4 分位數相對較短,為 220 的基準測試。
- 使用
InformationRetrievalEvaluator
進行評估。
指標 | 值 |
---|---|
cosine_accuracy@1 | 0.6103 |
cosine_accuracy@3 | 0.8169 |
cosine_accuracy@5 | 0.8732 |
cosine_accuracy@10 | 0.9202 |
cosine_precision@1 | 0.6103 |
cosine_precision@3 | 0.3787 |
cosine_precision@5 | 0.2761 |
cosine_precision@10 | 0.1728 |
cosine_recall@1 | 0.3847 |
cosine_recall@3 | 0.5902 |
cosine_recall@5 | 0.6794 |
cosine_recall@10 | 0.7695 |
cosine_ndcg@10 | 0.6833 |
cosine_mrr@10 | 0.7262 |
cosine_map@100 | 0.6074 |
dot_accuracy@1 | 0.6103 |
dot_accuracy@3 | 0.8169 |
dot_accuracy@5 | 0.8732 |
dot_accuracy@10 | 0.9202 |
dot_precision@1 | 0.6103 |
dot_precision@3 | 0.3787 |
dot_precision@5 | 0.2761 |
dot_precision@10 | 0.1728 |
dot_recall@1 | 0.3847 |
dot_recall@3 | 0.5902 |
dot_recall@5 | 0.6794 |
dot_recall@10 | 0.7695 |
dot_ndcg@10 | 0.6723 |
dot_mrr@10 | 0.7262 |
dot_map@100 | 0.6074 |
偏差、風險和侷限性
- 由於評估結果因領域而異,因此有必要在自己的領域中對模型進行比較和評估。在 Miracl 基準測試中,使用韓語維基百科作為語料庫進行評估,在這種情況下,學習後 cosine_ndcg@10 分數下降了 0.02 分。然而,在金融領域的 Auto-RAG 基準測試中,當排名為第 1 時,ndcg 分數提高了 0.09。這個模型可能在特定領域的使用中具有優勢。
- 此外,由於 miracl 基準測試由相對較短的字符串語料庫組成,而韓語嵌入基準測試由較長的字符串語料庫組成,如果您要使用的語料庫長度較長,這個模型可能更具優勢。
訓練超參數
非默認超參數
批量大小參考了以下論文:Text Embeddings by Weakly-Supervised Contrastive Pre-training(https://arxiv.org/pdf/2212.03533)
eval_strategy
:stepsper_device_train_batch_size
:32768per_device_eval_batch_size
:32768learning_rate
:3e-05warmup_ratio
:0.03333333333333333fp16
:Truebatch_sampler
:no_duplicates
所有超參數
點擊展開
overwrite_output_dir
:Falsedo_predict
:Falseeval_strategy
:stepsprediction_loss_only
:Trueper_device_train_batch_size
:32768per_device_eval_batch_size
:32768per_gpu_train_batch_size
:Noneper_gpu_eval_batch_size
:Nonegradient_accumulation_steps
:1eval_accumulation_steps
:Nonelearning_rate
:3e-05weight_decay
:0.0adam_beta1
:0.9adam_beta2
:0.999adam_epsilon
:1e-08max_grad_norm
:1.0num_train_epochs
:3max_steps
:-1lr_scheduler_type
:linearlr_scheduler_kwargs
:{}warmup_ratio
:0.03333333333333333warmup_steps
:0log_level
:passivelog_level_replica
:warninglog_on_each_node
:Truelogging_nan_inf_filter
:Truesave_safetensors
:Truesave_on_each_node
:Falsesave_only_model
:Falserestore_callback_states_from_checkpoint
:Falseno_cuda
:Falseuse_cpu
:Falseuse_mps_device
:Falseseed
:42data_seed
:Nonejit_mode_eval
:Falseuse_ipex
:Falsebf16
:Falsefp16
:Truefp16_opt_level
:O1half_precision_backend
:autobf16_full_eval
:Falsefp16_full_eval
:Falsetf32
:Nonelocal_rank
:0ddp_backend
:Nonetpu_num_cores
:Nonetpu_metrics_debug
:Falsedebug
:[]dataloader_drop_last
:Truedataloader_num_workers
:0dataloader_prefetch_factor
:Nonepast_index
:-1disable_tqdm
:Falseremove_unused_columns
:Truelabel_names
:Noneload_best_model_at_end
:Falseignore_data_skip
:Falsefsdp
:[]fsdp_min_num_params
:0fsdp_config
:{'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}fsdp_transformer_layer_cls_to_wrap
:Noneaccelerator_config
:{'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}deepspeed
:Nonelabel_smoothing_factor
:0.0optim
:adamw_torchoptim_args
:Noneadafactor
:Falsegroup_by_length
:Falselength_column_name
:lengthddp_find_unused_parameters
:Noneddp_bucket_cap_mb
:Noneddp_broadcast_buffers
:Falsedataloader_pin_memory
:Truedataloader_persistent_workers
:Falseskip_memory_metrics
:Trueuse_legacy_prediction_loop
:Falsepush_to_hub
:Falseresume_from_checkpoint
:Nonehub_model_id
:Nonehub_strategy
:every_savehub_private_repo
:Falsehub_always_push
:Falsegradient_checkpointing
:Falsegradient_checkpointing_kwargs
:Noneinclude_inputs_for_metrics
:Falseeval_do_concat_batches
:Truefp16_backend
:autopush_to_hub_model_id
:Nonepush_to_hub_organization
:Nonemp_parameters
:auto_find_batch_size
:Falsefull_determinism
:Falsetorchdynamo
:Noneray_scope
:lastddp_timeout
:1800torch_compile
:Falsetorch_compile_backend
:Nonetorch_compile_mode
:Nonedispatch_batches
:Nonesplit_batches
:Noneinclude_tokens_per_second
:Falseinclude_num_input_tokens_seen
:Falseneftune_noise_alpha
:Noneoptim_target_modules
:Nonebatch_eval_metrics
:Falsebatch_sampler
:no_duplicatesmulti_dataset_batch_sampler
:proportional
📄 許可證
本項目採用 Apache-2.0 許可證。
📚 引用
BibTeX
Sentence Transformers
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
@misc{bge-m3,
title={BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation},
author={Jianlv Chen and Shitao Xiao and Peitian Zhang and Kun Luo and Defu Lian and Zheng Liu},
year={2024},
eprint={2402.03216},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
@article{wang2022text,
title={Text Embeddings by Weakly-Supervised Contrastive Pre-training},
author={Wang, Liang and Yang, Nan and Huang, Xiaolong and Jiao, Binxing and Yang, Linjun and Jiang, Daxin and Majumder, Rangan and Wei, Furu},
journal={arXiv preprint arXiv:2212.03533},
year={2022}
}







