模型简介
模型特点
模型能力
使用案例
🚀 SentenceTransformer
这是一个基于 sentence-transformers 的模型,在 train_set 数据集上进行了训练。它可以将句子和段落映射到一个 1024 维的密集向量空间,可用于语义文本相似度计算、语义搜索、释义挖掘、文本分类、聚类等任务。
🚀 快速开始
直接使用(Sentence Transformers)
首先安装 Sentence Transformers 库:
pip install -U sentence-transformers
然后你可以加载这个模型并进行推理。
from sentence_transformers import SentenceTransformer
# 从 🤗 Hub 下载
model = SentenceTransformer("dragonkue/bge-m3-ko")
# 进行推理
sentences = [
'수급권자 중 근로 능력이 없는 임산부는 몇 종에 해당하니?',
'내년부터 저소득층 1세 미만 아동의 \n의료비 부담이 더 낮아진다!\n의료급여제도 개요\n□ (목적) 생활유지 능력이 없거나 생활이 어려운 국민들에게 발생하는 질병, 부상, 출산 등에 대해 국가가 의료서비스 제공\n□ (지원대상) 국민기초생활보장 수급권자, 타 법에 의한 수급권자 등\n\n| 구분 | 국민기초생활보장법에 의한 수급권자 | 국민기초생활보장법 이외의 타 법에 의한 수급권자 |\n| --- | --- | --- |\n| 1종 | ○ 국민기초생활보장 수급권자 중 근로능력이 없는 자만으로 구성된 가구 - 18세 미만, 65세 이상 - 4급 이내 장애인 - 임산부, 병역의무이행자 등 | ○ 이재민(재해구호법) ○ 의상자 및 의사자의 유족○ 국내 입양된 18세 미만 아동○ 국가유공자 및 그 유족․가족○ 국가무형문화재 보유자 및 그 가족○ 새터민(북한이탈주민)과 그 가족○ 5․18 민주화운동 관련자 및 그 유가족○ 노숙인 ※ 행려환자 (의료급여법 시행령) |\n| 2종 | ○ 국민기초생활보장 수급권자 중 근로능력이 있는 가구 | - |\n',
'이어 이날 오후 1시30분부터 열릴 예정이던 스노보드 여자 슬로프스타일 예선 경기는 연기를 거듭하다 취소됐다. 조직위는 예선 없이 다음 날 결선에서 참가자 27명이 한번에 경기해 순위를 가리기로 했다.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]
# 获取嵌入向量的相似度分数
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
✨ 主要特性
- 学习中文和英文以外的其他语言的能力不足,因此需要额外的学习来优化对其他语言的使用。
- 该模型在韩语数据集上进行了额外的训练。
📚 详细文档
模型详情
属性 | 详情 |
---|---|
模型类型 | Sentence Transformer Transformer Encoder |
最大序列长度 | 8192 个词元 |
输出维度 | 1024 个词元 |
相似度函数 | 余弦相似度 |
模型来源
- 文档:Sentence Transformers 文档
- 仓库:GitHub 上的 Sentence Transformers
- Hugging Face:Hugging Face 上的 Sentence Transformers
完整模型架构
SentenceTransformer(
(0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: XLMRobertaModel
(1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Normalize()
)
🔧 技术细节
评估指标
- ndcg、mrr、map 指标是考虑排名的指标,而准确率、精确率和召回率是不考虑排名的指标。(例如:在考虑检索前 10 名的排名时,当正确文档排在第 1 位和排在第 10 位时会给出不同的分数。但是,如果它们在前 10 名中,准确率、精确率和召回率分数是相同的。)
信息检索
- 韩语嵌入基准测试是一个字符串长度的 3/4 分位数相对较长,为 1024 的基准测试。
使用 AutoRAG 的韩语嵌入基准测试
这是韩语嵌入模型的基准测试。(https://github.com/Marker-Inc-Korea/AutoRAG-example-korean-embedding-benchmark)
-
Top-k 1 | 模型名称 | F1 | 召回率 | 精确率 | mAP | mRR | NDCG | | --- | --- | --- | --- | --- | --- | --- | | paraphrase-multilingual-mpnet-base-v2 | 0.3596 | 0.3596 | 0.3596 | 0.3596 | 0.3596 | 0.3596 | | KoSimCSE-roberta | 0.4298 | 0.4298 | 0.4298 | 0.4298 | 0.4298 | 0.4298 | | Cohere embed-multilingual-v3.0 | 0.3596 | 0.3596 | 0.3596 | 0.3596 | 0.3596 | 0.3596 | | openai ada 002 | 0.4737 | 0.4737 | 0.4737 | 0.4737 | 0.4737 | 0.4737 | | multilingual-e5-large-instruct | 0.4649 | 0.4649 | 0.4649 | 0.4649 | 0.4649 | 0.4649 | | Upstage Embedding | 0.6579 | 0.6579 | 0.6579 | 0.6579 | 0.6579 | 0.6579 | | paraphrase-multilingual-MiniLM-L12-v2 | 0.2982 | 0.2982 | 0.2982 | 0.2982 | 0.2982 | 0.2982 | | openai_embed_3_small | 0.5439 | 0.5439 | 0.5439 | 0.5439 | 0.5439 | 0.5439 | | ko-sroberta-multitask | 0.4211 | 0.4211 | 0.4211 | 0.4211 | 0.4211 | 0.4211 | | openai_embed_3_large | 0.6053 | 0.6053 | 0.6053 | 0.6053 | 0.6053 | 0.6053 | | KU-HIAI-ONTHEIT-large-v1 | 0.7105 | 0.7105 | 0.7105 | 0.7105 | 0.7105 | 0.7105 | | KU-HIAI-ONTHEIT-large-v1.1 | 0.7193 | 0.7193 | 0.7193 | 0.7193 | 0.7193 | 0.7193 | | kf-deberta-multitask | 0.4561 | 0.4561 | 0.4561 | 0.4561 | 0.4561 | 0.4561 | | gte-multilingual-base | 0.5877 | 0.5877 | 0.5877 | 0.5877 | 0.5877 | 0.5877 | | KoE5 | 0.7018 | 0.7018 | 0.7018 | 0.7018 | 0.7018 | 0.7018 | | BGE-m3 | 0.6578 | 0.6578 | 0.6578 | 0.6578 | 0.6578 | 0.6578 | | bge-m3-korean | 0.5351 | 0.5351 | 0.5351 | 0.5351 | 0.5351 | 0.5351 | | BGE-m3-ko | 0.7456 | 0.7456 | 0.7456 | 0.7456 | 0.7456 | 0.7456 |
-
Top-k 3 | 模型名称 | F1 | 召回率 | 精确率 | mAP | mRR | NDCG | | --- | --- | --- | --- | --- | --- | --- | | paraphrase-multilingual-mpnet-base-v2 | 0.2368 | 0.4737 | 0.1579 | 0.2032 | 0.2032 | 0.2712 | | KoSimCSE-roberta | 0.3026 | 0.6053 | 0.2018 | 0.2661 | 0.2661 | 0.3515 | | Cohere embed-multilingual-v3.0 | 0.2851 | 0.5702 | 0.1901 | 0.2515 | 0.2515 | 0.3321 | | openai ada 002 | 0.3553 | 0.7105 | 0.2368 | 0.3202 | 0.3202 | 0.4186 | | multilingual-e5-large-instruct | 0.3333 | 0.6667 | 0.2222 | 0.2909 | 0.2909 | 0.3856 | | Upstage Embedding | 0.4211 | 0.8421 | 0.2807 | 0.3509 | 0.3509 | 0.4743 | | paraphrase-multilingual-MiniLM-L12-v2 | 0.2061 | 0.4123 | 0.1374 | 0.1740 | 0.1740 | 0.2340 | | openai_embed_3_small | 0.3640 | 0.7281 | 0.2427 | 0.3026 | 0.3026 | 0.4097 | | ko-sroberta-multitask | 0.2939 | 0.5877 | 0.1959 | 0.2500 | 0.2500 | 0.3351 | | openai_embed_3_large | 0.3947 | 0.7895 | 0.2632 | 0.3348 | 0.3348 | 0.4491 | | KU-HIAI-ONTHEIT-large-v1 | 0.4386 | 0.8772 | 0.2924 | 0.3421 | 0.3421 | 0.4766 | | KU-HIAI-ONTHEIT-large-v1.1 | 0.4430 | 0.8860 | 0.2953 | 0.3406 | 0.3406 | 0.4778 | | kf-deberta-multitask | 0.3158 | 0.6316 | 0.2105 | 0.2792 | 0.2792 | 0.3679 | | gte-multilingual-base | 0.4035 | 0.8070 | 0.2690 | 0.3450 | 0.3450 | 0.4614 | | KoE5 | 0.4254 | 0.8509 | 0.2836 | 0.3173 | 0.3173 | 0.4514 | | BGE-m3 | 0.4254 | 0.8508 | 0.2836 | 0.3421 | 0.3421 | 0.4701 | | bge-m3-korean | 0.3684 | 0.7368 | 0.2456 | 0.3143 | 0.3143 | 0.4207 | | BGE-m3-ko | 0.4517 | 0.9035 | 0.3011 | 0.3494 | 0.3494 | 0.4886 |
-
Top-k 5 | 模型名称 | F1 | 召回率 | 精确率 | mAP | mRR | NDCG | | --- | --- | --- | --- | --- | --- | --- | | paraphrase-multilingual-mpnet-base-v2 | 0.1813 | 0.5439 | 0.1088 | 0.1575 | 0.1575 | 0.2491 | | KoSimCSE-roberta | 0.2164 | 0.6491 | 0.1298 | 0.1751 | 0.1751 | 0.2873 | | Cohere embed-multilingual-v3.0 | 0.2076 | 0.6228 | 0.1246 | 0.1640 | 0.1640 | 0.2731 | | openai ada 002 | 0.2602 | 0.7807 | 0.1561 | 0.2139 | 0.2139 | 0.3486 | | multilingual-e5-large-instruct | 0.2544 | 0.7632 | 0.1526 | 0.2194 | 0.2194 | 0.3487 | | Upstage Embedding | 0.2982 | 0.8947 | 0.1789 | 0.2237 | 0.2237 | 0.3822 | | paraphrase-multilingual-MiniLM-L12-v2 | 0.1637 | 0.4912 | 0.0982 | 0.1437 | 0.1437 | 0.2264 | | openai_embed_3_small | 0.2690 | 0.8070 | 0.1614 | 0.2148 | 0.2148 | 0.3553 | | ko-sroberta-multitask | 0.2164 | 0.6491 | 0.1298 | 0.1697 | 0.1697 | 0.2835 | | openai_embed_3_large | 0.2807 | 0.8421 | 0.1684 | 0.2088 | 0.2088 | 0.3586 | | KU-HIAI-ONTHEIT-large-v1 | 0.3041 | 0.9123 | 0.1825 | 0.2137 | 0.2137 | 0.3783 | | KU-HIAI-ONTHEIT-large-v1.1 | 0.3099 | 0.9298 | 0.1860 | 0.2148 | 0.2148 | 0.3834 | | kf-deberta-multitask | 0.2281 | 0.6842 | 0.1368 | 0.1724 | 0.1724 | 0.2939 | | gte-multilingual-base | 0.2865 | 0.8596 | 0.1719 | 0.2096 | 0.2096 | 0.3637 | | KoE5 | 0.2982 | 0.8947 | 0.1789 | 0.2054 | 0.2054 | 0.3678 | | BGE-m3 | 0.3041 | 0.9123 | 0.1825 | 0.2193 | 0.2193 | 0.3832 | | bge-m3-korean | 0.2661 | 0.7982 | 0.1596 | 0.2116 | 0.2116 | 0.3504 | | BGE-m3-ko | 0.3099 | 0.9298 | 0.1860 | 0.2098 | 0.2098 | 0.3793 |
-
Top-k 10 | 模型名称 | F1 | 召回率 | 精确率 | mAP | mRR | NDCG | | --- | --- | --- | --- | --- | --- | --- | | paraphrase-multilingual-mpnet-base-v2 | 0.1212 | 0.6667 | 0.0667 | 0.1197 | 0.1197 | 0.2382 | | KoSimCSE-roberta | 0.1324 | 0.7281 | 0.0728 | 0.1080 | 0.1080 | 0.2411 | | Cohere embed-multilingual-v3.0 | 0.1324 | 0.7281 | 0.0728 | 0.1150 | 0.1150 | 0.2473 | | openai ada 002 | 0.1563 | 0.8596 | 0.0860 | 0.1051 | 0.1051 | 0.2673 | | multilingual-e5-large-instruct | 0.1483 | 0.8158 | 0.0816 | 0.0980 | 0.0980 | 0.2520 | | Upstage Embedding | 0.1707 | 0.9386 | 0.0939 | 0.1078 | 0.1078 | 0.2848 | | paraphrase-multilingual-MiniLM-L12-v2 | 0.1053 | 0.5789 | 0.0579 | 0.0961 | 0.0961 | 0.2006 | | openai_embed_3_small | 0.1547 | 0.8509 | 0.0851 | 0.0984 | 0.0984 | 0.2593 | | ko-sroberta-multitask | 0.1276 | 0.7018 | 0.0702 | 0.0986 | 0.0986 | 0.2275 | | openai_embed_3_large | 0.1643 | 0.9035 | 0.0904 | 0.1180 | 0.1180 | 0.2855 | | KU-HIAI-ONTHEIT-large-v1 | 0.1707 | 0.9386 | 0.0939 | 0.1105 | 0.1105 | 0.2860 | | KU-HIAI-ONTHEIT-large-v1.1 | 0.1722 | 0.9474 | 0.0947 | 0.1033 | 0.1033 | 0.2822 | | kf-deberta-multitask | 0.1388 | 0.7632 | 0.0763 | 0.1 | 0.1 | 0.2422 | | gte-multilingual-base | 0.1675 | 0.9211 | 0.0921 | 0.1066 | 0.1066 | 0.2805 | | KoE5 | 0.1675 | 0.9211 | 0.0921 | 0.1011 | 0.1011 | 0.2750 | | BGE-m3 | 0.1707 | 0.9386 | 0.0939 | 0.1130 | 0.1130 | 0.2884 | | bge-m3-korean | 0.1579 | 0.8684 | 0.0868 | 0.1093 | 0.1093 | 0.2721 | | BGE-m3-ko | 0.1770 | 0.9736 | 0.0974 | 0.1097 | 0.1097 | 0.2932 |
信息检索
- 数据集:
miracl-ko
(https://github.com/project-miracl/miracl) - miracl 基准测试是一个在韩语维基数据集上字符串长度的 3/4 分位数相对较短,为 220 的基准测试。
- 使用
InformationRetrievalEvaluator
进行评估。
指标 | 值 |
---|---|
cosine_accuracy@1 | 0.6103 |
cosine_accuracy@3 | 0.8169 |
cosine_accuracy@5 | 0.8732 |
cosine_accuracy@10 | 0.9202 |
cosine_precision@1 | 0.6103 |
cosine_precision@3 | 0.3787 |
cosine_precision@5 | 0.2761 |
cosine_precision@10 | 0.1728 |
cosine_recall@1 | 0.3847 |
cosine_recall@3 | 0.5902 |
cosine_recall@5 | 0.6794 |
cosine_recall@10 | 0.7695 |
cosine_ndcg@10 | 0.6833 |
cosine_mrr@10 | 0.7262 |
cosine_map@100 | 0.6074 |
dot_accuracy@1 | 0.6103 |
dot_accuracy@3 | 0.8169 |
dot_accuracy@5 | 0.8732 |
dot_accuracy@10 | 0.9202 |
dot_precision@1 | 0.6103 |
dot_precision@3 | 0.3787 |
dot_precision@5 | 0.2761 |
dot_precision@10 | 0.1728 |
dot_recall@1 | 0.3847 |
dot_recall@3 | 0.5902 |
dot_recall@5 | 0.6794 |
dot_recall@10 | 0.7695 |
dot_ndcg@10 | 0.6723 |
dot_mrr@10 | 0.7262 |
dot_map@100 | 0.6074 |
偏差、风险和局限性
- 由于评估结果因领域而异,因此有必要在自己的领域中对模型进行比较和评估。在 Miracl 基准测试中,使用韩语维基百科作为语料库进行评估,在这种情况下,学习后 cosine_ndcg@10 分数下降了 0.02 分。然而,在金融领域的 Auto-RAG 基准测试中,当排名为第 1 时,ndcg 分数提高了 0.09。这个模型可能在特定领域的使用中具有优势。
- 此外,由于 miracl 基准测试由相对较短的字符串语料库组成,而韩语嵌入基准测试由较长的字符串语料库组成,如果您要使用的语料库长度较长,这个模型可能更具优势。
训练超参数
非默认超参数
批量大小参考了以下论文:Text Embeddings by Weakly-Supervised Contrastive Pre-training(https://arxiv.org/pdf/2212.03533)
eval_strategy
:stepsper_device_train_batch_size
:32768per_device_eval_batch_size
:32768learning_rate
:3e-05warmup_ratio
:0.03333333333333333fp16
:Truebatch_sampler
:no_duplicates
所有超参数
点击展开
overwrite_output_dir
:Falsedo_predict
:Falseeval_strategy
:stepsprediction_loss_only
:Trueper_device_train_batch_size
:32768per_device_eval_batch_size
:32768per_gpu_train_batch_size
:Noneper_gpu_eval_batch_size
:Nonegradient_accumulation_steps
:1eval_accumulation_steps
:Nonelearning_rate
:3e-05weight_decay
:0.0adam_beta1
:0.9adam_beta2
:0.999adam_epsilon
:1e-08max_grad_norm
:1.0num_train_epochs
:3max_steps
:-1lr_scheduler_type
:linearlr_scheduler_kwargs
:{}warmup_ratio
:0.03333333333333333warmup_steps
:0log_level
:passivelog_level_replica
:warninglog_on_each_node
:Truelogging_nan_inf_filter
:Truesave_safetensors
:Truesave_on_each_node
:Falsesave_only_model
:Falserestore_callback_states_from_checkpoint
:Falseno_cuda
:Falseuse_cpu
:Falseuse_mps_device
:Falseseed
:42data_seed
:Nonejit_mode_eval
:Falseuse_ipex
:Falsebf16
:Falsefp16
:Truefp16_opt_level
:O1half_precision_backend
:autobf16_full_eval
:Falsefp16_full_eval
:Falsetf32
:Nonelocal_rank
:0ddp_backend
:Nonetpu_num_cores
:Nonetpu_metrics_debug
:Falsedebug
:[]dataloader_drop_last
:Truedataloader_num_workers
:0dataloader_prefetch_factor
:Nonepast_index
:-1disable_tqdm
:Falseremove_unused_columns
:Truelabel_names
:Noneload_best_model_at_end
:Falseignore_data_skip
:Falsefsdp
:[]fsdp_min_num_params
:0fsdp_config
:{'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}fsdp_transformer_layer_cls_to_wrap
:Noneaccelerator_config
:{'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}deepspeed
:Nonelabel_smoothing_factor
:0.0optim
:adamw_torchoptim_args
:Noneadafactor
:Falsegroup_by_length
:Falselength_column_name
:lengthddp_find_unused_parameters
:Noneddp_bucket_cap_mb
:Noneddp_broadcast_buffers
:Falsedataloader_pin_memory
:Truedataloader_persistent_workers
:Falseskip_memory_metrics
:Trueuse_legacy_prediction_loop
:Falsepush_to_hub
:Falseresume_from_checkpoint
:Nonehub_model_id
:Nonehub_strategy
:every_savehub_private_repo
:Falsehub_always_push
:Falsegradient_checkpointing
:Falsegradient_checkpointing_kwargs
:Noneinclude_inputs_for_metrics
:Falseeval_do_concat_batches
:Truefp16_backend
:autopush_to_hub_model_id
:Nonepush_to_hub_organization
:Nonemp_parameters
:auto_find_batch_size
:Falsefull_determinism
:Falsetorchdynamo
:Noneray_scope
:lastddp_timeout
:1800torch_compile
:Falsetorch_compile_backend
:Nonetorch_compile_mode
:Nonedispatch_batches
:Nonesplit_batches
:Noneinclude_tokens_per_second
:Falseinclude_num_input_tokens_seen
:Falseneftune_noise_alpha
:Noneoptim_target_modules
:Nonebatch_eval_metrics
:Falsebatch_sampler
:no_duplicatesmulti_dataset_batch_sampler
:proportional
📄 许可证
本项目采用 Apache-2.0 许可证。
📚 引用
BibTeX
Sentence Transformers
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
@misc{bge-m3,
title={BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation},
author={Jianlv Chen and Shitao Xiao and Peitian Zhang and Kun Luo and Defu Lian and Zheng Liu},
year={2024},
eprint={2402.03216},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
@article{wang2022text,
title={Text Embeddings by Weakly-Supervised Contrastive Pre-training},
author={Wang, Liang and Yang, Nan and Huang, Xiaolong and Jiao, Binxing and Yang, Linjun and Jiang, Daxin and Majumder, Rangan and Wei, Furu},
journal={arXiv preprint arXiv:2212.03533},
year={2022}
}







