fine-tuned-embedding-model開源句子轉換器 - 免費實現文本語義相似度計算

首頁

Fine Tuned Embedding Model

由svb01開發

這是一個基於sentence-transformers/all-MiniLM-L6-v2微調的句子轉換器模型，用於將文本映射到384維向量空間，支持語義相似度計算等任務。

文本嵌入

Safetensors

#短文本語義匹配 #多負例對比學習 #風險管理文本分析

下載量 17

發布時間 : 9/23/2024

模型概述

該模型將句子和段落映射到384維密集向量空間，可用於語義文本相似性、語義搜索、釋義挖掘、文本分類、聚類等任務。

模型特點

高效語義編碼

將文本高效編碼為384維向量，保留語義信息

多任務支持

支持語義相似度計算、文本分類、聚類等多種下游任務

輕量級模型

基於MiniLM架構，在保持性能的同時減少計算資源需求

模型能力

語義文本相似度計算

語義搜索

釋義挖掘

文本分類

文本聚類

特徵提取

使用案例

信息檢索

文檔相似度匹配

計算文檔間的語義相似度，用於推薦相關文檔

內容管理

重複內容檢測

識別語義相似的重複內容

🚀 基於sentence-transformers/all-MiniLM-L6-v2的句子轉換器模型

本項目是一個基於 sentence-transformers 框架，從 sentence-transformers/all-MiniLM-L6-v2 微調而來的模型。它能將句子和段落映射到384維的密集向量空間，可用於語義文本相似度計算、語義搜索、釋義挖掘、文本分類、聚類等任務。

🚀 快速開始

本模型是基於 sentence-transformers 框架微調的，以下是使用該模型的具體步驟：

安裝Sentence Transformers庫

pip install -U sentence-transformers

加載模型並進行推理

from sentence_transformers import SentenceTransformer

# 從🤗 Hub下載模型
model = SentenceTransformer("sentence_transformers_model_id")
# 進行推理
sentences = [
    'What does this text say about data privacy?',
    'information during GAI training and maintenance. \nHuman-AI Conﬁguration; Obscene, \nDegrading, and/or Abusive \nContent; Value Chain and \nComponent Integration; \nDangerous, Violent, or Hateful \nContent \nMS-2.6-002 \nAssess existence or levels of harmful bias, intellectual property infringement, \ndata privacy violations, obscenity, extremism, violence, or CBRN information in \nsystem training data. \nData Privacy; Intellectual Property; \nObscene, Degrading, and/or \nAbusive Content; Harmful Bias and \nHomogenization; Dangerous, \nViolent, or Hateful Content; CBRN \nInformation or Capabilities \nMS-2.6-003 Re-evaluate safety features of ﬁne-tuned models when the negative risk exceeds \norganizational risk tolerance. \nDangerous, Violent, or Hateful \nContent \nMS-2.6-004 Review GAI system outputs for validity and safety: Review generated code to \nassess risks that may arise from unreliable downstream decision-making. \nValue Chain and Component \nIntegration; Dangerous, Violent, or \nHateful Content',
    'Scheurer, J. et al. (2023) Technical report: Large language models can strategically deceive their users \nwhen put under pressure. arXiv. https://arxiv.org/abs/2311.07590 \nShelby, R. et al. (2023) Sociotechnical Harms of Algorithmic Systems: Scoping a Taxonomy for Harm \nReduction. arXiv. https://arxiv.org/pdf/2210.05791 \nShevlane, T. et al. (2023) Model evaluation for extreme risks. arXiv. https://arxiv.org/pdf/2305.15324 \nShumailov, I. et al. (2023) The curse of recursion: training on generated data makes models forget. arXiv. \nhttps://arxiv.org/pdf/2305.17493v2 \nSmith, A. et al. (2023) Hallucination or Confabulation? Neuroanatomy as metaphor in Large Language \nModels. PLOS Digital Health. \nhttps://journals.plos.org/digitalhealth/article?id=10.1371/journal.pdig.0000388 \nSoice, E. et al. (2023) Can large language models democratize access to dual-use biotechnology? arXiv. \nhttps://arxiv.org/abs/2306.03809',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 384]

# 獲取嵌入向量的相似度分數
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

✨ 主要特性

高維向量映射：能夠將句子和段落映射到384維的密集向量空間。
多任務支持：可用於語義文本相似度計算、語義搜索、釋義挖掘、文本分類、聚類等多種自然語言處理任務。

📦 安裝指南

要使用本模型，需要安裝 sentence-transformers 庫，可使用以下命令進行安裝：

pip install -U sentence-transformers

💻 使用示例

基礎用法

from sentence_transformers import SentenceTransformer

# 從🤗 Hub下載模型
model = SentenceTransformer("sentence_transformers_model_id")
# 進行推理
sentences = [
    'What does this text say about data privacy?',
    'information during GAI training and maintenance. \nHuman-AI Conﬁguration; Obscene, \nDegrading, and/or Abusive \nContent; Value Chain and \nComponent Integration; \nDangerous, Violent, or Hateful \nContent \nMS-2.6-002 \nAssess existence or levels of harmful bias, intellectual property infringement, \ndata privacy violations, obscenity, extremism, violence, or CBRN information in \nsystem training data. \nData Privacy; Intellectual Property; \nObscene, Degrading, and/or \nAbusive Content; Harmful Bias and \nHomogenization; Dangerous, \nViolent, or Hateful Content; CBRN \nInformation or Capabilities \nMS-2.6-003 Re-evaluate safety features of ﬁne-tuned models when the negative risk exceeds \norganizational risk tolerance. \nDangerous, Violent, or Hateful \nContent \nMS-2.6-004 Review GAI system outputs for validity and safety: Review generated code to \nassess risks that may arise from unreliable downstream decision-making. \nValue Chain and Component \nIntegration; Dangerous, Violent, or \nHateful Content',
    'Scheurer, J. et al. (2023) Technical report: Large language models can strategically deceive their users \nwhen put under pressure. arXiv. https://arxiv.org/abs/2311.07590 \nShelby, R. et al. (2023) Sociotechnical Harms of Algorithmic Systems: Scoping a Taxonomy for Harm \nReduction. arXiv. https://arxiv.org/pdf/2210.05791 \nShevlane, T. et al. (2023) Model evaluation for extreme risks. arXiv. https://arxiv.org/pdf/2305.15324 \nShumailov, I. et al. (2023) The curse of recursion: training on generated data makes models forget. arXiv. \nhttps://arxiv.org/pdf/2305.17493v2 \nSmith, A. et al. (2023) Hallucination or Confabulation? Neuroanatomy as metaphor in Large Language \nModels. PLOS Digital Health. \nhttps://journals.plos.org/digitalhealth/article?id=10.1371/journal.pdig.0000388 \nSoice, E. et al. (2023) Can large language models democratize access to dual-use biotechnology? arXiv. \nhttps://arxiv.org/abs/2306.03809',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 384]

# 獲取嵌入向量的相似度分數
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

📚 詳細文檔

模型詳情

模型描述

屬性	詳情
模型類型	句子轉換器
基礎模型	sentence-transformers/all-MiniLM-L6-v2
最大序列長度	256個詞元
輸出維度	384個詞元
相似度函數	餘弦相似度

模型資源

文檔：Sentence Transformers文檔
代碼倉庫：GitHub上的Sentence Transformers
Hugging Face：Hugging Face上的Sentence Transformers

完整模型架構

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

訓練詳情

訓練數據集

未命名數據集

數據集大小：555個訓練樣本
列信息：<code>sentence_0</code> 和 <code>sentence_1</code>
近似統計信息（基於前555個樣本）：
sentence_0 sentence_1
類型字符串字符串
詳情
最小：10個詞元
平均：11.2個詞元
最大：12個詞元
最小：156個詞元
平均：199.37個詞元
最大：256個詞元

	sentence_0	sentence_1
類型	字符串	字符串
詳情	最小：10個詞元平均：11.2個詞元最大：12個詞元	最小：156個詞元平均：199.37個詞元最大：256個詞元

樣本示例：

sentence_0	sentence_1
`What does this text say about trustworthiness?`	other systems. Information Integrity; Value Chain and Component Integration MP-2.2-002 Observe and analyze how the GAI system interacts with external networks, and identify any potential for negative externalities, particularly where content provenance might be compromised. Information Integrity AI Actor Tasks: End Users MAP 2.3: Scientiﬁc integrity and TEVV considerations are identiﬁed and documented, including those related to experimental design, data collection and selection (e.g., availability, representativeness, suitability), system trustworthiness, and construct validation Action ID Suggested Action GAI Risks MP-2.3-001 Assess the accuracy, quality, reliability, and authenticity of GAI output by comparing it to a set of known ground truth data and by using a variety of evaluation methods (e.g., human oversight and automated evaluation, proven cryptographic techniques, review of content inputs). Information Integrity 25
`What does this text say about unclassified?`	training and TEVV data; Filtering of hate speech or content in GAI system training data; Prevalence of GAI-generated data in GAI system training data. Harmful Bias and Homogenization 15 Winogender Schemas is a sample set of paired sentences which diﬀer only by gender of the pronouns used, which can be used to evaluate gender bias in natural language processing coreference resolution systems. 37 MS-2.11-005 Assess the proportion of synthetic to non-synthetic training data and verify training data is not overly homogenous or GAI-produced to mitigate concerns of model collapse. Harmful Bias and Homogenization AI Actor Tasks: AI Deployment, AI Impact Assessment, Aﬀected Individuals and Communities, Domain Experts, End-Users, Operation and Monitoring, TEVV MEASURE 2.12: Environmental impact and sustainability of AI model training and management activities – as identiﬁed in the MAP function – are assessed and documented. Action ID Suggested Action GAI Risks
`What does this text say about unclassified?`	Padmakumar, V. et al. (2024) Does writing with language models reduce content diversity? ICLR. https://arxiv.org/pdf/2309.05196 Park, P. et. al. (2024) AI deception: A survey of examples, risks, and potential solutions. Patterns, 5(5). arXiv. https://arxiv.org/pdf/2308.14752 Partnership on AI (2023) Building a Glossary for Synthetic Media Transparency Methods, Part 1: Indirect Disclosure. https://partnershiponai.org/glossary-for-synthetic-media-transparency-methods-part-1- indirect-disclosure/ Qu, Y. et al. (2023) Unsafe Diﬀusion: On the Generation of Unsafe Images and Hateful Memes From Text- To-Image Models. arXiv. https://arxiv.org/pdf/2305.13873 Rafat, K. et al. (2023) Mitigating carbon footprint for knowledge distillation based deep learning model compression. PLOS One. https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0285668 Said, I. et al. (2022) Nonconsensual Distribution of Intimate Images: Exploring the Role of Legal Attitudes

訓練損失函數

使用 MultipleNegativesRankingLoss 損失函數，參數如下：

{
    "scale": 20.0,
    "similarity_fct": "cos_sim"
}

訓練超參數

非默認超參數

per_device_train_batch_size: 16
per_device_eval_batch_size: 16
multi_dataset_batch_sampler: round_robin

所有超參數

點擊展開

overwrite_output_dir: False
do_predict: False
eval_strategy: no
prediction_loss_only: True
per_device_train_batch_size: 16
per_device_eval_batch_size: 16
per_gpu_train_batch_size: None
per_gpu_eval_batch_size: None
gradient_accumulation_steps: 1
eval_accumulation_steps: None
torch_empty_cache_steps: None
learning_rate: 5e-05
weight_decay: 0.0
adam_beta1: 0.9
adam_beta2: 0.999
adam_epsilon: 1e-08
max_grad_norm: 1
num_train_epochs: 3
max_steps: -1
lr_scheduler_type: linear
lr_scheduler_kwargs: {}
warmup_ratio: 0.0
warmup_steps: 0
log_level: passive
log_level_replica: warning
log_on_each_node: True
logging_nan_inf_filter: True
save_safetensors: True
save_on_each_node: False
save_only_model: False
restore_callback_states_from_checkpoint: False
no_cuda: False
use_cpu: False
use_mps_device: False
seed: 42
data_seed: None
jit_mode_eval: False
use_ipex: False
bf16: False
fp16: False
fp16_opt_level: O1
half_precision_backend: auto
bf16_full_eval: False
fp16_full_eval: False
tf32: None
local_rank: 0
ddp_backend: None
tpu_num_cores: None
tpu_metrics_debug: False
debug: []
dataloader_drop_last: False
dataloader_num_workers: 0
dataloader_prefetch_factor: None
past_index: -1
disable_tqdm: False
remove_unused_columns: True
label_names: None
load_best_model_at_end: False
ignore_data_skip: False
fsdp: []
fsdp_min_num_params: 0
fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
fsdp_transformer_layer_cls_to_wrap: None
accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
deepspeed: None
label_smoothing_factor: 0.0
optim: adamw_torch
optim_args: None
adafactor: False
group_by_length: False
length_column_name: length
ddp_find_unused_parameters: None
ddp_bucket_cap_mb: None
ddp_broadcast_buffers: False
dataloader_pin_memory: True
dataloader_persistent_workers: False
skip_memory_metrics: True
use_legacy_prediction_loop: False
push_to_hub: False
resume_from_checkpoint: None
hub_model_id: None
hub_strategy: every_save
hub_private_repo: False
hub_always_push: False
gradient_checkpointing: False
gradient_checkpointing_kwargs: None
include_inputs_for_metrics: False
eval_do_concat_batches: True
fp16_backend: auto
push_to_hub_model_id: None
push_to_hub_organization: None
mp_parameters:
auto_find_batch_size: False
full_determinism: False
torchdynamo: None
ray_scope: last
ddp_timeout: 1800
torch_compile: False
torch_compile_backend: None
torch_compile_mode: None
dispatch_batches: None
split_batches: None
include_tokens_per_second: False
include_num_input_tokens_seen: False
neftune_noise_alpha: None
optim_target_modules: None
batch_eval_metrics: False
eval_on_start: False
eval_use_gather_object: False
batch_sampler: batch_sampler
multi_dataset_batch_sampler: round_robin

框架版本

Python：3.11.5
Sentence Transformers：3.1.1
Transformers：4.44.2
PyTorch：2.4.1+cpu
Accelerate：0.34.2
Datasets：3.0.0
Tokenizers：0.19.1

🔧 技術細節

本模型基於 sentence-transformers 框架，從 sentence-transformers/all-MiniLM-L6-v2 微調而來。模型結構包含 Transformer 層、Pooling 層和 Normalize 層，具體如下：

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

在訓練過程中，使用了 MultipleNegativesRankingLoss 損失函數，並設置了相應的超參數，以優化模型性能。

📄 許可證

文檔中未提及相關許可證信息。

📖 引用

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}