fine-tuned-embedding-model开源句子转换器 - 免费实现文本语义相似度计算

首页

Fine Tuned Embedding Model

由 svb01 开发

这是一个基于sentence-transformers/all-MiniLM-L6-v2微调的句子转换器模型，用于将文本映射到384维向量空间，支持语义相似度计算等任务。

文本嵌入

Safetensors

#短文本语义匹配 #多负例对比学习 #风险管理文本分析

下载量 17

发布时间 : 9/23/2024

模型简介

该模型将句子和段落映射到384维密集向量空间，可用于语义文本相似性、语义搜索、释义挖掘、文本分类、聚类等任务。

模型特点

高效语义编码

将文本高效编码为384维向量，保留语义信息

多任务支持

支持语义相似度计算、文本分类、聚类等多种下游任务

轻量级模型

基于MiniLM架构，在保持性能的同时减少计算资源需求

模型能力

语义文本相似度计算

语义搜索

释义挖掘

文本分类

文本聚类

特征提取

使用案例

信息检索

文档相似度匹配

计算文档间的语义相似度，用于推荐相关文档

内容管理

重复内容检测

识别语义相似的重复内容

🚀 基于sentence-transformers/all-MiniLM-L6-v2的句子转换器模型

本项目是一个基于 sentence-transformers 框架，从 sentence-transformers/all-MiniLM-L6-v2 微调而来的模型。它能将句子和段落映射到384维的密集向量空间，可用于语义文本相似度计算、语义搜索、释义挖掘、文本分类、聚类等任务。

🚀 快速开始

本模型是基于 sentence-transformers 框架微调的，以下是使用该模型的具体步骤：

安装Sentence Transformers库

pip install -U sentence-transformers

加载模型并进行推理

from sentence_transformers import SentenceTransformer

# 从🤗 Hub下载模型
model = SentenceTransformer("sentence_transformers_model_id")
# 进行推理
sentences = [
    'What does this text say about data privacy?',
    'information during GAI training and maintenance. \nHuman-AI Conﬁguration; Obscene, \nDegrading, and/or Abusive \nContent; Value Chain and \nComponent Integration; \nDangerous, Violent, or Hateful \nContent \nMS-2.6-002 \nAssess existence or levels of harmful bias, intellectual property infringement, \ndata privacy violations, obscenity, extremism, violence, or CBRN information in \nsystem training data. \nData Privacy; Intellectual Property; \nObscene, Degrading, and/or \nAbusive Content; Harmful Bias and \nHomogenization; Dangerous, \nViolent, or Hateful Content; CBRN \nInformation or Capabilities \nMS-2.6-003 Re-evaluate safety features of ﬁne-tuned models when the negative risk exceeds \norganizational risk tolerance. \nDangerous, Violent, or Hateful \nContent \nMS-2.6-004 Review GAI system outputs for validity and safety: Review generated code to \nassess risks that may arise from unreliable downstream decision-making. \nValue Chain and Component \nIntegration; Dangerous, Violent, or \nHateful Content',
    'Scheurer, J. et al. (2023) Technical report: Large language models can strategically deceive their users \nwhen put under pressure. arXiv. https://arxiv.org/abs/2311.07590 \nShelby, R. et al. (2023) Sociotechnical Harms of Algorithmic Systems: Scoping a Taxonomy for Harm \nReduction. arXiv. https://arxiv.org/pdf/2210.05791 \nShevlane, T. et al. (2023) Model evaluation for extreme risks. arXiv. https://arxiv.org/pdf/2305.15324 \nShumailov, I. et al. (2023) The curse of recursion: training on generated data makes models forget. arXiv. \nhttps://arxiv.org/pdf/2305.17493v2 \nSmith, A. et al. (2023) Hallucination or Confabulation? Neuroanatomy as metaphor in Large Language \nModels. PLOS Digital Health. \nhttps://journals.plos.org/digitalhealth/article?id=10.1371/journal.pdig.0000388 \nSoice, E. et al. (2023) Can large language models democratize access to dual-use biotechnology? arXiv. \nhttps://arxiv.org/abs/2306.03809',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 384]

# 获取嵌入向量的相似度分数
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

✨ 主要特性

高维向量映射：能够将句子和段落映射到384维的密集向量空间。
多任务支持：可用于语义文本相似度计算、语义搜索、释义挖掘、文本分类、聚类等多种自然语言处理任务。

📦 安装指南

要使用本模型，需要安装 sentence-transformers 库，可使用以下命令进行安装：

pip install -U sentence-transformers

💻 使用示例

基础用法

from sentence_transformers import SentenceTransformer

# 从🤗 Hub下载模型
model = SentenceTransformer("sentence_transformers_model_id")
# 进行推理
sentences = [
    'What does this text say about data privacy?',
    'information during GAI training and maintenance. \nHuman-AI Conﬁguration; Obscene, \nDegrading, and/or Abusive \nContent; Value Chain and \nComponent Integration; \nDangerous, Violent, or Hateful \nContent \nMS-2.6-002 \nAssess existence or levels of harmful bias, intellectual property infringement, \ndata privacy violations, obscenity, extremism, violence, or CBRN information in \nsystem training data. \nData Privacy; Intellectual Property; \nObscene, Degrading, and/or \nAbusive Content; Harmful Bias and \nHomogenization; Dangerous, \nViolent, or Hateful Content; CBRN \nInformation or Capabilities \nMS-2.6-003 Re-evaluate safety features of ﬁne-tuned models when the negative risk exceeds \norganizational risk tolerance. \nDangerous, Violent, or Hateful \nContent \nMS-2.6-004 Review GAI system outputs for validity and safety: Review generated code to \nassess risks that may arise from unreliable downstream decision-making. \nValue Chain and Component \nIntegration; Dangerous, Violent, or \nHateful Content',
    'Scheurer, J. et al. (2023) Technical report: Large language models can strategically deceive their users \nwhen put under pressure. arXiv. https://arxiv.org/abs/2311.07590 \nShelby, R. et al. (2023) Sociotechnical Harms of Algorithmic Systems: Scoping a Taxonomy for Harm \nReduction. arXiv. https://arxiv.org/pdf/2210.05791 \nShevlane, T. et al. (2023) Model evaluation for extreme risks. arXiv. https://arxiv.org/pdf/2305.15324 \nShumailov, I. et al. (2023) The curse of recursion: training on generated data makes models forget. arXiv. \nhttps://arxiv.org/pdf/2305.17493v2 \nSmith, A. et al. (2023) Hallucination or Confabulation? Neuroanatomy as metaphor in Large Language \nModels. PLOS Digital Health. \nhttps://journals.plos.org/digitalhealth/article?id=10.1371/journal.pdig.0000388 \nSoice, E. et al. (2023) Can large language models democratize access to dual-use biotechnology? arXiv. \nhttps://arxiv.org/abs/2306.03809',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 384]

# 获取嵌入向量的相似度分数
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

📚 详细文档

模型详情

模型描述

属性	详情
模型类型	句子转换器
基础模型	sentence-transformers/all-MiniLM-L6-v2
最大序列长度	256个词元
输出维度	384个词元
相似度函数	余弦相似度

模型资源

文档：Sentence Transformers文档
代码仓库：GitHub上的Sentence Transformers
Hugging Face：Hugging Face上的Sentence Transformers

完整模型架构

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

训练详情

训练数据集

未命名数据集

数据集大小：555个训练样本
列信息：<code>sentence_0</code> 和 <code>sentence_1</code>
近似统计信息（基于前555个样本）：
sentence_0 sentence_1
类型字符串字符串
详情
最小：10个词元
平均：11.2个词元
最大：12个词元
最小：156个词元
平均：199.37个词元
最大：256个词元

	sentence_0	sentence_1
类型	字符串	字符串
详情	最小：10个词元平均：11.2个词元最大：12个词元	最小：156个词元平均：199.37个词元最大：256个词元

样本示例：

sentence_0	sentence_1
`What does this text say about trustworthiness?`	other systems. Information Integrity; Value Chain and Component Integration MP-2.2-002 Observe and analyze how the GAI system interacts with external networks, and identify any potential for negative externalities, particularly where content provenance might be compromised. Information Integrity AI Actor Tasks: End Users MAP 2.3: Scientiﬁc integrity and TEVV considerations are identiﬁed and documented, including those related to experimental design, data collection and selection (e.g., availability, representativeness, suitability), system trustworthiness, and construct validation Action ID Suggested Action GAI Risks MP-2.3-001 Assess the accuracy, quality, reliability, and authenticity of GAI output by comparing it to a set of known ground truth data and by using a variety of evaluation methods (e.g., human oversight and automated evaluation, proven cryptographic techniques, review of content inputs). Information Integrity 25
`What does this text say about unclassified?`	training and TEVV data; Filtering of hate speech or content in GAI system training data; Prevalence of GAI-generated data in GAI system training data. Harmful Bias and Homogenization 15 Winogender Schemas is a sample set of paired sentences which diﬀer only by gender of the pronouns used, which can be used to evaluate gender bias in natural language processing coreference resolution systems. 37 MS-2.11-005 Assess the proportion of synthetic to non-synthetic training data and verify training data is not overly homogenous or GAI-produced to mitigate concerns of model collapse. Harmful Bias and Homogenization AI Actor Tasks: AI Deployment, AI Impact Assessment, Aﬀected Individuals and Communities, Domain Experts, End-Users, Operation and Monitoring, TEVV MEASURE 2.12: Environmental impact and sustainability of AI model training and management activities – as identiﬁed in the MAP function – are assessed and documented. Action ID Suggested Action GAI Risks
`What does this text say about unclassified?`	Padmakumar, V. et al. (2024) Does writing with language models reduce content diversity? ICLR. https://arxiv.org/pdf/2309.05196 Park, P. et. al. (2024) AI deception: A survey of examples, risks, and potential solutions. Patterns, 5(5). arXiv. https://arxiv.org/pdf/2308.14752 Partnership on AI (2023) Building a Glossary for Synthetic Media Transparency Methods, Part 1: Indirect Disclosure. https://partnershiponai.org/glossary-for-synthetic-media-transparency-methods-part-1- indirect-disclosure/ Qu, Y. et al. (2023) Unsafe Diﬀusion: On the Generation of Unsafe Images and Hateful Memes From Text- To-Image Models. arXiv. https://arxiv.org/pdf/2305.13873 Rafat, K. et al. (2023) Mitigating carbon footprint for knowledge distillation based deep learning model compression. PLOS One. https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0285668 Said, I. et al. (2022) Nonconsensual Distribution of Intimate Images: Exploring the Role of Legal Attitudes

训练损失函数

使用 MultipleNegativesRankingLoss 损失函数，参数如下：

{
    "scale": 20.0,
    "similarity_fct": "cos_sim"
}

训练超参数

非默认超参数

per_device_train_batch_size: 16
per_device_eval_batch_size: 16
multi_dataset_batch_sampler: round_robin

所有超参数

点击展开

overwrite_output_dir: False
do_predict: False
eval_strategy: no
prediction_loss_only: True
per_device_train_batch_size: 16
per_device_eval_batch_size: 16
per_gpu_train_batch_size: None
per_gpu_eval_batch_size: None
gradient_accumulation_steps: 1
eval_accumulation_steps: None
torch_empty_cache_steps: None
learning_rate: 5e-05
weight_decay: 0.0
adam_beta1: 0.9
adam_beta2: 0.999
adam_epsilon: 1e-08
max_grad_norm: 1
num_train_epochs: 3
max_steps: -1
lr_scheduler_type: linear
lr_scheduler_kwargs: {}
warmup_ratio: 0.0
warmup_steps: 0
log_level: passive
log_level_replica: warning
log_on_each_node: True
logging_nan_inf_filter: True
save_safetensors: True
save_on_each_node: False
save_only_model: False
restore_callback_states_from_checkpoint: False
no_cuda: False
use_cpu: False
use_mps_device: False
seed: 42
data_seed: None
jit_mode_eval: False
use_ipex: False
bf16: False
fp16: False
fp16_opt_level: O1
half_precision_backend: auto
bf16_full_eval: False
fp16_full_eval: False
tf32: None
local_rank: 0
ddp_backend: None
tpu_num_cores: None
tpu_metrics_debug: False
debug: []
dataloader_drop_last: False
dataloader_num_workers: 0
dataloader_prefetch_factor: None
past_index: -1
disable_tqdm: False
remove_unused_columns: True
label_names: None
load_best_model_at_end: False
ignore_data_skip: False
fsdp: []
fsdp_min_num_params: 0
fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
fsdp_transformer_layer_cls_to_wrap: None
accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
deepspeed: None
label_smoothing_factor: 0.0
optim: adamw_torch
optim_args: None
adafactor: False
group_by_length: False
length_column_name: length
ddp_find_unused_parameters: None
ddp_bucket_cap_mb: None
ddp_broadcast_buffers: False
dataloader_pin_memory: True
dataloader_persistent_workers: False
skip_memory_metrics: True
use_legacy_prediction_loop: False
push_to_hub: False
resume_from_checkpoint: None
hub_model_id: None
hub_strategy: every_save
hub_private_repo: False
hub_always_push: False
gradient_checkpointing: False
gradient_checkpointing_kwargs: None
include_inputs_for_metrics: False
eval_do_concat_batches: True
fp16_backend: auto
push_to_hub_model_id: None
push_to_hub_organization: None
mp_parameters:
auto_find_batch_size: False
full_determinism: False
torchdynamo: None
ray_scope: last
ddp_timeout: 1800
torch_compile: False
torch_compile_backend: None
torch_compile_mode: None
dispatch_batches: None
split_batches: None
include_tokens_per_second: False
include_num_input_tokens_seen: False
neftune_noise_alpha: None
optim_target_modules: None
batch_eval_metrics: False
eval_on_start: False
eval_use_gather_object: False
batch_sampler: batch_sampler
multi_dataset_batch_sampler: round_robin

框架版本

Python：3.11.5
Sentence Transformers：3.1.1
Transformers：4.44.2
PyTorch：2.4.1+cpu
Accelerate：0.34.2
Datasets：3.0.0
Tokenizers：0.19.1

🔧 技术细节

本模型基于 sentence-transformers 框架，从 sentence-transformers/all-MiniLM-L6-v2 微调而来。模型结构包含 Transformer 层、Pooling 层和 Normalize 层，具体如下：

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

在训练过程中，使用了 MultipleNegativesRankingLoss 损失函数，并设置了相应的超参数，以优化模型性能。

📄 许可证

文档中未提及相关许可证信息。

📖 引用

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}