Sentest开源句子转换器模型 - 免费计算句子相似度与进行语义搜索

首页

Sentest

由 palusi 开发

这是一个基于BERT的句子转换器模型，用于计算句子相似度和语义搜索任务。

文本嵌入

TensorBoard

英语#句子相似度计算 #三元组损失优化 #语义搜索增强

下载量 18

发布时间 : 2/14/2025

模型简介

该模型在QQP_triplets数据集上微调，可将句子和段落映射到768维密集向量空间，适用于语义文本相似度、语义搜索、复述挖掘、文本分类和聚类等任务。

模型特点

高效句子嵌入

将句子转换为768维密集向量，保留语义信息

高准确度相似度计算

在测试集上达到98.83%的余弦准确率

长文本支持

最大支持512个标记的输入序列

模型能力

语义文本相似度计算

语义搜索

复述挖掘

文本分类

文本聚类

使用案例

问答系统

相似问题匹配

识别用户提问与知识库问题的语义相似度

高准确度匹配相似问题

信息检索

语义搜索

根据查询语义而非关键词匹配返回结果

提升搜索结果相关性

🚀 基于google-bert/bert-base-uncased的句子转换器

这是一个基于 sentence-transformers 框架，从 google-bert/bert-base-uncased 微调而来的模型。它能够将句子和段落映射到一个768维的密集向量空间，可用于语义文本相似度计算、语义搜索、释义挖掘、文本分类、聚类等任务。

🚀 快速开始

直接使用（Sentence Transformers）

首先，安装 Sentence Transformers 库：

pip install -U sentence-transformers

然后，你可以加载这个模型并进行推理：

from sentence_transformers import SentenceTransformer

# 从 🤗 Hub 下载
model = SentenceTransformer("palusi/sentest")
# 运行推理
sentences = [
    'How can I open my computer if I forget my password?',
    'I forget my PC password what should I do to open it?',
    'I forgot my security code on my Nokia 206 how can I unlock it?',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# 获取嵌入向量的相似度分数
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

✨ 主要特性

多用途：可用于语义文本相似度计算、语义搜索、释义挖掘、文本分类、聚类等多种自然语言处理任务。
微调模型：基于 google-bert/bert-base-uncased 进行微调，能更好地适应特定任务。
高准确率：在 sentest 数据集上的余弦准确率达到 0.9883。

📦 安装指南

若要使用该模型，需安装 Sentence Transformers 库，可通过以下命令进行安装：

pip install -U sentence-transformers

📚 详细文档

模型详情

模型描述

属性	详情
模型类型	句子转换器
基础模型	google-bert/bert-base-uncased
最大序列长度	512 个词元
输出维度	768 维
相似度函数	余弦相似度
训练数据集	qqp_triplets
语言	英语

模型来源

文档：Sentence Transformers 文档
仓库：GitHub 上的 Sentence Transformers
Hugging Face：Hugging Face 上的 Sentence Transformers

完整模型架构

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

评估

指标

三元组

数据集：sentest
使用 TripletEvaluator 进行评估

指标	值
余弦准确率	0.9883

训练详情

训练数据集

qqp_triplets

数据集：qqp_triplets，版本为 f475d9c
大小：101,762 个训练样本
列：anchor、positive 和 negative

基于前 1000 个样本的近似统计信息：

	锚点	正样本	负样本
类型	字符串	字符串	字符串
详情	最小值：6 个词元平均值：13.96 个词元最大值：54 个词元	最小值：5 个词元平均值：13.99 个词元最大值：52 个词元	最小值：6 个词元平均值：14.49 个词元最大值：73 个词元

样本：

锚点	正样本	负样本
`Who are Mona Punjabi?`	`Who are Mona punjabis?`	`Why are Punjabis so proud of their Punjabi-hood?`
`What are some of the best books on/by Bill Gates?`	`What are the best books of Bill Gates?`	`Are there any films about Bill Gates?`
`Where can I get best pasta in Bangalore?`	`Where can I get best pasta in Bangalore ?`	`Where can I get best street food in Bangalore?`

损失函数：TripletLoss，参数如下：

{
    "distance_metric": "TripletDistanceMetric.EUCLIDEAN",
    "triplet_margin": 5
}

评估数据集

qqp_triplets

数据集：qqp_triplets，版本为 f475d9c
大小：101,762 个评估样本
列：anchor、positive 和 negative

基于前 1000 个样本的近似统计信息：

	锚点	正样本	负样本
类型	字符串	字符串	字符串
详情	最小值：6 个词元平均值：13.99 个词元最大值：61 个词元	最小值：6 个词元平均值：13.76 个词元最大值：49 个词元	最小值：6 个词元平均值：14.75 个词元最大值：78 个词元

样本：

锚点	正样本	负样本
`How do l study efficiently?`	`How do you study effectively?`	`Why can't I study efficiently?`
`How do you commit suicide?`	`What is the easiest way to commite suicide?`	`What is a way to commit suicide and not damaging your organs so that they can be donated?`
`How do you learn to speak a foreign language?`	`What is the quickest way a person can learn to speak a new language fluently?`	`What's the easiest foreign language for a native English speaker, living in America, to learn to speak?`

损失函数：TripletLoss，参数如下：

{
    "distance_metric": "TripletDistanceMetric.EUCLIDEAN",
    "triplet_margin": 5
}

训练超参数

非默认超参数

eval_strategy：按步数评估
per_device_train_batch_size：16
per_device_eval_batch_size：16
learning_rate：2e-05
weight_decay：0.01
num_train_epochs：1
warmup_ratio：0.1
fp16：True
load_best_model_at_end：True
push_to_hub：True
hub_model_id：palusi/sentest
batch_sampler：无重复采样

所有超参数

点击展开

overwrite_output_dir：False
do_predict：False
eval_strategy：steps
prediction_loss_only：True
per_device_train_batch_size：16
per_device_eval_batch_size：16
per_gpu_train_batch_size：None
per_gpu_eval_batch_size：None
gradient_accumulation_steps：1
eval_accumulation_steps：None
torch_empty_cache_steps：None
learning_rate：2e-05
weight_decay：0.01
adam_beta1：0.9
adam_beta2：0.999
adam_epsilon：1e-08
max_grad_norm：1.0
num_train_epochs：1
max_steps：-1
lr_scheduler_type：linear
lr_scheduler_kwargs：{}
warmup_ratio：0.1
warmup_steps：0
log_level：passive
log_level_replica：warning
log_on_each_node：True
logging_nan_inf_filter：True
save_safetensors：True
save_on_each_node：False
save_only_model：False
restore_callback_states_from_checkpoint：False
no_cuda：False
use_cpu：False
use_mps_device：False
seed：42
data_seed：None
jit_mode_eval：False
use_ipex：False
bf16：False
fp16：True
fp16_opt_level：O1
half_precision_backend：auto
bf16_full_eval：False
fp16_full_eval：False
tf32：None
local_rank：0
ddp_backend：None
tpu_num_cores：None
tpu_metrics_debug：False
debug：[]
dataloader_drop_last：False
dataloader_num_workers：0
dataloader_prefetch_factor：None
past_index：-1
disable_tqdm：False
remove_unused_columns：True
label_names：None
load_best_model_at_end：True
ignore_data_skip：False
fsdp：[]
fsdp_min_num_params：0
fsdp_config：{'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
fsdp_transformer_layer_cls_to_wrap：None
accelerator_config：{'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
deepspeed：None
label_smoothing_factor：0.0
optim：adamw_torch
optim_args：None
adafactor：False
group_by_length：False
length_column_name：length
ddp_find_unused_parameters：None
ddp_bucket_cap_mb：None
ddp_broadcast_buffers：False
dataloader_pin_memory：True
dataloader_persistent_workers：False
skip_memory_metrics：True
use_legacy_prediction_loop：False
push_to_hub：True
resume_from_checkpoint：None
hub_model_id：palusi/sentest
hub_strategy：every_save
hub_private_repo：None
hub_always_push：False
gradient_checkpointing：False
gradient_checkpointing_kwargs：None
include_inputs_for_metrics：False
include_for_metrics：[]
eval_do_concat_batches：True
fp16_backend：auto
push_to_hub_model_id：None
push_to_hub_organization：None
mp_parameters：
auto_find_batch_size：False
full_determinism：False
torchdynamo：None
ray_scope：last
ddp_timeout：1800
torch_compile：False
torch_compile_backend：None
torch_compile_mode：None
dispatch_batches：None
split_batches：None
include_tokens_per_second：False
include_num_input_tokens_seen：False
neftune_noise_alpha：None
optim_target_modules：None
batch_eval_metrics：False
eval_on_start：False
use_liger_kernel：False
eval_use_gather_object：False
average_tokens_across_devices：False
prompts：None
batch_sampler：no_duplicates
multi_dataset_batch_sampler：proportional

训练日志

轮次	步数	训练损失	验证损失	sentest 余弦准确率
-1	-1	-	-	0.8806
0.0983	500	2.5691	-	-
0.1965	1000	1.2284	0.6712	0.9645
0.2948	1500	0.8769	-	-
0.3930	2000	0.7151	0.4490	0.9787
0.4913	2500	0.6506	-	-
0.5895	3000	0.5855	0.3519	0.9848
0.6878	3500	0.5397	-	-
0.7860	4000	0.4998	0.3079	0.9871
0.8843	4500	0.4885	-	-
0.9825	5000	0.483	0.288	0.9883

加粗行表示保存的检查点。

框架版本

Python：3.11.11
Sentence Transformers：3.4.1
Transformers：4.48.2
PyTorch：2.5.1+cu124
Accelerate：1.3.0
Datasets：3.2.0
Tokenizers：0.21.0

📄 许可证

文档中未提及相关许可证信息。

🔧 技术细节

该模型基于 google-bert/bert-base-uncased 进行微调，使用 TripletLoss 损失函数进行训练。在训练过程中，使用了三元组数据，通过最小化锚点与正样本之间的距离，同时最大化锚点与负样本之间的距离，来学习句子的嵌入表示。模型的输出是一个 768 维的向量，可用于计算句子之间的相似度。

📄 引用

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

TripletLoss

@misc{hermans2017defense,
    title={In Defense of the Triplet Loss for Person Re-Identification},
    author={Alexander Hermans and Lucas Beyer and Bastian Leibe},
    year={2017},
    eprint={1703.07737},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}