allmini-ai-embedding-similarity开源模型 - 精准匹配职位描述与技能要求

首页

Allmini Ai Embedding Similarity

由 Mubin 开发

这是一个基于sentence-transformers/all-MiniLM-L6-v2微调的句子嵌入模型，专门用于职位描述和技能要求的相似度匹配。

文本嵌入

Safetensors

#职位描述语义匹配 #高精度句子嵌入 #云技术技能识别

下载量 88

发布时间 : 1/23/2025

模型简介

该模型通过微调sentence-transformers/all-MiniLM-L6-v2基础模型，专注于计算职位描述和技能要求之间的语义相似度，适用于人才招聘和职位匹配场景。

模型特点

职位描述专用嵌入

针对AI和数据工程领域的职位描述和技能要求进行了专门优化

高效语义匹配

能够准确捕捉技术术语和技能要求之间的语义关系

小规模高效模型

基于MiniLM架构，在保持高性能的同时具有较小的模型体积

模型能力

计算句子相似度

提取句子嵌入特征

职位描述匹配

技能要求分析

使用案例

人才招聘

职位匹配系统

自动匹配候选人简历与职位要求的契合度

提高招聘效率和匹配准确率

技能差距分析

分析现有团队技能与项目要求的差距

帮助制定培训和发展计划

人力资源分析

职位聚类分析

将相似职位分组以优化组织结构

发现组织内潜在的冗余或缺口

🚀 基于sentence-transformers/all-MiniLM-L6-v2的句子转换器

本模型是基于 sentence-transformers，在 ai-job-embedding-finetuning 数据集上对 sentence-transformers/all-MiniLM-L6-v2 进行微调得到的。它能将句子和段落映射到384维的密集向量空间，可用于语义文本相似度计算、语义搜索、释义挖掘、文本分类、聚类等任务。

✨ 主要特性

语义理解能力强：能精准捕捉句子和段落的语义信息，有效应用于语义文本相似度计算、语义搜索等任务。
多任务适用性：可用于多种自然语言处理任务，如文本分类、聚类等，为不同应用场景提供支持。
高效向量映射：将输入的文本快速准确地映射到384维的密集向量空间，便于后续的计算和分析。

📦 安装指南

首先，你需要安装 sentence-transformers 库：

pip install -U sentence-transformers

💻 使用示例

基础用法

from sentence_transformers import SentenceTransformer

# 从 🤗 Hub 下载模型
model = SentenceTransformer("Mubin/allmini-ai-embedding-similarity")
# 运行推理
sentences = [
    'NLP algorithm development, statistical modeling, biomedical informatics',
    "skills for this position are:Natural Language Processing (NLP)Python (Programming Language)Statistical ModelingHigh-Performance Liquid Chromatography (HPLC)Java Job Description:We are seeking a highly skilled NLP Scientist to develop our innovative and cutting-edge NLP/AI solutions to empower life science. This involves working directly with our clients, as well as cross-functional Biomedical Science, Engineering, and Business leaders, to identify, prioritize, and develop NLP/AI and Advanced analytics products from inception to delivery.Key requirements and design innovative NLP/AI solutions.Develop and validate cutting-edge NLP algorithms, including large language models tailored for healthcare and biopharma use cases.Translate complex technical insights into accessible language for non-technical stakeholders.Mentor junior team members, fostering a culture of continuous learning and growth.Publish findings in peer-reviewed journals and conferences.Engage with the broader scientific community by attending conferences, workshops, and collaborating on research projects. Qualifications:Ph.D. or master's degree in biomedical NLP, Computer Science, Biomedical Informatics, Computational Linguistics, Mathematics, or other related fieldsPublication records in leading computer science or biomedical informatics journals and conferences are highly desirable\n\nRegards,Guru Prasath M US IT RecruiterPSRTEK Inc.Princeton, NJ 08540guru@psrtek.comNo: 609-917-9967 Ext:114",
    'Skills :\na) Azure Data Factory – Min 3 years of project experiencea. Design of pipelinesb. Use of project with On-prem to Cloud Data Migrationc. Understanding of ETLd. Change Data Capture from Multiple Sourcese. Job Schedulingb) Azure Data Lake – Min 3 years of project experiencea. All steps from design to deliverb. Understanding of different Zones and design principalc) Data Modeling experience Min 5 Yearsa. Data Mart/Warehouseb. Columnar Data design and modelingd) Reporting using PowerBI Min 3 yearsa. Analytical Reportingb. Business Domain Modeling and data dictionary\nInterested please apply to the job, looking only for W2 candidates.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 384]

# 获取嵌入向量的相似度分数
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

📚 详细文档

模型详情

属性	详情
模型类型	句子转换器
基础模型	sentence-transformers/all-MiniLM-L6-v2
最大序列长度	256个标记
输出维度	384维
相似度函数	余弦相似度
训练数据集	ai-job-embedding-finetuning

模型来源

文档：Sentence Transformers Documentation
仓库：Sentence Transformers on GitHub
Hugging Face：Sentence Transformers on Hugging Face

完整模型架构

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

评估指标

三元组

数据集：ai-job-validation 和 ai-job-test
评估方法：使用 TripletEvaluator 进行评估

指标	ai-job-validation	ai-job-test
余弦准确率	0.9703	0.9804

训练详情

训练数据集

数据集：ai-job-embedding-finetuning
大小：812个训练样本
列名：query、job_description_pos 和 job_description_neg
样本示例：包含查询语句、正样本和负样本的具体内容

评估数据集

数据集：ai-job-embedding-finetuning
大小：101个评估样本
列名：query、job_description_pos 和 job_description_neg
样本示例：包含查询语句、正样本和负样本的具体内容

训练超参数

非默认超参数：eval_strategy、per_device_train_batch_size 等
所有超参数：包含更多详细的训练参数设置

训练日志

轮次	步数	ai-job-validation余弦准确率	ai-job-test余弦准确率
0	0	0.9307	-
1.0	51	0.9703	0.9804

框架版本

Python：3.11.11
Sentence Transformers：3.3.1
Transformers：4.47.1
PyTorch：2.5.1+cu121
Accelerate：1.2.1
Datasets：3.2.0
Tokenizers：0.21.0

🔧 技术细节

模型微调

本模型基于 sentence-transformers/all-MiniLM-L6-v2 进行微调，使用 MultipleNegativesRankingLoss 损失函数，通过在 ai-job-embedding-finetuning 数据集上进行训练，使模型能够更好地适应特定的任务需求。

向量映射

模型将输入的句子和段落映射到384维的密集向量空间，通过计算向量之间的余弦相似度来衡量文本的语义相似度。在训练过程中，模型学习到如何将不同的文本表示为具有区分性的向量，从而实现对语义信息的有效捕捉。

训练过程

训练过程中，使用了特定的超参数设置，如学习率、批次大小等，以确保模型能够稳定收敛并取得良好的性能。同时，通过在验证集和测试集上进行评估，不断调整模型的参数，提高模型的泛化能力。

📄 许可证

引用

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}