resume-job-matcher-all-MiniLM-L6-v2开源模型 - 精准算句子相似度，高效提取特征

首页

Resume Job Matcher All MiniLM L6 V2

由 anass1209 开发

一个基于MiniLM-L6-v2架构的句子嵌入模型，专门用于计算句子相似度和特征提取。

文本嵌入

Safetensors

#简历语义匹配 #NLP嵌入优化 #余弦相似度评估

下载量 124

发布时间 : 4/16/2025

模型简介

该模型通过将句子转换为高维向量表示，用于计算句子间的语义相似度，适用于简历匹配、信息检索等任务。

模型特点

高效的句子嵌入

能够快速将句子转换为高质量的向量表示，适用于实时应用场景。

优化的相似度计算

使用余弦相似度损失进行训练，专门优化了句子间的语义相似度计算。

轻量级架构

基于MiniLM架构，在保持较高性能的同时减少了模型复杂度。

模型能力

句子向量化

语义相似度计算

特征提取

文本匹配

使用案例

人力资源

简历匹配

将求职者简历与职位描述进行语义匹配，提高招聘效率。

皮尔逊余弦相似度达到0.537

信息检索

文档相似度搜索

在大规模文档库中查找语义相似的文档。

🚀 基于sentence-transformers/all-MiniLM-L6-v2的句子转换器

这是一个基于 sentence-transformers/all-MiniLM-L6-v2 微调的 sentence-transformers 模型。它可以将句子和段落映射到384维的密集向量空间，可用于语义文本相似度、语义搜索、释义挖掘、文本分类、聚类等任务。

🚀 快速开始

安装Sentence Transformers库

pip install -U sentence-transformers

加载模型并进行推理

from sentence_transformers import SentenceTransformer

# 从🤗 Hub下载
model = SentenceTransformer("anass1209/resume-job-matcher-all-MiniLM-L6-v2")
# 运行推理
sentences = [
    'Developed and maintained core backend services using Python and Django, focusing on scalability and efficiency. Implemented RESTful APIs for data retrieval and manipulation.  Worked extensively with PostgreSQL for data storage and retrieval.  Responsible for optimizing database queries and improving API response times.  Experience with model fine-tuning for semantic search and document retrieval using pre-trained embedding models like Sentence Transformers or similar libraries, specifically for improving the relevance of search results and document matching within the web application.  Experience using vector databases (e.g., ChromaDB, Weaviate) preferred.',
    '## Senior Backend Engineer\n\n*   **ABC Corp** | 2020 - Present\n*   Led development of a new REST API for user authentication and profile management using Python and Django.\n*   Managed a PostgreSQL database, optimizing queries and schema design for improved performance, resulting in a 20% reduction in average API response time.\n*   Improved system scalability through efficient code design and load balancing techniques.\n*   Experience using pre-trained embedding models (BERT) for natural language processing tasks to improve search accuracy, with focus on keyphrase extraction and content similarity comparison for the recommendations engine. Proficient in Flask.',
    "PhD in Computer Science, University of California, Berkeley (2018-2023). Dissertation: 'Adversarial Robustness in NLP for Cybersecurity Applications.' Focused on fine-tuning BERT for malware detection and social engineering attacks. Proficient in Python, TensorFlow, and AWS. Published in top-tier NLP and security conferences. Experienced with large datasets and model evaluation metrics.\n\nMaster of Science in Cybersecurity, Johns Hopkins University (2016-2018). Relevant coursework included Machine Learning, Data Mining, and Network Security. Developed a system for anomaly detection using a recurrent neural network (RNN). Familiar with Python and cloud computing platforms. Good understanding of NLP concepts, but limited experience fine-tuning transformer models. Strong understanding of Information Security Principles.\n\nBachelor of Science in Computer Engineering, Carnegie Mellon University (2012-2016). Relevant coursework: Artificial Intelligence, Database Management, and Software Engineering. Project experience: Developed a web application using Python. No direct experience with fine-tuning NLP models, but a strong foundation in programming and data structures.  Familiar with cloud infrastructure concepts. Possess CISSP certification.",
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 384]

# 获取嵌入向量的相似度分数
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

✨ 主要特性

语义文本相似度计算：能够计算句子和段落之间的语义相似度。
多任务应用：可用于语义搜索、释义挖掘、文本分类、聚类等多种自然语言处理任务。
高维向量映射：将句子和段落映射到384维的密集向量空间。

📦 安装指南

安装Sentence Transformers库：

pip install -U sentence-transformers

💻 使用示例

基础用法

from sentence_transformers import SentenceTransformer

# 从🤗 Hub下载
model = SentenceTransformer("anass1209/resume-job-matcher-all-MiniLM-L6-v2")
# 运行推理
sentences = [
    'Developed and maintained core backend services using Python and Django, focusing on scalability and efficiency. Implemented RESTful APIs for data retrieval and manipulation.  Worked extensively with PostgreSQL for data storage and retrieval.  Responsible for optimizing database queries and improving API response times.  Experience with model fine-tuning for semantic search and document retrieval using pre-trained embedding models like Sentence Transformers or similar libraries, specifically for improving the relevance of search results and document matching within the web application.  Experience using vector databases (e.g., ChromaDB, Weaviate) preferred.',
    '## Senior Backend Engineer\n\n*   **ABC Corp** | 2020 - Present\n*   Led development of a new REST API for user authentication and profile management using Python and Django.\n*   Managed a PostgreSQL database, optimizing queries and schema design for improved performance, resulting in a 20% reduction in average API response time.\n*   Improved system scalability through efficient code design and load balancing techniques.\n*   Experience using pre-trained embedding models (BERT) for natural language processing tasks to improve search accuracy, with focus on keyphrase extraction and content similarity comparison for the recommendations engine. Proficient in Flask.',
    "PhD in Computer Science, University of California, Berkeley (2018-2023). Dissertation: 'Adversarial Robustness in NLP for Cybersecurity Applications.' Focused on fine-tuning BERT for malware detection and social engineering attacks. Proficient in Python, TensorFlow, and AWS. Published in top-tier NLP and security conferences. Experienced with large datasets and model evaluation metrics.\n\nMaster of Science in Cybersecurity, Johns Hopkins University (2016-2018). Relevant coursework included Machine Learning, Data Mining, and Network Security. Developed a system for anomaly detection using a recurrent neural network (RNN). Familiar with Python and cloud computing platforms. Good understanding of NLP concepts, but limited experience fine-tuning transformer models. Strong understanding of Information Security Principles.\n\nBachelor of Science in Computer Engineering, Carnegie Mellon University (2012-2016). Relevant coursework: Artificial Intelligence, Database Management, and Software Engineering. Project experience: Developed a web application using Python. No direct experience with fine-tuning NLP models, but a strong foundation in programming and data structures.  Familiar with cloud infrastructure concepts. Possess CISSP certification.",
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 384]

# 获取嵌入向量的相似度分数
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

📚 详细文档

模型详情

模型描述

属性	详情
模型类型	句子转换器
基础模型	sentence-transformers/all-MiniLM-L6-v2
最大序列长度	256个标记
输出维度	384维
相似度函数	余弦相似度

模型来源

文档：Sentence Transformers Documentation
仓库：Sentence Transformers on GitHub
Hugging Face：Sentence Transformers on Hugging Face

完整模型架构

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

评估

指标

语义相似度

数据集：dev_evaluation 和 test_evaluation
评估方法：使用 EmbeddingSimilarityEvaluator 进行评估

指标	dev_evaluation	test_evaluation
pearson_cosine	0.5379	0.5379
spearman_cosine	0.6213	0.6213

训练详情

训练数据集

未命名数据集

大小：958个训练样本
列：sentence_0，sentence_1 和 label
基于前958个样本的近似统计信息： | | sentence_0 | sentence_1 | label | |------|-----------------|-----------------|-----------------| | 类型 | 字符串 | 字符串 | 浮点数 | | 详情 |
- 最小：41个标记
- 平均：110.12个标记
- 最大：234个标记
|
- 最小：25个标记
- 平均：134.18个标记
- 最大：256个标记
|
- 最小：0.5
- 平均：0.78
- 最大：0.96
|

训练超参数

非默认超参数

eval_strategy: steps
per_device_train_batch_size: 16
per_device_eval_batch_size: 16
num_train_epochs: 50
multi_dataset_batch_sampler: round_robin

框架版本

Python: 3.11.11
Sentence Transformers: 4.1.0
Transformers: 4.51.1
PyTorch: 2.5.1+cu124
Accelerate: 1.3.0
Datasets: 3.5.0
Tokenizers: 0.21.0

🔧 技术细节

训练日志

轮次	步骤	训练损失	dev_evaluation_spearman_cosine	test_evaluation_spearman_cosine
1.0	60	-	0.4867	-
2.0	120	-	0.5612	-
3.0	180	-	0.5929	-
4.0	240	-	0.6229	-
5.0	300	-	0.6377	-
6.0	360	-	0.6434	-
7.0	420	-	0.6104	-
8.0	480	-	0.6064	-
8.3333	500	0.0122	-	-
9.0	540	-	0.6005	-
10.0	600	-	0.6064	-
11.0	660	-	0.5973	-
12.0	720	-	0.6097	-
13.0	780	-	0.5907	-
14.0	840	-	0.5870	-
15.0	900	-	0.5989	-
16.0	960	-	0.6018	-
16.6667	1000	0.0019	-	-
17.0	1020	-	0.6208	-
18.0	1080	-	0.6133	-
19.0	1140	-	0.6200	-
20.0	1200	-	0.5960	-
21.0	1260	-	0.5999	-
22.0	1320	-	0.5995	-
23.0	1380	-	0.6177	-
24.0	1440	-	0.6201	-
25.0	1500	0.0009	0.6110	-
26.0	1560	-	0.6184	-
27.0	1620	-	0.6133	-
28.0	1680	-	0.6287	-
29.0	1740	-	0.6200	-
30.0	1800	-	0.6272	-
31.0	1860	-	0.6222	-
32.0	1920	-	0.6199	-
33.0	1980	-	0.6141	-
33.3333	2000	0.0006	-	-
34.0	2040	-	0.6228	-
35.0	2100	-	0.6275	-
36.0	2160	-	0.6167	-
37.0	2220	-	0.6140	-
38.0	2280	-	0.6217	-
39.0	2340	-	0.6280	-
40.0	2400	-	0.6254	-
41.0	2460	-	0.6186	-
41.6667	2500	0.0005	-	-
42.0	2520	-	0.6185	-
43.0	2580	-	0.6242	-
44.0	2640	-	0.6183	-
45.0	2700	-	0.6213	-
46.0	2760	-	0.6220	-
47.0	2820	-	0.6213	-
48.0	2880	-	0.6213	-
49.0	2940	-	0.6214	-
50.0	3000	0.0004	0.6213	-
-1	-1	-	-	0.6213

📄 许可证

文档中未提及相关许可证信息。

📖 引用

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}