resume-job-matcher-all-MiniLM-L6-v2開源模型 - 精準算句子相似度，高效提取特徵

首頁

Resume Job Matcher All MiniLM L6 V2

由anass1209開發

一個基於MiniLM-L6-v2架構的句子嵌入模型，專門用於計算句子相似度和特徵提取。

文本嵌入

Safetensors

#簡歷語義匹配 #NLP嵌入優化 #餘弦相似度評估

下載量 124

發布時間 : 4/16/2025

模型概述

該模型通過將句子轉換為高維向量表示，用於計算句子間的語義相似度，適用於簡歷匹配、信息檢索等任務。

模型特點

高效的句子嵌入

能夠快速將句子轉換為高質量的向量表示，適用於即時應用場景。

優化的相似度計算

使用餘弦相似度損失進行訓練，專門優化了句子間的語義相似度計算。

輕量級架構

基於MiniLM架構，在保持較高性能的同時減少了模型複雜度。

模型能力

句子向量化

語義相似度計算

特徵提取

文本匹配

使用案例

人力資源

簡歷匹配

將求職者簡歷與職位描述進行語義匹配，提高招聘效率。

皮爾遜餘弦相似度達到0.537

信息檢索

文檔相似度搜索

在大規模文檔庫中查找語義相似的文檔。

🚀 基於sentence-transformers/all-MiniLM-L6-v2的句子轉換器

這是一個基於 sentence-transformers/all-MiniLM-L6-v2 微調的 sentence-transformers 模型。它可以將句子和段落映射到384維的密集向量空間，可用於語義文本相似度、語義搜索、釋義挖掘、文本分類、聚類等任務。

🚀 快速開始

安裝Sentence Transformers庫

pip install -U sentence-transformers

加載模型並進行推理

from sentence_transformers import SentenceTransformer

# 從🤗 Hub下載
model = SentenceTransformer("anass1209/resume-job-matcher-all-MiniLM-L6-v2")
# 運行推理
sentences = [
    'Developed and maintained core backend services using Python and Django, focusing on scalability and efficiency. Implemented RESTful APIs for data retrieval and manipulation.  Worked extensively with PostgreSQL for data storage and retrieval.  Responsible for optimizing database queries and improving API response times.  Experience with model fine-tuning for semantic search and document retrieval using pre-trained embedding models like Sentence Transformers or similar libraries, specifically for improving the relevance of search results and document matching within the web application.  Experience using vector databases (e.g., ChromaDB, Weaviate) preferred.',
    '## Senior Backend Engineer\n\n*   **ABC Corp** | 2020 - Present\n*   Led development of a new REST API for user authentication and profile management using Python and Django.\n*   Managed a PostgreSQL database, optimizing queries and schema design for improved performance, resulting in a 20% reduction in average API response time.\n*   Improved system scalability through efficient code design and load balancing techniques.\n*   Experience using pre-trained embedding models (BERT) for natural language processing tasks to improve search accuracy, with focus on keyphrase extraction and content similarity comparison for the recommendations engine. Proficient in Flask.',
    "PhD in Computer Science, University of California, Berkeley (2018-2023). Dissertation: 'Adversarial Robustness in NLP for Cybersecurity Applications.' Focused on fine-tuning BERT for malware detection and social engineering attacks. Proficient in Python, TensorFlow, and AWS. Published in top-tier NLP and security conferences. Experienced with large datasets and model evaluation metrics.\n\nMaster of Science in Cybersecurity, Johns Hopkins University (2016-2018). Relevant coursework included Machine Learning, Data Mining, and Network Security. Developed a system for anomaly detection using a recurrent neural network (RNN). Familiar with Python and cloud computing platforms. Good understanding of NLP concepts, but limited experience fine-tuning transformer models. Strong understanding of Information Security Principles.\n\nBachelor of Science in Computer Engineering, Carnegie Mellon University (2012-2016). Relevant coursework: Artificial Intelligence, Database Management, and Software Engineering. Project experience: Developed a web application using Python. No direct experience with fine-tuning NLP models, but a strong foundation in programming and data structures.  Familiar with cloud infrastructure concepts. Possess CISSP certification.",
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 384]

# 獲取嵌入向量的相似度分數
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

✨ 主要特性

語義文本相似度計算：能夠計算句子和段落之間的語義相似度。
多任務應用：可用於語義搜索、釋義挖掘、文本分類、聚類等多種自然語言處理任務。
高維向量映射：將句子和段落映射到384維的密集向量空間。

📦 安裝指南

安裝Sentence Transformers庫：

pip install -U sentence-transformers

💻 使用示例

基礎用法

from sentence_transformers import SentenceTransformer

# 從🤗 Hub下載
model = SentenceTransformer("anass1209/resume-job-matcher-all-MiniLM-L6-v2")
# 運行推理
sentences = [
    'Developed and maintained core backend services using Python and Django, focusing on scalability and efficiency. Implemented RESTful APIs for data retrieval and manipulation.  Worked extensively with PostgreSQL for data storage and retrieval.  Responsible for optimizing database queries and improving API response times.  Experience with model fine-tuning for semantic search and document retrieval using pre-trained embedding models like Sentence Transformers or similar libraries, specifically for improving the relevance of search results and document matching within the web application.  Experience using vector databases (e.g., ChromaDB, Weaviate) preferred.',
    '## Senior Backend Engineer\n\n*   **ABC Corp** | 2020 - Present\n*   Led development of a new REST API for user authentication and profile management using Python and Django.\n*   Managed a PostgreSQL database, optimizing queries and schema design for improved performance, resulting in a 20% reduction in average API response time.\n*   Improved system scalability through efficient code design and load balancing techniques.\n*   Experience using pre-trained embedding models (BERT) for natural language processing tasks to improve search accuracy, with focus on keyphrase extraction and content similarity comparison for the recommendations engine. Proficient in Flask.',
    "PhD in Computer Science, University of California, Berkeley (2018-2023). Dissertation: 'Adversarial Robustness in NLP for Cybersecurity Applications.' Focused on fine-tuning BERT for malware detection and social engineering attacks. Proficient in Python, TensorFlow, and AWS. Published in top-tier NLP and security conferences. Experienced with large datasets and model evaluation metrics.\n\nMaster of Science in Cybersecurity, Johns Hopkins University (2016-2018). Relevant coursework included Machine Learning, Data Mining, and Network Security. Developed a system for anomaly detection using a recurrent neural network (RNN). Familiar with Python and cloud computing platforms. Good understanding of NLP concepts, but limited experience fine-tuning transformer models. Strong understanding of Information Security Principles.\n\nBachelor of Science in Computer Engineering, Carnegie Mellon University (2012-2016). Relevant coursework: Artificial Intelligence, Database Management, and Software Engineering. Project experience: Developed a web application using Python. No direct experience with fine-tuning NLP models, but a strong foundation in programming and data structures.  Familiar with cloud infrastructure concepts. Possess CISSP certification.",
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 384]

# 獲取嵌入向量的相似度分數
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

📚 詳細文檔

模型詳情

模型描述

屬性	詳情
模型類型	句子轉換器
基礎模型	sentence-transformers/all-MiniLM-L6-v2
最大序列長度	256個標記
輸出維度	384維
相似度函數	餘弦相似度

模型來源

文檔：Sentence Transformers Documentation
倉庫：Sentence Transformers on GitHub
Hugging Face：Sentence Transformers on Hugging Face

完整模型架構

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

評估

指標

語義相似度

數據集：dev_evaluation 和 test_evaluation
評估方法：使用 EmbeddingSimilarityEvaluator 進行評估

指標	dev_evaluation	test_evaluation
pearson_cosine	0.5379	0.5379
spearman_cosine	0.6213	0.6213

訓練詳情

訓練數據集

未命名數據集

大小：958個訓練樣本
列：sentence_0，sentence_1 和 label
基於前958個樣本的近似統計信息： | | sentence_0 | sentence_1 | label | |------|-----------------|-----------------|-----------------| | 類型 | 字符串 | 字符串 | 浮點數 | | 詳情 |
- 最小：41個標記
- 平均：110.12個標記
- 最大：234個標記
|
- 最小：25個標記
- 平均：134.18個標記
- 最大：256個標記
|
- 最小：0.5
- 平均：0.78
- 最大：0.96
|

訓練超參數

非默認超參數

eval_strategy: steps
per_device_train_batch_size: 16
per_device_eval_batch_size: 16
num_train_epochs: 50
multi_dataset_batch_sampler: round_robin

框架版本

Python: 3.11.11
Sentence Transformers: 4.1.0
Transformers: 4.51.1
PyTorch: 2.5.1+cu124
Accelerate: 1.3.0
Datasets: 3.5.0
Tokenizers: 0.21.0

🔧 技術細節

訓練日誌

輪次	步驟	訓練損失	dev_evaluation_spearman_cosine	test_evaluation_spearman_cosine
1.0	60	-	0.4867	-
2.0	120	-	0.5612	-
3.0	180	-	0.5929	-
4.0	240	-	0.6229	-
5.0	300	-	0.6377	-
6.0	360	-	0.6434	-
7.0	420	-	0.6104	-
8.0	480	-	0.6064	-
8.3333	500	0.0122	-	-
9.0	540	-	0.6005	-
10.0	600	-	0.6064	-
11.0	660	-	0.5973	-
12.0	720	-	0.6097	-
13.0	780	-	0.5907	-
14.0	840	-	0.5870	-
15.0	900	-	0.5989	-
16.0	960	-	0.6018	-
16.6667	1000	0.0019	-	-
17.0	1020	-	0.6208	-
18.0	1080	-	0.6133	-
19.0	1140	-	0.6200	-
20.0	1200	-	0.5960	-
21.0	1260	-	0.5999	-
22.0	1320	-	0.5995	-
23.0	1380	-	0.6177	-
24.0	1440	-	0.6201	-
25.0	1500	0.0009	0.6110	-
26.0	1560	-	0.6184	-
27.0	1620	-	0.6133	-
28.0	1680	-	0.6287	-
29.0	1740	-	0.6200	-
30.0	1800	-	0.6272	-
31.0	1860	-	0.6222	-
32.0	1920	-	0.6199	-
33.0	1980	-	0.6141	-
33.3333	2000	0.0006	-	-
34.0	2040	-	0.6228	-
35.0	2100	-	0.6275	-
36.0	2160	-	0.6167	-
37.0	2220	-	0.6140	-
38.0	2280	-	0.6217	-
39.0	2340	-	0.6280	-
40.0	2400	-	0.6254	-
41.0	2460	-	0.6186	-
41.6667	2500	0.0005	-	-
42.0	2520	-	0.6185	-
43.0	2580	-	0.6242	-
44.0	2640	-	0.6183	-
45.0	2700	-	0.6213	-
46.0	2760	-	0.6220	-
47.0	2820	-	0.6213	-
48.0	2880	-	0.6213	-
49.0	2940	-	0.6214	-
50.0	3000	0.0004	0.6213	-
-1	-1	-	-	0.6213

📄 許可證

文檔中未提及相關許可證信息。

📖 引用

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}