allmini-ai-embedding-similarity開源模型 - 精準匹配職位描述與技能要求

首頁

Allmini Ai Embedding Similarity

由Mubin開發

這是一個基於sentence-transformers/all-MiniLM-L6-v2微調的句子嵌入模型，專門用於職位描述和技能要求的相似度匹配。

文本嵌入

Safetensors

#職位描述語義匹配 #高精度句子嵌入 #雲技術技能識別

下載量 88

發布時間 : 1/23/2025

模型概述

該模型通過微調sentence-transformers/all-MiniLM-L6-v2基礎模型，專注於計算職位描述和技能要求之間的語義相似度，適用於人才招聘和職位匹配場景。

模型特點

職位描述專用嵌入

針對AI和數據工程領域的職位描述和技能要求進行了專門優化

高效語義匹配

能夠準確捕捉技術術語和技能要求之間的語義關係

小規模高效模型

基於MiniLM架構，在保持高性能的同時具有較小的模型體積

模型能力

計算句子相似度

提取句子嵌入特徵

職位描述匹配

技能要求分析

使用案例

人才招聘

職位匹配系統

自動匹配候選人簡歷與職位要求的契合度

提高招聘效率和匹配準確率

技能差距分析

分析現有團隊技能與項目要求的差距

幫助制定培訓和發展計劃

人力資源分析

職位聚類分析

將相似職位分組以優化組織結構

發現組織內潛在的冗餘或缺口

🚀 基於sentence-transformers/all-MiniLM-L6-v2的句子轉換器

本模型是基於 sentence-transformers，在 ai-job-embedding-finetuning 數據集上對 sentence-transformers/all-MiniLM-L6-v2 進行微調得到的。它能將句子和段落映射到384維的密集向量空間，可用於語義文本相似度計算、語義搜索、釋義挖掘、文本分類、聚類等任務。

✨ 主要特性

語義理解能力強：能精準捕捉句子和段落的語義信息，有效應用於語義文本相似度計算、語義搜索等任務。
多任務適用性：可用於多種自然語言處理任務，如文本分類、聚類等，為不同應用場景提供支持。
高效向量映射：將輸入的文本快速準確地映射到384維的密集向量空間，便於後續的計算和分析。

📦 安裝指南

首先，你需要安裝 sentence-transformers 庫：

pip install -U sentence-transformers

💻 使用示例

基礎用法

from sentence_transformers import SentenceTransformer

# 從 🤗 Hub 下載模型
model = SentenceTransformer("Mubin/allmini-ai-embedding-similarity")
# 運行推理
sentences = [
    'NLP algorithm development, statistical modeling, biomedical informatics',
    "skills for this position are:Natural Language Processing (NLP)Python (Programming Language)Statistical ModelingHigh-Performance Liquid Chromatography (HPLC)Java Job Description:We are seeking a highly skilled NLP Scientist to develop our innovative and cutting-edge NLP/AI solutions to empower life science. This involves working directly with our clients, as well as cross-functional Biomedical Science, Engineering, and Business leaders, to identify, prioritize, and develop NLP/AI and Advanced analytics products from inception to delivery.Key requirements and design innovative NLP/AI solutions.Develop and validate cutting-edge NLP algorithms, including large language models tailored for healthcare and biopharma use cases.Translate complex technical insights into accessible language for non-technical stakeholders.Mentor junior team members, fostering a culture of continuous learning and growth.Publish findings in peer-reviewed journals and conferences.Engage with the broader scientific community by attending conferences, workshops, and collaborating on research projects. Qualifications:Ph.D. or master's degree in biomedical NLP, Computer Science, Biomedical Informatics, Computational Linguistics, Mathematics, or other related fieldsPublication records in leading computer science or biomedical informatics journals and conferences are highly desirable\n\nRegards,Guru Prasath M US IT RecruiterPSRTEK Inc.Princeton, NJ 08540guru@psrtek.comNo: 609-917-9967 Ext:114",
    'Skills :\na) Azure Data Factory – Min 3 years of project experiencea. Design of pipelinesb. Use of project with On-prem to Cloud Data Migrationc. Understanding of ETLd. Change Data Capture from Multiple Sourcese. Job Schedulingb) Azure Data Lake – Min 3 years of project experiencea. All steps from design to deliverb. Understanding of different Zones and design principalc) Data Modeling experience Min 5 Yearsa. Data Mart/Warehouseb. Columnar Data design and modelingd) Reporting using PowerBI Min 3 yearsa. Analytical Reportingb. Business Domain Modeling and data dictionary\nInterested please apply to the job, looking only for W2 candidates.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 384]

# 獲取嵌入向量的相似度分數
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

📚 詳細文檔

模型詳情

屬性	詳情
模型類型	句子轉換器
基礎模型	sentence-transformers/all-MiniLM-L6-v2
最大序列長度	256個標記
輸出維度	384維
相似度函數	餘弦相似度
訓練數據集	ai-job-embedding-finetuning

模型來源

文檔：Sentence Transformers Documentation
倉庫：Sentence Transformers on GitHub
Hugging Face：Sentence Transformers on Hugging Face

完整模型架構

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

評估指標

三元組

數據集：ai-job-validation 和 ai-job-test
評估方法：使用 TripletEvaluator 進行評估

指標	ai-job-validation	ai-job-test
餘弦準確率	0.9703	0.9804

訓練詳情

訓練數據集

數據集：ai-job-embedding-finetuning
大小：812個訓練樣本
列名：query、job_description_pos 和 job_description_neg
樣本示例：包含查詢語句、正樣本和負樣本的具體內容

評估數據集

數據集：ai-job-embedding-finetuning
大小：101個評估樣本
列名：query、job_description_pos 和 job_description_neg
樣本示例：包含查詢語句、正樣本和負樣本的具體內容

訓練超參數

非默認超參數：eval_strategy、per_device_train_batch_size 等
所有超參數：包含更多詳細的訓練參數設置

訓練日誌

輪次	步數	ai-job-validation餘弦準確率	ai-job-test餘弦準確率
0	0	0.9307	-
1.0	51	0.9703	0.9804

框架版本

Python：3.11.11
Sentence Transformers：3.3.1
Transformers：4.47.1
PyTorch：2.5.1+cu121
Accelerate：1.2.1
Datasets：3.2.0
Tokenizers：0.21.0

🔧 技術細節

模型微調

本模型基於 sentence-transformers/all-MiniLM-L6-v2 進行微調，使用 MultipleNegativesRankingLoss 損失函數，通過在 ai-job-embedding-finetuning 數據集上進行訓練，使模型能夠更好地適應特定的任務需求。

向量映射

模型將輸入的句子和段落映射到384維的密集向量空間，通過計算向量之間的餘弦相似度來衡量文本的語義相似度。在訓練過程中，模型學習到如何將不同的文本表示為具有區分性的向量，從而實現對語義信息的有效捕捉。

訓練過程

訓練過程中，使用了特定的超參數設置，如學習率、批次大小等，以確保模型能夠穩定收斂並取得良好的性能。同時，通過在驗證集和測試集上進行評估，不斷調整模型的參數，提高模型的泛化能力。

📄 許可證

引用

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}