🚀 CodeBERT-base-mlm
CodeBERT-base-mlm 提供了预训练权重,用于处理编程和自然语言相关任务,其基于预训练模型 CodeBERT,能在代码理解等方面发挥重要作用。
🚀 快速开始
本模型是 CodeBERT: A Pre-Trained Model for Programming and Natural Languages 的预训练权重。以下是使用示例:
from transformers import RobertaTokenizer, RobertaForMaskedLM, pipeline
model = RobertaForMaskedLM.from_pretrained('microsoft/codebert-base-mlm')
tokenizer = RobertaTokenizer.from_pretrained('microsoft/codebert-base-mlm')
code_example = "if (x is not None) <mask> (x>1)"
fill_mask = pipeline('fill-mask', model=model, tokenizer=tokenizer)
outputs = fill_mask(code_example)
print(outputs)
预期结果:
{'sequence': '<s> if (x is not None) and (x>1)</s>', 'score': 0.6049249172210693, 'token': 8}
{'sequence': '<s> if (x is not None) or (x>1)</s>', 'score': 0.30680200457572937, 'token': 50}
{'sequence': '<s> if (x is not None) if (x>1)</s>', 'score': 0.02133703976869583, 'token': 114}
{'sequence': '<s> if (x is not None) then (x>1)</s>', 'score': 0.018607674166560173, 'token': 172}
{'sequence': '<s> if (x is not None) AND (x>1)</s>', 'score': 0.007619690150022507, 'token': 4248}
📦 安装指南
文档未提及具体安装步骤,你可以参考 transformers
库的官方文档进行安装。
💻 使用示例
基础用法
from transformers import RobertaTokenizer, RobertaForMaskedLM, pipeline
model = RobertaForMaskedLM.from_pretrained('microsoft/codebert-base-mlm')
tokenizer = RobertaTokenizer.from_pretrained('microsoft/codebert-base-mlm')
code_example = "if (x is not None) <mask> (x>1)"
fill_mask = pipeline('fill-mask', model=model, tokenizer=tokenizer)
outputs = fill_mask(code_example)
print(outputs)
🔧 技术细节
训练数据
该模型在 CodeSearchNet 的代码语料库上进行训练。
训练目标
此模型以 Roberta-base 为初始模型,并使用简单的 MLM(掩码语言模型)目标进行训练。
📚 详细文档
参考资料
- 使用 MLM+RTD 目标训练的双峰 CodeBERT(适用于代码搜索和文档生成)
- 🤗 Hugging Face 的 CodeBERTa(小尺寸,6 层)
引用格式
@misc{feng2020codebert,
title={CodeBERT: A Pre-Trained Model for Programming and Natural Languages},
author={Zhangyin Feng and Daya Guo and Duyu Tang and Nan Duan and Xiaocheng Feng and Ming Gong and Linjun Shou and Bing Qin and Ting Liu and Daxin Jiang and Ming Zhou},
year={2020},
eprint={2002.08155},
archivePrefix={arXiv},
primaryClass={cs.CL}
}