UniXcoder-base開源代碼模型 - 免費利用多模態數據預訓練代碼表示

首頁

Unixcoder Base

由microsoft開發

UniXcoder是一個統一的多模態預訓練模型，利用代碼註釋和抽象語法樹等多模態數據預訓練代碼表示。

多模態融合

Transformers

英語開源協議:Apache-2.0 #多模態代碼理解 #零樣本代碼任務 #跨模態預訓練

下載量 347.45k

發布時間 : 3/23/2022

模型概述

UniXcoder是一個基於RoBERTa的多模態預訓練模型，專門用於代碼表示學習，支持多種代碼相關任務。

模型特點

多模態預訓練

利用代碼註釋和抽象語法樹等多模態數據進行預訓練，增強代碼表示能力

多任務支持

支持編碼器、解碼器以及編碼器-解碼器三種模式，適應不同代碼相關任務

零樣本學習

無需微調即可在多種代碼相關任務上表現良好

模型能力

代碼搜索

代碼補全

函數名預測

API推薦

代碼摘要

使用案例

代碼理解

代碼搜索

根據自然語言查詢搜索相關代碼片段

能準確區分語義相近但功能不同的代碼

代碼生成

代碼補全

根據上下文自動補全代碼

能生成符合上下文的合理代碼

代碼文檔

函數名預測

根據函數體預測合適的函數名

能預測語義準確的函數名

代碼摘要

為代碼片段生成自然語言描述

能生成簡潔準確的代碼描述

🚀 UniXcoder-base 模型卡片

UniXcoder 是一個統一的跨模態預訓練模型，它利用多模態數據（即代碼註釋和抽象語法樹）來預訓練代碼表示，為代碼相關任務提供了強大的支持。

🚀 快速開始

依賴安裝

使用以下命令安裝所需依賴：

pip install torch
pip install transformers

快速上手

我們實現了一個類來使用 UniXcoder，你可以按照以下代碼構建 UniXcoder。首先，下載該類：

wget https://raw.githubusercontent.com/microsoft/CodeBERT/master/UniXcoder/unixcoder.py

import torch
from unixcoder import UniXcoder

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = UniXcoder("microsoft/unixcoder-base")
model.to(device)

接下來，我們將給出幾個不同模式下的零樣本示例，包括代碼搜索（僅編碼器）、代碼補全（僅解碼器）、函數名預測（編碼器 - 解碼器）、API 推薦（編碼器 - 解碼器）、代碼摘要（編碼器 - 解碼器）。

✨ 主要特性

僅編碼器模式

代碼和自然語言嵌入

以下是一個從 CodeBERT 獲取代碼片段嵌入的示例：

# Encode maximum function
func = "def f(a,b): if a>b: return a else return b"
tokens_ids = model.tokenize([func],max_length=512,mode="<encoder-only>")
source_ids = torch.tensor(tokens_ids).to(device)
tokens_embeddings,max_func_embedding = model(source_ids)

# Encode minimum function
func = "def f(a,b): if a<b: return a else return b"
tokens_ids = model.tokenize([func],max_length=512,mode="<encoder-only>")
source_ids = torch.tensor(tokens_ids).to(device)
tokens_embeddings,min_func_embedding = model(source_ids)

# Encode NL
nl = "return maximum value"
tokens_ids = model.tokenize([nl],max_length=512,mode="<encoder-only>")
source_ids = torch.tensor(tokens_ids).to(device)
tokens_embeddings,nl_embedding = model(source_ids)

print(max_func_embedding.shape)
print(max_func_embedding)

torch.Size([1, 768])
tensor([[ 8.6533e-01, -1.9796e+00, -8.6849e-01,  4.2652e-01, -5.3696e-01,
         -1.5521e-01,  5.3770e-01,  3.4199e-01,  3.6305e-01, -3.9391e-01,
         -1.1816e+00,  2.6010e+00, -7.7133e-01,  1.8441e+00,  2.3645e+00,
         ...,
         -2.9188e+00,  1.2555e+00, -1.9953e+00, -1.9795e+00,  1.7279e+00,
          6.4590e-01, -5.2769e-02,  2.4965e-01,  2.3962e-02,  5.9996e-02,
          2.5659e+00,  3.6533e+00,  2.0301e+00]], device='cuda:0',
       grad_fn=<DivBackward0>)

代碼和自然語言的相似度

現在，我們計算自然語言和兩個函數之間的餘弦相似度。儘管兩個函數的差異僅在於一個運算符（< 和 >），但 UniXcoder 可以區分它們。

# Normalize embedding
norm_max_func_embedding = torch.nn.functional.normalize(max_func_embedding, p=2, dim=1)
norm_min_func_embedding = torch.nn.functional.normalize(min_func_embedding, p=2, dim=1)
norm_nl_embedding = torch.nn.functional.normalize(nl_embedding, p=2, dim=1)

max_func_nl_similarity = torch.einsum("ac,bc->ab",norm_max_func_embedding,norm_nl_embedding)
min_func_nl_similarity = torch.einsum("ac,bc->ab",norm_min_func_embedding,norm_nl_embedding)

print(max_func_nl_similarity)
print(min_func_nl_similarity)

tensor([[0.3002]], device='cuda:0', grad_fn=<ViewBackward>)
tensor([[0.1881]], device='cuda:0', grad_fn=<ViewBackward>)

僅解碼器模式

以下是一個代碼補全的示例：

context = """
def f(data,file_path):
    # write json data into file_path in python language
"""
tokens_ids = model.tokenize([context],max_length=512,mode="<decoder-only>")
source_ids = torch.tensor(tokens_ids).to(device)
prediction_ids = model.generate(source_ids, decoder_only=True, beam_size=3, max_length=128)
predictions = model.decode(prediction_ids)
print(context+predictions[0][0])

def f(data,file_path):
    # write json data into file_path in python language
    data = json.dumps(data)
    with open(file_path, 'w') as f:
        f.write(data)

編碼器 - 解碼器模式

函數名預測

context = """
def <mask0>(data,file_path):
    data = json.dumps(data)
    with open(file_path, 'w') as f:
        f.write(data)
"""
tokens_ids = model.tokenize([context],max_length=512,mode="<encoder-decoder>")
source_ids = torch.tensor(tokens_ids).to(device)
prediction_ids = model.generate(source_ids, decoder_only=False, beam_size=3, max_length=128)
predictions = model.decode(prediction_ids)
print([x.replace("<mask0>","").strip() for x in predictions[0]])

['write_json', 'write_file', 'to_json']

API 推薦

context = """
def write_json(data,file_path):
    data = <mask0>(data)
    with open(file_path, 'w') as f:
        f.write(data)
"""
tokens_ids = model.tokenize([context],max_length=512,mode="<encoder-decoder>")
source_ids = torch.tensor(tokens_ids).to(device)
prediction_ids = model.generate(source_ids, decoder_only=False, beam_size=3, max_length=128)
predictions = model.decode(prediction_ids)
print([x.replace("<mask0>","").strip() for x in predictions[0]])

['json.dumps', 'json.loads', 'str']

代碼摘要

context = """
# <mask0>
def write_json(data,file_path):
    data = json.dumps(data)
    with open(file_path, 'w') as f:
        f.write(data)
"""
tokens_ids = model.tokenize([context],max_length=512,mode="<encoder-decoder>")
source_ids = torch.tensor(tokens_ids).to(device)
prediction_ids = model.generate(source_ids, decoder_only=False, beam_size=3, max_length=128)
predictions = model.decode(prediction_ids)
print([x.replace("<mask0>","").strip() for x in predictions[0]])

['Write JSON to file', 'Write json to file', 'Write a json file']

📚 詳細文檔

模型詳情

屬性	詳情
開發團隊	微軟團隊
共享方	Hugging Face
模型類型	特徵工程
語言	英語
許可證	Apache - 2.0
相關模型	父模型：RoBERTa
更多信息資源	關聯論文

📄 許可證

本模型使用 Apache - 2.0 許可證。

🔗 引用

如果您使用此代碼或 UniXcoder，請考慮引用我們：

@article{guo2022unixcoder,
  title={UniXcoder: Unified Cross-Modal Pre-training for Code Representation},
  author={Guo, Daya and Lu, Shuai and Duan, Nan and Wang, Yanlin and Zhou, Ming and Yin, Jian},
  journal={arXiv preprint arXiv:2203.03850},
  year={2022}
}