UniXcoder-base Open-source Code Model - Freely Utilize Multi-modal Data for Pre-training Code Representations

Unixcoder Base

Developed by microsoft

UniXcoder is a unified multimodal pretrained model that leverages multimodal data such as code comments and abstract syntax trees for pretraining code representations.

Multimodal Fusion

Transformers

EnglishOpen Source License:Apache-2.0 #Multimodal Code Understanding #Zero-shot Code Tasks #Cross-modal Pretraining

Downloads 347.45k

Release Time : 3/23/2022

Model Overview

UniXcoder is a RoBERTa-based multimodal pretrained model specifically designed for code representation learning, supporting various code-related tasks.

Model Features

Multimodal Pretraining

Utilizes multimodal data such as code comments and abstract syntax trees for pretraining to enhance code representation capabilities.

Multi-task Support

Supports three modes: encoder, decoder, and encoder-decoder, adapting to different code-related tasks.

Zero-shot Learning

Performs well on various code-related tasks without fine-tuning.

Model Capabilities

Code Search

Code Completion

Function Name Prediction

API Recommendation

Code Summarization

Use Cases

Code Understanding

Code Search

Search for relevant code snippets based on natural language queries.

Can accurately distinguish between semantically similar but functionally different code.

Code Generation

Code Completion

Automatically complete code based on context.

Can generate reasonable code that fits the context.

Code Documentation

Function Name Prediction

Predict appropriate function names based on the function body.

Can predict semantically accurate function names.

Code Summarization

Generate natural language descriptions for code snippets.

Can generate concise and accurate code descriptions.

🚀 Model Card for UniXcoder-base

UniXcoder is a unified cross-modal pre-trained model. It utilizes multimodal data such as code comments and AST to pre-train code representation, offering powerful features for various code-related tasks.

🚀 Quick Start

Dependency

pip install torch
pip install transformers

Quick Tour

We've implemented a class to use UniXcoder. You can follow the code below to build UniXcoder. First, download the class:

wget https://raw.githubusercontent.com/microsoft/CodeBERT/master/UniXcoder/unixcoder.py

import torch
from unixcoder import UniXcoder

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = UniXcoder("microsoft/unixcoder-base")
model.to(device)

In the following, we'll provide zero-shot examples for several tasks under different modes, including code search (encoder-only), code completion (decoder-only), function name prediction (encoder-decoder), API recommendation (encoder-decoder), and code summarization (encoder-decoder).

✨ Features

UniXcoder is a unified cross-modal pre-trained model that leverages multimodal data (i.e., code comment and AST) to pretrain code representation.

📦 Installation

Dependency

pip install torch
pip install transformers

💻 Usage Examples

Basic Usage

Encoder-only Mode - Code Search

Code and NL Embeddings

# Encode maximum function
func = "def f(a,b): if a>b: return a else return b"
tokens_ids = model.tokenize([func],max_length=512,mode="<encoder-only>")
source_ids = torch.tensor(tokens_ids).to(device)
tokens_embeddings,max_func_embedding = model(source_ids)

# Encode minimum function
func = "def f(a,b): if a<b: return a else return b"
tokens_ids = model.tokenize([func],max_length=512,mode="<encoder-only>")
source_ids = torch.tensor(tokens_ids).to(device)
tokens_embeddings,min_func_embedding = model(source_ids)

# Encode NL
nl = "return maximum value"
tokens_ids = model.tokenize([nl],max_length=512,mode="<encoder-only>")
source_ids = torch.tensor(tokens_ids).to(device)
tokens_embeddings,nl_embedding = model(source_ids)

print(max_func_embedding.shape)
print(max_func_embedding)

torch.Size([1, 768])
tensor([[ 8.6533e-01, -1.9796e+00, -8.6849e-01,  4.2652e-01, -5.3696e-01,
         -1.5521e-01,  5.3770e-01,  3.4199e-01,  3.6305e-01, -3.9391e-01,
         -1.1816e+00,  2.6010e+00, -7.7133e-01,  1.8441e+00,  2.3645e+00,
         ...,
         -2.9188e+00,  1.2555e+00, -1.9953e+00, -1.9795e+00,  1.7279e+00,
          6.4590e-01, -5.2769e-02,  2.4965e-01,  2.3962e-02,  5.9996e-02,
          2.5659e+00,  3.6533e+00,  2.0301e+00]], device='cuda:0',
       grad_fn=<DivBackward0>)

Similarity between code and NL

# Normalize embedding
norm_max_func_embedding = torch.nn.functional.normalize(max_func_embedding, p=2, dim=1)
norm_min_func_embedding = torch.nn.functional.normalize(min_func_embedding, p=2, dim=1)
norm_nl_embedding = torch.nn.functional.normalize(nl_embedding, p=2, dim=1)

max_func_nl_similarity = torch.einsum("ac,bc->ab",norm_max_func_embedding,norm_nl_embedding)
min_func_nl_similarity = torch.einsum("ac,bc->ab",norm_min_func_embedding,norm_nl_embedding)

print(max_func_nl_similarity)
print(min_func_nl_similarity)

tensor([[0.3002]], device='cuda:0', grad_fn=<ViewBackward>)
tensor([[0.1881]], device='cuda:0', grad_fn=<ViewBackward>)

Decoder-only Mode - Code Completion

context = """
def f(data,file_path):
    # write json data into file_path in python language
"""
tokens_ids = model.tokenize([context],max_length=512,mode="<decoder-only>")
source_ids = torch.tensor(tokens_ids).to(device)
prediction_ids = model.generate(source_ids, decoder_only=True, beam_size=3, max_length=128)
predictions = model.decode(prediction_ids)
print(context+predictions[0][0])

def f(data,file_path):
    # write json data into file_path in python language
    data = json.dumps(data)
    with open(file_path, 'w') as f:
        f.write(data)

Encoder-Decoder Mode

Function Name Prediction

context = """
def <mask0>(data,file_path):
    data = json.dumps(data)
    with open(file_path, 'w') as f:
        f.write(data)
"""
tokens_ids = model.tokenize([context],max_length=512,mode="<encoder-decoder>")
source_ids = torch.tensor(tokens_ids).to(device)
prediction_ids = model.generate(source_ids, decoder_only=False, beam_size=3, max_length=128)
predictions = model.decode(prediction_ids)
print([x.replace("<mask0>","").strip() for x in predictions[0]])

['write_json', 'write_file', 'to_json']

API Recommendation

context = """
def write_json(data,file_path):
    data = <mask0>(data)
    with open(file_path, 'w') as f:
        f.write(data)
"""
tokens_ids = model.tokenize([context],max_length=512,mode="<encoder-decoder>")
source_ids = torch.tensor(tokens_ids).to(device)
prediction_ids = model.generate(source_ids, decoder_only=False, beam_size=3, max_length=128)
predictions = model.decode(prediction_ids)
print([x.replace("<mask0>","").strip() for x in predictions[0]])

['json.dumps', 'json.loads', 'str']

Code Summarization

context = """
# <mask0>
def write_json(data,file_path):
    data = json.dumps(data)
    with open(file_path, 'w') as f:
        f.write(data)
"""
tokens_ids = model.tokenize([context],max_length=512,mode="<encoder-decoder>")
source_ids = torch.tensor(tokens_ids).to(device)
prediction_ids = model.generate(source_ids, decoder_only=False, beam_size=3, max_length=128)
predictions = model.decode(prediction_ids)
print([x.replace("<mask0>","").strip() for x in predictions[0]])

['Write JSON to file', 'Write json to file', 'Write a json file']

📚 Documentation

Model Details

Model Description

UniXcoder is a unified cross-modal pre-trained model that leverages multimodal data (i.e., code comment and AST) to pretrain code representation.

Property	Details
Developed by	Microsoft Team
Shared by	Hugging Face
Model Type	Feature Engineering
Language(s) (NLP)	en
License	Apache-2.0
Parent Model	RoBERTa
Resources for more information	Associated Paper

📄 License

This project is licensed under the Apache-2.0 license.

📖 Reference

If you use this code or UniXcoder, please consider citing us.

@article{guo2022unixcoder,
  title={UniXcoder: Unified Cross-Modal Pre-training for Code Representation},
  author={Guo, Daya and Lu, Shuai and Duan, Nan and Wang, Yanlin and Zhou, Ming and Yin, Jian},
  journal={arXiv preprint arXiv:2203.03850},
  year={2022}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご