đ ERNIE-Code
ERNIE-Code is a unified large language model (LLM) that connects 116 natural languages with 6 programming languages. It uses advanced pre - training methods to achieve excellent performance in various code intelligence tasks, outperforming previous multilingual LLMs.
ERNIE-Code: Beyond English-Centric Cross-lingual Pretraining for Programming Languages

ERNIE-Code is a unified large language model (LLM) that connects 116 natural languages with 6 programming languages. We employ two pre-training methods for universal cross-lingual pre-training: span-corruption language modeling that learns patterns from monolingual NL or PL; and pivot-based translation language modeling that relies on parallel data of many NLs and PLs. Extensive results show that ERNIE-Code outperforms previous multilingual LLMs for PL or NL across a wide range of end tasks of code intelligence, including multilingual code-to-text, text-to-code, code-to-code, and text-to-text generation. We further show its advantage of zero-shot prompting on multilingual code summarization and text-to-text translation.
ACL 2023 (Findings) | arXiv
đ Quick Start
ERNIE-Code can be easily integrated into your projects using the transformers
library. You can start using it by following the steps below.
đģ Usage Examples
Basic Usage
import torch
from transformers import (
AutoModelForSeq2SeqLM,
AutoModelForCausalLM,
AutoTokenizer
)
model_name = "baidu/ernie-code-560m"
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
def format_code_with_spm_compatablity(line: str):
format_dict = {
" " : "<|space|>"
}
tokens = list(line)
i = 0
while i < len(tokens):
if line[i] == "\n":
while i+1 < len(tokens) and tokens[i+1] == " ":
tokens[i+1] = format_dict.get(" ")
i += 1
i += 1
formatted_line = ''.join(tokens)
return formatted_line
"""
TYPE="code" # define input type in ("code", "text")
input="arr.sort()"
prompt="translate python to java: \n%s" % (input) # your prompt here
"""
TYPE="text"
input="quick sort"
prompt="translate English to Japanese: \n%s" % (input)
assert TYPE in ("code", "text")
if TYPE=="code":
prompt = format_code_with_spm_compatablity(prompt)
model_inputs = tokenizer(prompt, max_length=512, padding=False, truncation=True, return_tensors="pt")
model = model.cuda()
input_ids = model_inputs.input_ids.cuda()
attention_mask = model_inputs.attention_mask.cuda()
output = model.generate(input_ids=input_ids, attention_mask=attention_mask,
num_beams=5, max_length=20)
output = tokenizer.decode(output.flatten(), skip_special_tokens=True)
def clean_up_code_spaces(s: str):
new_tokens = ["<pad>", "</s>", "<unk>", "\n", "\t", "<|space|>"*4, "<|space|>"*2, "<|space|>"]
for tok in new_tokens:
s = s.replace(f"{tok} ", tok)
cleaned_tokens = ["<pad>", "</s>", "<unk>"]
for tok in cleaned_tokens:
s = s.replace(tok, "")
s = s.replace("<|space|>", " ")
return s
output = [clean_up_code_spaces(pred) for pred in output]
Advanced Usage
You can adapt seq2seq translation code for finetuning.
You can also check the official inference code on PaddleNLP.
⨠Features
- Multilingual Connectivity: Connects 116 natural languages with 6 programming languages.
- Advanced Pre - training Methods: Employs span - corruption language modeling and pivot - based translation language modeling.
- Excellent Performance: Outperforms previous multilingual LLMs in various code intelligence tasks.
- Zero - shot Advantage: Shows advantages in zero - shot prompting for multilingual code summarization and text - to - text translation.
đ Zero-shot Examples
Multilingual code-to-text generation (zero-shot)


Multilingual text-to-text translation (zero-shot)

đ License
This project is released under the MIT license.
đ BibTeX
@inproceedings{chai-etal-2023-ernie,
title = "{ERNIE}-Code: Beyond {E}nglish-Centric Cross-lingual Pretraining for Programming Languages",
author = "Chai, Yekun and
Wang, Shuohuan and
Pang, Chao and
Sun, Yu and
Tian, Hao and
Wu, Hua",
booktitle = "Findings of the Association for Computational Linguistics: ACL 2023",
month = jul,
year = "2023",
address = "Toronto, Canada",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.findings-acl.676",
pages = "10628--10650",
abstract = "Software engineers working with the same programming language (PL) may speak different natural languages (NLs) and vice versa, erecting huge barriers to communication and working efficiency. Recent studies have demonstrated the effectiveness of generative pre-training in computer programs, yet they are always English-centric. In this work, we step towards bridging the gap between multilingual NLs and multilingual PLs for large language models (LLMs). We release ERNIE-Code, a unified pre-trained language model for 116 NLs and 6 PLs. We employ two methods for universal cross-lingual pre-training: span-corruption language modeling that learns patterns from monolingual NL or PL; and pivot-based translation language modeling that relies on parallel data of many NLs and PLs. Extensive results show that ERNIE-Code outperforms previous multilingual LLMs for PL or NL across a wide range of end tasks of code intelligence, including multilingual code-to-text, text-to-code, code-to-code, and text-to-text generation. We further show its advantage of zero-shot prompting on multilingual code summarization and text-to-text translation. We release our code and pre-trained checkpoints.",
}