ernie-code-560mオープンソースモデル - 116種類の言語と6種類のプログラミング言語を接続し、多言語間のタスクをサポート

ホーム

Ernie Code 560m

baiduによって開発

ERNIE-Codeは116の自然言語と6つのプログラミング言語を接続する統一大型言語モデルで、様々なクロスランガージュタスクをサポートします。

大規模言語モデル

Transformers

オープンソースライセンス:MIT #多言語コード生成 #クロスランガージュ事前学習 #ゼロショット翻訳

ダウンロード数 69

リリース時間 : 3/9/2024

モデル概要

ERNIE-Codeは多言語とプログラミング言語をサポートする大型言語モデルで、セグメントマスク言語モデリングとピボットベースの翻訳言語モデリングによる事前学習を行い、コードからテキスト、テキストからコードなど様々なタスクに適用可能です。

モデル特徴

多言語サポート

116の自然言語と6つのプログラミング言語をサポートし、幅広いクロスランガージュタスクをカバーします。

クロスランガージュ事前学習

セグメントマスク言語モデリングとピボットベースの翻訳言語モデリングを採用し、多言語タスクにおけるモデルの性能を向上させます。

ゼロショット能力

コード要約やテキスト翻訳タスクで優れたゼロショットプロンプト能力を発揮します。

モデル能力

多言語コードからテキスト生成

多言語テキストからコード生成

多言語コードからコード生成

多言語テキストからテキスト翻訳

使用事例

コードインテリジェンス

コード要約

複数のプログラミング言語のコードに対して自然言語記述を生成します。

多言語コード要約タスクで優れた性能を発揮します。

コード翻訳

あるプログラミング言語のコードを別のプログラミング言語に翻訳します。

コードからコード生成タスクで他の多言語モデルを上回ります。

自然言語処理

テキスト翻訳

複数の自然言語間のテキスト翻訳をサポートします。

ゼロショットテキスト翻訳タスクで優位性を示します。

🚀 ERNIE-Code

ERNIE-Codeは、116の自然言語と6つのプログラミング言語をつなぐ統一型大規模言語モデル（LLM）です。広範なコードインテリジェンスのエンドタスクにおいて、従来の多言語LLMを上回る性能を発揮します。

🚀 クイックスタート

ERNIE-Codeは、116の自然言語と6つのプログラミング言語をつなぐ統一型大規模言語モデル（LLM）です。2つの事前学習方法を用いて、普遍的な多言語事前学習を行います。広範な結果から、ERNIE-Codeはコードインテリジェンスの幅広いエンドタスクにおいて、以前の多言語LLMを上回ることが示されています。

ERNIE-Code: Beyond English-Centric Cross-lingual Pretraining for Programming Languages

ernie-code-comp

ACL 2023 (Findings) | arXiv

✨ 主な機能

ERNIE-Codeは、以下のような主な機能を備えています。

116の自然言語と6つのプログラミング言語をつなぐ統一型大規模言語モデル
広範なコードインテリジェンスのエンドタスクにおいて、従来の多言語LLMを上回る性能
多言語コードからテキスト、テキストからコード、コードからコード、テキストからテキストの生成などのタスクに対応
多言語コード要約とテキスト翻訳におけるゼロショットプロンプトの利点を示す

💻 使用例

基本的な使用法

# 元のコードとコメントを保持
import torch
from transformers import (
    AutoModelForSeq2SeqLM,
    AutoModelForCausalLM,
    AutoTokenizer
)

model_name = "baidu/ernie-code-560m"

model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# note that you can use aforementioned `clean_up_code_spaces` to proprocess the code


def format_code_with_spm_compatablity(line: str):
    format_dict = {
        " " : "<|space|>"
    }
    tokens = list(line)
    i = 0
    while i < len(tokens):
        if line[i] == "\n":
            while i+1 < len(tokens) and tokens[i+1] == " ":
                tokens[i+1] = format_dict.get(" ")
                i += 1
        i += 1
    formatted_line = ''.join(tokens)
    return formatted_line

"""
TYPE="code" # define input type in ("code", "text")
input="arr.sort()"
prompt="translate python to java: \n%s" % (input)  # your prompt here
"""

TYPE="text" # define input type in ("code", "text")
input="quick sort"
prompt="translate English to Japanese: \n%s" % (input)  # your prompt here

assert TYPE in ("code", "text")

# preprocess for code input
if TYPE=="code":
    prompt = format_code_with_spm_compatablity(prompt)

model_inputs = tokenizer(prompt, max_length=512, padding=False, truncation=True, return_tensors="pt")

model = model.cuda() # by default
input_ids = model_inputs.input_ids.cuda() # by default
attention_mask = model_inputs.attention_mask.cuda() # by default

output = model.generate(input_ids=input_ids, attention_mask=attention_mask, 
        num_beams=5, max_length=20) # change to your needs

# Ensure to customize the post-processing of `clean_up_code_spaces` output according to specific requirements.
output = tokenizer.decode(output.flatten(), skip_special_tokens=True)


# post-process the code generation
def clean_up_code_spaces(s: str):
    # post process
    # ===========================
    new_tokens = ["<pad>", "</s>", "<unk>", "\n", "\t", "<|space|>"*4, "<|space|>"*2, "<|space|>"]
    for tok in new_tokens:
        s = s.replace(f"{tok} ", tok)

    cleaned_tokens = ["<pad>", "</s>", "<unk>"]
    for tok in cleaned_tokens:
        s = s.replace(tok, "")
    s = s.replace("<|space|>", " ")
    return s
output = [clean_up_code_spaces(pred) for pred in output]

高度な使用法

# 微調整用に[seq2seq translation code](https://github.com/huggingface/transformers/tree/main/examples/pytorch/translation)を適応させることができます。
# また、[PaddleNLP](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/model_zoo/ernie-code/README.en.md)の公式推論コードを確認することもできます。

📚 ドキュメント

ゼロショットの例

多言語コードからテキストの生成（ゼロショット）

code-to-text-examples

zh_code-to-text_examples-1

多言語テキスト翻訳（ゼロショット）

zero-shot-mt-examples

📄 ライセンス

このプロジェクトはMITライセンスの下で提供されています。

BibTeX

@inproceedings{chai-etal-2023-ernie,
    title = "{ERNIE}-Code: Beyond {E}nglish-Centric Cross-lingual Pretraining for Programming Languages",
    author = "Chai, Yekun  and
      Wang, Shuohuan  and
      Pang, Chao  and
      Sun, Yu  and
      Tian, Hao  and
      Wu, Hua",
    booktitle = "Findings of the Association for Computational Linguistics: ACL 2023",
    month = jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.findings-acl.676",
    pages = "10628--10650",
    abstract = "Software engineers working with the same programming language (PL) may speak different natural languages (NLs) and vice versa, erecting huge barriers to communication and working efficiency. Recent studies have demonstrated the effectiveness of generative pre-training in computer programs, yet they are always English-centric. In this work, we step towards bridging the gap between multilingual NLs and multilingual PLs for large language models (LLMs). We release ERNIE-Code, a unified pre-trained language model for 116 NLs and 6 PLs. We employ two methods for universal cross-lingual pre-training: span-corruption language modeling that learns patterns from monolingual NL or PL; and pivot-based translation language modeling that relies on parallel data of many NLs and PLs. Extensive results show that ERNIE-Code outperforms previous multilingual LLMs for PL or NL across a wide range of end tasks of code intelligence, including multilingual code-to-text, text-to-code, code-to-code, and text-to-text generation. We further show its advantage of zero-shot prompting on multilingual code summarization and text-to-text translation. We release our code and pre-trained checkpoints.",
}