llm4decompile-6.7b-v2オープンソースモデル - x86アセンブリ命令を効率的にCコードに逆コンパイルする

ホーム

Llm4decompile 6.7b V2

LLM4Binaryによって開発

LLM4Decompileは、x86アセンブリ命令をCコードに逆コンパイルすることに特化したモデルで、V2バージョンは性能が大幅に向上しています。

大規模言語モデル

Transformers

オープンソースライセンス:MIT #x86逆コンパイル最適化 #大規模モデルによる逆解析支援 #Ghidra機能強化

ダウンロード数 2,370

リリース時間 : 6/18/2024

モデル概要

LLM4Decompileは、x86アセンブリ命令をCコードに逆コンパイルすることを目的としています。新しくリリースされたV2シリーズは、より大きなデータセット（2Bトークン）を使用して訓練され、最大トークン長は4096に達します。以前のモデルと比較して、性能が大幅に向上しています（最大100％）。

モデル特徴

強力な逆コンパイル能力

LLM4Decompileは、x86アセンブリ命令をCコードに逆コンパイルすることに取り組んでおり、新しくリリースされたV2シリーズは性能が大幅に向上しています。

大規模データセットによる訓練

V2シリーズは、2Bトークンのより大きなデータセットを使用して訓練され、最大トークン長は4096に達します。

高性能最適化

以前のモデルと比較して、性能が大幅に向上しています（最大100％）。

モデル能力

x86アセンブリ命令の逆コンパイル

最適化されたCコードの生成

長いシーケンスの処理（最大トークン長4096）

使用事例

逆解析エンジニアリング

バイナリファイルの逆コンパイル

コンパイルされたバイナリファイルを読みやすいCコードに逆コンパイルし、分析や修正を容易にします。

Ghidraなどの従来のツールと比較して、再実行可能率が大幅に高くなります。

セキュリティ分析

脆弱性分析

バイナリファイルを逆コンパイルすることで、潜在的なセキュリティ脆弱性を分析します。

より明確なコード構造を提供し、脆弱性の識別を容易にします。

🚀 LLM4Decompile

LLM4Decompile は、x86 アセンブリ命令を C コードに逆コンパイルすることを目的としています。新しくリリースされた V2 シリーズは、より大きなデータセット（20 億トークン）でトレーニングされ、最大トークン長が 4096 に達し、以前のモデルと比較して性能が大幅に向上しています（最大 100%）。

🚀 クイックスタート

モデルの使用例（V2 バージョンのみ。旧バージョンは Hugging Face の該当するモデルページを参照）

Ghidra のインストール Ghidra を現在のフォルダにダウンロードします。他のバージョンはこのページで確認できます。圧縮ファイルを現在のフォルダに解凍します。 bash では、以下のコマンドを使用できます。

cd LLM4Decompile/ghidra
wget https://github.com/NationalSecurityAgency/ghidra/releases/download/Ghidra_11.0.3_build/ghidra_11.0.3_PUBLIC_20240410.zip
unzip ghidra_11.0.3_PUBLIC_20240410.zip

Java-SDK-17 のインストール Ghidra 11 は Java-SDK-17 に依存しています。Ubuntu で SDK をインストールする簡単な方法は以下の通りです。

apt-get update
apt-get upgrade
apt install openjdk-17-jdk openjdk-17-jre

他のプラットフォームは Ghidra インストールガイドを参照してください。

Ghidra Headless を使用したバイナリファイルの逆コンパイル（demo.py）

注意：func0 を逆コンパイルする関数名に置き換えてください。

前処理：C コードをバイナリファイルにコンパイルし、バイナリファイルをアセンブリ命令に逆アセンブルします。

import os
import subprocess
from tqdm import tqdm,trange

OPT = ["O0", "O1", "O2", "O3"]
timeout_duration = 10

ghidra_path = "./ghidra_11.0.3_PUBLIC/support/analyzeHeadless"#path to the headless analyzer, change the path accordingly
postscript = "./decompile.py"#path to the decompiler helper function, change the path accordingly
project_path = "."#path to temp folder for analysis, change the path accordingly
project_name = "tmp_ghidra_proj"
func_path = "../samples/sample.c"#path to c code for compiling and decompiling, change the path accordingly
fileName = "sample"

with tempfile.TemporaryDirectory() as temp_dir:
    pid = os.getpid()
    asm_all = {}
    for opt in [OPT[0]]:
        executable_path = os.path.join(temp_dir, f"{pid}_{opt}.o")
        cmd = f'gcc -{opt} -o {executable_path} {func_path} -lm'
        subprocess.run(
        cmd.split(' '),
        check=True,
        stdout=subprocess.DEVNULL,  # Suppress stdout
        stderr=subprocess.DEVNULL,  # Suppress stderr
        timeout=timeout_duration,
        )

        output_path = os.path.join(temp_dir, f"{pid}_{opt}.c")
        command = [
            ghidra_path,
            temp_dir,
            project_name,
            "-import", executable_path,
            "-postScript", postscript, output_path,
            "-deleteProject",  # WARNING: This will delete the project after analysis
        ]
        result = subprocess.run(command, text=True, capture_output=True, check=True)
        with open(output_path,'r') as f:
            c_decompile = f.read()
        c_func = []
        flag = 0
        for line in c_decompile.split('\n'):
            if "Function: func0" in line:#**Replace** func0 with the function name you want to decompile.
                flag = 1
                c_func.append(line)
                continue
            if flag:
                if '// Function:' in line:
                    if len(c_func) > 1:
                        break
                c_func.append(line)
        if flag == 0:
            raise ValueError('bad case no function found')
        for idx_tmp in range(1,len(c_func)):##########remove the comments
            if 'func0' in c_func[idx_tmp]:
                break
        c_func = c_func[idx_tmp:]
        input_asm = '\n'.join(c_func).strip()

        before = f"# This is the assembly code:\n"#prompt
        after = "\n# What is the source code?\n"#prompt
        input_asm_prompt = before+input_asm.strip()+after
        with open(fileName +'_' + opt +'.pseudo','w',encoding='utf-8') as f:
            f.write(input_asm_prompt)

Ghidra の疑似コードの例は以下の通りです。

undefined4 func0(float param_1,long param_2,int param_3)
{
  int local_28;
  int local_24;
  
  local_24 = 0;
  do {
    local_28 = local_24;
    if (param_3 <= local_24) {
      return 0;
    }
    while (local_28 = local_28 + 1, local_28 < param_3) {
      if ((double)((ulong)(double)(*(float *)(param_2 + (long)local_24 * 4) -
                                  *(float *)(param_2 + (long)local_28 * 4)) &
                  SUB168(_DAT_00402010,0)) < (double)param_1) {
        return 1;
      }
    }
    local_24 = local_24 + 1;
  } while( true );
}

LLM4Decompile を使用した疑似コードの最適化（demo.py）

逆コンパイル：LLM4Decompile-Ref を使用して Ghidra の疑似コードを C コードに最適化します。

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_path = 'LLM4Binary/llm4decompile-6.7b-v2' # V2 Model
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.bfloat16).cuda()

with open(fileName +'_' + OPT[0] +'.pseudo','r') as f:#optimization level O0
    asm_func = f.read()
inputs = tokenizer(asm_func, return_tensors="pt").to(model.device)
with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=2048)### max length to 4096, max new tokens should be below the range
c_func_decompile = tokenizer.decode(outputs[0][len(inputs[0]):-1])

with open(fileName +'_' + OPT[0] +'.pseudo','r') as f:#original file
    func = f.read()

print(f'pseudo function:\n{func}')# Note we only decompile one function, where the original file may contain multiple functions
print(f'refined function:\n{c_func_decompile}')

✨ 主な機能

強力な逆コンパイル能力：LLM4Decompile は、x86 アセンブリ命令を C コードに逆コンパイルすることに特化しており、新しくリリースされた V2 シリーズは性能が大幅に向上しています。
大規模データセットでのトレーニング：V2 シリーズは 20 億トークンの大規模データセットでトレーニングされ、最大トークン長が 4096 に達しています。

📚 ドキュメント

評価結果

指標	再実行可能率					編集類似度
最適化レベル	O0	O1	O2	O3	平均	O0	O1	O2	O3	平均
LLM4Decompile-End-6.7B	0.6805	0.3951	0.3671	0.3720	0.4537	0.1557	0.1292	0.1293	0.1269	0.1353
Ghidra	0.3476	0.1646	0.1524	0.1402	0.2012	0.0699	0.0613	0.0619	0.0547	0.0620
+GPT-4o	0.4695	0.3415	0.2866	0.3110	0.3522	0.0660	0.0563	0.0567	0.0499	0.0572
+LLM4Decompile-Ref-1.3B	0.6890	0.3720	0.4085	0.3720	0.4604	0.1517	0.1325	0.1292	0.1267	0.1350
+LLM4Decompile-Ref-6.7B	0.7439	0.4695	0.4756	0.4207	0.5274	0.1559	0.1353	0.1342	0.1273	0.1382
+LLM4Decompile-Ref-33B	0.7073	0.4756	0.4390	0.4146	0.5091	0.1540	0.1379	0.1363	0.1307	0.1397