llm4decompile-6.7b-v2開源模型 - 高效將x86彙編指令反編譯成C代碼

首頁

Llm4decompile 6.7b V2

由LLM4Binary開發

LLM4Decompile 是一個專注於將 x86 彙編指令反編譯為 C 代碼的模型，V2 版本在性能上有顯著提升。

大型語言模型

Transformers

開源協議:MIT #x86反編譯優化 #大模型輔助逆向 #Ghidra增強

下載量 2,370

發布時間 : 6/18/2024

模型概述

LLM4Decompile 旨在將 x86 彙編指令反編譯為 C 代碼，新發布的 V2 系列使用了更大的數據集（2B 標記）進行訓練，最大標記長度達到 4096，與之前的模型相比，性能有顯著提升（最高可達 100%）。

模型特點

強大的反編譯能力

LLM4Decompile 致力於將 x86 彙編指令反編譯為 C 代碼，新發布的 V2 系列在性能上有顯著提升。

大規模數據集訓練

V2 系列使用 2B 標記的更大數據集進行訓練，最大標記長度達到 4096。

高性能優化

與之前的模型相比，性能有顯著提升（最高可達 100%）。

模型能力

反編譯 x86 彙編指令

生成優化的 C 代碼

處理長序列（最大標記長度 4096）

使用案例

逆向工程

二進制文件反編譯

將編譯後的二進制文件反編譯為可讀的 C 代碼，便於分析和修改。

可重新執行率顯著高於傳統工具如 Ghidra。

安全分析

漏洞分析

通過反編譯二進制文件，分析潛在的安全漏洞。

提供更清晰的代碼結構，便於識別漏洞。

🚀 LLM4Decompile

LLM4Decompile 旨在將 x86 彙編指令反編譯為 C 代碼。新發布的 V2 系列使用了更大的數據集（2B 標記）進行訓練，最大標記長度達到 4096，與之前的模型相比，性能有顯著提升（最高可達 100%）。

🚀 快速開始

模型使用示例（僅適用於 V2 版本，舊版本請查看 Hugging Face 上對應的模型頁面）

安裝 Ghidra 下載 Ghidra 到當前文件夾，你也可以在此頁面查看其他版本。將壓縮包解壓到當前文件夾。在 bash 中，你可以使用以下命令：

cd LLM4Decompile/ghidra
wget https://github.com/NationalSecurityAgency/ghidra/releases/download/Ghidra_11.0.3_build/ghidra_11.0.3_PUBLIC_20240410.zip
unzip ghidra_11.0.3_PUBLIC_20240410.zip

安裝 Java-SDK-17 Ghidra 11 依賴於 Java-SDK-17，在 Ubuntu 上安裝 SDK 的簡單方法如下：

apt-get update
apt-get upgrade
apt install openjdk-17-jdk openjdk-17-jre

其他平臺請查看 Ghidra 安裝指南。

使用 Ghidra Headless 反編譯二進制文件（demo.py）

注意：將 func0 替換為你要反編譯的函數名。

預處理：將 C 代碼編譯為二進制文件，並將二進制文件反彙編為彙編指令。

import os
import subprocess
from tqdm import tqdm,trange

OPT = ["O0", "O1", "O2", "O3"]
timeout_duration = 10

ghidra_path = "./ghidra_11.0.3_PUBLIC/support/analyzeHeadless"#path to the headless analyzer, change the path accordingly
postscript = "./decompile.py"#path to the decompiler helper function, change the path accordingly
project_path = "."#path to temp folder for analysis, change the path accordingly
project_name = "tmp_ghidra_proj"
func_path = "../samples/sample.c"#path to c code for compiling and decompiling, change the path accordingly
fileName = "sample"

with tempfile.TemporaryDirectory() as temp_dir:
    pid = os.getpid()
    asm_all = {}
    for opt in [OPT[0]]:
        executable_path = os.path.join(temp_dir, f"{pid}_{opt}.o")
        cmd = f'gcc -{opt} -o {executable_path} {func_path} -lm'
        subprocess.run(
        cmd.split(' '),
        check=True,
        stdout=subprocess.DEVNULL,  # Suppress stdout
        stderr=subprocess.DEVNULL,  # Suppress stderr
        timeout=timeout_duration,
        )

        output_path = os.path.join(temp_dir, f"{pid}_{opt}.c")
        command = [
            ghidra_path,
            temp_dir,
            project_name,
            "-import", executable_path,
            "-postScript", postscript, output_path,
            "-deleteProject",  # WARNING: This will delete the project after analysis
        ]
        result = subprocess.run(command, text=True, capture_output=True, check=True)
        with open(output_path,'r') as f:
            c_decompile = f.read()
        c_func = []
        flag = 0
        for line in c_decompile.split('\n'):
            if "Function: func0" in line:#**Replace** func0 with the function name you want to decompile.
                flag = 1
                c_func.append(line)
                continue
            if flag:
                if '// Function:' in line:
                    if len(c_func) > 1:
                        break
                c_func.append(line)
        if flag == 0:
            raise ValueError('bad case no function found')
        for idx_tmp in range(1,len(c_func)):##########remove the comments
            if 'func0' in c_func[idx_tmp]:
                break
        c_func = c_func[idx_tmp:]
        input_asm = '\n'.join(c_func).strip()

        before = f"# This is the assembly code:\n"#prompt
        after = "\n# What is the source code?\n"#prompt
        input_asm_prompt = before+input_asm.strip()+after
        with open(fileName +'_' + opt +'.pseudo','w',encoding='utf-8') as f:
            f.write(input_asm_prompt)

Ghidra 偽代碼示例如下：

undefined4 func0(float param_1,long param_2,int param_3)
{
  int local_28;
  int local_24;
  
  local_24 = 0;
  do {
    local_28 = local_24;
    if (param_3 <= local_24) {
      return 0;
    }
    while (local_28 = local_28 + 1, local_28 < param_3) {
      if ((double)((ulong)(double)(*(float *)(param_2 + (long)local_24 * 4) -
                                  *(float *)(param_2 + (long)local_28 * 4)) &
                  SUB168(_DAT_00402010,0)) < (double)param_1) {
        return 1;
      }
    }
    local_24 = local_24 + 1;
  } while( true );
}

使用 LLM4Decompile 優化偽代碼（demo.py）

反編譯：使用 LLM4Decompile-Ref 將 Ghidra 偽代碼優化為 C 代碼：

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_path = 'LLM4Binary/llm4decompile-6.7b-v2' # V2 Model
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.bfloat16).cuda()

with open(fileName +'_' + OPT[0] +'.pseudo','r') as f:#optimization level O0
    asm_func = f.read()
inputs = tokenizer(asm_func, return_tensors="pt").to(model.device)
with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=2048)### max length to 4096, max new tokens should be below the range
c_func_decompile = tokenizer.decode(outputs[0][len(inputs[0]):-1])

with open(fileName +'_' + OPT[0] +'.pseudo','r') as f:#original file
    func = f.read()

print(f'pseudo function:\n{func}')# Note we only decompile one function, where the original file may contain multiple functions
print(f'refined function:\n{c_func_decompile}')

✨ 主要特性

強大的反編譯能力：LLM4Decompile 致力於將 x86 彙編指令反編譯為 C 代碼，新發布的 V2 系列在性能上有顯著提升。
大規模數據集訓練：V2 系列使用 2B 標記的更大數據集進行訓練，最大標記長度達到 4096。

📚 詳細文檔

評估結果

指標	可重新執行率					編輯相似度
優化級別	O0	O1	O2	O3	平均	O0	O1	O2	O3	平均
LLM4Decompile-End-6.7B	0.6805	0.3951	0.3671	0.3720	0.4537	0.1557	0.1292	0.1293	0.1269	0.1353
Ghidra	0.3476	0.1646	0.1524	0.1402	0.2012	0.0699	0.0613	0.0619	0.0547	0.0620
+GPT-4o	0.4695	0.3415	0.2866	0.3110	0.3522	0.0660	0.0563	0.0567	0.0499	0.0572
+LLM4Decompile-Ref-1.3B	0.6890	0.3720	0.4085	0.3720	0.4604	0.1517	0.1325	0.1292	0.1267	0.1350
+LLM4Decompile-Ref-6.7B	0.7439	0.4695	0.4756	0.4207	0.5274	0.1559	0.1353	0.1342	0.1273	0.1382
+LLM4Decompile-Ref-33B	0.7073	0.4756	0.4390	0.4146	0.5091	0.1540	0.1379	0.1363	0.1307	0.1397