llm4decompile-6.7b-v2开源模型 - 高效将x86汇编指令反编译成C代码

首页

Llm4decompile 6.7b V2

由 LLM4Binary 开发

LLM4Decompile 是一个专注于将 x86 汇编指令反编译为 C 代码的模型，V2 版本在性能上有显著提升。

大型语言模型

Transformers

开源协议:MIT #x86反编译优化 #大模型辅助逆向 #Ghidra增强

下载量 2,370

发布时间 : 6/18/2024

模型简介

LLM4Decompile 旨在将 x86 汇编指令反编译为 C 代码，新发布的 V2 系列使用了更大的数据集（2B 标记）进行训练，最大标记长度达到 4096，与之前的模型相比，性能有显著提升（最高可达 100%）。

模型特点

强大的反编译能力

LLM4Decompile 致力于将 x86 汇编指令反编译为 C 代码，新发布的 V2 系列在性能上有显著提升。

大规模数据集训练

V2 系列使用 2B 标记的更大数据集进行训练，最大标记长度达到 4096。

高性能优化

与之前的模型相比，性能有显著提升（最高可达 100%）。

模型能力

反编译 x86 汇编指令

生成优化的 C 代码

处理长序列（最大标记长度 4096）

使用案例

逆向工程

二进制文件反编译

将编译后的二进制文件反编译为可读的 C 代码，便于分析和修改。

可重新执行率显著高于传统工具如 Ghidra。

安全分析

漏洞分析

通过反编译二进制文件，分析潜在的安全漏洞。

提供更清晰的代码结构，便于识别漏洞。

🚀 LLM4Decompile

LLM4Decompile 旨在将 x86 汇编指令反编译为 C 代码。新发布的 V2 系列使用了更大的数据集（2B 标记）进行训练，最大标记长度达到 4096，与之前的模型相比，性能有显著提升（最高可达 100%）。

🚀 快速开始

模型使用示例（仅适用于 V2 版本，旧版本请查看 Hugging Face 上对应的模型页面）

安装 Ghidra 下载 Ghidra 到当前文件夹，你也可以在此页面查看其他版本。将压缩包解压到当前文件夹。在 bash 中，你可以使用以下命令：

cd LLM4Decompile/ghidra
wget https://github.com/NationalSecurityAgency/ghidra/releases/download/Ghidra_11.0.3_build/ghidra_11.0.3_PUBLIC_20240410.zip
unzip ghidra_11.0.3_PUBLIC_20240410.zip

安装 Java-SDK-17 Ghidra 11 依赖于 Java-SDK-17，在 Ubuntu 上安装 SDK 的简单方法如下：

apt-get update
apt-get upgrade
apt install openjdk-17-jdk openjdk-17-jre

其他平台请查看 Ghidra 安装指南。

使用 Ghidra Headless 反编译二进制文件（demo.py）

注意：将 func0 替换为你要反编译的函数名。

预处理：将 C 代码编译为二进制文件，并将二进制文件反汇编为汇编指令。

import os
import subprocess
from tqdm import tqdm,trange

OPT = ["O0", "O1", "O2", "O3"]
timeout_duration = 10

ghidra_path = "./ghidra_11.0.3_PUBLIC/support/analyzeHeadless"#path to the headless analyzer, change the path accordingly
postscript = "./decompile.py"#path to the decompiler helper function, change the path accordingly
project_path = "."#path to temp folder for analysis, change the path accordingly
project_name = "tmp_ghidra_proj"
func_path = "../samples/sample.c"#path to c code for compiling and decompiling, change the path accordingly
fileName = "sample"

with tempfile.TemporaryDirectory() as temp_dir:
    pid = os.getpid()
    asm_all = {}
    for opt in [OPT[0]]:
        executable_path = os.path.join(temp_dir, f"{pid}_{opt}.o")
        cmd = f'gcc -{opt} -o {executable_path} {func_path} -lm'
        subprocess.run(
        cmd.split(' '),
        check=True,
        stdout=subprocess.DEVNULL,  # Suppress stdout
        stderr=subprocess.DEVNULL,  # Suppress stderr
        timeout=timeout_duration,
        )

        output_path = os.path.join(temp_dir, f"{pid}_{opt}.c")
        command = [
            ghidra_path,
            temp_dir,
            project_name,
            "-import", executable_path,
            "-postScript", postscript, output_path,
            "-deleteProject",  # WARNING: This will delete the project after analysis
        ]
        result = subprocess.run(command, text=True, capture_output=True, check=True)
        with open(output_path,'r') as f:
            c_decompile = f.read()
        c_func = []
        flag = 0
        for line in c_decompile.split('\n'):
            if "Function: func0" in line:#**Replace** func0 with the function name you want to decompile.
                flag = 1
                c_func.append(line)
                continue
            if flag:
                if '// Function:' in line:
                    if len(c_func) > 1:
                        break
                c_func.append(line)
        if flag == 0:
            raise ValueError('bad case no function found')
        for idx_tmp in range(1,len(c_func)):##########remove the comments
            if 'func0' in c_func[idx_tmp]:
                break
        c_func = c_func[idx_tmp:]
        input_asm = '\n'.join(c_func).strip()

        before = f"# This is the assembly code:\n"#prompt
        after = "\n# What is the source code?\n"#prompt
        input_asm_prompt = before+input_asm.strip()+after
        with open(fileName +'_' + opt +'.pseudo','w',encoding='utf-8') as f:
            f.write(input_asm_prompt)

Ghidra 伪代码示例如下：

undefined4 func0(float param_1,long param_2,int param_3)
{
  int local_28;
  int local_24;
  
  local_24 = 0;
  do {
    local_28 = local_24;
    if (param_3 <= local_24) {
      return 0;
    }
    while (local_28 = local_28 + 1, local_28 < param_3) {
      if ((double)((ulong)(double)(*(float *)(param_2 + (long)local_24 * 4) -
                                  *(float *)(param_2 + (long)local_28 * 4)) &
                  SUB168(_DAT_00402010,0)) < (double)param_1) {
        return 1;
      }
    }
    local_24 = local_24 + 1;
  } while( true );
}

使用 LLM4Decompile 优化伪代码（demo.py）

反编译：使用 LLM4Decompile-Ref 将 Ghidra 伪代码优化为 C 代码：

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_path = 'LLM4Binary/llm4decompile-6.7b-v2' # V2 Model
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.bfloat16).cuda()

with open(fileName +'_' + OPT[0] +'.pseudo','r') as f:#optimization level O0
    asm_func = f.read()
inputs = tokenizer(asm_func, return_tensors="pt").to(model.device)
with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=2048)### max length to 4096, max new tokens should be below the range
c_func_decompile = tokenizer.decode(outputs[0][len(inputs[0]):-1])

with open(fileName +'_' + OPT[0] +'.pseudo','r') as f:#original file
    func = f.read()

print(f'pseudo function:\n{func}')# Note we only decompile one function, where the original file may contain multiple functions
print(f'refined function:\n{c_func_decompile}')

✨ 主要特性

强大的反编译能力：LLM4Decompile 致力于将 x86 汇编指令反编译为 C 代码，新发布的 V2 系列在性能上有显著提升。
大规模数据集训练：V2 系列使用 2B 标记的更大数据集进行训练，最大标记长度达到 4096。

📚 详细文档

评估结果

指标	可重新执行率					编辑相似度
优化级别	O0	O1	O2	O3	平均	O0	O1	O2	O3	平均
LLM4Decompile-End-6.7B	0.6805	0.3951	0.3671	0.3720	0.4537	0.1557	0.1292	0.1293	0.1269	0.1353
Ghidra	0.3476	0.1646	0.1524	0.1402	0.2012	0.0699	0.0613	0.0619	0.0547	0.0620
+GPT-4o	0.4695	0.3415	0.2866	0.3110	0.3522	0.0660	0.0563	0.0567	0.0499	0.0572
+LLM4Decompile-Ref-1.3B	0.6890	0.3720	0.4085	0.3720	0.4604	0.1517	0.1325	0.1292	0.1267	0.1350
+LLM4Decompile-Ref-6.7B	0.7439	0.4695	0.4756	0.4207	0.5274	0.1559	0.1353	0.1342	0.1273	0.1382
+LLM4Decompile-Ref-33B	0.7073	0.4756	0.4390	0.4146	0.5091	0.1540	0.1379	0.1363	0.1307	0.1397