đ LLM4Decompile
LLM4Decompile is designed to decompile x86 assembly instructions into C. The newly launched V1.5 series is trained with a larger dataset (15B tokens) and supports a maximum token length of 4,096. It shows remarkable performance (up to 100% improvement) compared to the previous model.
đ Quick Start
⨠Features
- Decompile x86 assembly instructions into C.
- The V1.5 series is trained with a larger dataset (15B tokens) and a maximum token length of 4,096.
- Achieves remarkable performance improvement (up to 100%) compared to the previous model.
đ Documentation
Evaluation Results
Model/Benchmark |
HumanEval-Decompile |
|
|
|
|
ExeBench |
|
|
|
|
Optimization Level |
O0 |
O1 |
O2 |
O3 |
AVG |
O0 |
O1 |
O2 |
O3 |
AVG |
DeepSeek-Coder-6.7B |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0.0000 |
GPT-4o |
0.3049 |
0.1159 |
0.1037 |
0.1159 |
0.1601 |
0.0443 |
0.0328 |
0.0397 |
0.0343 |
0.0378 |
LLM4Decompile-End-1.3B |
0.4720 |
0.2061 |
0.2122 |
0.2024 |
0.2732 |
0.1786 |
0.1362 |
0.1320 |
0.1328 |
0.1449 |
LLM4Decompile-End-6.7B |
0.6805 |
0.3951 |
0.3671 |
0.3720 |
0.4537 |
0.2289 |
0.1660 |
0.1618 |
0.1625 |
0.1798 |
LLM4Decompile-End-33B |
0.5168 |
0.2956 |
0.2815 |
0.2675 |
0.3404 |
0.1886 |
0.1465 |
0.1396 |
0.1411 |
0.1540 |
đģ Usage Examples
Basic Usage
Preprocessing: Compile the C code into binary, and disassemble the binary into assembly instructions.
import subprocess
import os
OPT = ["O0", "O1", "O2", "O3"]
fileName = 'samples/sample'
for opt_state in OPT:
output_file = fileName +'_' + opt_state
input_file = fileName+'.c'
compile_command = f'gcc -o {output_file}.o {input_file} -{opt_state} -lm'
subprocess.run(compile_command, shell=True, check=True)
compile_command = f'objdump -d {output_file}.o > {output_file}.s'
subprocess.run(compile_command, shell=True, check=True)
input_asm = ''
with open(output_file+'.s') as f:
asm= f.read()
if '<'+'func0'+'>:' not in asm:
raise ValueError("compile fails")
asm = '<'+'func0'+'>:' + asm.split('<'+'func0'+'>:')[-1].split('\n\n')[0]
asm_clean = ""
asm_sp = asm.split("\n")
for tmp in asm_sp:
if len(tmp.split("\t"))<3 and '00' in tmp:
continue
idx = min(
len(tmp.split("\t")) - 1, 2
)
tmp_asm = "\t".join(tmp.split("\t")[idx:])
tmp_asm = tmp_asm.split("#")[0].strip()
asm_clean += tmp_asm + "\n"
input_asm = asm_clean.strip()
before = f"# This is the assembly code:\n"
after = "\n# What is the source code?\n"
input_asm_prompt = before+input_asm.strip()+after
with open(fileName +'_' + opt_state +'.asm','w',encoding='utf-8') as f:
f.write(input_asm_prompt)
Advanced Usage
Decompilation: Use LLM4Decompile to translate the assembly instructions into C:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_path = 'LLM4Binary/llm4decompile-6.7b-v1.5'
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path,torch_dtype=torch.bfloat16).cuda()
with open(fileName +'_' + OPT[0] +'.asm','r') as f:
asm_func = f.read()
inputs = tokenizer(asm_func, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=4000)
c_func_decompile = tokenizer.decode(outputs[0][len(inputs[0]):-1])
with open(fileName +'.c','r') as f:
func = f.read()
print(f'original function:\n{func}')
print(f'decompiled function:\n{c_func_decompile}')
đ License
This code repository is licensed under the MIT License.
đ Contact
If you have any questions, please raise an issue.