VirtualCompiler Open-Source Model - Freely Compile Any Programming Language into Low-Level Assembly Code

Virtualcompiler

Developed by elsagranger

A large language model based on the 34-billion-parameter CodeLlama, capable of compiling any programming language into low-level assembly code

Large Language Model

Transformers

Open Source License:Apache-2.0 #Assembly Code Generation #Large Language Model #Code Search Enhancement

Downloads 17

Release Time : 5/25/2024

Model Overview

The Virtual Compiler is a large language model that simulates the behavior of real compilers, specializing in converting high-level programming languages into low-level assembly code and validating its effectiveness through assembly code search tasks

Model Features

Virtual Compilation Capability

Simulates real compiler behavior to convert high-level language code into equivalent assembly code

Large-scale Parameters

Built on the 34-billion-parameter CodeLlama model with powerful code comprehension capabilities

Assembly Code Search

Generated virtual assembly code can be used for efficient code search tasks

Model Capabilities

Programming language compilation

Assembly code generation

Code semantic understanding

Assembly code search

Use Cases

Reverse Engineering

Binary Code Analysis

Assists in analyzing binary programs through generated assembly code

Improves reverse engineering efficiency

Code Security

Vulnerability Detection

Identifies potential security vulnerabilities through assembly code patterns

Enhances code security analysis capabilities

🚀 Virtual Compiler Is All You Need For Assembly Code Search

This repository provides models and corresponding evaluation datasets for the ACL 2024 paper "Virtual Compiler Is All You Need For Assembly Code Search", aiming to revolutionize assembly code search through a virtual compiler.

🚀 Quick Start

This repo contains the models and the corresponding evaluation datasets of ACL 2024 paper "Virtual Compiler Is All You Need For Assembly Code Search".

A virtual compiler is a LLM that is capable of compiling any programming language into underlying assembly code. The virtual compiler model is available at elsagranger/VirtualCompiler, based on 34B CodeLlama.

We evaluate the similarity of the virtual assembly code generated by the virtual compiler and the real assembly code using force execution by script force-exec.py, the corresponding evaluation dataset is available at virtual_assembly_and_ground_truth.

We evaluate the effectiveness of the virtual compiler through a downstream task -- assembly code search, the evaluation dataset is available at elsagranger/AssemblyCodeSearchEval.

✨ Features

LLM-based Compilation: The virtual compiler, powered by a 34B CodeLlama-based LLM, can compile any programming language into assembly code.
Similarity Evaluation: Use a script to evaluate the similarity between virtual and real assembly code, with corresponding datasets provided.
Downstream Task Evaluation: Evaluate the effectiveness of the virtual compiler through assembly code search, with a dedicated evaluation dataset.

📦 Installation

This project uses FastChat and vllm worker to host the model. Run the following commands in separate terminals, such as tmux.

LOGDIR="" python3 -m fastchat.serve.openai_api_server \
    --host 0.0.0.0 --port 8080 \
    --controller-address http://localhost:21000

LOGDIR="" python3 -m fastchat.serve.controller \
    --host 0.0.0.0 --port 21000

LOGDIR="" RAY_LOG_TO_STDERR=1 \
    python3 -m fastchat.serve.vllm_worker \
    --model-path ./VirtualCompiler \
    --num-gpus 8 \
    --controller http://localhost:21000 \
    --max-num-batched-tokens 40960 \
    --disable-log-requests \
    --host 0.0.0.0 --port 22000 \
    --worker-address http://localhost:22000 \
    --model-names "VirtualCompiler"

💻 Usage Examples

Basic Usage

After hosting the model, use do_request.py to make requests to the model.

~/C/VirtualCompiler (main)> python3 do_request.py
test rdx, rdx
setz al
movzx eax, al
neg eax
retn

Advanced Usage

Here is an example of using the assembly code search encoder.

def calc_map_at_k(logits, pos_cnt, ks=[10,]):
    _, indices = torch.sort(logits, dim=1, descending=True)

    # [batch_size, pos_cnt]
    ranks = torch.nonzero(
        indices < pos_cnt,
        as_tuple=False
    )[:, 1].reshape(logits.shape[0], -1)

    # [batch_size, pos_cnt]
    mrr = torch.mean(1 / (ranks + 1), dim=1)

    res = {}

    for k in ks:
        res[k] = (
            torch.sum((ranks < k).float(), dim=1) / min(k, pos_cnt)
        ).cpu().numpy()

    return ranks.cpu().numpy(), res, mrr.cpu().numpy()

pos_asm_cnt = 1

query = ["List all files in a directory"]

# Extracted by the process_asm.py script mentioned above
anchor_asm = [ {"1": "endbr64", "2": "mov eax, 0" }, ... ]
neg_anchor_asm = [ {"1": "push rbp", "2": "mov rbp, rsp", ... }, ... ]

query_embs = text_encoder(**text_tokenizer(query))

kwargs = dict(padding=True, pad_to_multiple_of=8, return_tensors="pt")
anchor_asm_ids = asm_tokenizer.pad([asm_tokenizer(pos) for pos in anchor_asm], **kwargs)
neg_anchor_asm_ids = asm_tokenizer.pad([asm_tokenizer(neg) for neg in neg_anchor_asm], **kwargs)

asm_embs = asm_encoder(**anchor_asm_ids)
asm_neg_emb = asm_encoder(**neg_anchor_asm_ids)

# query_embs: [query_cnt, emb_dim]
# asm_embs: [pos_asm_cnt, emb_dim]

# logits_pos: [query_cnt, pos_asm_cnt]
logits_pos = torch.einsum(
    "ic,jc->ij", [query_embs, asm_embs])
# logits_neg: [query_cnt, neg_asm_cnt]
logits_neg = torch.einsum(
    "ic,jc->ij", [query_embs, asm_neg_emb[pos_asm_cnt:]]
)
logits = torch.cat([logits_pos, logits_neg], dim=1)

ranks, map_at_k, mrr = calc_map_at_k(
    logits, pos_asm_cnt, [1, 5, 10, 20, 50, 100])

📚 Documentation

As huggingface does not support loading a remote model inside a folder, we host the model trained on the assembly code search dataset augmented by the Virtual Compiler in vic-encoder. You can use the model.py to test the custom model loading.

Here is an example on text encoder and asm encoder. Please refer to this script on how to extract the assembly code from the binary: process_asm.py.

📄 License

This project is licensed under the Apache-2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご