đ CLAP: Learning Transferable Binary Code Representations with Natural Language Supervision
CLAP is a framework that learns binary code representations through natural language supervision. It aligns binary code with natural - language explanations, enhancing analysis performance in few - shot and zero - shot scenarios. With a dataset engine generating 195 million pairs of code snippets and descriptions, it offers high transferability in binary code analysis.
đ Quick Start
This document will guide you to set up and start using the CLAP model for various tasks, such as fine - grained classification of sorting algorithms, malware, and cryptographic algorithms without further training.
Requirements
- Python 3.6 or higher
- [PyTorch](https://pytorch.org/get - started/locally/)
- Transformers library
- A CUDA - enabled GPU is highly recommended for faster processing.
Make sure you have Python and PyTorch installed on your system. Then, install the Transformers library using pip:
pip install transformers
Preparing Tokenizers and Models
Import necessary libraries and initialize the model and tokenizers:
import torch
from transformers import AutoModel, AutoTokenizer
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
asm_tokenizer = AutoTokenizer.from_pretrained("hustcw/clap-asm", trust_remote_code=True)
text_tokenizer = AutoTokenizer.from_pretrained("hustcw/clap-text", trust_remote_code=True)
asm_encoder = AutoModel.from_pretrained("hustcw/clap-asm", trust_remote_code=True).to(device)
text_encoder = AutoModel.from_pretrained("hustcw/clap-text", trust_remote_code=True).to(device)
đģ Usage Examples
Basic Usage
Here is an example of fine - grained sorting algorithm classification (zero - shot):
- Load your assembly (asm) code dataset. For demonstration, we use a JSON file containing assembly code snippets related to bubble sort:
with open("./CaseStudy/bubblesort.json") as fp:
asm = json.load(fp)
- Define your classification prompts:
prompts = [
"This is a function related to bubble sort",
"This is a function related to selection sort",
...
]
- Encode the assembly code and prompts, then perform classification:
asm_input = asm_tokenizer([asm], padding=True, return_tensors="pt").to(device)
asm_embedding = asm_encoder(**asm_input)
text_input = text_tokenizer(prompts, return_tensors='pt').to(device)
text_embeddings = text_encoder(**text_input)
logits = torch.einsum("nc,ck->nk", [asm_embedding.last_hidden_state, text_embeddings.last_hidden_state.T])
preds = torch.softmax(logits / 0.07, dim=1).squeeze(0).tolist()
for i, prompt in enumerate(prompts):
print(f"Probability: {preds[i]*100:.3f}%, Text: {prompt}")
You can repeat the process for other classification tasks like malware classification and cryptographic algorithm identification by loading the respective datasets and defining the relevant natural language prompts.
⨠Features
CLAP (Contrastive Language - Assembly Pre - training) is a framework that learns binary code representations through natural language supervision. By aligning binary code with natural language explanations, it improves analysis performance in few - shot and zero - shot scenarios. Utilizing a dataset engine capable of automatically generating 195 million pairs of code snippets and their descriptions, CLAP offers a method with exceptional transferability in the field of binary code analysis.

đ° News
- [2024/2/29] CLAP is available on Hugging Face Model Hub ([clap - asm](https://huggingface.co/hustcw/clap - asm) and [clap - text](https://huggingface.co/hustcw/clap - text)).
- [2024/2/28] CLAP is now on ArXiv.
đ License
This project is licensed under the MIT license.
đ Citation
If this work is helpful for your research, please consider giving a star đ and citing our work.
@misc{wang2024clap,
title={CLAP: Learning Transferable Binary Code Representations with Natural Language Supervision},
author={Hao Wang and Zeyu Gao and Chao Zhang and Zihan Sha and Mingyang Sun and Yuchen Zhou and Wenyu Zhu and Wenju Sun and Han Qiu and Xi Xiao},
year={2024},
eprint={2402.16928},
archivePrefix={arXiv},
primaryClass={cs.SE}
}