Open-source CLAP-ASM Model - Freely Enhance Binary Code Analysis Performance with the Aid of Natural Language!

Clap Asm

Developed by hustcw

CLAP is a framework for learning binary code representations through natural language supervision, enhancing analysis performance by aligning binary code with natural language descriptions.

Multimodal Alignment

Transformers

Open Source License:MIT #Binary Code Analysis #Zero-shot Learning #Natural Language Supervision

Downloads 102

Release Time : 2/29/2024

Model Overview

CLAP is a transferable binary code representation learning framework based on natural language supervision, significantly improving binary code analysis performance in few-shot and zero-shot scenarios.

Model Features

Natural Language Supervision

Achieves better representation learning by aligning binary code with natural language descriptions.

Zero-shot and Few-shot Learning Capabilities

Achieves high-performance classification with little or no additional training data.

Large-scale Dataset Support

Trained on an automatically generated dataset of 195 million code snippets and their descriptions.

Excellent Transferability

The pretrained model can be transferred to various binary code analysis tasks.

Model Capabilities

Binary code representation learning

Zero-shot classification

Few-shot learning

Code snippet matching

Cross-task transfer learning

Use Cases

Algorithm Recognition

Sorting Algorithm Recognition

Identify the type of sorting algorithm in binary code (e.g., bubble sort, selection sort, etc.)

High-accuracy zero-shot classification performance

Security Analysis

Malware Classification

Identify malware types based on binary code snippets

Encryption Algorithm Identification

Identify encryption algorithms used in binary code

🚀 CLAP: Learning Transferable Binary Code Representations with Natural Language Supervision

CLAP is a framework that learns binary code representations through natural language supervision. It aligns binary code with natural - language explanations, enhancing analysis performance in few - shot and zero - shot scenarios. With a dataset engine generating 195 million pairs of code snippets and descriptions, it offers high transferability in binary code analysis.

🚀 Quick Start

This document will guide you to set up and start using the CLAP model for various tasks, such as fine - grained classification of sorting algorithms, malware, and cryptographic algorithms without further training.

Requirements

Python 3.6 or higher
[PyTorch](https://pytorch.org/get - started/locally/)
Transformers library
A CUDA - enabled GPU is highly recommended for faster processing.

Make sure you have Python and PyTorch installed on your system. Then, install the Transformers library using pip:

pip install transformers

Preparing Tokenizers and Models

Import necessary libraries and initialize the model and tokenizers:

import torch
from transformers import AutoModel, AutoTokenizer

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

asm_tokenizer = AutoTokenizer.from_pretrained("hustcw/clap-asm", trust_remote_code=True)
text_tokenizer = AutoTokenizer.from_pretrained("hustcw/clap-text", trust_remote_code=True)
asm_encoder = AutoModel.from_pretrained("hustcw/clap-asm", trust_remote_code=True).to(device)
text_encoder = AutoModel.from_pretrained("hustcw/clap-text", trust_remote_code=True).to(device)

💻 Usage Examples

Basic Usage

Here is an example of fine - grained sorting algorithm classification (zero - shot):

Load your assembly (asm) code dataset. For demonstration, we use a JSON file containing assembly code snippets related to bubble sort:

with open("./CaseStudy/bubblesort.json") as fp:
    asm = json.load(fp)

Define your classification prompts:

prompts = [
    "This is a function related to bubble sort",
    "This is a function related to selection sort",
    ...
]

Encode the assembly code and prompts, then perform classification:

# Encode assembly code
asm_input = asm_tokenizer([asm], padding=True, return_tensors="pt").to(device)
asm_embedding = asm_encoder(**asm_input)

# Encode prompts
text_input = text_tokenizer(prompts, return_tensors='pt').to(device)
text_embeddings = text_encoder(**text_input)

# Classification
logits = torch.einsum("nc,ck->nk", [asm_embedding.last_hidden_state, text_embeddings.last_hidden_state.T])
preds = torch.softmax(logits / 0.07, dim=1).squeeze(0).tolist()

# Output predictions
for i, prompt in enumerate(prompts):
    print(f"Probability: {preds[i]*100:.3f}%, Text: {prompt}")

You can repeat the process for other classification tasks like malware classification and cryptographic algorithm identification by loading the respective datasets and defining the relevant natural language prompts.

✨ Features

CLAP (Contrastive Language - Assembly Pre - training) is a framework that learns binary code representations through natural language supervision. By aligning binary code with natural language explanations, it improves analysis performance in few - shot and zero - shot scenarios. Utilizing a dataset engine capable of automatically generating 195 million pairs of code snippets and their descriptions, CLAP offers a method with exceptional transferability in the field of binary code analysis.

![clap_model](https://cdn - uploads.huggingface.co/production/uploads/6342dd731bdd3dfa55d66931/qCNIjTlzOPtTpI3NLBY14.png)

📰 News

[2024/2/29] CLAP is available on Hugging Face Model Hub ([clap - asm](https://huggingface.co/hustcw/clap - asm) and [clap - text](https://huggingface.co/hustcw/clap - text)).
[2024/2/28] CLAP is now on ArXiv.

📄 License

This project is licensed under the MIT license.

📚 Citation

If this work is helpful for your research, please consider giving a star 🌟 and citing our work.

@misc{wang2024clap,
title={CLAP: Learning Transferable Binary Code Representations with Natural Language Supervision},
author={Hao Wang and Zeyu Gao and Chao Zhang and Zihan Sha and Mingyang Sun and Yuchen Zhou and Wenyu Zhu and Wenju Sun and Han Qiu and Xi Xiao},
year={2024},
eprint={2402.16928},
archivePrefix={arXiv},
primaryClass={cs.SE}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご