Nova-6.7b Open-source Assembly Code Generation Model - Freely Assist in Binary Code Analysis Tasks

Home

Nova 6.7b

Developed by lt-asset

Nova is a generative language model for assembly code, focusing on binary code analysis tasks

Large Language Model

Transformers

Open Source License:Bsd-3-clause #Assembly Code Generation #Binary Decompilation #Hierarchical Attention Mechanism

Downloads 20

Release Time : 7/18/2024

Model Overview

Nova is a generative language model for assembly code developed based on the DeepSeek-Coder checkpoint. It effectively addresses the unique challenges of assembly code through hierarchical attention mechanisms and contrastive learning objectives

Model Features

Hierarchical Attention Mechanism

More effectively captures the semantics of assembly code by constructing attention summaries

Contrastive Learning Objective

Trains the LLM to learn assembly optimization, improving model performance

Multi-task Support

Supports multiple tasks including binary code decompilation and similarity detection

Model Capabilities

Binary code decompilation

Binary code similarity detection

Assembly code generation

Assembly code understanding

Use Cases

Security Analysis

Malware Analysis

Analyze malicious software by decompiling binary code

Improves decompilation accuracy

Code Optimization

Assembly Code Optimization

Generate optimized assembly code

Improves code execution efficiency

🚀 Nova: Generative Language Model For Assembly Code

Nova is a generative large language model for assembly code, which uses a hierarchical attention mechanism and contrastive learning objectives to overcome the challenges of assembly code and shows promising abilities in assembly generation and understanding tasks.

🚀 Quick Start

Environment Setup

You can set up the environment using conda or a Docker image.

Using Conda

conda create -n nova python=3.10
conda activate nova
pip install -r requirements.txt

Using Docker

docker pull jiang719/nova
docker run --gpus all -it jiang719/nova

✨ Features

Hierarchical Attention Mechanism: Builds attention summaries to capture the semantics of assembly code more effectively.
Contrastive Learning Objectives: Trains LLMs to learn assembly optimization.
Performance Superiority: Outperforms existing techniques on binary code decompilation and similarity detection tasks.

📚 Documentation

Abstract

Binary code analysis is fundamental to crucial security - related tasks. Although large language models (LLMs) have significantly improved source - code tasks, they cannot be directly applied to assembly code due to its unique challenges: (1) low information density and (2) diverse optimizations. To address these issues, this work proposes a hierarchical attention mechanism and contrastive learning objectives. Based on these techniques, Nova, a generative LLM for assembly code, is developed. Nova outperforms existing techniques on binary code decompilation, with Pass@1 and Pass@10 up to 14.84 - 21.58% higher, and on binary code similarity detection, with Recall@1 up to 6.17% higher, demonstrating its potential in assembly generation and understanding.

Introduction of Nova

Nova is pre - trained with the language modeling objective starting from DeepSeek - Coder checkpoints. It uses the disassembly code from AnghaBench and C/C++ programs compiled from [The - Stack](https://huggingface.co/datasets/bigcode/the - stack).

This repository contains the foundation model of Nova, which has 6.7B parameters. Other models in this series are as follows:

[Nova - 1.3b](https://huggingface.co/lt - asset/nova - 1.3b): A foundation model for binary code with 1.3B parameters.
[Nova - 1.3b - bcr](https://huggingface.co/lt - asset/nova - 1.3b - bcr): The Nova - 1.3b model further instruction - tuned for binary code recovery.
[Nova - 6.7b - bcr](https://huggingface.co/lt - asset/nova - 6.7b - bcr): The Nova - 6.7b model further instruction - tuned for binary code recovery.

📄 License

The project is licensed under the BSD 3 - Clause Clear License.

📚 Citation

@misc{jiang2024nova,
      title={Nova: Generative Language Models for Assembly Code with Hierarchical Attention and Contrastive Learning}, 
      author={Nan Jiang and Chengxiao Wang and Kevin Liu and Xiangzhe Xu and Lin Tan and Xiangyu Zhang},
      year={2024},
      eprint={2311.13721},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      url={https://arxiv.org/abs/2311.13721}, 
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご