đ Nova: Generative Language Model For Assembly Code
Nova is a generative large language model for assembly code, which uses a hierarchical attention mechanism and contrastive learning objectives to overcome the challenges of assembly code and shows promising abilities in assembly generation and understanding tasks.
đ Quick Start
Environment Setup
You can set up the environment using conda
or a Docker image.
Using Conda
conda create -n nova python=3.10
conda activate nova
pip install -r requirements.txt
Using Docker
docker pull jiang719/nova
docker run --gpus all -it jiang719/nova
⨠Features
- Hierarchical Attention Mechanism: Builds attention summaries to capture the semantics of assembly code more effectively.
- Contrastive Learning Objectives: Trains LLMs to learn assembly optimization.
- Performance Superiority: Outperforms existing techniques on binary code decompilation and similarity detection tasks.
đ Documentation
Abstract
Binary code analysis is fundamental to crucial security - related tasks. Although large language models (LLMs) have significantly improved source - code tasks, they cannot be directly applied to assembly code due to its unique challenges: (1) low information density and (2) diverse optimizations. To address these issues, this work proposes a hierarchical attention mechanism and contrastive learning objectives. Based on these techniques, Nova, a generative LLM for assembly code, is developed. Nova outperforms existing techniques on binary code decompilation, with Pass@1 and Pass@10 up to 14.84 - 21.58% higher, and on binary code similarity detection, with Recall@1 up to 6.17% higher, demonstrating its potential in assembly generation and understanding.
Introduction of Nova
Nova is pre - trained with the language modeling objective starting from DeepSeek - Coder checkpoints. It uses the disassembly code from AnghaBench and C/C++ programs compiled from [The - Stack](https://huggingface.co/datasets/bigcode/the - stack).
This repository contains the foundation model of Nova, which has 6.7B parameters. Other models in this series are as follows:
- [Nova - 1.3b](https://huggingface.co/lt - asset/nova - 1.3b): A foundation model for binary code with 1.3B parameters.
- [Nova - 1.3b - bcr](https://huggingface.co/lt - asset/nova - 1.3b - bcr): The Nova - 1.3b model further instruction - tuned for binary code recovery.
- [Nova - 6.7b - bcr](https://huggingface.co/lt - asset/nova - 6.7b - bcr): The Nova - 6.7b model further instruction - tuned for binary code recovery.
đ License
The project is licensed under the BSD 3 - Clause Clear License.
đ Citation
@misc{jiang2024nova,
title={Nova: Generative Language Models for Assembly Code with Hierarchical Attention and Contrastive Learning},
author={Nan Jiang and Chengxiao Wang and Kevin Liu and Xiangzhe Xu and Lin Tan and Xiangyu Zhang},
year={2024},
eprint={2311.13721},
archivePrefix={arXiv},
primaryClass={cs.SE},
url={https://arxiv.org/abs/2311.13721},
}