đ VeriGen
A fine - tuned model for Verilog code generation based on CodeGen-multi-16B, trained on Verilog code dataset.
đ Quick Start
This README provides a detailed introduction to the VeriGen model, including its summary, usage, limitations, training details, license, and citation information.
⨠Features
- Fine - tuned from CodeGen-multi-16B on Verilog code dataset.
- Capable of generating Verilog code snippets given some context.
- Can serve as a Verilog teaching assistant with appropriate prompts.
đĻ Installation
No specific installation steps are provided in the original README.
đģ Usage Examples
Basic Usage
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
prompt = "//module half adder "
device='cuda'
model_name = "shailja/fine-tuned-codegen-16B-Verilog"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
sample = model.generate(input_ids, max_length=128, temperature=0.5, top_p=0.9)
print(tokenizer.decode(sample[0], truncate_before_pattern=[r"endmodule"]) + "endmodule")
Advanced Usage
The model is not an instruction model. By adding a partial line of module header like "module mux" in addition with the text in the prompt, it can be used as a capable Verilog teaching assistant.
đ Documentation
Model Summary
The VeriGen model is a 16B parameter fine - tuned version of CodeGen-multi-16B trained on Verilog code dataset.
Use
Intended use
The model was trained on Verilog from GitHub and textbooks. It is not an instruction model, and commands like "Write a module that implements a 2 - to - 1 Mux." do not work well. However, by adding a partial line of module header like "module mux" in addition with the text in the prompt turns it into a capable Verilog teaching assistant.
Feel free to share your generations in the Community tab!
Attribution & Other Requirements
The pretraining dataset of the model was not filtered for permissive licenses only. Nevertheless, the model can generate source code verbatim from the dataset. The code's license might require attribution and/or other specific requirements that must be respected.
Limitations
The model has been trained on Verilog source code from open sources. The predominant natural language in source code is English, although other languages are also present. As such the model is capable of generating Verilog snippets provided some context but the generated code is not guaranteed to work as intended. It can be inefficient, contain bugs or exploits. See [the paper](https://drive.google.com/file/d/1cN - b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view) for an in - depth discussion of the model limitations.
Training
Model
- Architecture: GPT - 2 model with multi - query attention
- Pretraining steps: 150k
- Pretraining tokens: ~72B
- Precision: fp16
Hardware
- GPUs: 4 Tesla A100
- Training time: 15 days
đ License
The model is licensed under the BigCode OpenRAIL - M v1 license agreement. You can find the full agreement here.
đ Citation
@misc{https://doi.org/10.48550/arxiv.2212.11140,
doi = {10.48550/ARXIV.2212.11140},
url = {https://arxiv.org/abs/2212.11140},
author = {Thakur, Shailja and Ahmad, Baleegh and Fan, Zhenxing and Pearce, Hammond and Tan, Benjamin and Karri, Ramesh and Dolan - Gavitt, Brendan and Garg, Siddharth},
title = {Benchmarking Large Language Models for Automated Verilog RTL Code Generation},
publisher = {arXiv},
year = {2022},
copyright = {arXiv.org perpetual, non - exclusive license}
}