đ CodeAstra-7b: Open Source State-of-the-Art Vulnerability Detection Model
CodeAstra-7b is a cutting - edge open - source model designed for detecting vulnerabilities in source code. It offers high - performance vulnerability detection across multiple programming languages, providing valuable assistance to developers, security researchers, and code auditors.
⨠Features
- Multi - language Support: Capable of detecting vulnerabilities in a wide range of programming languages, including Go, Python, C, C++, Fortran, Ruby, Java, Kotlin, C#, PHP, Swift, JavaScript, and TypeScript.
- State - of - the - Art Performance: Achieves top - notch results in vulnerability detection tasks.
- Custom Dataset: Trained on a proprietary dataset specifically curated for comprehensive vulnerability detection.
- Large - scale Training: Utilized A100 GPUs for efficient and powerful training.
đĻ Installation
This README does not provide installation steps, so this section is skipped.
đģ Usage Examples
Basic Usage
import torch
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
peft_model_id = "rootxhacker/CodeAstra-7B"
config = PeftConfig.from_pretrained(peft_model_id)
model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path, return_dict=True, load_in_4bit=True, device_map='auto')
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)
model = PeftModel.from_pretrained(model, peft_model_id)
def get_completion(query, model, tokenizer):
inputs = tokenizer(query, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=512, do_sample=True, temperature=0.7)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
code_to_analyze = """
def user_input():
name = input("Enter your name: ")
print("Hello, " + name + "!")
user_input()
"""
query = f"Analyze this code for vulnerabilities and quality issues:\n{code_to_analyze}"
result = get_completion(query, model, tokenizer)
print(result)
đ Documentation
Model Description
CodeAstra-7b is a state - of - the - art language model fine - tuned for vulnerability detection in multiple programming languages. Based on the powerful Mistral - 7B - Instruct - v0.2 model, it has been specifically trained to identify potential security vulnerabilities across a wide range of popular programming languages.
Performance Comparison
CodeAstra-7b significantly outperforms existing models in vulnerability detection accuracy. Here's a comparison table:
Property |
Details |
Model Type |
CodeAstra-7b is a fine - tuned language model for vulnerability detection. |
Training Data |
Trained on a proprietary dataset curated for comprehensive vulnerability detection. |
Model |
Accuracy (%) |
gpt4o |
88.78 |
CodeAstra-7b |
83.00 |
codebert - base - finetuned - detect - insecure - code |
65.30 |
CodeBERT |
62.08 |
RoBERTa |
61.05 |
TextCNN |
60.69 |
BiLSTM |
59.37 |
As shown in the table, CodeAstra-7b achieves an impressive 83% accuracy, substantially surpassing other state - of - the - art models in the field of vulnerability detection.
Intended Use
CodeAstra-7b is designed to assist developers, security researchers, and code auditors in identifying potential security vulnerabilities in source code. It can be integrated into development workflows, code review processes, or used as a standalone tool for code analysis.
Multiple Vulnerability Scenarios
It's important to note that while CodeAstra-7b excels at finding security issues in most cases, its performance may vary when multiple vulnerabilities are present in the same code snippet. In scenarios where two or three vulnerabilities coexist, the model might not always identify all of them correctly. Users should be aware of this limitation and consider using the model as part of a broader, multi - faceted security review process.
Limitations
- The model may not catch all vulnerabilities or code quality issues and should be used as part of a comprehensive security and code review strategy.
- In cases where multiple vulnerabilities (two or three) are present in the same code snippet, the model might not identify all of them correctly.
- False positives are possible, and results should be verified by human experts.
- The model's performance may vary depending on the complexity and context of the code being analyzed.
- CodeAstra's performance depends on input code snippet length.
Test Aparatus
The author tested CodeAstra-7b against code snippets from datasets such as Cvefix, YesWeHack vulnerable code repository, synthetically generated code using LLMs, and OWASP Juice Shop source code. The author also ran all those vulnerable scripts against LLMs such as GPT4, GPT4o etc for evaluation.
đ§ Technical Details
This README does not provide detailed technical implementation information, so this section is skipped.
đ License
CodeAstra-7b is released under the Apache License 2.0.
Copyright 2024 [Harish Santhanalakshmi Ganesan]
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
Acknowledgements
We would like to thank the Mistral AI team for their excellent base model, which served as the foundation for CodeAstra-7b.