🚀 UniXcoder for Code Vulnerability Detection
This model, based on Microsoft's UniXcoder, is fine - tuned for detecting vulnerabilities in C/C++ code. It uses the DetectVul/devign dataset and offers high - accuracy vulnerability detection, classifying code snippets as either safe or vulnerable.
✨ Features
- Optimized for C/C++ code vulnerability detection.
- Trained on the DetectVul/devign dataset.
- Achieves 68.34% accuracy and an F1 score of 62.14%.
- Classifies code snippets as safe (0) or vulnerable (1).
📦 Installation
No specific installation steps are provided in the original document, so this section is skipped.
💻 Usage Examples
Basic Usage
Use the following code to load the model and run inference on a sample code snippet:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
tokenizer = AutoTokenizer.from_pretrained("microsoft/unixcoder-base")
model = AutoModelForSequenceClassification.from_pretrained("mahdin70/unixcoder-code-vulnerability-detector")
code_snippet = """
void process(char *input) {
char buffer[50];
strcpy(buffer, input); // Potential buffer overflow
}
"""
inputs = tokenizer(code_snippet, return_tensors="pt", truncation=True, padding="max_length", max_length=512)
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
predicted_label = torch.argmax(predictions, dim=1).item()
print("Vulnerable Code" if predicted_label == 1 else "Safe Code")
📚 Documentation
Model Details
Property |
Details |
Developed by |
[mahdin70(Mukit Mahdin)] |
Finetuned from |
microsoft/unixcoder-base |
Language(s) |
English (for code comments & metadata), C/C++ |
License |
MIT |
Task |
Code vulnerability detection |
Dataset Used |
DetectVul/devign |
Architecture |
Transformer - based sequence classification |
Uses
Direct Use
This model can be used for static code analysis, security audits, and automatic vulnerability detection in software repositories. It benefits:
- Developers: To analyze their code for potential security flaws.
- Security Teams: To scan repositories for known vulnerabilities.
- Researchers: To study vulnerability detection in AI - powered systems.
Downstream Use
This model can be integrated into IDE plugins, CI/CD pipelines, or security scanners to provide real - time vulnerability detection.
Out - of - Scope Use
- The model is not meant to replace human security experts.
- It may not generalize well to languages other than C/C++.
- False positives/negatives may occur due to dataset limitations.
Bias, Risks, and Limitations
- False Positives & False Negatives: The model may flag safe code as vulnerable or miss actual vulnerabilities.
- Limited to C/C++: The model was trained on a dataset primarily composed of C and C++ code. It may not perform well on other languages.
- Dataset Bias: The training data may not cover all possible vulnerabilities.
Recommendations
💡 Usage Tip
Users should not rely solely on the model for security assessments. Instead, it should be used alongside manual code review and static analysis tools.
Training Details
Training Data
- Dataset:
DetectVul/devign
- Classes:
0 (Safe)
, 1 (Vulnerable)
- Size: 17483 code snippets
Training Procedure
- Optimizer: AdamW
- Loss Function: Cross - Entropy Loss
- Batch Size: 8
- Learning Rate: 2e - 5
- Epochs: 3
- Hardware Used: 2x T4 GPU
Metrics
Metric |
Score |
Train Loss |
0.4835 |
Evaluation Loss |
0.6855 |
Accuracy |
68.34% |
F1 Score |
62.14% |
Precision |
69.18% |
Recall |
56.40% |
Environmental Impact
Factor |
Value |
GPU Used |
2x T4 GPU |
Training Time |
~1 hour |
📄 License
This model is released under the MIT license.