Unixcoder Code Vulnerability Detector - Open-source C/C++ code vulnerability detection with nearly 70% accuracy!

Unixcoder Code Vulnerability Detector

Developed by mahdin70

A C/C++ code vulnerability detection model fine-tuned based on Microsoft's UniXcoder, with an accuracy of 68.34% and an F1 score of 62.14%.

Text Classification

Transformers

English#C/C++ Vulnerability Detection #Static Code Analysis #Security Audit

Downloads 416

Release Time : 3/1/2025

Model Overview

Specifically designed to detect vulnerabilities in C/C++ code and classify code snippets to determine their security.

Model Features

Optimized for C/C++

Fine-tuned based on UniXcoder and specifically optimized for C/C++ code vulnerability detection.

Static Code Analysis

Can be used for static code analysis and security audit.

Friendly Integration

Can be integrated into IDE plugins or CI/CD pipelines to provide real-time detection.

Model Capabilities

C/C++ Code Vulnerability Detection

Code Security Classification

Static Code Analysis

Use Cases

Development Security

Developer Self-check

Developers analyze their own code for potential security vulnerabilities.

Identify potentially vulnerable code

Security Scan

Security teams scan repositories to find known vulnerabilities.

Discover security risks in the codebase

Research Application

AI Vulnerability Detection Research

Researchers study vulnerability detection methods in artificial intelligence systems.

🚀 UniXcoder for Code Vulnerability Detection

This model, based on Microsoft's UniXcoder, is fine - tuned for detecting vulnerabilities in C/C++ code. It uses the DetectVul/devign dataset and offers high - accuracy vulnerability detection, classifying code snippets as either safe or vulnerable.

✨ Features

Optimized for C/C++ code vulnerability detection.
Trained on the DetectVul/devign dataset.
Achieves 68.34% accuracy and an F1 score of 62.14%.
Classifies code snippets as safe (0) or vulnerable (1).

📦 Installation

No specific installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

Use the following code to load the model and run inference on a sample code snippet:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load the fine - tuned model
tokenizer = AutoTokenizer.from_pretrained("microsoft/unixcoder-base")
model = AutoModelForSequenceClassification.from_pretrained("mahdin70/unixcoder-code-vulnerability-detector")

# Sample code snippet
code_snippet = """
void process(char *input) {
    char buffer[50];
    strcpy(buffer, input); // Potential buffer overflow
}
"""

# Tokenize the input
inputs = tokenizer(code_snippet, return_tensors="pt", truncation=True, padding="max_length", max_length=512)

# Run inference
with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
    predicted_label = torch.argmax(predictions, dim=1).item()

# Output the result
print("Vulnerable Code" if predicted_label == 1 else "Safe Code")

📚 Documentation

Model Details

Property	Details
Developed by	[mahdin70(Mukit Mahdin)]
Finetuned from	`microsoft/unixcoder-base`
Language(s)	English (for code comments & metadata), C/C++
License	MIT
Task	Code vulnerability detection
Dataset Used	`DetectVul/devign`
Architecture	Transformer - based sequence classification

Uses

Direct Use

This model can be used for static code analysis, security audits, and automatic vulnerability detection in software repositories. It benefits:

Developers: To analyze their code for potential security flaws.
Security Teams: To scan repositories for known vulnerabilities.
Researchers: To study vulnerability detection in AI - powered systems.

Downstream Use

This model can be integrated into IDE plugins, CI/CD pipelines, or security scanners to provide real - time vulnerability detection.

Out - of - Scope Use

The model is not meant to replace human security experts.
It may not generalize well to languages other than C/C++.
False positives/negatives may occur due to dataset limitations.

Bias, Risks, and Limitations

False Positives & False Negatives: The model may flag safe code as vulnerable or miss actual vulnerabilities.
Limited to C/C++: The model was trained on a dataset primarily composed of C and C++ code. It may not perform well on other languages.
Dataset Bias: The training data may not cover all possible vulnerabilities.

Recommendations

💡 Usage Tip

Users should not rely solely on the model for security assessments. Instead, it should be used alongside manual code review and static analysis tools.

Training Details

Training Data

Dataset: DetectVul/devign
Classes: 0 (Safe), 1 (Vulnerable)
Size: 17483 code snippets

Training Procedure

Optimizer: AdamW
Loss Function: Cross - Entropy Loss
Batch Size: 8
Learning Rate: 2e - 5
Epochs: 3
Hardware Used: 2x T4 GPU

Metrics

Metric	Score
Train Loss	0.4835
Evaluation Loss	0.6855
Accuracy	68.34%
F1 Score	62.14%
Precision	69.18%
Recall	56.40%

Environmental Impact

Factor	Value
GPU Used	2x T4 GPU
Training Time	~1 hour

📄 License

This model is released under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご