SecureBERT_Plus Open-source Model - Powerfully Parses Cybersecurity Text Data with Significantly Improved Performance

Securebert Plus

Developed by ehsanaghaei

SecureBERT+ is an enhanced version of SecureBERT, with a training corpus eight times larger than its predecessor, achieving an average performance improvement of 9% in masked language modeling (MLM) tasks, specializing in parsing and representing cybersecurity text data.

Large Language Model

Transformers

English#Cybersecurity Text Understanding #Malware Analysis #System Call Parsing

Downloads 682

Release Time : 8/9/2023

Model Overview

SecureBERT+ is a domain-specific language model based on the RoBERTa architecture, trained and fine-tuned on massive cybersecurity texts, focusing on language understanding and representation learning in the cybersecurity domain.

Model Features

Enhanced Performance

Training corpus is eight times larger than the previous version, with a 9% performance improvement in MLM tasks.

Cybersecurity-Specific

Designed specifically for the cybersecurity domain, it better understands and represents cybersecurity text data.

Large-Scale Training

Trained using 8 A100 GPUs, significantly enhancing model capabilities.

Model Capabilities

Cybersecurity Text Understanding

Masked Language Modeling

Cybersecurity Domain Language Representation

Use Cases

Cybersecurity Analysis

Native API Function Analysis

Analyze native API functions and their usage in user-mode applications.

Malware Distribution Analysis

Identify and analyze malware distribution tools (e.g., GuLoader) and the types of malware they distribute.

Secure DLL Search Patterns

Analyze the implementation of secure DLL search patterns and their impact on system security.

🚀 SecureBERT+

SecureBERT+ is an enhanced version of the SecureBERT model. It is trained on a corpus eight times larger than its predecessor, using 8xA100 GPUs. This version shows an average 9% improvement in the Masked Language Model (MLM) task, significantly advancing language understanding and representation learning in the cybersecurity domain.

SecureBERT is a domain - specific language model based on RoBERTa. It is trained on a large amount of cybersecurity data and fine - tuned to understand and represent cybersecurity textual data.

📚 Documentation

Dataset

image/png

Load Model

SecureBERT+ has been uploaded to the Huggingface framework.

from transformers import RobertaTokenizer, RobertaModel
import torch

tokenizer = RobertaTokenizer.from_pretrained("ehsanaghaei/SecureBERT_Plus")
model = RobertaModel.from_pretrained("ehsanaghaei/SecureBERT_Plus")

inputs = tokenizer("This is SecureBERT Plus!", return_tensors="pt")
outputs = model(**inputs)

last_hidden_states = outputs.last_hidden_state

Fill Mask (MLM)

Use the following code to predict the masked word in the given sentences:

#!pip install transformers
#!pip install torch
#!pip install tokenizers

import torch
import transformers
from transformers import RobertaTokenizer, RobertaTokenizerFast

tokenizer = RobertaTokenizerFast.from_pretrained("ehsanaghaei/SecureBERT_Plus")
model = transformers.RobertaForMaskedLM.from_pretrained("ehsanaghaei/SecureBERT_Plus")

def predict_mask(sent, tokenizer, model, topk =10, print_results = True):
    token_ids = tokenizer.encode(sent, return_tensors='pt')
    masked_position = (token_ids.squeeze() == tokenizer.mask_token_id).nonzero()
    masked_pos = [mask.item() for mask in masked_position]
    words = []
    with torch.no_grad():
        output = model(token_ids)

    last_hidden_state = output[0].squeeze()

    list_of_list = []
    for index, mask_index in enumerate(masked_pos):
        mask_hidden_state = last_hidden_state[mask_index]
        idx = torch.topk(mask_hidden_state, k=topk, dim=0)[1]
        words = [tokenizer.decode(i.item()).strip() for i in idx]
        words = [w.replace(' ','') for w in words]
        list_of_list.append(words)
        if print_results:
            print("Mask ", "Predictions: ", words)

    best_guess = ""
    for j in list_of_list:
        best_guess = best_guess + "," + j[0]

    return words


while True:
    sent = input("Text here: \t")
    print("SecureBERT: ")
    predict_mask(sent, tokenizer, model)
     
    print("===========================\n")

Other model variants

📄 License

The model is licensed under cc - by - nc - 4.0.

📋 Widget Examples

Native API functions

Native API functions such as , may be directed invoked via system calls/syscalls, but these features are also often exposed to user - mode applications via interfaces and libraries.

Assigning the PPID of a new process

One way of explicitly assigning the PPID of a new process is via the API call, which supports a parameter that defines the PPID to use.

Enable Safe DLL Search Mode

Enable Safe DLL Search Mode to force search for system DLLs in directories with greater restrictions (e.g. %%) to be used before local directory DLLs (e.g. a user's home directory)

GuLoader is a file downloader

GuLoader is a file downloader that has been used since at least December 2019 to distribute a variety of , including NETWIRE, Agent Tesla, NanoCore, and FormBook.

📖 Reference

@inproceedings{aghaei2023securebert, title={SecureBERT: A Domain - Specific Language Model for Cybersecurity}, author={Aghaei, Ehsan and Niu, Xi and Shadid, Waseem and Al - Shaer, Ehab}, booktitle={Security and Privacy in Communication Networks: 18th EAI International Conference, SecureComm 2022, Virtual Event, October 2022, Proceedings}, pages={39--56}, year={2023}, organization={Springer} }

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご