SecureBERT_Plusオープンソースモデル - 強力にネットワークセキュリティテキストデータを解析し、性能を大幅に向上

Securebert Plus

ehsanaghaeiによって開発

SecureBERT+はSecureBERTの強化バージョンで、トレーニングコーパスの規模が前作の8倍に達し、マスク言語モデリング（MLM）タスクで平均9%の性能向上を達成、ネットワークセキュリティテキストデータの解析と表現に特化しています。

大規模言語モデル

Transformers

英語#ネットワークセキュリティテキスト理解 #マルウェア分析 #システムコール解析

ダウンロード数 682

リリース時間 : 8/9/2023

モデル概要

SecureBERT+はRoBERTaアーキテクチャに基づくドメイン特化型言語モデルで、大量のネットワークセキュリティテキストでトレーニングされ、ネットワークセキュリティ分野の言語理解と表現学習に焦点を当てています。

モデル特徴

強化版性能

トレーニングコーパスの規模が前作の8倍で、MLMタスクの性能が9%向上しました。

ネットワークセキュリティ専用

ネットワークセキュリティ分野専用に設計されており、ネットワークセキュリティテキストデータをより良く理解し表現できます。

大規模トレーニング

8台のA100 GPUを使用してトレーニングを行い、モデルの能力が大幅に向上しました。

モデル能力

ネットワークセキュリティテキスト理解

マスク言語モデリング

ネットワークセキュリティ分野の言語表現

使用事例

ネットワークセキュリティ分析

ネイティブAPI関数分析

ネイティブAPI関数とそれらがユーザーモードアプリケーションでどのように使用されるかを分析します。

マルウェア配布分析

GuLoaderなどのマルウェア配布ツールとそれらが配布するマルウェアの種類を識別し分析します。

安全なDLL検索パターン

安全なDLL検索パターンの実装とシステムセキュリティへの影響を分析します。

🚀 SecureBERT+

SecureBERT+は、SecureBERTモデルの改良版です。このモデルは、前作の8倍もの大きなコーパスで学習され、8xA100 GPUの計算能力を活用しています。SecureBERT+は、マスク言語モデル（MLM）タスクのパフォーマンスにおいて平均9％の改善をもたらします。この進歩は、サイバーセキュリティ分野における言語理解と表現学習の高度な能力達成に向けた大きな一歩を意味します。

SecureBERTは、RoBERTaに基づくドメイン固有の言語モデルで、大量のサイバーセキュリティデータで学習され、サイバーセキュリティのテキストデータを理解・表現するように微調整されています。

🚀 クイックスタート

✨ 主な機能

SecureBERT+は、サイバーセキュリティ分野の言語理解と表現学習に特化したモデルで、前作に比べてデータセットが拡大され、パフォーマンスが向上しています。

📦 インストール

SecureBERT+を使用するには、以下のコードを実行してモデルをロードします。

from transformers import RobertaTokenizer, RobertaModel
import torch

tokenizer = RobertaTokenizer.from_pretrained("ehsanaghaei/SecureBERT_Plus")
model = RobertaModel.from_pretrained("ehsanaghaei/SecureBERT_Plus")

inputs = tokenizer("This is SecureBERT Plus!", return_tensors="pt")
outputs = model(**inputs)

last_hidden_states = outputs.last_hidden_state

💻 使用例

基本的な使用法

# モデルのロード
from transformers import RobertaTokenizer, RobertaModel
import torch

tokenizer = RobertaTokenizer.from_pretrained("ehsanaghaei/SecureBERT_Plus")
model = RobertaModel.from_pretrained("ehsanaghaei/SecureBERT_Plus")

inputs = tokenizer("This is SecureBERT Plus!", return_tensors="pt")
outputs = model(**inputs)

last_hidden_states = outputs.last_hidden_state

高度な使用法

# マスクされた単語の予測
#!pip install transformers
#!pip install torch
#!pip install tokenizers

import torch
import transformers
from transformers import RobertaTokenizer, RobertaTokenizerFast

tokenizer = RobertaTokenizerFast.from_pretrained("ehsanaghaei/SecureBERT_Plus")
model = transformers.RobertaForMaskedLM.from_pretrained("ehsanaghaei/SecureBERT_Plus")

def predict_mask(sent, tokenizer, model, topk =10, print_results = True):
    token_ids = tokenizer.encode(sent, return_tensors='pt')
    masked_position = (token_ids.squeeze() == tokenizer.mask_token_id).nonzero()
    masked_pos = [mask.item() for mask in masked_position]
    words = []
    with torch.no_grad():
        output = model(token_ids)

    last_hidden_state = output[0].squeeze()

    list_of_list = []
    for index, mask_index in enumerate(masked_pos):
        mask_hidden_state = last_hidden_state[mask_index]
        idx = torch.topk(mask_hidden_state, k=topk, dim=0)[1]
        words = [tokenizer.decode(i.item()).strip() for i in idx]
        words = [w.replace(' ','') for w in words]
        list_of_list.append(words)
        if print_results:
            print("Mask ", "Predictions: ", words)

    best_guess = ""
    for j in list_of_list:
        best_guess = best_guess + "," + j[0]

    return words


while True:
    sent = input("Text here: \t")
    print("SecureBERT: ")
    predict_mask(sent, tokenizer, model)
     
    print("===========================\n")

📚 ドキュメント

データセット

image/png

モデルのロード

SecureBERT+は、Huggingfaceフレームワークにアップロードされています。

他のモデルバリアント

🔧 技術詳細

SecureBERT+は、RoBERTaベースのドメイン固有の言語モデルで、大量のサイバーセキュリティデータで学習されています。このモデルは、マスク言語モデル（MLM）タスクに特化しており、前作に比べてデータセットが8倍に拡大され、8xA100 GPUの計算能力を活用して学習されています。これにより、MLMタスクのパフォーマンスが平均9％向上しています。

📄 ライセンス

このモデルは、CC BY-NC 4.0ライセンスの下で提供されています。

参考

@inproceedings{aghaei2023securebert, title={SecureBERT: A Domain-Specific Language Model for Cybersecurity}, author={Aghaei, Ehsan and Niu, Xi and Shadid, Waseem and Al-Shaer, Ehab}, booktitle={Security and Privacy in Communication Networks: 18th EAI International Conference, SecureComm 2022, Virtual Event, October 2022, Proceedings}, pages={39--56}, year={2023}, organization={Springer} }