🚀 CmdCaliper-large
CmdCaliper模型是由CyCraft AI Lab開發的首個專門為命令行嵌入設計的嵌入模型。評估結果表明,即使是參數約3000萬的CmdCaliper最小版本,在各種特定命令行任務中,也能超越參數多10倍以上(3.35億)的最先進句子嵌入模型。CmdCaliper提供了三種不同大小的模型,為不同硬件資源限制提供了靈活的選擇。
🚀 快速開始
CmdCaliper模型專為命令行嵌入設計,具有出色的性能。你可以通過以下鏈接訪問相關資源:
✨ 主要特性
- 針對性設計:首個專門為命令行嵌入設計的嵌入模型。
- 高性能表現:小版本模型在特定命令行任務中能超越參數多10倍以上的先進模型。
- 多種選擇:提供CmdCaliper-large、CmdCaliper-base和CmdCaliper-small三種不同大小的模型,適應不同硬件資源。
📊 評估指標
方法 |
模型參數 |
MRR @3 |
MRR @10 |
Top @3 |
Top @10 |
萊文斯坦距離 |
- |
71.23 |
72.45 |
74.99 |
81.83 |
Word2Vec |
- |
45.83 |
46.93 |
48.49 |
54.86 |
E5-small |
小型 (0.03B) |
81.59 |
82.6 |
84.97 |
90.59 |
GTE-small |
小型 (0.03B) |
82.35 |
83.28 |
85.39 |
90.84 |
CmdCaliper-small |
小型 (0.03B) |
86.81 |
87.78 |
89.21 |
94.76 |
BGE-en-base |
基礎 (0.11B) |
79.49 |
80.41 |
82.33 |
87.39 |
E5-base |
基礎 (0.11B) |
83.16 |
84.07 |
86.14 |
91.56 |
GTR-base |
基礎 (0.11B) |
81.55 |
82.51 |
84.54 |
90.1 |
GTE-base |
基礎 (0.11B) |
78.2 |
79.07 |
81.22 |
86.14 |
CmdCaliper-base |
基礎 (0.11B) |
87.56 |
88.47 |
90.27 |
95.26 |
BGE-en-large |
大型 (0.34B) |
84.11 |
84.92 |
86.64 |
91.09 |
E5-large |
大型 (0.34B) |
84.12 |
85.04 |
87.32 |
92.59 |
GTR-large |
大型 (0.34B) |
88.09 |
88.68 |
91.27 |
94.58 |
GTE-large |
大型 (0.34B) |
84.26 |
85.03 |
87.14 |
91.41 |
CmdCaliper-large |
大型 (0.34B) |
89.12 |
89.91 |
91.45 |
95.65 |
💻 使用示例
基礎用法
HuggingFace Transformers
import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel
def average_pool(last_hidden_states: Tensor,
attention_mask: Tensor) -> Tensor:
last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
input_texts = [
'cronjob schedule daily 00:00 ./program.exe',
'schtasks /create /tn "TaskName" /tr "C:\program.exe" /sc daily /st 00:00',
'xcopy C:\Program Files (x86) E:\Program Files /E /H /K /O /X'
]
tokenizer = AutoTokenizer.from_pretrained("CyCraftAI/CmdCaliper-base")
model = AutoModel.from_pretrained("CyCraftAI/CmdCaliper-base")
batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')
outputs = model(**batch_dict)
embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:1] @ embeddings[1:].T) * 100
print(scores.tolist())
Sentence Transformers
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("CyCraftAI/CmdCaliper-base")
sentences = [
'cronjob schedule daily 00:00 ./program.exe',
'schtasks /create /tn "TaskName" /tr "C:\program.exe" /sc daily /st 00:00',
'xcopy C:\Program Files (x86) E:\Program Files /E /H /K /O /X'
]
embeddings = model.encode(sentences)
print(embeddings.shape)
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
⚠️ 侷限性
⚠️ 重要提示
該模型僅專注於Windows命令行。此外,任何長文本將被截斷為最多512個標記。
📄 引用
@inproceedings{huang2024cmdcaliper,
title={CmdCaliper: A Semantic-Aware Command-Line Embedding Model and Dataset for Security Research},
author={SianYao Huang, ChengLin Yang, CheYu Lin, and ChunYing Huang},
booktitle={Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,
year={2024}
}