lsg-bart-large-4096開源長文本處理模型 - 高效應對各類長文本任務

首頁

Lsg Bart Large 4096

由ccdv開發

LSG模型是基於BART-large改進的長序列處理模型，採用局部+稀疏+全局注意力機制，高效處理長文本任務

文本生成

Transformers

英語#長文本摘要生成 #局部稀疏全局注意力 #高效序列處理

下載量 15

發布時間 : 3/2/2022

模型概述

該模型針對編碼器-解碼器任務優化，能高效處理長序列輸入，相比傳統長序列模型具有更快速度和更高效率

模型特點

高效長序列處理

採用局部+稀疏+全局注意力機制(LSG)，顯著提升長文本處理效率

自適應序列長度

支持自動填充序列長度至分塊大小的整數倍，確保處理穩定性

多模式稀疏選擇

提供6種稀疏選擇模式（如BOS池化、LSH聚類等），適應不同任務需求

兼容原始架構

保持與BART-large相同的參數規模和層數，共享相同分詞器

模型能力

長文本摘要生成

序列到序列轉換

高效處理4096長度輸入

文本分類

使用案例

文本摘要

長文檔自動摘要

對科研論文、長篇文章等超長文本生成精準摘要

相比傳統模型處理速度提升顯著

文本處理

長文本分類

對超長文檔進行分類任務

保持高準確率的同時降低內存消耗

🚀 LSG模型

LSG模型基於BART-large進行了調整，適用於編碼器 - 解碼器任務，無需額外的預訓練。它能夠處理長序列，並且比來自模型中心的Longformer (LED) 或BigBird (Pegasus) 更快、更高效，依賴於局部 + 稀疏 + 全局注意力 (LSG) 機制。

⚠️ 重要提示

此模型依賴於自定義建模文件，需要添加trust_remote_code=True才能使用。此模型需要Transformers >= 4.36.1。請參考 #13467。

🚀 快速開始

此模型依賴於自定義建模文件，使用時需要添加trust_remote_code=True。

from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained("ccdv/lsg-bart-large-4096", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("ccdv/lsg-bart-large-4096")

✨ 主要特性

該模型改編自 BART-large，用於編碼器 - 解碼器任務，無需額外預訓練，使用相同數量的參數/層和相同的分詞器。
能夠處理長序列，並且比Longformer (LED) 或BigBird (Pegasus) 更快、更高效，依賴於局部 + 稀疏 + 全局注意力 (LSG)。
模型要求序列長度是塊大小的倍數，若需要可自動填充序列（配置中adaptive=True），不過建議使用分詞器截斷輸入（truncation=True），並可選擇按塊大小的倍數進行填充（pad_to_multiple_of=...）。

📦 安裝指南

此模型依賴於自定義建模文件，使用時需要添加trust_remote_code=True，同時需要Transformers >= 4.36.1。

💻 使用示例

基礎用法

from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained("ccdv/lsg-bart-large-4096", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("ccdv/lsg-bart-large-4096")

高級用法

from transformers import AutoModel

model = AutoModel.from_pretrained("ccdv/lsg-bart-large-4096", 
    trust_remote_code=True, 
    num_global_tokens=16,
    block_size=64,
    sparse_block_size=64,
    attention_probs_dropout_prob=0.0,
    sparsity_factor=4,
    sparsity_type="none",
    mask_first_token=True
)

Seq2Seq摘要任務示例

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model = AutoModelForSeq2SeqLM.from_pretrained("ccdv/lsg-bart-large-4096", 
    trust_remote_code=True, 
    pass_global_tokens_to_decoder=True, # Pass encoder global tokens to decoder
)
tokenizer = AutoTokenizer.from_pretrained("ccdv/lsg-bart-large-4096")

SENTENCE = "This is a test sequence to test the model. " * 300
token_ids = tokenizer(
    SENTENCE, 
    return_tensors="pt", 
    #pad_to_multiple_of=... # Optional
    truncation=True
    )
output = model(**token_ids)

分類任務示例

from transformers import AutoModelForSequenceClassification, AutoTokenizer

model = AutoModelForSequenceClassification.from_pretrained("ccdv/lsg-bart-large-4096", 
    trust_remote_code=True, 
    pass_global_tokens_to_decoder=True, # Pass encoder global tokens to decoder
)
tokenizer = AutoTokenizer.from_pretrained("ccdv/lsg-bart-large-4096")

SENTENCE = "This is a test sequence to test the model. " * 300
token_ids = tokenizer(
    SENTENCE, 
    return_tensors="pt", 
    padding="max_length", # Optional but recommended
    truncation=True # Optional but recommended
    )
output = model(**token_ids)

> SequenceClassifierOutput(loss=None, logits=tensor([[-0.3051, -0.1762]], grad_fn=<AddmmBackward>), hidden_states=None, attentions=None)

🔧 技術細節

參數設置

可以更改各種參數，例如：

全局令牌數量 (num_global_tokens=1)
局部塊大小 (block_size=128)
稀疏塊大小 (sparse_block_size=128)
稀疏因子 (sparsity_factor=2)
掩碼第一個令牌 (mask first token since it is redundant with the first global token)
更多參數可查看config.json文件

默認參數在實踐中效果良好。如果內存不足，可以減小塊大小、增加稀疏因子並去除注意力分數矩陣中的丟棄率。

稀疏選擇類型

有6種不同的稀疏選擇模式，最佳類型取決於具體任務。

若sparse_block_size=0或sparsity_type="none"，則僅考慮局部注意力。
注意，對於長度 < 2 * 塊大小的序列，稀疏選擇類型沒有影響。

稀疏選擇類型	描述	適用稀疏因子	附加參數
`sparsity_type="bos_pooling"` (新)	使用BOS令牌進行加權平均池化	通常較大 (8, 16, 32)	無
`sparsity_type="norm"`	選擇範數最高的令牌	較小 (2 到 4)	無
`sparsity_type="pooling"`	使用平均池化合並令牌	較小 (2 到 4)	無
`sparsity_type="lsh"`	使用LSH算法對相似令牌進行聚類	較大 (4+)	`lsg_num_pre_rounds=1`（在計算質心之前合併令牌n次）
`sparsity_type="stride"`	每個頭使用按稀疏因子跨步的不同令牌	不建議`sparsify_factor > num_heads`	無
`sparsity_type="block_stride"`	每個頭使用按稀疏因子跨步的令牌塊	不建議`sparsify_factor > num_heads`	無

📚 詳細文檔

LSG ArXiv 論文。
Github/轉換腳本可在這個鏈接獲取。

📄 許可證

BART引用

@article{DBLP:journals/corr/abs-1910-13461,
  author    = {Mike Lewis and
               Yinhan Liu and
               Naman Goyal and
               Marjan Ghazvininejad and
               Abdelrahman Mohamed and
               Omer Levy and
               Veselin Stoyanov and
               Luke Zettlemoyer},
  title     = {{BART:} Denoising Sequence-to-Sequence Pre-training for Natural Language
               Generation, Translation, and Comprehension},
  journal   = {CoRR},
  volume    = {abs/1910.13461},
  year      = {2019},
  url       = {http://arxiv.org/abs/1910.13461},
  eprinttype = {arXiv},
  eprint    = {1910.13461},
  timestamp = {Thu, 31 Oct 2019 14:02:26 +0100},
  biburl    = {https://dblp.org/rec/journals/corr/abs-1910-13461.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}