t5-base-finetuned-sarcasm-twitter開源模型 - 免費部署精準檢測文本諷刺內容

首頁

T5 Base Finetuned Sarcasm Twitter

由mrm8488開發

該模型是基於T5-base架構在推特諷刺數據集上微調的文本分類模型，用於檢測文本中的諷刺內容。

文本分類

Transformers

英語#推特諷刺檢測 #文本生成式分類 #上下文敏感分析

下載量 1,779

發布時間 : 3/2/2022

模型概述

通過將序列分類任務轉換為文本生成形式，該模型能夠準確識別推特對話中的諷刺語句，適用於社交媒體內容分析場景。

模型特點

文本到文本統一框架

採用T5的文本生成形式處理分類任務，實現任務統一處理

上下文感知分析

能結合對話上下文進行諷刺判斷，提升檢測準確性

輕量級微調

基於預訓練T5模型進行高效微調，適合特定領域任務

模型能力

諷刺內容檢測

文本分類

對話理解

使用案例

社交媒體分析

推特諷刺內容過濾

自動識別用戶生成內容中的諷刺語句

F1值達0.83（測試集表現）

對話情感分析增強

作為情感分析系統的補充模塊識別反諷表達

🚀 T5-base 微調用於諷刺檢測 🙄

本項目基於 Google 的 T5 基礎模型，在 Twitter 諷刺數據集上進行微調，以完成**序列分類（文本生成形式）**的下游任務。

🚀 快速開始

代碼示例

from transformers import AutoTokenizer, AutoModelWithLMHead
tokenizer = AutoTokenizer.from_pretrained("mrm8488/t5-base-finetuned-sarcasm-twitter")

model = AutoModelWithLMHead.from_pretrained("mrm8488/t5-base-finetuned-sarcasm-twitter")

def eval_conversation(text):

  input_ids = tokenizer.encode(text + '</s>', return_tensors='pt')

  output = model.generate(input_ids=input_ids, max_length=3)
  
  dec = [tokenizer.decode(ids) for ids in output]

  label = dec[0]

  return label

# For similarity with the training dataset we should replace users mentions in twits for @USER token and urls for URL token.

twit1 = "Trump just suspended the visa program that allowed me to move to the US to start @USER!" +
" Unfortunately, I won’t be able to vote in a few months but if you can, please vote him out, " +
"he's destroying what made America great in so many different ways!"

twit2 = "@USER @USER @USER We have far more cases than any other country, " +
"so leaving remote workers in would be disastrous. Makes Trump sense."

twit3 = "My worry is that i wouldn’t be surprised if half the country actually agrees with this move..."

me = "Trump doing so??? It must be a mistake... XDDD"

conversation = twit1 + twit2

eval_conversation(conversation) #Output: 'derison'

conversation = twit1 + twit3

eval_conversation(conversation) #Output: 'normal'

conversation = twit1 + me

eval_conversation(conversation) #Output: 'derison'

# We will get 'normal' when sarcasm is not detected and 'derison' when detected

✨ 主要特性

基於 Google 的 T5 基礎模型進行微調，可用於諷刺檢測任務。
以文本生成的形式完成序列分類任務。

📚 詳細文檔

T5 模型詳情

T5 模型由 Colin Raffel、Noam Shazeer、Adam Roberts、Katherine Lee、Sharan Narang、Michael Matena、Yanqi Zhou、Wei Li、Peter J. Liu 在論文 Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer 中提出。以下是論文摘要：

遷移學習是自然語言處理（NLP）中一種強大的技術，即先在數據豐富的任務上對模型進行預訓練，然後在下游任務上進行微調。遷移學習的有效性催生了多種方法、方法論和實踐。在本文中，我們通過引入一個統一的框架，將每個語言問題轉化為文本到文本的格式，探索了 NLP 遷移學習技術的領域。我們的系統研究比較了預訓練目標、架構、無標籤數據集、遷移方法和其他因素在數十個語言理解任務上的表現。通過將我們的探索見解與規模和新的“Colossal Clean Crawled Corpus”相結合，我們在許多涵蓋摘要、問答、文本分類等的基準測試中取得了最先進的結果。為了促進未來 NLP 遷移學習的研究，我們發佈了數據集、預訓練模型和代碼。

模型圖片

下游任務詳情（文本生成形式的序列分類） - 數據集 📚

Twitter 諷刺數據集

該數據集為諷刺檢測任務提供了 Twitter 訓練和測試數據集，格式為 jsonlines。

每行包含一個 JSON 對象，具有以下字段：

label：SARCASM 或 NOT_SARCASM
- 測試數據中無此字段
id：樣本的字符串標識符。提交結果時需要此 id。
- 僅在測試數據中有此字段
response：諷刺回覆，即一條諷刺推文
context：response 的對話上下文
- 注意，上下文是一個有序的對話列表，即如果上下文包含三個元素 c1、c2、c3，那麼 c2 是對 c1 的回覆，c3 是對 c2 的回覆。此外，如果諷刺回覆是 r，那麼 r 是對 c3 的回覆。

例如，以下是一個訓練示例：

"label": "SARCASM", "response": "Did Kelly just call someone else messy? Baaaahaaahahahaha", "context": ["X is looking a First Lady should . #classact", "didn't think it was tailored enough it looked messy"]

回覆推文 "Did Kelly..." 是對其直接上下文 "didn't think it was tailored..." 的回覆，而該上下文又是對 "X is looking..." 的回覆。你的目標是在使用上下文（即直接或完整上下文）的同時預測 "response" 的標籤。

數據集規模統計：