t5-base-finetuned-sarcasm-twitter开源模型 - 免费部署精准检测文本讽刺内容

首页

T5 Base Finetuned Sarcasm Twitter

由 mrm8488 开发

该模型是基于T5-base架构在推特讽刺数据集上微调的文本分类模型，用于检测文本中的讽刺内容。

文本分类

Transformers

英语#推特讽刺检测 #文本生成式分类 #上下文敏感分析

下载量 1,779

发布时间 : 3/2/2022

模型简介

通过将序列分类任务转换为文本生成形式，该模型能够准确识别推特对话中的讽刺语句，适用于社交媒体内容分析场景。

模型特点

文本到文本统一框架

采用T5的文本生成形式处理分类任务，实现任务统一处理

上下文感知分析

能结合对话上下文进行讽刺判断，提升检测准确性

轻量级微调

基于预训练T5模型进行高效微调，适合特定领域任务

模型能力

讽刺内容检测

文本分类

对话理解

使用案例

社交媒体分析

推特讽刺内容过滤

自动识别用户生成内容中的讽刺语句

F1值达0.83（测试集表现）

对话情感分析增强

作为情感分析系统的补充模块识别反讽表达

🚀 T5-base 微调用于讽刺检测 🙄

本项目基于 Google 的 T5 基础模型，在 Twitter 讽刺数据集上进行微调，以完成**序列分类（文本生成形式）**的下游任务。

🚀 快速开始

代码示例

from transformers import AutoTokenizer, AutoModelWithLMHead
tokenizer = AutoTokenizer.from_pretrained("mrm8488/t5-base-finetuned-sarcasm-twitter")

model = AutoModelWithLMHead.from_pretrained("mrm8488/t5-base-finetuned-sarcasm-twitter")

def eval_conversation(text):

  input_ids = tokenizer.encode(text + '</s>', return_tensors='pt')

  output = model.generate(input_ids=input_ids, max_length=3)
  
  dec = [tokenizer.decode(ids) for ids in output]

  label = dec[0]

  return label

# For similarity with the training dataset we should replace users mentions in twits for @USER token and urls for URL token.

twit1 = "Trump just suspended the visa program that allowed me to move to the US to start @USER!" +
" Unfortunately, I won’t be able to vote in a few months but if you can, please vote him out, " +
"he's destroying what made America great in so many different ways!"

twit2 = "@USER @USER @USER We have far more cases than any other country, " +
"so leaving remote workers in would be disastrous. Makes Trump sense."

twit3 = "My worry is that i wouldn’t be surprised if half the country actually agrees with this move..."

me = "Trump doing so??? It must be a mistake... XDDD"

conversation = twit1 + twit2

eval_conversation(conversation) #Output: 'derison'

conversation = twit1 + twit3

eval_conversation(conversation) #Output: 'normal'

conversation = twit1 + me

eval_conversation(conversation) #Output: 'derison'

# We will get 'normal' when sarcasm is not detected and 'derison' when detected

✨ 主要特性

基于 Google 的 T5 基础模型进行微调，可用于讽刺检测任务。
以文本生成的形式完成序列分类任务。

📚 详细文档

T5 模型详情

T5 模型由 Colin Raffel、Noam Shazeer、Adam Roberts、Katherine Lee、Sharan Narang、Michael Matena、Yanqi Zhou、Wei Li、Peter J. Liu 在论文 Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer 中提出。以下是论文摘要：

迁移学习是自然语言处理（NLP）中一种强大的技术，即先在数据丰富的任务上对模型进行预训练，然后在下游任务上进行微调。迁移学习的有效性催生了多种方法、方法论和实践。在本文中，我们通过引入一个统一的框架，将每个语言问题转化为文本到文本的格式，探索了 NLP 迁移学习技术的领域。我们的系统研究比较了预训练目标、架构、无标签数据集、迁移方法和其他因素在数十个语言理解任务上的表现。通过将我们的探索见解与规模和新的“Colossal Clean Crawled Corpus”相结合，我们在许多涵盖摘要、问答、文本分类等的基准测试中取得了最先进的结果。为了促进未来 NLP 迁移学习的研究，我们发布了数据集、预训练模型和代码。

模型图片

下游任务详情（文本生成形式的序列分类） - 数据集 📚

Twitter 讽刺数据集

该数据集为讽刺检测任务提供了 Twitter 训练和测试数据集，格式为 jsonlines。

每行包含一个 JSON 对象，具有以下字段：

label：SARCASM 或 NOT_SARCASM
- 测试数据中无此字段
id：样本的字符串标识符。提交结果时需要此 id。
- 仅在测试数据中有此字段
response：讽刺回复，即一条讽刺推文
context：response 的对话上下文
- 注意，上下文是一个有序的对话列表，即如果上下文包含三个元素 c1、c2、c3，那么 c2 是对 c1 的回复，c3 是对 c2 的回复。此外，如果讽刺回复是 r，那么 r 是对 c3 的回复。

例如，以下是一个训练示例：

"label": "SARCASM", "response": "Did Kelly just call someone else messy? Baaaahaaahahahaha", "context": ["X is looking a First Lady should . #classact", "didn't think it was tailored enough it looked messy"]

回复推文 "Did Kelly..." 是对其直接上下文 "didn't think it was tailored..." 的回复，而该上下文又是对 "X is looking..." 的回复。你的目标是在使用上下文（即直接或完整上下文）的同时预测 "response" 的标签。

数据集规模统计：