mT5_multilingual_XLSum开源模型 - 免费部署支持45种语言摘要生成

首页

Mt5 Multilingual XLSum

由 csebuetnlp 开发

基于XL-Sum数据集45种语言微调的mT5模型，用于多语言摘要生成任务

文本生成

Transformers

支持多种语言#多语言摘要生成 #45种语言支持 #新闻摘要

下载量 73.34k

发布时间 : 3/2/2022

模型简介

该模型是基于mT5架构的多语言摘要生成模型，支持45种语言的文本摘要任务，在XL-Sum数据集上进行了微调。

模型特点

多语言支持

支持45种语言的摘要生成任务

高性能

在XL-Sum测试集上ROUGE-1得分达到36.5002

基于mT5架构

采用mT5预训练模型架构，适合多语言任务

模型能力

文本摘要生成

多语言处理

长文本理解

使用案例

新闻摘要

新闻文章摘要

将长篇新闻文章自动生成为简洁摘要

生成准确反映原文内容的简短摘要

内容管理

社交媒体内容摘要

为社交媒体平台生成内容摘要

帮助用户快速理解长内容

🚀 mT5-multilingual-XLSum

本项目包含在 XL - Sum 数据集的45种语言上微调的mT5检查点。有关微调的详细信息和脚本，请参阅论文和官方仓库。

🚀 快速开始

环境要求

本模型在 transformers 库（版本4.11.0.dev0）中进行了测试。

代码示例

import re
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

WHITESPACE_HANDLER = lambda k: re.sub('\s+', ' ', re.sub('\n+', ' ', k.strip()))

article_text = """Videos that say approved vaccines are dangerous and cause autism, cancer or infertility are among those that will be taken down, the company said.  The policy includes the termination of accounts of anti-vaccine influencers.  Tech giants have been criticised for not doing more to counter false health information on their sites.  In July, US President Joe Biden said social media platforms were largely responsible for people's scepticism in getting vaccinated by spreading misinformation, and appealed for them to address the issue.  YouTube, which is owned by Google, said 130,000 videos were removed from its platform since last year, when it implemented a ban on content spreading misinformation about Covid vaccines.  In a blog post, the company said it had seen false claims about Covid jabs "spill over into misinformation about vaccines in general". The new policy covers long-approved vaccines, such as those against measles or hepatitis B.  "We're expanding our medical misinformation policies on YouTube with new guidelines on currently administered vaccines that are approved and confirmed to be safe and effective by local health authorities and the WHO," the post said, referring to the World Health Organization."""

model_name = "csebuetnlp/mT5_multilingual_XLSum"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

input_ids = tokenizer(
    [WHITESPACE_HANDLER(article_text)],
    return_tensors="pt",
    padding="max_length",
    truncation=True,
    max_length=512
)["input_ids"]

output_ids = model.generate(
    input_ids=input_ids,
    max_length=84,
    no_repeat_ngram_size=2,
    num_beams=4
)[0]

summary = tokenizer.decode(
    output_ids,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=False
)

print(summary)

✨ 主要特性

多语言支持：该模型在XL - Sum数据集的45种语言上进行了微调，支持多种语言的文本摘要任务。
高性能表现：在多个语言的测试集上取得了较好的ROUGE指标成绩。

📚 详细文档

模型信息

属性	详情
模型名称	csebuetnlp/mT5_multilingual_XLSum
模型类型	多语言摘要模型
训练数据集	XL - Sum
评估指标	ROUGE - 1、ROUGE - 2、ROUGE - L、ROUGE - LSUM、loss、gen_len

基准测试

在XL - Sum测试集上的得分如下：

语言	ROUGE - 1 / ROUGE - 2 / ROUGE - L
阿姆哈拉语	20.0485 / 7.4111 / 18.0753
阿拉伯语	34.9107 / 14.7937 / 29.1623
阿塞拜疆语	21.4227 / 9.5214 / 19.3331
孟加拉语	29.5653 / 12.1095 / 25.1315
缅甸语	15.9626 / 5.1477 / 14.1819
中文（简体）	39.4071 / 17.7913 / 33.406
中文（繁体）	37.1866 / 17.1432 / 31.6184
英语	37.601 / 15.1536 / 29.8817
法语	35.3398 / 16.1739 / 28.2041
古吉拉特语	21.9619 / 7.7417 / 19.86
豪萨语	39.4375 / 17.6786 / 31.6667
印地语	38.5882 / 16.8802 / 32.0132
伊博语	31.6148 / 10.1605 / 24.5309
印尼语	37.0049 / 17.0181 / 30.7561
日语	48.1544 / 23.8482 / 37.3636
基隆迪语	31.9907 / 14.3685 / 25.8305
韩语	23.6745 / 11.4478 / 22.3619
吉尔吉斯语	18.3751 / 7.9608 / 16.5033
马拉地语	22.0141 / 9.5439 / 19.9208
尼泊尔语	26.6547 / 10.2479 / 24.2847
奥罗莫语	18.7025 / 6.1694 / 16.1862
普什图语	38.4743 / 15.5475 / 31.9065
波斯语	36.9425 / 16.1934 / 30.0701
皮钦语	37.9574 / 15.1234 / 29.872
葡萄牙语	37.1676 / 15.9022 / 28.5586
旁遮普语	30.6973 / 12.2058 / 25.515
俄语	32.2164 / 13.6386 / 26.1689
苏格兰盖尔语	29.0231 / 10.9893 / 22.8814
塞尔维亚语（西里尔文）	23.7841 / 7.9816 / 20.1379
塞尔维亚语（拉丁字母）	21.6443 / 6.6573 / 18.2336
僧伽罗语	27.2901 / 13.3815 / 23.4699
索马里语	31.5563 / 11.5818 / 24.2232
西班牙语	31.5071 / 11.8767 / 24.0746
斯瓦希里语	37.6673 / 17.8534 / 30.9146
泰米尔语	24.3326 / 11.0553 / 22.0741
泰卢固语	19.8571 / 7.0337 / 17.6101
泰语	37.3951 / 17.275 / 28.8796
提格雷尼亚语	25.321 / 8.0157 / 21.1729
土耳其语	32.9304 / 15.5709 / 29.2622
乌克兰语	23.9908 / 10.1431 / 20.9199
乌尔都语	39.5579 / 18.3733 / 32.8442
乌兹别克语	16.8281 / 6.3406 / 15.4055
越南语	32.8826 / 16.2247 / 26.0844
威尔士语	32.6599 / 11.596 / 26.1164
约鲁巴语	31.6595 / 11.6599 / 25.0898

📄 许可证

本模型使用的许可证为 cc - by - nc - sa - 4.0。

📝 引用

如果您使用了此模型，请引用以下论文：

@inproceedings{hasan-etal-2021-xl,
    title = "{XL}-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages",
    author = "Hasan, Tahmid  and
      Bhattacharjee, Abhik  and
      Islam, Md. Saiful  and
      Mubasshir, Kazi  and
      Li, Yuan-Fang  and
      Kang, Yong-Bin  and
      Rahman, M. Sohel  and
      Shahriyar, Rifat",
    booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.findings-acl.413",
    pages = "4693--4703",
}