bert-base-arabic-camelbert-msa-did-madar-twitter5开源模型

首页

Bert Base Arabic Camelbert Msa Did Madar Twitter5

由 CAMeL-Lab 开发

基于CAMeLBERT-MSA微调的阿拉伯语方言识别模型，支持21种方言分类

文本分类

Transformers

阿拉伯语开源协议:Apache-2.0 #阿拉伯方言识别 #社交媒体文本分析 #多方言分类

下载量 90

发布时间 : 3/2/2022

模型简介

该模型通过微调CAMeLBERT-MSA构建，专门用于阿拉伯语方言识别任务，在MADAR Twitter-5数据集上训练，可识别21种阿拉伯语方言变体。

模型特点

多方言支持

可识别21种阿拉伯语方言变体，包括埃及、科威特等地区方言

领域优化

专门针对Twitter社交媒体文本优化，适合处理非正式阿拉伯语表达

学术验证

训练方法和性能在ACL发表的论文中经过系统验证

模型能力

阿拉伯语方言分类

社交媒体文本分析

多方言变体识别

使用案例

社交媒体分析

Twitter用户地域分析

通过用户发文的方言特征推断其可能的地理来源

可识别21种阿拉伯语方言，准确率依方言差异而不同

语言学研究

方言分布研究

分析特定话题下不同方言的使用频率和分布特征

🚀 CAMeLBERT-MSA DID MADAR Twitter-5 模型

该模型是一个方言识别（DID）模型，通过微调 CAMeLBERT-MSA 模型构建而成。它使用了特定数据集进行微调，能有效完成方言识别任务，在相关领域有重要应用价值。

🚀 快速开始

你可以将 CAMeLBERT-MSA DID MADAR Twitter-5 模型作为 transformers 管道的一部分使用。并且，该模型很快也会在 CAMeL Tools 中可用。

✨ 主要特性

基于微调：通过微调 CAMeLBERT-MSA 模型构建，利用了预训练模型的强大能力。
特定数据集：在微调过程中使用了 MADAR Twitter - 5 数据集，该数据集包含 21 个标签。
可复现性：微调过程和使用的超参数可在论文 "The Interplay of Variant, Size, and Task Type in Arabic Pre - trained Language Models" 中找到，微调代码可在此处获取。

📦 安装指南

要下载该模型，你需要 transformers>=3.5.0。若不满足此条件，也可以手动下载模型。

💻 使用示例

基础用法

>>> from transformers import pipeline
>>> did = pipeline('text-classification', model='CAMeL-Lab/bert-base-arabic-camelbert-msa-did-madar-twitter5')
>>> sentences = ['عامل ايه ؟', 'شلونك ؟ شخبارك ؟']
>>> did(sentences)
[{'label': 'Egypt', 'score': 0.5741344094276428},
 {'label': 'Kuwait', 'score': 0.5225679278373718}]

注意事项

⚠️ 重要提示

要下载我们的模型，你需要 transformers>=3.5.0，否则你可以手动下载模型。

📚 详细文档

模型描述

CAMeLBERT-MSA DID MADAR Twitter - 5 模型 是一个方言识别（DID）模型，它是通过微调 CAMeLBERT-MSA 模型构建的。在微调时，使用了 MADAR Twitter - 5 数据集，该数据集包含 21 个标签。微调过程和使用的超参数可在论文 "The Interplay of Variant, Size, and Task Type in Arabic Pre - trained Language Models" 中找到，微调代码可在此处获取。

预期用途

你可以将该模型作为 transformers 管道的一部分使用，并且该模型很快也会在 CAMeL Tools 中可用。

📄 许可证

本项目采用 Apache - 2.0 许可证。

📚 引用

@inproceedings{inoue-etal-2021-interplay,
    title = "The Interplay of Variant, Size, and Task Type in {A}rabic Pre-trained Language Models",
    author = "Inoue, Go  and
      Alhafni, Bashar  and
      Baimukan, Nurpeiis  and
      Bouamor, Houda  and
      Habash, Nizar",
    booktitle = "Proceedings of the Sixth Arabic Natural Language Processing Workshop",
    month = apr,
    year = "2021",
    address = "Kyiv, Ukraine (Online)",
    publisher = "Association for Computational Linguistics",
    abstract = "In this paper, we explore the effects of language variants, data sizes, and fine-tuning task types in Arabic pre-trained language models. To do so, we build three pre-trained language models across three variants of Arabic: Modern Standard Arabic (MSA), dialectal Arabic, and classical Arabic, in addition to a fourth language model which is pre-trained on a mix of the three. We also examine the importance of pre-training data size by building additional models that are pre-trained on a scaled-down set of the MSA variant. We compare our different models to each other, as well as to eight publicly available models by fine-tuning them on five NLP tasks spanning 12 datasets. Our results suggest that the variant proximity of pre-training data to fine-tuning data is more important than the pre-training data size. We exploit this insight in defining an optimized system selection model for the studied tasks.",
}