mT5-base-finetuned-tydiQA-xqa Open-source Multilingual Q&A Model

Mt5 Base Finetuned Tydiqa Xqa

Developed by Narrativa

This model is a multilingual Q&A model fine-tuned on the TyDi QA dataset based on Google's mT5-base, supporting Q&A tasks in 101 languages.

Question Answering System

Transformers

Other#Multilingual Q&A #101 Language Support #TyDiQA Fine-tuning

Downloads 368

Release Time : 3/2/2022

Model Overview

A model specifically designed for multilingual Q&A tasks, capable of understanding questions and extracting answers from given contexts in multiple languages.

Model Features

Multilingual Support

Supports Q&A tasks in 101 languages, covering most major languages worldwide.

Real-world Q&A

Trained on the TyDi QA dataset with questions posed by real users, avoiding biases introduced by translation.

Typological Diversity

Training data includes typologically highly diverse languages, enhancing the model's generalization capabilities.

Model Capabilities

Multilingual Q&A

Context Understanding

Information Extraction

Use Cases

Multilingual Information Retrieval

Cross-lingual Knowledge Q&A

Search and answer user questions in documents of different languages

EM score of 60.88 on the TyDi QA validation set

Multilingual Customer Support

Automatically answer customer inquiries in different languages

🚀 mT5-base fine-tuned on TyDiQA for multilingual QA 🗺📖❓

Google's mT5-base fine-tuned on TyDi QA for multilingual Q&A downstream task.

🚀 Quick Start

This is Google's mT5-base fine-tuned on TyDi QA (secondary task) for the multilingual Q&A downstream task.

✨ Features

Details of mT5

Google's mT5 is pretrained on the mC4 corpus, which covers 101 languages:

Afrikaans, Albanian, Amharic, Arabic, Armenian, Azerbaijani, Basque, Belarusian, Bengali, Bulgarian, Burmese, Catalan, Cebuano, Chichewa, Chinese, Corsican, Czech, Danish, Dutch, English, Esperanto, Estonian, Filipino, Finnish, French, Galician, Georgian, German, Greek, Gujarati, Haitian Creole, Hausa, Hawaiian, Hebrew, Hindi, Hmong, Hungarian, Icelandic, Igbo, Indonesian, Irish, Italian, Japanese, Javanese, Kannada, Kazakh, Khmer, Korean, Kurdish, Kyrgyz, Lao, Latin, Latvian, Lithuanian, Luxembourgish, Macedonian, Malagasy, Malay, Malayalam, Maltese, Maori, Marathi, Mongolian, Nepali, Norwegian, Pashto, Persian, Polish, Portuguese, Punjabi, Romanian, Russian, Samoan, Scottish Gaelic, Serbian, Shona, Sindhi, Sinhala, Slovak, Slovenian, Somali, Sotho, Spanish, Sundanese, Swahili, Swedish, Tajik, Tamil, Telugu, Thai, Turkish, Ukrainian, Urdu, Uzbek, Vietnamese, Welsh, West Frisian, Xhosa, Yiddish, Yoruba, Zulu.

Note: mT5 was only pre-trained on mC4 excluding any supervised training. Therefore, this model has to be fine-tuned before it is useable on a downstream task.

Pretraining Dataset: mC4
Other Community Checkpoints: here
Paper: mT5: A massively multilingual pre-trained text-to-text transformer
Authors: Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel

Details of the dataset 📚

TyDi QA is a question answering dataset covering 11 typologically diverse languages with 204K question-answer pairs. The languages of TyDi QA are diverse with regard to their typology -- the set of linguistic features that each language expresses -- such that we expect models performing well on this set to generalize across a large number of the languages in the world. It contains language phenomena that would not be found in English-only corpora. To provide a realistic information-seeking task and avoid priming effects, questions are written by people who want to know the answer, but don’t know the answer yet, (unlike SQuAD and its descendents) and the data is collected directly in each language without the use of translation (unlike MLQA and XQuAD).

Property	Details
Dataset	TyDi QA
Task	GoldP
Split (train)	49881 samples
Split (valid)	5077 samples

Results on validation dataset 📝

Metric	Value
EM	60.88

💻 Usage Examples

Basic Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
tokenizer = AutoTokenizer.from_pretrained("Narrativa/mT5-base-finetuned-tydiQA-xqa")
model = AutoModelForCausalLM.from_pretrained("Narrativa/mT5-base-finetuned-tydiQA-xqa").to(device)

def get_response(question, context, max_length=32):
  input_text = 'question: %s  context: %s' % (question, context)
  features = tokenizer([input_text], return_tensors='pt')

  output = model.generate(input_ids=features['input_ids'].to(device), 
               attention_mask=features['attention_mask'].to(device),
               max_length=max_length)

  return tokenizer.decode(output[0])
  
# Some examples in different languages

context = 'HuggingFace won the best Demo paper at EMNLP2020.'
question = 'What won HuggingFace?'
get_response(question, context)

context = 'HuggingFace ganó la mejor demostración con su paper en la EMNLP2020.'
question = 'Qué ganó HuggingFace?'
get_response(question, context)

context = 'HuggingFace выиграл лучшую демонстрационную работу на EMNLP2020.'
question = 'Что победило в HuggingFace?'
get_response(question, context)

Advanced Usage

The provided code already demonstrates the basic usage. For more complex scenarios, you can adjust parameters such as max_length according to your specific needs.

📚 Documentation

This model is created by Narrativa.

About Narrativa: Natural Language Generation (NLG) | Gabriele, our machine learning-based platform, builds and deploys natural language solutions. #NLG #AI

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご