The vi - mrc - large open - source Vietnamese Q&A model can be freely deployed to facilitate accurate extractive Q&A.

Home

Vi Mrc Large

Developed by nguyenvulebinh

XLM-RoBERTa-based extractive QA model for Vietnamese, ranked first in VLSP MRC 2021 evaluation

Question Answering System

Transformers

Supports Multiple Languages#Vietnamese QA #Multilingual MRC #Extractive QA

Downloads 879

Release Time : 3/2/2022

Model Overview

Specialized extractive QA model for Vietnamese with English QA support, fine-tuned from multilingual pretrained models

Model Features

Multilingual Support

Built on XLM-RoBERTa architecture, natively supports Vietnamese and English QA tasks

High Performance

Achieved 1st place in VLSP MRC 2021 with public test set F1 score of 83.826

Subword Recombination Strategy

Recombines subword representations into complete word representations via summation strategy to improve comprehension accuracy

Model Capabilities

Vietnamese QA

English QA

Text Understanding

Answer Extraction

Use Cases

Smart Customer Service

Vietnamese FAQ System

Document-based automatic Q&A system

Accuracy over 85%

EdTech

Learning Assistant QA

Automatically extracts Q&A from textbooks

🚀 Vietnamese and English Question Answering Model

This model is designed for extractive question - answering tasks in both Vietnamese and English, leveraging fine - tuned XLM - RoBERTa on multiple datasets.

🚀 Quick Start

📦 Installation

The installation details are not provided in the original README. If you want to use this model, you can refer to the official documentation of the dependencies like transformers library.

💻 Usage Examples

🔍 Basic Usage (Hugging Face pipeline style - NOT using sum features strategy)

from transformers import pipeline
# model_checkpoint = "nguyenvulebinh/vi-mrc-large"
model_checkpoint = "nguyenvulebinh/vi-mrc-base"
nlp = pipeline('question-answering', model=model_checkpoint,
                   tokenizer=model_checkpoint)
QA_input = {
  'question': "Bình là chuyên gia về gì ?",
  'context': "Bình Nguyễn là một người đam mê với lĩnh vực xử lý ngôn ngữ tự nhiên . Anh nhận chứng chỉ Google Developer Expert năm 2020"
}
res = nlp(QA_input)
print('pipeline: {}'.format(res))
#{'score': 0.5782045125961304, 'start': 45, 'end': 68, 'answer': 'xử lý ngôn ngữ tự nhiên'}

🔎 Advanced Usage (More accurate infer process - Using sum features strategy)

from infer import tokenize_function, data_collator, extract_answer
from model.mrc_model import MRCQuestionAnswering
from transformers import AutoTokenizer

model_checkpoint = "nguyenvulebinh/vi-mrc-large"
#model_checkpoint = "nguyenvulebinh/vi-mrc-base"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = MRCQuestionAnswering.from_pretrained(model_checkpoint)

QA_input = {
  'question': "Bình được công nhận với danh hiệu gì ?",
  'context': "Bình Nguyễn là một người đam mê với lĩnh vực xử lý ngôn ngữ tự nhiên . Anh nhận chứng chỉ Google Developer Expert năm 2020"
}

inputs = [tokenize_function(*QA_input)]
inputs_ids = data_collator(inputs)
outputs = model(**inputs_ids)
answer = extract_answer(inputs, outputs, tokenizer)

print(answer)
# answer: Google Developer Expert. Score start: 0.9926977753639221, Score end: 0.9909810423851013

✨ Features

Language Model: XLM - RoBERTa
Fine - tune: MRCQuestionAnswering
Supported Languages: Vietnamese, English
Downstream - task: Extractive QA
Combined Datasets:

This model is mainly for Vietnamese QA, but it also works well for English. The evaluation on the VLSP MRC 2021 test set shows that this experiment achieved TOP 1 on the leaderboard.

Model	EM	F1
large public_test_set	85.847	83.826
large private_test_set	82.072	78.071

Public leaderboard	Private leaderboard

📚 Documentation

Model Details

MRCQuestionAnswering uses XLM - RoBERTa as a pre - trained language model. By default, XLM - RoBERTa splits words into sub - words. However, in this implementation, sub - words representations (after being encoded by the BERT layer) are recombined into word representations using the sum strategy.

📄 License

This model is licensed under the cc - by - nc - 4.0 license.

About

Built by Binh Nguyen For more details, visit the project repository.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご