🚀 Vietnamese and English Question Answering Model
This model is designed for extractive question - answering tasks in both Vietnamese and English, leveraging fine - tuned XLM - RoBERTa on multiple datasets.
🚀 Quick Start
📦 Installation
The installation details are not provided in the original README. If you want to use this model, you can refer to the official documentation of the dependencies like transformers
library.
💻 Usage Examples
🔍 Basic Usage (Hugging Face pipeline style - NOT using sum features strategy)
from transformers import pipeline
model_checkpoint = "nguyenvulebinh/vi-mrc-base"
nlp = pipeline('question-answering', model=model_checkpoint,
tokenizer=model_checkpoint)
QA_input = {
'question': "Bình là chuyên gia về gì ?",
'context': "Bình Nguyễn là một người đam mê với lĩnh vực xử lý ngôn ngữ tự nhiên . Anh nhận chứng chỉ Google Developer Expert năm 2020"
}
res = nlp(QA_input)
print('pipeline: {}'.format(res))
🔎 Advanced Usage (More accurate infer process - Using sum features strategy)
from infer import tokenize_function, data_collator, extract_answer
from model.mrc_model import MRCQuestionAnswering
from transformers import AutoTokenizer
model_checkpoint = "nguyenvulebinh/vi-mrc-large"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = MRCQuestionAnswering.from_pretrained(model_checkpoint)
QA_input = {
'question': "Bình được công nhận với danh hiệu gì ?",
'context': "Bình Nguyễn là một người đam mê với lĩnh vực xử lý ngôn ngữ tự nhiên . Anh nhận chứng chỉ Google Developer Expert năm 2020"
}
inputs = [tokenize_function(*QA_input)]
inputs_ids = data_collator(inputs)
outputs = model(**inputs_ids)
answer = extract_answer(inputs, outputs, tokenizer)
print(answer)
✨ Features
This model is mainly for Vietnamese QA, but it also works well for English. The evaluation on the VLSP MRC 2021 test set shows that this experiment achieved TOP 1 on the leaderboard.
Model |
EM |
F1 |
large public_test_set |
85.847 |
83.826 |
large private_test_set |
82.072 |
78.071 |
Public leaderboard |
Private leaderboard |
 |
 |
📚 Documentation
Model Details
MRCQuestionAnswering uses XLM - RoBERTa as a pre - trained language model. By default, XLM - RoBERTa splits words into sub - words. However, in this implementation, sub - words representations (after being encoded by the BERT layer) are recombined into word representations using the sum strategy.
📄 License
This model is licensed under the cc - by - nc - 4.0 license.
About
Built by Binh Nguyen
For more details, visit the project repository.
