Open-source vi-mrc-base model - Supports Vietnamese and English bilingual, accurately realizes Q&A functions

Home

Vi Mrc Base

Developed by nguyenvulebinh

Vietnamese QA system based on XLM-RoBERTa, supports English, fine-tuned on datasets like SQuAD

Question Answering System

Transformers

Supports Multiple Languages#Vietnamese QA #Multilingual MRC #Extractive QA

Downloads 22

Release Time : 3/2/2022

Model Overview

Extractive QA model specifically designed for Vietnamese, using XLM-RoBERTa architecture with optimized word representation through feature summation strategy

Model Features

Multilingual Support

Supports both Vietnamese and English QA tasks

Feature Summation Strategy

Innovative subword representation recombination technique to enhance full-word representation quality

Mixed Dataset Training

Trained on multilingual QA datasets including SQuAD and UIT-ViQuAD

Model Capabilities

Vietnamese text understanding

English text understanding

Contextual question answering

Answer position prediction

Use Cases

Intelligent Customer Service

Vietnamese FAQ System

Automatically answers common user questions about products and services

Achieves EM score exceeding 76% accuracy

Educational Assistance

Learning Material QA

Automatically generates Q&A based on textbook content

Achieves F1 score above 84%

🚀 Question Answering Model

This is a question answering model supporting Vietnamese and English, fine - tuned on multiple datasets for extractive QA tasks.

🚀 Quick Start

You can quickly start using this pre - trained model through the following ways:

Using Hugging Face Pipeline

from transformers import pipeline
# model_checkpoint = "nguyenvulebinh/vi-mrc-large"
model_checkpoint = "nguyenvulebinh/vi-mrc-base"
nlp = pipeline('question-answering', model=model_checkpoint,
                   tokenizer=model_checkpoint)
QA_input = {
  'question': "Bình là chuyên gia về gì ?",
  'context': "Bình Nguyễn là một người đam mê với lĩnh vực xử lý ngôn ngữ tự nhiên . Anh nhận chứng chỉ Google Developer Expert năm 2020"
}
res = nlp(QA_input)
print('pipeline: {}'.format(res))
#{'score': 0.5782045125961304, 'start': 45, 'end': 68, 'answer': 'xử lý ngôn ngữ tự nhiên'}

More Accurate Infer Process

from infer import tokenize_function, data_collator, extract_answer
from model.mrc_model import MRCQuestionAnswering
from transformers import AutoTokenizer

# model_checkpoint = "nguyenvulebinh/vi-mrc-large"
model_checkpoint = "nguyenvulebinh/vi-mrc-base"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = MRCQuestionAnswering.from_pretrained(model_checkpoint)

QA_input = {
  'question': "Bình được công nhận với danh hiệu gì ?",
  'context': "Bình Nguyễn là một người đam mê với lĩnh vực xử lý ngôn ngữ tự nhiên . Anh nhận chứng chỉ Google Developer Expert năm 2020"
}

inputs = [tokenize_function(*QA_input)]
inputs_ids = data_collator(inputs)
outputs = model(**inputs_ids)
answer = extract_answer(inputs, outputs, tokenizer)

print(answer)
# answer: Google Developer Expert. Score start: 0.9926977753639221, Score end: 0.9909810423851013

✨ Features

Multi - language Support: Supports both Vietnamese and English for question - answering tasks.
Fine - tuned on Multiple Datasets: Fine - tuned on a combination of English and Vietnamese datasets, including Squad 2.0, mailong25, UIT - ViQuAD, and MultiLingual Question Answering.
High - performance: Achieves good evaluation results on the Vietnamese dataset.

📚 Documentation

Model Description

Language model: XLM - RoBERTa
Fine - tune: MRCQuestionAnswering
Language: Vietnamese, English
Downstream - task: Extractive QA
Dataset (combine English and Vietnamese):

This model is intended to be used for QA in the Vietnamese language so the valid set is Vietnamese only (but English works fine). The evaluation result below using 10% of the Vietnamese dataset.

Model	EM	F1
base	76.43	84.16
large	77.32	85.46

Model Evaluation

The evaluation results are based on 10% of the Vietnamese dataset, showing the model's performance in terms of Exact Match (EM) and F1 score.

📄 License

This model is released under the cc - by - nc - 4.0 license.

About

Built by Binh Nguyen For more details, visit the project repository.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご