bert-base-multilingual-cased-finetuned-polish-squad2 Open Source Model

Bert Base Multilingual Cased Finetuned Polish Squad2

Developed by henryk

A Polish QA system fine-tuned on a multilingual BERT model, trained on a machine-translated Polish SQuAD2.0 dataset

Question Answering System Other#Polish QA #Multilingual BERT #Machine Translation Adaptation

Downloads 71

Release Time : 3/2/2022

Model Overview

This model is a Polish fine-tuned version of Google's multilingual BERT, specifically designed for Polish question-answering tasks, supporting answer extraction from given texts.

Model Features

Multilingual Support

Based on a multilingual model trained on 104 languages, specifically optimized for Polish

Question Answering Capability

Accurately extracts answers from given texts, supports no-answer detection

High Performance

Achieves 70.76% exact match and 72.92% F1 score on the Polish SQuAD2.0 test set

Model Capabilities

Polish QA

Text Understanding

Answer Extraction

No-Answer Detection

Use Cases

Education

Learning Assistance

Helps students quickly find answers to questions from textbooks

Improves learning efficiency

Customer Service

FAQ Auto-Response

Automatically answers common questions from Polish-speaking customers

Reduces workload for human customer service

🚀 Multilingual + Polish SQuAD2.0

This model is a multilingual model provided by Google's research team, fine - tuned for the Polish Q&A downstream task.

📚 Documentation

Language Model Details

Language model (bert-base-multilingual-cased):

12 - layer, 768 - hidden, 12 - heads, 110M parameters.
Trained on cased text in the top 104 languages with the largest Wikipedias.

Downstream Task Details

Using the mtranslate Python module, SQuAD2.0 was machine - translated. To find the start tokens, direct translations of the answers were searched in the corresponding paragraphs. Due to different translations depending on the context (missing context in the pure answer), the answer could not always be found in the text, resulting in a loss of question - answer examples. This is a potential problem where errors can occur in the dataset.

Dataset	# Q&A
SQuAD2.0 Train	130 K
Polish SQuAD2.0 Train	83.1 K
SQuAD2.0 Dev	12 K
Polish SQuAD2.0 Dev	8.5 K

Model Benchmark

Model	EM/F1	HasAns (EM/F1)	NoAns
SlavicBERT	69.35/71.51	47.02/54.09	79.20
polBERT	67.33/69.80	45.73/53.80	76.87
multiBERT	70.76/72.92	45.00/52.04	82.13

🔧 Technical Details

Model Training

The model was trained on a Tesla V100 GPU with the following command:

export SQUAD_DIR=path/to/pl_squad

python run_squad.py 
  --model_type bert \
  --model_name_or_path bert-base-multilingual-cased \
  --do_train \
  --do_eval \
  --version_2_with_negative \
  --train_file $SQUAD_DIR/pl_squadv2_train.json \
  --predict_file $SQUAD_DIR/pl_squadv2_dev.json \
  --num_train_epochs 2 \
  --max_seq_length 384 \
  --doc_stride 128 \
  --save_steps=8000 \
  --output_dir ../../output \
  --overwrite_cache \
  --overwrite_output_dir

Results:

{'exact': 70.76671723655035, 'f1': 72.92156947155917, 'total': 8569, 'HasAns_exact': 45.00762195121951, 'HasAns_f1': 52.04456128116991, 'HasAns_total': 2624, 'NoAns_exact': 82.13624894869638, 'NoAns_f1': 82.13624894869638, 'NoAns_total': 5945, 'best_exact': 71.72365503559342, 'best_exact_thresh': 0.0, 'best_f1': 73.62662512059369, 'best_f1_thresh': 0.0}

💻 Usage Examples

Basic Usage

from transformers import pipeline

qa_pipeline = pipeline(
    "question-answering",
    model="henryk/bert-base-multilingual-cased-finetuned-polish-squad2",
    tokenizer="henryk/bert-base-multilingual-cased-finetuned-polish-squad2"
)

qa_pipeline({
    'context': "Warszawa jest największym miastem w Polsce pod względem liczby ludności i powierzchni",
    'question': "Jakie jest największe miasto w Polsce?"})

Output

{
  "score": 0.9986,
  "start": 0, 
  "end": 8,
  "answer": "Warszawa"
}

📞 Contact

Please do not hesitate to contact me via LinkedIn if you want to discuss or get access to the Polish version of SQuAD.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご