Distill-bert-base-spanish: An Open-Source Spanish Question Answering Model - Lightweight, Efficient, and Precise in Answering Questions

Distill Bert Base Spanish Wwm Cased Finetuned Spa Squad2 Es

Developed by mrm8488

A Spanish Q&A model optimized via distillation from BETO, more lightweight and efficient than the standard version

Question Answering System SpanishOpen Source License:Apache-2.0 #Spanish Q&A #Knowledge Distillation #Lightweight & Efficient

Downloads 2,145

Release Time : 3/2/2022

Model Overview

This model is a distilled version of BETO fine-tuned on SQuAD-es-v2.0 dataset, specialized for Spanish Q&A tasks. Distillation enables smaller, faster and more cost-effective performance.

Model Features

Distillation Optimization

Uses bert-base-multilingual-cased as teacher model for distillation, resulting in lighter and more efficient model

Spanish Optimization

Based on BETO (Spanish BERT) architecture, specifically optimized for Spanish Q&A tasks

Performance Balance

Maintains high accuracy while achieving faster inference speed than standard BERT

Model Capabilities

Spanish Q&A

Context Understanding

Unanswerable Detection (supports SQuAD2.0 style questions)

Use Cases

Customer Support

Spanish FAQ System

For building automated Spanish FAQ answering systems

Can accurately extract answers from given text

Educational Applications

Spanish Learning Aid

Helps students quickly find answers from Spanish textbooks

🚀 BETO (Spanish BERT) + Spanish SQuAD2.0 + distillation using 'bert-base-multilingual-cased' as teacher

This project presents a fine - tuned and distilled version of BETO for Q&A tasks. It leverages SQuAD-es-v2.0 dataset. The distillation process makes the model smaller, faster, cheaper, and lighter compared to bert-base-spanish-wwm-cased-finetuned-spa-squad2-es.

🚀 Quick Start

This BETO - based model is ready for Q&A tasks. You can quickly start using it with the provided pipelines.

✨ Features

Distilled Model: The model is distilled, making it more efficient than its non - distilled counterparts.
Fine - Tuned on Spanish Q&A Dataset: It is fine - tuned on SQuAD-es-v2.0, suitable for Spanish Q&A.
Fast Inference: On average, it is twice as fast as mBERT - base due to the distillation process.

📦 Installation

The model can be installed and used within the transformers library. You can train it using the following command on a Tesla P100 GPU with 25GB of RAM:

!export SQUAD_DIR=/path/to/squad-v2_spanish \
&& python transformers/examples/distillation/run_squad_w_distillation.py \
  --model_type bert \
  --model_name_or_path dccuchile/bert-base-spanish-wwm-cased \
  --teacher_type bert \
  --teacher_name_or_path bert-base-multilingual-cased \
  --do_train \
  --do_eval \
  --do_lower_case \
  --train_file $SQUAD_DIR/train-v2.json \
  --predict_file $SQUAD_DIR/dev-v2.json \
  --per_gpu_train_batch_size 12 \
  --learning_rate 3e-5 \
  --num_train_epochs 5.0 \
  --max_seq_length 384 \
  --doc_stride 128 \
  --output_dir /content/model_output \
  --save_steps 5000 \
  --threads 4 \
  --version_2_with_negative

💻 Usage Examples

Basic Usage

from transformers import *

# Important!: By now the QA pipeline is not compatible with fast tokenizer, but they are working on it. So that pass the object to the tokenizer {"use_fast": False} as in the following example:

nlp = pipeline(
    'question-answering', 
    model='mrm8488/distill-bert-base-spanish-wwm-cased-finetuned-spa-squad2-es',
    tokenizer=(
        'mrm8488/distill-bert-base-spanish-wwm-cased-finetuned-spa-squad2-es',  
        {"use_fast": False}
    )
)

nlp(
    {
        'question': '¿Para qué lenguaje está trabajando?',
        'context': 'Manuel Romero está colaborando activamente con huggingface/transformers ' +
                    'para traer el poder de las últimas técnicas de procesamiento de lenguaje natural al idioma español'
    }
)
# Output: {'answer': 'español', 'end': 169, 'score': 0.67530957344621, 'start': 163}

You can play with this model and pipelines in a Colab:

📚 Documentation

Details of the downstream task (Q&A) - Dataset

SQuAD-es-v2.0

Dataset	# Q&A
SQuAD2.0 Train	130 K
SQuAD2.0-es-v2.0	111 K
SQuAD2.0 Dev	12 K
SQuAD-es-v2.0-small Dev	69 K

Model training

The model was trained on a Tesla P100 GPU and 25GB of RAM with the command shown in the Installation section.

More about `Huggingface pipelines`

Check this Colab out:

📄 License

This project is licensed under the Apache - 2.0 license.

Created by Manuel Romero/@mrm8488

Made with ♥ in Spain

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご