Bertin - RoBERTa - base - Spanish Open - source Model: A Free Text Analysis Tool for Spanish Processing

Bertin Roberta Base Spanish

Developed by bertin-project

BERTIN is a series of Spanish BERT-based models. The current model is a RoBERTa-base model trained from scratch on a portion of the Spanish mC4 dataset using Flax.

Large Language Model Spanish#Spanish RoBERTa #Perplexity Sampling #Efficient Pretraining

Downloads 1,845

Release Time : 3/2/2022

Model Overview

BERTIN is an efficient Spanish pretrained language model that uses perplexity sampling to optimize the training process, suitable for natural language processing tasks such as masked language modeling.

Model Features

Perplexity Sampling Technique

Uses an innovative perplexity sampling method to significantly reduce training data volume and time while maintaining model performance.

Efficient Pretraining

Training was completed during a Flax/JAX community event, demonstrating the feasibility of small teams efficiently training large language models.

Spanish Language Optimization

Specifically designed and optimized for Spanish, filling a gap in monolingual Spanish models.

Model Capabilities

Text Understanding

Masked Language Modeling

Spanish Natural Language Processing

Use Cases

Text Processing

Text Completion

Automatically completes missing parts of a sentence, such as 'I went to the bookstore and bought a <mask>.'

Language Research

Spanish Language Model Research

Provides a foundational model for Spanish NLP research.

🚀 BERTIN

BERTIN is a series of BERT-based models designed for the Spanish language. It addresses the scarcity of high - quality Spanish - specific NLP models, offering efficient pre - training techniques and making large - scale model training more accessible to smaller groups.

🚀 Quick Start

The current model hub points to the best of all RoBERTa - base models trained from scratch on the Spanish portion of mC4 using Flax. All code and scripts are included.

You can access different versions of the model:

Version v2 (default): April 28th, 2022
Version v1: July 26th, 2021
Version v1 - 512: July 26th, 2021
Version beta: July 15th, 2021

✨ Features

Monolingual Focus: Specifically designed for the Spanish language, which is the second most - spoken language by native speakers globally.
Efficient Pre - training: Utilizes a novel perplexity sampling technique to reduce training data size and steps while maintaining model performance.
Open - source: All code and scripts are open - source, promoting transparency and collaboration.

📦 Installation

No specific installation steps are provided in the original README.

💻 Usage Examples

Basic Usage

from datasets import load_dataset

for config in ("random", "stepwise", "gaussian"):
    mc4es = load_dataset(
        "bertin-project/mc4-es-sampled",
        config,
        split="train",
        streaming=True
    ).shuffle(buffer_size=1000)
    for sample in mc4es:
        print(config, sample)
        break

📚 Documentation

Team members

Javier de la Rosa (versae)
Eduardo González (edugp)
Paulo Villegas (paulo)
Pablo González de Prado (Pablogps)
Manu Romero (mrm8488)
María Grandury (mariagrandury)

Citation and Related Information

To cite this model:

@article{BERTIN,
    author = {Javier De la Rosa y Eduardo G. Ponferrada y Manu Romero y Paulo Villegas y Pablo González de Prado Salas y María Grandury},
    title = {BERTIN: Efficient Pre - Training of a Spanish Language Model using Perplexity Sampling},
    journal = {Procesamiento del Lenguaje Natural},
    volume = {68},
    number = {0},
    year = {2022},
    keywords = {},
    abstract = {The pre - training of large language models usually requires massive amounts of resources, both in terms of computation and data. Frequently used web sources such as Common Crawl might contain enough noise to make this pretraining sub - optimal. In this work, we experiment with different sampling methods from the Spanish version of mC4, and present a novel data - centric technique which we name perplexity sampling that enables the pre - training of language models in roughly half the amount of steps and using one fifth of the data. The resulting models are comparable to the current state - of - the - art, and even achieve better results for certain tasks. Our work is proof of the versatility of Transformers, and paves the way for small teams to train their models on a limited budget.},
    issn = {1989 - 7553},
    url = {http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6403},
    pages = {13--23}
}

See also https://arxiv.org/abs/2207.06814.

If you use this model, we would love to hear about it! Reach out on twitter, GitHub, Discord, or shoot us an email.

Acknowledgements

This project would not have been possible without compute generously provided by the Huggingface and Google through the TPU Research Cloud, as well as the Cloud TPU team for providing early access to the Cloud TPU VM.

Disclaimer

The models published in this repository are intended for a generalist purpose and are available to third parties. These models may have bias and/or any other undesirable distortions. When third parties, deploy or provide systems and/or services to other parties using any of these models (or using systems based on these models) or become users of the models, they should note that it is their responsibility to mitigate the risks arising from their use and, in any event, to comply with applicable regulations, including regulations regarding the use of artificial intelligence. In no event shall the owner of the models be liable for any results arising from the use made by third parties of these models.

🔧 Technical Details

Motivation

According to Wikipedia, Spanish is the second most - spoken language in the world by native speakers (>470 million speakers), only after Chinese, and the fourth including those who speak it as a second language. However, most NLP research is still mainly available in English. Relevant contributions like BERT, XLNet or GPT2 sometimes take years to be available in Spanish and, when they do, it is often via multilingual versions which are not as performant as the English alternative.

At the time of the event there were no RoBERTa models available in Spanish. Therefore, releasing one such model was the primary goal of our project. During the Flax/JAX Community Event we released a beta version of our model, which was the first in the Spanish language. Thereafter, on the last day of the event, the Barcelona Supercomputing Center released their own RoBERTa model. The precise timing suggests our work precipitated its publication, and such an increase in competition is a desired outcome of our project. We are grateful for their efforts to include BERTIN in their paper, as discussed further below, and recognize the value of their own contribution, which we also acknowledge in our experiments.

Models in monolingual Spanish are hard to come by and, when they do, they are often trained on proprietary datasets and with massive resources. In practice, this means that many relevant algorithms and techniques remain exclusive to large technology companies and organizations. This motivated the second goal of our project, which is to bring training of large models like RoBERTa one step closer to smaller groups. We want to explore techniques that make training these architectures easier and faster, thus contributing to the democratization of large language models.

Spanish mC4

The dataset mC4 is a multilingual variant of the C4, the Colossal, Cleaned version of Common Crawl's web crawl corpus. While C4 was used to train the T5 text - to - text Transformer models, mC4 comprises natural text in 101 languages drawn from the public Common Crawl web - scrape and was used to train mT5, the multilingual version of T5.

The Spanish portion of mC4 (mC4 - es) contains about 416 million samples and 235 billion words in approximately 1TB of uncompressed data.

$ zcat c4/multilingual/c4-es*.tfrecord*.json.gz | wc -l
416057992

$ zcat c4/multilingual/c4-es*.tfrecord-*.json.gz | jq -r '.text | split(" ") | length' | paste -s -d+ - | bc
235303687795

Perplexity sampling

The large amount of text in mC4 - es makes training a language model within the time constraints of the Flax/JAX Community Event problematic. This motivated the exploration of sampling methods, with the goal of creating a subset of the dataset that would allow for the training of well - performing models with roughly one eighth of the data (~50M samples) and at approximately half the training steps.

In order to efficiently build this subset of data, we decided to leverage a technique we call perplexity sampling, and whose origin can be traced to the construction of CCNet (Wenzek et al., 2020) and their high - quality monolingual datasets from web - crawl data. In their work, they suggest the possibility of applying fast language models trained on high - quality data such as Wikipedia to filter out texts that deviate too much from correct expressions of a language (see Figure 1). They also released Kneser - Ney models (Ney et al., 1994) for 100 languages (Spanish included) as implemented in the KenLM library (Heafield, 2011) and trained on their respective Wikipedias.

Perplexity distributions by percentage CCNet corpus

In this work, we tested the hypothesis that perplexity sampling might help reduce training - data size and training times, while keeping the performance of the final model.

Methodology

In order to test our hypothesis, we first calculated the perplexity of each document in a random subset (roughly a quarter of the data) of mC4 - es and extracted their distribution and quartiles (see Figure 2).

Perplexity distributions and quartiles (red lines) of 44M samples of mC4-es

With the extracted perplexity percentiles, we created two functions to oversample the central quartiles with the idea of biasing against samples that are either too small (short, repetitive texts) or too long (potentially poor quality) (see Figure 3).

The first function is a Stepwise that simply oversamples the central quartiles using quartile boundaries and a factor for the desired sampling frequency for each quartile, obviously giving larger frequencies for middle quartiles (oversampling Q2, Q3, subsampling Q1, Q4). The second function weighted the perplexity distribution by a Gaussian - like function, to smooth out the sharp boundaries of the Stepwise function and give a better approximation to the desired underlying distribution (see Figure 4).

We adjusted the factor parameter of the Stepwise function, and the factor and width parameter of the Gaussian function to roughly be able to sample 50M samples from the 416M in mC4 - es (see Figure 4). For comparison, we also sampled randomly mC4 - es up to 50M samples as well. In terms of sizes, we went down from 1TB of data to ~200GB. We released the code to sample from mC4 on the fly when streaming for any language under the dataset bertin-project/mc4-sampling.

Expected perplexity distributions of the sample mC4-es after applying the Stepwise function

Expected perplexity distributions of the sample mC4-es after applying Gaussian function

Figure 5 shows the actual perplexity distributions of the generated 50M subsets for each of the executed subsampling procedures. All subsets can be easily accessed for reproducibility purposes using the bertin-project/mc4-es-sampled dataset. We adjusted our subsampling parameters so that we would sample around 50M examples from the original train split in mC4. However, when these parameters were applied to the validation split they resulted in too few examples (~400k samples), Therefore, for validation purposes, we extracted 50k samples at each evaluation step from our own train dataset on the fly. Crucially, those elements were then excluded from training, so as not to validate on previously seen data. In the mc4-es-sampled dataset, the train split contains the full 50M samples, while validation is retrieved as it is from the original mC4.

Experimental perplexity distributions of the sampled mc4-es after applying Gaussian and Stepwise functions, and the Random control sample

📄 License

The model is released under the CC - BY - 4.0 license.

Property	Details
Model Type	RoBERTa - base
Training Data	Spanish portion of mC4 (`bertin-project/mc4-es-sampled`)
Pipeline Tag	fill - mask
Tags	spanish, roberta

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご