bertin-roberta-large-spanish Open-source Spanish Model - Free to Use and Boost Language Understanding and Applications

Bertin Roberta Large Spanish

Developed by flax-community

BERTIN is a series of Spanish language models based on BERT. This model follows the RoBERTa-large architecture, trained from scratch using the Flax framework, with data sourced from the Spanish portion of the mC4 corpus.

Large Language Model Spanish#Spanish text infilling #Flax framework training #mC4 corpus

Downloads 26

Release Time : 3/2/2022

Model Overview

This is a Spanish pre-trained model based on the RoBERTa-large architecture, specifically designed for masked language modeling tasks, suitable for Spanish natural language processing applications.

Model Features

Trained from scratch

Trained from scratch using the Flax framework, not fine-tuned from existing models.

Large-scale training data

Based on the Spanish portion of the mC4 corpus, containing approximately 416 million text records and 235 billion words.

Community-driven development

Developed as part of the HuggingFace Community Week project, with TPU resources provided by Google.

Model Capabilities

Spanish text understanding

Masked token prediction

Contextual semantic analysis

Use Cases

Natural Language Processing

Text completion

Predicting masked words in sentences.

Example: 'I went to the bookstore and bought a <mask>.' could be predicted as 'book' or other suitable words.

Semantic analysis

Understanding contextual meaning in Spanish text.

🚀 BERTIN

BERTIN is a series of BERT - based models designed for the Spanish language. This particular model is a RoBERTa - large model that has been trained from scratch on the Spanish portion of mC4 using Flax, and it also includes training scripts.

This project is part of the [Flax/Jax Community Week](https://discuss.huggingface.co/t/open - to - the - community - community - week - using - jax - flax - for - nlp - cv/7104), which is organized by HuggingFace and the TPU usage is sponsored by Google.

🚀 Quick Start

⚠️ Important Note

This repository is now superseded by https://huggingface.co/bertin - project/bertin - roberta - base - spanish. This model corresponds to the beta version of the model using stepwise over sampling trained for 200k steps with 128 sequence lengths. Version 1 is now available and should be used instead.

✨ Features

BERTIN is a family of BERT - based models tailored for the Spanish language.
This specific model is a RoBERTa - large architecture trained from scratch.
It is trained on the Spanish portion of mC4 with the help of Flax and comes with training scripts.

📚 Documentation

Spanish mC4

The Spanish portion of mC4 contains approximately 416 million records and 235 billion words.

$ zcat c4/multilingual/c4-es*.tfrecord*.json.gz | wc -l
416057992

$ zcat c4/multilingual/c4-es*.tfrecord-*.json.gz | jq -r '.text | split(" ") | length' | paste -s -d+ - | bc
235303687795

Team members

Javier de la Rosa (versae)
Eduardo González (edugp)
Paulo Villegas (paulo)
Pablo González de Prado (Pablogps)
Manu Romero (mrm8488)
María Grandury (mariagrandury)

Useful links

[Community Week timeline](https://discuss.huggingface.co/t/open - to - the - community - community - week - using - jax - flax - for - nlp - cv/7104#summary - timeline - calendar - 6)
[Community Week README](https://github.com/huggingface/transformers/blob/master/examples/research_projects/jax - projects/README.md)
[Community Week thread](https://discuss.huggingface.co/t/bertin - pretrain - roberta - large - from - scratch - in - spanish/7125)
Community Week channel
[Masked Language Modelling example scripts](https://github.com/huggingface/transformers/tree/master/examples/flax/language - modeling)
[Model Repository](https://huggingface.co/flax - community/bertin - roberta - large - spanish/)

📄 License

This model is licensed under CC - BY - 4.0.

Property	Details
Model Type	RoBERTa - large
Training Data	Spanish portion of mC4
Tags	spanish, roberta
Pipeline Tag	fill - mask

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご