Open-source BabyBERTa-3 Model - Facilitating Language Acquisition Research, Trained on Children's English Corpus

Babyberta 3

Developed by phueb

BabyBERTa is a lightweight version based on RoBERTa, specifically designed for language acquisition research, trained on a 5-million-word corpus of American English child-directed input.

Large Language Model

Transformers

EnglishOpen Source License:MIT #Child Language Acquisition #Lightweight RoBERTa #Grammar Knowledge Evaluation

Downloads 22

Release Time : 3/2/2022

Model Overview

BabyBERTa is a lightweight language model based on the RoBERTa architecture, developed specifically for studying child language acquisition. It can run on a single desktop computer with a single GPU, eliminating the need for high-performance computing infrastructure.

Model Features

Lightweight Design

The model is designed to run on a single desktop computer with a single GPU, eliminating the need for high-performance computing infrastructure.

Child-Directed Input

The training data consists of a 5-million-word corpus of American English child-directed input, making it suitable for language acquisition research.

Grammar Knowledge Learning

The model is specifically developed to learn grammar knowledge from child-directed input and is evaluated using the Zorro test suite.

Training Optimization

During training, the model never predicts unmasked tokens (the unmask_prob parameter is set to zero).

Model Capabilities

Language modeling

Grammar knowledge learning

Child language acquisition research

Use Cases

Language Acquisition Research

Child Language Development Study

Using BabyBERTa to analyze the process of grammar knowledge learning in child-directed input.

Achieved an overall accuracy of 80.3 on the Zorro test suite.

🚀 BabyBERTA

BabyBERTa is a light - weight version of RoBERTa. It's trained on 5M words of American - English child - directed input, aiming for language acquisition research. It can run on a single desktop with a single GPU, eliminating the need for high - performance computing infrastructure.

🚀 Quick Start

Loading the tokenizer

BabyBERTa was trained with add_prefix_space=True, so it won't work properly with the tokenizer defaults. For example, to load the tokenizer for BabyBERTa - 1, load it as follows:

tokenizer = RobertaTokenizerFast.from_pretrained("phueb/BabyBERTa-1",
                                                 add_prefix_space=True)

✨ Features

Overview

BabyBERTa is a specialized model for language acquisition research. It is trained on child - directed input and can be run on a regular desktop with a single GPU. The three provided models are randomly selected from 10 trained models reported in the paper.

Hyper - Parameters

All provided models were trained for 400K steps with a batch size of 16. Notably, during training, BabyBERTa never predicts unmasked tokens, with unmask_prob set to zero. For detailed hyper - parameter information, refer to the paper.

Performance

BabyBerta is designed to learn grammatical knowledge from child - directed input. Its grammatical knowledge is evaluated using the Zorro test suite.

The best model achieves an overall accuracy of 80.3, comparable to RoBERTa - base which achieves 82.6 on the latest version of Zorro (as of October, 2021). There are two reasons for the slight differences from the values reported in the [CoNLL 2021 paper](https://aclanthology.org/2021.conll - 1.49/):

The performance of RoBERTa - base is slightly better because previously, the authors lower - cased all words in Zorro before evaluation. Lower - casing proper nouns is harmful to RoBERTa - base as it is likely trained on title - cased proper nouns. In contrast, BabyBERTa is not case - sensitive, so its performance is not affected.
The latest version of Zorro no longer contains ambiguous content words like "Spanish" (which can be both a noun and an adjective), resulting in a small reduction in BabyBERTa's performance.

Overall Accuracy on Zorro:

Model Name	Accuracy (holistic scoring)	Accuracy (MLM - scoring)
[BabyBERTa - 1][link - BabyBERTa - 1]	80.3	79.9
[BabyBERTa - 2][link - BabyBERTa - 2]	78.6	78.2
[BabyBERTa - 3][link - BabyBERTa - 3]	74.5	78.1

Additional Information

This model was trained by Philip Huebner, currently at the UIUC Language and Learning Lab. More info can be found here.

📄 License

This project is licensed under the MIT license.

[link - BabyBERTa - 1]: https://huggingface.co/phueb/BabyBERTa - 1 [link - BabyBERTa - 2]: https://huggingface.co/phueb/BabyBERTa - 2 [link - BabyBERTa - 3]: https://huggingface.co/phueb/BabyBERTa - 3

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご