Bert-L12-h240-A12 Open-source Model - Suitable for Masked Language Modeling Tasks, Free to Use

Bert L12 H240 A12

Developed by eli4s

A variant of the BERT model pre-trained using knowledge distillation technology, with a hidden layer dimension of 240 and 12 attention heads, suitable for masked language modeling tasks.

Large Language Model

Transformers

#Knowledge Distilled BERT #Small Dimensional Hidden Layer #Masked Language Model

Downloads 7

Release Time : 3/2/2022

Model Overview

This model is a variant of the BERT architecture, pre-trained using knowledge distillation technology, with a unique hidden layer dimension and attention head configuration, mainly used for masked language modeling tasks.

Model Features

Knowledge Distillation Pre-training

Pre-trained using knowledge distillation technology, it may inherit the excellent characteristics of the teacher model.

Unique Dimension Configuration

The hidden layer has a dimension of 240 and is equipped with 12 attention heads, with each head having a dimension of 20, which is different from the standard BERT model.

Multiple Loss Functions

A combination of multiple loss functions is used during the knowledge distillation process, which may improve the model's performance.

Model Capabilities

Masked Language Prediction

Text Understanding

Contextual Semantic Analysis

Use Cases

Natural Language Processing

Text Filling

Predict the masked words in the text for text completion or understanding tasks.

Semantic Analysis

Understand the contextual semantics through masked prediction, which can be used in question-answering systems or text classification.

🚀 Eli4s/Bert-L12-h240-A12 Model

This model is pretrained using knowledge distillation on the bookcorpus dataset, offering a unique architecture with specific hidden size and attention head configurations.

🚀 Quick Start

This model was pretrained on the bookcorpus dataset using knowledge distillation. It shares the same architecture as BERT but has a hidden size of 240. With 12 attention heads, the head size is 20, different from the BERT base model's 64. The knowledge distillation was performed using multiple loss functions, and the model weights were initialized from scratch. The tokenizer is the same as that of the bert-base-uncased model.

📦 Installation

No specific installation steps are provided in the original text. However, you need to have the transformers library installed to use this model. You can install it using the following command:

pip install transformers

💻 Usage Examples

Basic Usage

To load the model and tokenizer:

from transformers import AutoModelForMaskedLM, BertTokenizer

model_name = "eli4s/Bert-L12-h240-A12"
model = AutoModelForMaskedLM.from_pretrained(model_name)
tokenizer = BertTokenizer.from_pretrained(model_name)

Advanced Usage

Use it as a masked language model

import torch

sentence = "Let's have a [MASK]."

model.eval()
inputs = tokenizer([sentence], padding='longest', return_tensors='pt')
output = model(inputs['input_ids'], attention_mask=inputs['attention_mask'])

mask_index = inputs['input_ids'].tolist()[0].index(103)
masked_token = output['logits'][0][mask_index].argmax(axis=-1)
predicted_token = tokenizer.decode(masked_token)

print(predicted_token)

Predict the n most relevant predictions

top_n = 5

vocab_size = model.config.vocab_size
logits = output['logits'][0][mask_index].tolist()
top_tokens = sorted(list(range(vocab_size)), key=lambda  i:logits[i], reverse=True)[:top_n]

tokenizer.decode(top_tokens)

🔧 Technical Details

The model is based on the BERT architecture but has a hidden size of 240 and 12 attention heads with a head size of 20. Knowledge distillation was employed during the pretraining process, using multiple loss functions. The weights of the model were initialized from scratch. The tokenizer is the same as that of the bert-base-uncased model.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご