ibert-roberta-base Open-source Model - Accelerate Inference by Storing Parameters in INT8, Free and Practical

Ibert Roberta Base

Developed by kssteven

I-BERT is a pure integer quantized version of RoBERTa, storing parameters in INT8 format and using integer operations for inference, significantly improving inference speed.

Large Language Model

Transformers

#Pure integer inference #Quantization acceleration #Text classification

Downloads 2,988

Release Time : 3/2/2022

Model Overview

I-BERT replaces floating-point operations with integer operations in the Transformer architecture, enabling efficient inference. Suitable for tasks requiring fast text processing.

Model Features

Pure integer operations

All parameters are stored in INT8 format, with inference performed entirely using integer operations, eliminating the need for floating-point operations.

Efficient inference

Tested on Nvidia T4 GPU, achieves up to 4x inference speedup compared to the floating-point version.

Quantization-aware training

Supports quantization-aware fine-tuning, optimizing quantized model performance through a three-stage process.

Model Capabilities

Text classification

Natural language understanding

Efficient inference

Use Cases

Text processing

Text classification

Tasks such as MRPC text classification

Maintains high accuracy through quantization-aware training

🚀 I-BERT base model

I-BERT base model is an integer-only quantized version of RoBERTa, enabling faster inference with integer arithmetic.

This model, ibert-roberta-base, is an integer-only quantized version of RoBERTa, and was introduced in this paper. I-BERT stores all parameters with INT8 representation, and carries out the entire inference using integer-only arithmetic. In particular, I-BERT replaces all floating point operations in the Transformer architectures (e.g., MatMul, GELU, Softmax, and LayerNorm) with closely approximating integer operations. This can result in upto 4x inference speed up as compared to floating point counterpart when tested on an Nvidia T4 GPU. The best model parameters searched via quantization-aware finetuning can be then exported (e.g., to TensorRT) for integer-only deployment of the model.

🚀 Quick Start

The following will introduce the finetuning process of I-BERT, which includes full-precision finetuning, model quantization, and integer-only finetuning.

✨ Features

Integer-only Quantization: I-BERT stores all parameters in INT8 representation and performs inference using integer-only arithmetic, replacing floating-point operations in Transformer architectures with approximate integer operations.
Inference Speedup: It can achieve up to 4x inference speedup compared to its floating-point counterpart on an Nvidia T4 GPU.
Quantization-aware Finetuning: The model can be fine-tuned using quantization-aware techniques, and the best parameters can be exported for integer-only deployment.

📦 Installation

This README does not provide installation steps, so this section is skipped.

💻 Usage Examples

Basic Usage

The usage of I-BERT mainly involves the finetuning process, which consists of three stages: full-precision finetuning, model quantization, and integer-only finetuning.

Full-precision finetuning

Full-precision finetuning of I-BERT is similar to RoBERTa finetuning. For instance, you can run the following command to finetune on the MRPC text classification task.

python examples/text-classification/run_glue.py \
         --model_name_or_path kssteven/ibert-roberta-base \
         --task_name MRPC \
         --do_eval \
         --do_train \
         --evaluation_strategy epoch \
         --max_seq_length 128 \
         --per_device_train_batch_size 32 \
         --save_steps 115 \
         --learning_rate 2e-5 \
         --num_train_epochs 10 \
         --output_dir $OUTPUT_DIR

Model Quantization

Once you are done with full-precision finetuning, open up config.json in your checkpoint directory and set the quantize attribute as true.

{                                  
  "_name_or_path": "kssteven/ibert-roberta-base",       
  "architectures": [               
    "IBertForSequenceClassification"                    
  ],                               
  "attention_probs_dropout_prob": 0.1,                  
  "bos_token_id": 0,               
  "eos_token_id": 2,               
  "finetuning_task": "mrpc",       
  "force_dequant": "none",         
  "hidden_act": "gelu",            
  "hidden_dropout_prob": 0.1,      
  "hidden_size": 768,              
  "initializer_range": 0.02,       
  "intermediate_size": 3072,       
  "layer_norm_eps": 1e-05,         
  "max_position_embeddings": 514,  
  "model_type": "ibert",           
  "num_attention_heads": 12,       
  "num_hidden_layers": 12,         
  "pad_token_id": 1,               
  "position_embedding_type": "absolute",                
  "quant_mode": true,             
  "tokenizer_class": "RobertaTokenizer",                
  "transformers_version": "4.4.0.dev0",                 
  "type_vocab_size": 1,            
  "vocab_size": 50265              
}

Then, your model will automatically run as the integer-only mode when you load the checkpoint. Also, make sure to delete optimizer.pt, scheduler.pt and trainer_state.json in the same directory. Otherwise, HF will not reset the optimizer, scheduler, or trainer state for the following integer-only finetuning.

Integer-only finetuning (Quantization-aware training)

Finally, you will be able to run integer-only finetuning simply by loading the checkpoint file you modified. Note that the only difference in the example command below is model_name_or_path.

python examples/text-classification/run_glue.py \
         --model_name_or_path $CHECKPOINT_DIR
         --task_name MRPC \
         --do_eval \
         --do_train \
         --evaluation_strategy epoch \
         --max_seq_length 128 \
         --per_device_train_batch_size 32 \
         --save_steps 115 \
         --learning_rate 1e-6 \
         --num_train_epochs 10 \
         --output_dir $OUTPUT_DIR

📚 Documentation

This section is mainly about the finetuning process of I-BERT, which has been detailed in the "Usage Examples" section.

🔧 Technical Details

I-BERT is an integer-only quantized version of RoBERTa. It stores all parameters in INT8 representation and performs the entire inference using integer-only arithmetic. By replacing floating-point operations in Transformer architectures with approximate integer operations, it can achieve significant inference speedup. The finetuning process involves three stages: full-precision finetuning, model quantization, and integer-only finetuning, which helps to find the best model parameters for integer-only deployment.

📄 License

This README does not provide license information, so this section is skipped.

📚 Citation info

If you use I-BERT, please cite our papaer.

@article{kim2021bert,
  title={I-BERT: Integer-only BERT Quantization},
  author={Kim, Sehoon and Gholami, Amir and Yao, Zhewei and Mahoney, Michael W and Keutzer, Kurt},
  journal={arXiv preprint arXiv:2101.01321},
  year={2021}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご