Ibert - Roberta - large open-source model - Integer-only quantization to achieve inference acceleration, free deployment for greater efficiency

Ibert Roberta Large

Developed by kssteven

I-BERT is a pure integer-quantized version of RoBERTa-large, using INT8 to store parameters and integer operations for inference, achieving up to 4x inference acceleration.

Large Language Model

Transformers

#INT8 quantization #Pure integer inference #Text classification

Downloads 45

Release Time : 3/2/2022

Model Overview

An integer-quantized model based on the RoBERTa architecture, designed for efficient inference, suitable for tasks requiring fast text processing.

Model Features

Pure integer operations

All parameters are stored in INT8 format, executing inference entirely with integer operations, eliminating the need for floating-point computation units.

Quantization-aware training

Supports a three-stage fine-tuning process (full precision → quantization → integer fine-tuning) to maximize post-quantization accuracy.

4x inference acceleration

Achieves up to 4x inference speed improvement compared to the floating-point version on Nvidia T4 GPUs.

Model Capabilities

Text classification

Semantic understanding

Efficient inference

Use Cases

Text processing

Semantic similarity judgment

E.g., sentence pair similarity classification in MRPC tasks

Maintains accuracy close to the full-precision model after quantization.

🚀 I-BERT large model

The ibert-roberta-large model is an integer-only quantized version of RoBERTa, offering efficient inference with integer arithmetic.

🚀 Quick Start

This model, ibert-roberta-large, is an integer-only quantized version of RoBERTa, and was introduced in this paper. I-BERT stores all parameters with INT8 representation, and carries out the entire inference using integer-only arithmetic. In particular, I-BERT replaces all floating point operations in the Transformer architectures (e.g., MatMul, GELU, Softmax, and LayerNorm) with closely approximating integer operations. This can result in up to 4x inference speed up as compared to floating point counterpart when tested on an Nvidia T4 GPU. The best model parameters searched via quantization-aware finetuning can be then exported (e.g., to TensorRT) for integer-only deployment of the model.

✨ Features

Integer-only quantization: Stores all parameters in INT8 and performs inference using integer arithmetic.
Transformer operation replacement: Replaces floating point operations in Transformer architectures with integer operations.
Inference speed up: Can achieve up to 4x faster inference on an Nvidia T4 GPU compared to floating point models.
Exportable for deployment: Best parameters can be exported for integer-only deployment.

📦 Installation

No specific installation steps are provided in the original README.

💻 Usage Examples

Basic Usage

The usage mainly involves the finetuning process. Here are the steps:

Full-precision finetuning

Full-precision finetuning of I-BERT is similar to RoBERTa finetuning. For instance, you can run the following command to finetune on the MRPC text classification task.

python examples/text-classification/run_glue.py \
         --model_name_or_path kssteven/ibert-roberta-large \
         --task_name MRPC \
         --do_eval \
         --do_train \
         --evaluation_strategy epoch \
         --max_seq_length 128 \
         --per_device_train_batch_size 32 \
         --save_steps 115 \
         --learning_rate 2e-5 \
         --num_train_epochs 10 \
         --output_dir $OUTPUT_DIR

Model Quantization

Once you are done with full-precision finetuning, open up config.json in your checkpoint directory and set the quantize attribute as true.

{                                  
  "_name_or_path": "kssteven/ibert-roberta-large",       
  "architectures": [               
    "IBertForSequenceClassification"                    
  ],                               
  "attention_probs_dropout_prob": 0.1,                  
  "bos_token_id": 0,               
  "eos_token_id": 2,               
  "finetuning_task": "mrpc",       
  "force_dequant": "none",         
  "hidden_act": "gelu",            
  "hidden_dropout_prob": 0.1,      
  "hidden_size": 768,              
  "initializer_range": 0.02,       
  "intermediate_size": 3072,       
  "layer_norm_eps": 1e-05,         
  "max_position_embeddings": 514,  
  "model_type": "ibert",           
  "num_attention_heads": 12,       
  "num_hidden_layers": 12,         
  "pad_token_id": 1,               
  "position_embedding_type": "absolute",                
  "quant_mode": true,             
  "tokenizer_class": "RobertaTokenizer",                
  "transformers_version": "4.4.0.dev0",                 
  "type_vocab_size": 1,            
  "vocab_size": 50265              
}

Then, your model will automatically run as the integer-only mode when you load the checkpoint. Also, make sure to delete optimizer.pt, scheduler.pt and trainer_state.json in the same directory. Otherwise, HF will not reset the optimizer, scheduler, or trainer state for the following integer-only finetuning.

Integer-only finetuning (Quantization-aware training)

Finally, you will be able to run integer-only finetuning simply by loading the checkpoint file you modified. Note that the only difference in the example command below is model_name_or_path.

python examples/text-classification/run_glue.py \
         --model_name_or_path $CHECKPOINT_DIR
         --task_name MRPC \
         --do_eval \
         --do_train \
         --evaluation_strategy epoch \
         --max_seq_length 128 \
         --per_device_train_batch_size 32 \
         --save_steps 115 \
         --learning_rate 1e-6 \
         --num_train_epochs 10 \
         --output_dir $OUTPUT_DIR

📚 Documentation

The finetuning procedure of I-BERT consists of 3 stages: (1) Full-precision finetuning from the pretrained model on a down-stream task, (2) model quantization, and (3) integer-only finetuning (i.e., quantization-aware training) of the quantized model.

📄 License

No license information is provided in the original README.

📚 Citation info

If you use I-BERT, please cite our paper.

@article{kim2021bert,
  title={I-BERT: Integer-only BERT Quantization},
  author={Kim, Sehoon and Gholami, Amir and Yao, Zhewei and Mahoney, Michael W and Keutzer, Kurt},
  journal={arXiv preprint arXiv:2101.01321},
  year={2021}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご