🚀 BGE-M3 ONNX Model
The BGE-M3 model converted to ONNX weights using HF Optimum, ensuring compatibility with tools like ONNX Runtime.
This ONNX model simultaneously outputs dense, sparse, and ColBERT embedding representations. The output is a list of numpy arrays in the order of the representations mentioned above.
⚠️ Important Note
Dense and ColBERT embeddings are normalized, following the default behavior in the original FlagEmbedding library. If you need unnormalized outputs, modify the code in bgem3_model.py
and re - run the ONNX export using the export_onnx.py
script.
This ONNX model also has "O2" level graph optimizations applied. You can find more information about optimization levels here. If you want an ONNX model with different optimizations or no optimizations, re - run the ONNX export script export_onnx.py
with the appropriate optimization argument.
🚀 Quick Start
✨ Features
- Outputs dense, sparse, and ColBERT embedding representations simultaneously.
- Supports "O2" level graph optimizations.
📦 Installation
If you haven't already, install the ONNX Runtime Python library with pip:
pip install onnxruntime==1.17.0
For tokenization, install HF Transformers with pip:
pip install transformers==4.37.2
Clone this repository with Git LFS to obtain the ONNX model files.
💻 Usage Examples
Basic Usage
You can use the model to compute embeddings as follows:
import onnxruntime as ort
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("BAAI/bge-m3")
ort_session = ort.InferenceSession("model.onnx")
inputs = tokenizer("BGE M3 is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction.", padding="longest", return_tensors="np")
inputs_onnx = {k: ort.OrtValue.ortvalue_from_numpy(v) for k, v in inputs.items()}
outputs = ort_session.run(None, inputs_onnx)
Advanced Usage
You can use the following sparse token weight processor from FlagEmbedding to get the same output for the sparse representation from the ONNX model:
from collections import defaultdict
def process_token_weights(token_weights: np.ndarray, input_ids: list):
result = defaultdict(int)
unused_tokens = set(
[
tokenizer.cls_token_id,
tokenizer.eos_token_id,
tokenizer.pad_token_id,
tokenizer.unk_token_id,
]
)
for w, idx in zip(token_weights, input_ids):
if idx not in unused_tokens and w > 0:
idx = str(idx)
if w > result[idx]:
result[idx] = w
return result
token_weights = outputs[1].squeeze(-1)
lexical_weights = list(
map(process_token_weights, token_weights, inputs["input_ids"].tolist())
)
📦 Export ONNX weights
You can export ONNX weights using the provided custom BGE - M3 PyTorch model bgem3_model.py
file and the export_onnx.py
ONNX weight export script, which leverages HF Optimum. If necessary, modify the bgem3_model.py
model configuration, for example, to remove embedding normalization or to change the number of output representations. If you modify the number of output representations, also modify the ONNX output config BGEM3OnnxConfig
in export_onnx.py
.
First, install the required Python packages:
pip install -r requirements.txt
Then, export ONNX weights:
python export_onnx.py --output . --opset 17 --device cpu --optimize O2
You can read more about the optional optimization levels here
📄 License
This project is licensed under the MIT license.