đ hotchpotch/japanese-reranker-tiny-v2
A very small and fast Japanese reranker model series (v2).
For more information about rerankers, technical reports, and evaluations, please refer to the following links:

đ Quick Start
Prerequisites
The operation requires version 4.48 or higher of the transformers
library.
pip install -U "transformers>=4.48.0" sentence-transformers sentencepiece
If your GPU supports Flash Attention 2, you can install the flash-attn
library for faster inference:
pip install flash-attn --no-build-isolation
đ» Usage Examples
SentenceTransformers
from sentence_transformers import CrossEncoder
import torch
MODEL_NAME = "hotchpotch/japanese-reranker-tiny-v2"
model = CrossEncoder(MODEL_NAME)
if model.device == "cuda" or model.device == "mps":
model.model.half()
query = "æćçăȘæ ç»ă«ă€ăăŠ"
passages = [
"æ·±ăăăŒăăæăĄăȘăăăăèŠłăäșșăźćżăæșăă¶ăćäœăç»ć Žäșșç©ăźćżæ
æćăç§éžă§ăă©ăčăăŻæ¶ăȘăă§ăŻèŠăăăȘăă",
"éèŠăȘăĄăă»ăŒăžæ§ăŻè©äŸĄă§ăăăăæă話ăç¶ăăźă§æ°ćăèœăĄèŸŒăă§ăăŸăŁăăăăć°ăæăăèŠçŽ ăăăă°ăăăŁăă",
"ă©ăă«ăăȘăąăȘăăŁă«æŹ ăăć±éăæ°ă«ăȘăŁăăăăŁăšæ·±ăżăźăăäșșéăă©ăăèŠăăăŁăă",
"ăąăŻă·ă§ăłă·ăŒăłă愜ăăăăăèŠăŠăăŠéŁœăăȘăăăčăăŒăȘăŒăŻă·ăłăă«ă ăăăăăéă«èŻăă",
]
scores = model.predict(
[(query, passage) for passage in passages],
show_progress_bar=True,
)
print("Scores:", scores)
SentenceTransformers + onnx
If you want to run the model faster in a CPU or ARM environment, you can use ONNX or a quantized model.
pip install onnx onnxruntime accelerate optimum
from sentence_transformers import CrossEncoder
onnx_filename = "onnx/model_qint8_arm64.onnx"
if onnx_filename:
model = CrossEncoder(
MODEL_NAME,
device="cpu",
backend="onnx",
model_kwargs={"file_name": onnx_filename},
)
else:
model = CrossEncoder(MODEL_NAME, device="cpu", backend="onnx")
...
HuggingFace transformers
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from torch.nn import Sigmoid
def detect_device():
if torch.cuda.is_available():
return "cuda"
elif hasattr(torch, "mps") and torch.mps.is_available():
return "mps"
return "cpu"
device = detect_device()
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME)
model.to(device)
model.eval()
if device == "cuda":
model.half()
query = "æćçăȘæ ç»ă«ă€ăăŠ"
passages = [
"æ·±ăăăŒăăæăĄăȘăăăăèŠłăäșșăźćżăæșăă¶ăćäœăç»ć Žäșșç©ăźćżæ
æćăç§éžă§ăă©ăčăăŻæ¶ăȘăă§ăŻèŠăăăȘăă",
"éèŠăȘăĄăă»ăŒăžæ§ăŻè©äŸĄă§ăăăăæă話ăç¶ăăźă§æ°ćăèœăĄèŸŒăă§ăăŸăŁăăăăć°ăæăăèŠçŽ ăăăă°ăăăŁăă",
"ă©ăă«ăăȘăąăȘăăŁă«æŹ ăăć±éăæ°ă«ăȘăŁăăăăŁăšæ·±ăżăźăăäșșéăă©ăăèŠăăăŁăă",
"ăąăŻă·ă§ăłă·ăŒăłă愜ăăăăăèŠăŠăăŠéŁœăăȘăăăčăăŒăȘăŒăŻă·ăłăă«ă ăăăăăéă«èŻăă",
]
inputs = tokenizer(
[(query, passage) for passage in passages],
padding=True,
truncation=True,
max_length=512,
return_tensors="pt",
)
inputs = {k: v.to(device) for k, v in inputs.items()}
logits = model(**inputs).logits
activation = Sigmoid()
scores = activation(logits).squeeze().tolist()
print("Scores:", scores)
âš Features
Characteristics of Small Rerankers
japanese-reranker-tiny-v2
and japanese-reranker-xsmall-v2
are small reranker models with the following features:
- They can operate at a practical speed even in CPU or Apple Silicon environments.
- They can improve the accuracy of RAG systems without expensive GPU resources.
- They can be deployed on edge devices and used in production environments that require low latency.
- They are based on ModernBert's ruri-v3-pt-30m.
Evaluation Results
Inference Speed
The following are the inference speed results when reranking approximately 150,000 pairs (pure model inference time excluding tokenization time). M4 Max was used for MPS (Apple Silicon) and CPU measurements, and RTX5090 was used for GPU. Flash-attention2 was used for GPU processing.
The script used for the inference speed benchmark is available here.
đ License
This project is licensed under the MIT License.