Japanese-reranker-tiny-v2 Open-source Japanese Re-ranking Model - Improve RAG Accuracy, Ultra-efficient on CPU

Japanese Reranker Tiny V2

Developed by hotchpotch

This is a very compact and fast Japanese reranking model, suitable for improving the accuracy of RAG systems and can run efficiently on CPUs or edge devices.

Text Embedding

Safetensors

JapaneseOpen Source License:MIT #Japanese Reranking #Lightweight Model #Edge Computing Optimization

Downloads 339

Release Time : 5/7/2025

Model Overview

This model is a Japanese text reranker, primarily used to reorder retrieved documents to improve relevance. Based on the ModernBert architecture, it is specially optimized for performance in resource-constrained environments.

Model Features

Lightweight and Efficient

Only a 3-layer architecture, capable of running at practical speeds on CPUs or Apple Silicon environments.

Resource-Friendly

Improves the accuracy of RAG systems without requiring expensive GPU resources.

Edge Device Compatible

Suitable for deployment on edge devices or production environments with high latency requirements.

Optimized Inference

Supports Flash Attention 2 acceleration and ONNX quantization optimization.

Model Capabilities

Japanese text relevance scoring

Retrieval result reranking

Fast inference

Use Cases

Information Retrieval

Document Retrieval Optimization

Reranks search engine results to improve relevance

Achieved a score of 0.6455 on the JQaRA dataset

Question Answering Systems

QA Candidate Answer Sorting

Reranks candidate answers generated by a QA system for relevance

Achieved a score of 0.9608 on the JSQuAD dataset

🚀 hotchpotch/japanese-reranker-tiny-v2

A very small and fast Japanese reranker model series (v2).

Model Name	Layers	Hidden Size	Score (avg)	Speed (GPU)
hotchpotch/japanese-reranker-tiny-v2	3	256	0.8138	2.1s
hotchpotch/japanese-reranker-xsmall-v2	10	256	0.8699	6.5s
hotchpotch/japanese-reranker-cross-encoder-xsmall-v1	6	384	0.8131	20.5s
hotchpotch/japanese-reranker-cross-encoder-small-v1	12	384	0.8254	40.3s
hotchpotch/japanese-reranker-cross-encoder-base-v1	12	768	0.8484	96.8s
hotchpotch/japanese-reranker-cross-encoder-large-v1	24	1024	0.8661	312.2s
hotchpotch/japanese-bge-reranker-v2-m3-v1	24	1024	0.8584	310.6s

For more information about rerankers, technical reports, and evaluations, please refer to the following links:

Reranker Benchmark

🚀 Quick Start

Prerequisites

The operation requires version 4.48 or higher of the transformers library.

pip install -U "transformers>=4.48.0" sentence-transformers sentencepiece

If your GPU supports Flash Attention 2, you can install the flash-attn library for faster inference:

pip install flash-attn --no-build-isolation

💻 Usage Examples

SentenceTransformers

from sentence_transformers import CrossEncoder
import torch

MODEL_NAME = "hotchpotch/japanese-reranker-tiny-v2"

model = CrossEncoder(MODEL_NAME)
if model.device == "cuda" or model.device == "mps":
    model.model.half()
query = "感動的な映画について"
passages = [
    "深いテーマを持ちながらも、観る人の心を揺さぶる名作。登場人物の心情描写が秀逸で、ラストは涙なしでは見られない。",
    "重要なメッセージ性は評価できるが、暗い話が続くので気分が落ち込んでしまった。もう少し明るい要素があればよかった。",
    "どうにもリアリティに欠ける展開が気になった。もっと深みのある人間ドラマが見たかった。",
    "アクションシーンが楽しすぎる。見ていて飽きない。ストーリーはシンプルだが、それが逆に良い。",
]
scores = model.predict(
    [(query, passage) for passage in passages],
    show_progress_bar=True,
)
print("Scores:", scores)

SentenceTransformers + onnx

If you want to run the model faster in a CPU or ARM environment, you can use ONNX or a quantized model.

pip install onnx onnxruntime accelerate optimum

from sentence_transformers import CrossEncoder

# If you don't choose the ONNX model, model.onnx will be used automatically
# onnx_filename = None

# If you want to use the quantized optimal model, specify the file name in onnx_filename
# onnx_filename = "onnx/model_qint8_avx2.onnx"
onnx_filename = "onnx/model_qint8_arm64.onnx"

if onnx_filename:
    model = CrossEncoder(
        MODEL_NAME,
        device="cpu",
        backend="onnx",
        model_kwargs={"file_name": onnx_filename},
    )
else:
    model = CrossEncoder(MODEL_NAME, device="cpu", backend="onnx")

...

HuggingFace transformers

from transformers import AutoTokenizer, AutoModelForSequenceClassification
from torch.nn import Sigmoid


def detect_device():
    if torch.cuda.is_available():
        return "cuda"
    elif hasattr(torch, "mps") and torch.mps.is_available():
        return "mps"
    return "cpu"


device = detect_device()
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME)
model.to(device)
model.eval()

if device == "cuda":
    model.half()

query = "感動的な映画について"
passages = [
    "深いテーマを持ちながらも、観る人の心を揺さぶる名作。登場人物の心情描写が秀逸で、ラストは涙なしでは見られない。",
    "重要なメッセージ性は評価できるが、暗い話が続くので気分が落ち込んでしまった。もう少し明るい要素があればよかった。",
    "どうにもリアリティに欠ける展開が気になった。もっと深みのある人間ドラマが見たかった。",
    "アクションシーンが楽しすぎる。見ていて飽きない。ストーリーはシンプルだが、それが逆に良い。",
]
inputs = tokenizer(
    [(query, passage) for passage in passages],
    padding=True,
    truncation=True,
    max_length=512,
    return_tensors="pt",
)
inputs = {k: v.to(device) for k, v in inputs.items()}
logits = model(**inputs).logits
activation = Sigmoid()
scores = activation(logits).squeeze().tolist()

print("Scores:", scores)

✨ Features

Characteristics of Small Rerankers

japanese-reranker-tiny-v2 and japanese-reranker-xsmall-v2 are small reranker models with the following features:

They can operate at a practical speed even in CPU or Apple Silicon environments.
They can improve the accuracy of RAG systems without expensive GPU resources.
They can be deployed on edge devices and used in production environments that require low latency.
They are based on ModernBert's ruri-v3-pt-30m.

Evaluation Results

Model Name	avg	JQaRA	JaCWIR	MIRACL	JSQuAD
japanese-reranker-tiny-v2	0.8138	0.6455	0.9287	0.7201	0.9608
japanese-reranker-xsmall-v2	0.8699	0.7403	0.9409	0.8206	0.9776
japanese-reranker-cross-encoder-xsmall-v1	0.8131	0.6136	0.9376	0.7411	0.9602
japanese-reranker-cross-encoder-small-v1	0.8254	0.6247	0.9390	0.7776	0.9604
japanese-reranker-cross-encoder-base-v1	0.8484	0.6711	0.9337	0.8180	0.9708
japanese-reranker-cross-encoder-large-v1	0.8661	0.7099	0.9364	0.8406	0.9773
japanese-bge-reranker-v2-m3-v1	0.8584	0.6918	0.9372	0.8423	0.9624
bge-reranker-v2-m3	0.8512	0.6730	0.9343	0.8374	0.9599
ruri-v3-reranker-310m	0.9171	0.8688	0.9506	0.8670	0.9820

Inference Speed

The following are the inference speed results when reranking approximately 150,000 pairs (pure model inference time excluding tokenization time). M4 Max was used for MPS (Apple Silicon) and CPU measurements, and RTX5090 was used for GPU. Flash-attention2 was used for GPU processing.

Model Name	Layers	Hidden Size	Speed (GPU)	Speed (MPS)	Speed (CPU)
japanese-reranker-tiny-v2	3	256	2.1s	82s	702s
japanese-reranker-xsmall-v2	10	256	6.5s	303s	2300s
japanese-reranker-cross-encoder-xsmall-v1	6	384	20.5s
japanese-reranker-cross-encoder-small-v1	12	384	40.3s
japanese-reranker-cross-encoder-base-v1	12	768	96.8s
japanese-reranker-cross-encoder-large-v1	24	1024	312.2s
japanese-bge-reranker-v2-m3-v1	24	1024	310.6s
bge-reranker-v2-m3	24	1024	310.7s
ruri-v3-reranker-310m	25	768	81.4s

The script used for the inference speed benchmark is available here.

📄 License

This project is licensed under the MIT License.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご