ko-reranker-8k Open-source Text Re-ranking Model - Finetuned with Korean data for precise text content ranking

Ko Reranker 8k

Developed by upskyy

A text ranking model fine-tuned with Korean data based on BAAI/bge-reranker-v2-m3

Supports Multiple LanguagesOpen Source License:Apache-2.0 #Korean Re-ranking #Multilingual Support #High-precision Relevance Scoring

Downloads 14

Release Time : 8/16/2024

Model Overview

This model is a text ranking model specifically optimized for Korean and multilingual texts, capable of calculating relevance scores between query statements and text passages.

Model Features

Korean Optimization

Fine-tuned with Korean data, especially suitable for Korean text ranking tasks

Multilingual Support

Supports multiple languages in addition to Korean

Efficient Computation

Supports FP16 acceleration for improved processing efficiency

Score Normalization

Optional mapping of relevance scores to a 0-1 range for easier comparison

Model Capabilities

Text Relevance Scoring

Multilingual Text Processing

Query-Passage Matching

Use Cases

Information Retrieval

Search Engine Result Ranking

Ranking search engine results by relevance

Improves the relevance of search results

Q&A Systems

Selecting the most relevant answer from candidate responses

Enhances the accuracy of Q&A systems

Content Recommendation

News Recommendation

Recommending the most relevant news content based on user queries

Improves the precision of content recommendations

🚀 upskyy/ko-reranker-8k

The ko-reranker-8k is a model fine-tuned on the BAAI/bge-reranker-v2-m3 model with Korean data.

🚀 Quick Start

✨ Features

This model is fine - tuned on the BAAI/bge-reranker-v2-m3 model using Korean data, which is suitable for text ranking tasks in Korean and multilingual scenarios.

📦 Installation

Using FlagEmbedding

pip install -U FlagEmbedding

💻 Usage Examples

Basic Usage

Using FlagEmbedding

Get relevance scores (higher scores indicate more relevance):

from FlagEmbedding import FlagReranker


reranker = FlagReranker('upskyy/ko-reranker-8k', use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation

score = reranker.compute_score(['query', 'passage'])
print(score) # -8.3828125

# You can map the scores into 0-1 by set "normalize=True", which will apply sigmoid function to the score
score = reranker.compute_score(['query', 'passage'], normalize=True)
print(score) # 0.000228713314721116

scores = reranker.compute_score([['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']])
print(scores) # [-11.2265625, 8.6875]

# You can map the scores into 0-1 by set "normalize=True", which will apply sigmoid function to the score
scores = reranker.compute_score([['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']], normalize=True)
print(scores) # [1.3315579521758342e-05, 0.9998313472460109]

Using Huggingface transformers

Get relevance scores (higher scores indicate more relevance):

import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer


tokenizer = AutoTokenizer.from_pretrained('upskyy/ko-reranker-8k')
model = AutoModelForSequenceClassification.from_pretrained('upskyy/ko-reranker-8k')
model.eval()

pairs = [['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']]
with torch.no_grad():
    inputs = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt', max_length=512)
    scores = model(**inputs, return_dict=True).logits.view(-1, ).float()
    print(scores)

📚 Documentation

Citation

@misc{li2023making,
      title={Making Large Language Models A Better Foundation For Dense Retrieval}, 
      author={Chaofan Li and Zheng Liu and Shitao Xiao and Yingxia Shao},
      year={2023},
      eprint={2312.15503},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
@misc{chen2024bge,
      title={BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation}, 
      author={Jianlv Chen and Shitao Xiao and Peitian Zhang and Kun Luo and Defu Lian and Zheng Liu},
      year={2024},
      eprint={2402.03216},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Reference

📄 License

This project is licensed under the Apache-2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご