ViRanker Open-Source Vietnamese Text Re-ranking Model - Free Deployment with Direct Output of Query-Document Relevance Scores

Viranker

Developed by namdp-ptit

ViRanker is a cross-encoder model for Vietnamese text re-ranking, which can directly output the relevance score between the query and the document.

Text Embedding

Transformers

OtherOpen Source License:Apache-2.0 #Vietnamese text re-ranking #Cross-encoder model #High relevance score

Downloads 692

Release Time : 8/14/2024

Model Overview

This model takes the query and the paragraph as input and directly outputs the relevance score instead of the embedding vector. The score can be mapped to the [0,1] interval through the sigmoid function. It is suitable for Vietnamese text sorting tasks.

Model Features

Direct relevance scoring

Directly output the relevance score between the query and the document without generating embedding vectors.

High precision

It performs excellently on the MS MMarco Passage Reranking dataset, with an NDCG@3 of 0.6815.

Support for FP16 acceleration

Supports FP16 computation, which can significantly improve the computation speed with a slight performance loss.

Model Capabilities

Text relevance scoring

Vietnamese text processing

Query-document matching

Use Cases

Information retrieval

Search engine result sorting

Re-rank the results returned by the search engine to improve the ranking of the most relevant results.

Can significantly improve the accuracy of the top results

Question-answering system

Answer relevance evaluation

Evaluate the relevance between the candidate answers and the question and select the most appropriate answer.

Improve the accuracy of the question-answering system

🚀 Reranker

Reranker is different from embedding models. It takes questions and documents as input and directly outputs similarity scores instead of embeddings. You can obtain a relevance score by inputting a query and a passage, and this score can be mapped to a float value in the range of [0,1] using the sigmoid function.

🚀 Quick Start

✨ Features

Different from embedding model, reranker uses question and document as input and directly output similarity instead of embedding. You can get a relevance score by inputting query and passage to the reranker, and the score can be mapped to a float value in [0,1] by sigmoid function.

📦 Installation

There are two ways to install the necessary libraries for using the reranker:

Using FlagEmbedding

pip install -U FlagEmbedding

Using Huggingface transformers

pip install -U transformers

💻 Usage Examples

Basic Usage

Using FlagEmbedding

from FlagEmbedding import FlagReranker

reranker = FlagReranker('namdp-ptit/ViRanker',
                        use_fp16=True)  # Setting use_fp16 to True speeds up computation with a slight performance degradation

score = reranker.compute_score(['ai là vị vua cuối cùng của việt nam', 'vua bảo đại là vị vua cuối cùng của nước ta'])
print(score)  # 13.71875

# You can map the scores into 0-1 by set "normalize=True", which will apply sigmoid function to the score
score = reranker.compute_score(['ai là vị vua cuối cùng của việt nam', 'vua bảo đại là vị vua cuối cùng của nước ta'],
                               normalize=True)
print(score)  # 0.99999889840464

scores = reranker.compute_score(
    [
        ['ai là vị vua cuối cùng của việt nam', 'vua bảo đại là vị vua cuối cùng của nước ta'],
        ['ai là vị vua cuối cùng của việt nam', 'lý nam đế là vị vua đầu tiên của nước ta']
    ]
)
print(scores)  # [13.7265625, -8.53125]

# You can map the scores into 0-1 by set "normalize=True", which will apply sigmoid function to the score
scores = reranker.compute_score(
    [
        ['ai là vị vua cuối cùng của việt nam', 'vua bảo đại là vị vua cuối của nước ta'],
        ['ai là vị vua cuối cùng của việt nam', 'lý nam đế là vị vua đầu tiên của nước ta']
    ],
    normalize=True
)
print(scores)  # [0.99999889840464, 0.00019716942196222918]

Using Huggingface transformers

import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('namdp-ptit/ViRanker')
model = AutoModelForSequenceClassification.from_pretrained('namdp-ptit/ViRanker')
model.eval()

pairs = [
    ['ai là vị vua cuối cùng của việt nam', 'vua bảo đại là vị vua cuối cùng của nước ta'],
    ['ai là vị vua cuối cùng của việt nam', 'lý nam đế là vị vua đầu tiên của nước ta']
],
with torch.no_grad():
    inputs = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt', max_length=512)
    scores = model(**inputs, return_dict=True).logits.view(-1, ).float()
    print(scores)

🔧 Technical Details

Fine tune

Data Format

Train data should be a json file, where each line is a dict like this:

{"query": str, "pos": List[str], "neg": List[str]}

query is the query, and pos is a list of positive texts, neg is a list of negative texts. If you have no negative texts for a query, you can random sample some from the entire corpus as the negatives. Besides, for each query in the train data, we used LLMs to generate hard negative for them by asking LLMs to create a document that is the opposite one of the documents in 'pos'.

Performance

Below is a comparison table of the results we achieved compared to some other pre-trained Cross-Encoders on the MS MMarco Passage Reranking - Vi - Dev dataset.

Model Name	NDCG@3	MRR@3	NDCG@5	MRR@5	NDCG@10	MRR@10
namdp-ptit/ViRanker	0.6815	0.6641	0.6983	0.6894	0.7302	0.7107
itdainb/PhoRanker	0.6625	0.6458	0.7147	0.6731	0.7422	0.6830
kien-vu-uet/finetuned-phobert-passage-rerank-best-eval	0.0963	0.0883	0.1396	0.1131	0.1681	0.1246
BAAI/bge-reranker-v2-m3	0.6087	0.5841	0.6513	0.6062	0.6872	0.6209
BAAI/bge-reranker-v2-gemma	0.6088	0.5908	0.6446	0.6108	0.6785	0.6249

📄 License

This project is licensed under the Apache-2.0 license.

Contact

Email: phuongnamdpn2k2@gmail.com
LinkedIn: Dang Phuong Nam
Facebook: Phương Nam

Support The Project

If you find this project helpful and wish to support its ongoing development, here are some ways you can contribute:

Star the Repository: Show your appreciation by starring the repository. Your support motivates further development and enhancements.
Contribute: We welcome your contributions! You can help by reporting bugs, submitting pull requests, or suggesting new features.
Donate: If you’d like to support financially, consider making a donation. You can donate through:
- Vietcombank: 9912692172 - DANG PHUONG NAM

Thank you for your support!

Citation

Please cite as

@misc{ViRanker,
  title={ViRanker: A Cross-encoder Model for Vietnamese Text Ranking},
  author={Nam Dang Phuong},
  year={2024},
  publisher={Huggingface},
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご