JaColBERT: An Open-Source Japanese Document Retrieval Model - Accurately Find Information with Strong Out-of-Domain Generalization Ability

Jacolbert

Developed by bclavie

JaColBERT is the first Japanese-specific document retrieval model based on ColBERT, featuring strong out-of-domain generalization capabilities.

Text Embedding

Safetensors

JapaneseOpen Source License:MIT #Japanese Document Retrieval #ColBERT Architecture #Semantic Retrieval

Downloads 859

Release Time : 12/25/2023

Model Overview

JaColBERT is the first Japanese-specific document retrieval model based on ColBERT. By representing documents as sets of embedding vectors, it achieves excellent performance and strong out-of-domain generalization capabilities at a low computational cost.

Model Features

Strong Out-of-Domain Generalization

Despite being evaluated on out-of-domain datasets, JaColBERT surpasses previously common Japanese document retrieval models and approaches the performance of multilingual models.

Efficient Training

Trained on only 10 million triplets from a single dataset, requiring far less data than dense embedding models.

High Computational Efficiency

By representing documents as sets of embedding vectors, it achieves superior performance at a much lower computational cost compared to cross-encoders.

Model Capabilities

Japanese Document Retrieval

Sentence Similarity Calculation

Semantic Search

Use Cases

Information Retrieval

Question Answering System

Used to build Japanese question-answering systems, quickly retrieving relevant documents to answer questions.

Achieved R@1 of 0.906 on the JSQuAD dataset

Document Search

Used for semantic search of Japanese documents, improving search relevance.

Performed excellently on MIRACL and MrTyDi datasets

🚀 JaColBERT v1: Japanese Document Retrieval Model

Welcome to JaColBERT version 1, an initial release of a Japanese-only document retrieval model based on ColBERT. This model outperforms previous Japanese models for document retrieval and approaches the performance of multilingual models, demonstrating the strong generalization potential of ColBERT-based models for the Japanese language.

🚀 Quick Start

If you just want to check out how to use the model, please check out the Usage section below!

✨ Features

Outperforms Previous Models: JaColBERT outperforms previous common Japanese models used for document retrieval and gets close to the performance of multilingual models.
Strong Generalization: Despite being trained on a single dataset, JaColBERT shows strong generalization potential, even on out-of-domain evaluation datasets.
Efficient Approach: It uses a ColBERT-like approach, which combines the best of traditional sparse, cross-encoder, and dense retrieval methods, offering a good balance between performance and computational cost.

📦 Installation

JaColBERT works using ColBERT+RAGatouille. You can install it and all its necessary dependencies by running:

pip install -U ragatouille

For further examples on how to use RAGatouille with ColBERT models, you can check out the examples section in the github repository.

💻 Usage Examples

Basic Usage

Encoding and querying documents without an index

If you want to use JaColBERT without building an index, it's very simple, you just need to load the model, encode() some documents, and then search_encoded_documents():

from ragatouille import RAGPretrainedModel
RAG = RAGPretrainedModel.from_pretrained("bclavie/JaColBERT")

RAG.encode(['document_1', 'document_2', ...])
RAG.search_encoded_documents(query="your search query")

Subsequent calls to encode() will add to the existing in-memory collection. If you want to empty it, simply run RAG.clear_encoded_docs().

Indexing

In order for the late-interaction retrieval approach used by ColBERT to work, you must first build your index.

from ragatouille import RAGPretrainedModel

RAG = RAGPretrainedModel.from_pretrained("bclavie/JaColBERT")
documents = [ "マクドナルドのフライドポテトの少量のカロリーはいくつですか？マクドナルドの小さなフライドポテトのカロリーマクドナルドのウェブサイトには、次のように記載されています。フライドポテトの小さな注文で230カロリーケチャップで25カロリー、ケチャップパケットで15カロリー。",]
RAG.index(name="My_first_index", collection=documents)

The index files are stored, by default, at .ragatouille/colbert/indexes/{index_name}.

Searching

Once you have created an index, searching through it is just as simple! If you're in the same session and RAG is still loaded, you can directly search the newly created index. Otherwise, you'll want to load it from disk:

RAG = RAGPretrainedModel.from_index(".ragatouille/colbert/indexes/My_first_index")

And then query it:

RAG.search(query="What animation studio did Miyazaki found?")

📚 Documentation

Intro

Detailed report in the arXiv Report

Welcome to JaColBERT version 1, the initial release of JaColBERT, a Japanese-only document retrieval model based on ColBERT.

It outperforms previous common Japanese models used for document retrieval, and gets close to the performance of multilingual models, despite the evaluation datasets being out-of-domain for our models but in-domain for multilingual approaches. This showcases the strong generalisation potential of ColBERT-based models, even applied to Japanese!

JaColBERT is only an initial release: it is trained on only 10 million triplets from a single dataset. This is a first version, hopefully demonstrating the strong potential of this approach.

The information on this model card is minimal and intends to give an overview. I've been asked before to make a citeable version, please refer to the Techical Report for more information.

Why use a ColBERT-like approach for your RAG application?

Most retrieval methods have strong tradeoffs:

Traditional sparse approaches, such as BM25, are strong baselines, but do not leverage any semantic understanding, and thus hit a hard ceiling.
Cross-encoder retriever methods are powerful, but prohibitively expensive over large datasets: they must process the query against every single known document to be able to output scores.
Dense retrieval methods, using dense embeddings in vector databases, are lightweight and perform well, but are not data-efficient (they often require hundreds of millions if not billions of training examples pairs to reach state-of-the-art performance) and generalise poorly in a lot of cases. This makes sense: representing every single aspect of a document, to be able to match it to any potential query, into a single vector is an extremely hard problem.

ColBERT and its variants, including JaColBERT, aim to combine the best of all worlds: by representing the documents as essentially bags-of-embeddings, we obtain superior performance and strong out-of-domain generalisation at much lower compute cost than cross-encoders.

The strong out-of-domain performance can be seen in our results: JaColBERT, despite not having been trained on Mr.TyDi and MIRACL, nearly matches e5 dense retrievers, who have been trained on these datasets.

On JSQuAD, which is partially out-of-domain for e5 (it has only been exposed to the English version) and entirely out-of-domain for JaColBERT, it outperforms all e5 models.

Moreover, this approach requires considerably less data than dense embeddings: To reach its current performance, JaColBERT v1 is only trained on 10M training triplets, compared to billion of examples for the multilingual e5 models.

Training

Training Data

The model is trained on the japanese split of MMARCO, augmented with hard negatives. The data, including the hard negatives, is available on huggingface datasets.

We do not train nor perform data augmentation on any other dataset at this stage. We hope to do so in future work, or support practitioners intending to do so (feel free to reach out).

Training Method

JColBERT is trained for a single epoch (1-pass over every triplet) on 8 NVidia L4 GPUs. Total training time was around 10 hours.

JColBERT is initiated from Tohoku University's excellent bert-base-japanese-v3 and benefitted strongly from Nagoya University's work on building strong Japanese SimCSE models, among other work.

We attempted to train JaColBERT with a variety of settings, including different batch sizes (8, 16, 32 per GPU) and learning rates (3e-6, 5e-6, 1e-5, 2e-5). The best results were obtained with 5e-6, though were very close when using 3e-6. Any higher learning rate consistently resulted in lower performance in early evaluations and was discarded. In all cases, we applied warmup steps equal to 10% of the total steps.

In-batch negative loss was applied, and we did not use any distillation methods (using the scores from an existing model).

Results

See the table below for an overview of results, vs previous Japanese-only models and the current multilingual state-of-the-art (multilingual-e5).

Worth noting: JaColBERT is evaluated out-of-domain on all three datasets, whereas JSQuAD is partially (English version) and MIRACL & Mr.TyDi are fully in-domain for e5, likely contributing to their strong performance. In a real-world setting, I'm hopeful this could be bridged with moderate, quick (>2hrs) fine-tuning.

(refer to the technical report for exact evaluation method + code. * indicates the best monolingual/out-of-domain result. bold is best overall result. italic indicates the task is in-domain for the model.)

	JSQuAD			MIRACL			MrTyDi			Average
	R@1	R@5	R@10	R@3	R@5	R@10	R@3	R@5	R@10	R@{1\|3}	R@5	R@10
JaColBERT	0.906*	0.968*	0.978*	0.464*	0.546*	0.645*	0.744*	0.781*	0.821*	0.705*	0.765*	0.813*
m-e5-large (in-domain)	0.865	0.966	0.977	0.522	0.600	0.697	0.813	0.856	0.893	0.730	0.807	0.856
m-e5-base (in-domain)	0.838	0.955	0.973	0.482	0.553	0.632	0.777	0.815	0.857	0.699	0.775	0.820
m-e5-small (in-domain)	0.840	0.954	0.973	0.464	0.540	0.640	0.767	0.794	0.844	0.690	0.763	0.819
GLuCoSE	0.645	0.846	0.897	0.369	0.432	0.515	0.617	0.670	0.735	0.544	0.649	0.716
sentence-bert-base-ja-v2	0.654	0.863	0.914	0.172	0.224	0.338	0.488	0.549	0.611	0.435	0.545	0.621
sup-simcse-ja-base	0.632	0.849	0.897	0.133	0.177	0.264	0.454	0.514	0.580	0.406	0.513	0.580
sup-simcse-ja-large	0.603	0.833	0.889	0.159	0.212	0.295	0.457	0.517	0.581	0.406	0.521	0.588
fio-base-v0.1	0.700	0.879	0.924	0.279	0.358	0.462	0.582	0.649	0.712	0.520	0.629	0.699

🔧 Technical Details

The information on this model card is minimal and intends to give an overview. I've been asked before to make a citeable version, please refer to the Techical Report for more information.

📄 License

The license of this model is MIT.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご