Bloomz-560m-retriever-v2 Open Source Model - Support Cross-language Article and Query Retrieval between English and French

Bloomz 560m Retriever V2

Developed by cmarkea

A dual encoder based on the Bloomz-560m-dpo-chat model, designed to map articles and queries into the same vector space, supporting cross-language retrieval in French and English.

Text Embedding

Transformers

Supports Multiple LanguagesOpen Source License:Openrail #Cross-language retrieval #Open-domain question answering #Contrastive learning

Downloads 17

Release Time : 5/26/2024

Model Overview

This model is a dual encoder specifically designed for Open-Domain Question Answering (ODQA) tasks, capable of mapping queries and relevant articles into the same vector space to ensure proximity between queries and relevant articles. It supports cross-language retrieval in French and English.

Model Features

Cross-language retrieval

Supports cross-language retrieval in French and English, allowing queries in either language to find relevant articles regardless of the article's language.

Efficient retrieval

Uses cosine distance as the metric, significantly improving retrieval efficiency.

Contrastive learning training

Trained with contrastive learning using an improved mMARCO dataset, filtering false-negative samples and employing a hard negative sampling strategy.

Model Capabilities

Feature extraction

Cross-language retrieval

Open-domain question answering

Use Cases

Information retrieval

Open-domain question answering

Used in open-domain question answering systems to quickly retrieve relevant articles for answering questions.

Performs excellently on the SQuAD test set, achieving Top-1 accuracy of 68% (Fr/Fr) and 66.6% (En/Fr).

Cross-language document retrieval

Supports cross-language document retrieval between French and English.

Outperforms traditional models like BM25 and CamemBERT in cross-language retrieval tasks.

🚀 Bloomz-560m-retriever-v2

This is a bi - encoder model based on Bloomz - 560m - dpo - chat, which can project articles and queries into the same vector space, suitable for Open Domain Question Answering (ODQA).

🚀 Quick Start

The Bloomz - 560m - retriever - v2 model is a powerful tool for information retrieval. It can effectively handle queries in both French and English, making it suitable for multilingual scenarios.

✨ Features

Multilingual Compatibility: Supports both French and English, enabling queries and articles in different languages to be compared in the same vector space.
Efficient Retrieval: Uses cosine distance as the metric, which is more efficient compared to the previous version.
Suitable for ODQA: Ideal for Open Domain Question Answering tasks, and can be complemented by rerankers.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

from typing import Union, List

import numpy as np
import torch
from transformers import AutoTokenizer, AutoModel
from scipy.spatial.distance import cdist

tokenizer = AutoTokenizer.from_pretrained('cmarkea/bloomz-560m-retriever-v2')
model = AutoModel.from_pretrained('cmarkea/bloomz-560m-retriever-v2')

def infer(txt: Union[str, List[str]]):
    tok = tokenizer(txt, padding=True, return_tensors='pt')
    with torch.inference_mode():
        embedding = model(**tok)
    # Inportant: take only last token!
    return embedding.get('last_hidden_state')[:,-1,:].numpy()

list_of_contexts: List[str] = [...]
emb_contexts = infer(list_of_contexts)
list_of_queries: List[str] = [...]
emb_queries = infer(list_of_queries)

# Important: take cosine distance!
dist = cdist(emb_queries, emb_contexts, 'cosine')
top_k = lambda x: [
    [list_of_contexts[qq] for qq in ii]
    for ii in dist.argsort(axis=-1)[:,:x]
]

# top 5 nearest contexts for each queries
top_contexts = top_k(5)

Advanced Usage

import numpy as np
from transformers import pipeline
from scipy.spatial.distance import cdist

retriever = pipeline('feature-extraction', 'cmarkea/bloomz-560m-retriever-v2')

# Inportant: take only last token!
infer = lambda x: [np.array(ii[0][-1]).reshape(1,-1) for ii in retriever(x)]

list_of_contexts: List[str] = [...]
emb_contexts = np.concatenate(infer(list_of_contexts), axis=0)
list_of_queries: List[str] = [...]
emb_queries = np.concatenate(infer(list_of_queries), axis=0)

# Important: take cosine distance!
dist = cdist(emb_queries, emb_contexts, 'cosine')
top_k = lambda x: [
    [list_of_contexts[qq] for qq in ii]
    for ii in dist.argsort(axis=-1)[:,:x]
]

# top 5 nearest contexts for each queries
top_contexts = top_k(5)

📚 Documentation

Model Introduction

We introduce the Bloomz - 560m - retriever - v2 model, based on the Bloomz - 560m - dpo - chat model. This bi - encoder projects articles and queries into the same vector space, ensuring the proximity of queries to related articles. The model is language - agnostic for French and English, meaning a query in either language will be close to an article regardless of whether it is in French or English. This model is ideal for Open Domain Question Answering (ODQA) and can be complemented by the rerankers Bloomz - 560m - reranking or Bloomz - 3b - reranking.

Training Details

The training dataset used is a variant of mMARCO enabling contrastive learning and filtering out false negatives. The filtering threshold was set at 0.8, and a positive observation is confronted with 10 hard negatives, ordered by decreasing score (the 10 hardest). The model was trained on a uniform distribution of languages (1/4 French - French, 1/4 French - English, 1/4 English - French, 1/4 English - English). The learning objective is of the InfoNCE type with a trainable temperature parameter as presented for the CLIP model.

Note

Unlike the Bloomz - 560m - retriever, this much more efficient model uses cosine distance as its metric (instead of L2 distance as previously).

Benchmark

The performance evaluation is based on the evaluation portion of SQuAD (5921 queries over 1204 articles across 35 different topics). One interesting aspect of this dataset is having multiple articles associated with a single theme, representing challenging contexts where a query may be close to several relevant articles. On average, there are about thirty articles per theme (see Bloomz - 560m - reranking for the exact distribution).

We compare the performances using the average top rank of the articles targeted by a query (Top - mean), the standard deviation of the top ranks (Top - std), the percentage of correct articles within Top - 1, Top - 5, and Top - 10, and finally, the mean reciprocal rank (MRR) across the 1204 articles.

Model (FR/FR)	Top - mean	Top - std	Top - 1 (%)	Top - 5 (%)	Top - 10 (%)	MRR (%)
BM25	16.8	100.8	71.7	88.3	91.8	79.2
CamemBERT	269.6	303.0	5.6	12.5	16.5	9.7
[STS - CamemBERT](h4c5/sts - camembert - base)	23.1	85.5	36.0	63.0	74.0	48.5
[Sentence - BERT](https://huggingface.co/sentence - transformers/paraphrase - multilingual - mpnet - base - v2)	10.2	40.1	43.9	73.9	84.0	57.3
[E5 - base](https://huggingface.co/intfloat/multilingual - e5 - base)	6.1	29.7	59.9	84.9	91.0	71.1
[E5 - large](https://huggingface.co/intfloat/multilingual - e5 - large)	5.2	29.2	67.0	89.2	93.7	76.7
Bloomz - 560m - retriever	10.2	46.6	51.5	78.1	86.2	63.5
Bloomz - 3b - retriever	8.8	36.4	49.2	77.5	86.1	62.0
Bloomz - 560m - retriever - v2	4.0	17.1	68.0	89.9	94.4	77.7
Bloomz - 3b - retriever - v2	2.8	14.8	76.5	94.4	97.2	84.4

Model (EN/FR)	Top - mean	Top - std	Top - 1 (%)	Top - 5 (%)	Top - 10 (%)	MRR (%)
BM25	280.7	371.8	23.9	37.4	43.3	30.4
CamemBERT	355.0	328.3	0.9	3.7	6.4	3.13
[STS - CamemBERT](h4c5/sts - camembert - base)	102.2	196.9	13.1	30.5	40.7	22.1
[Sentence - BERT](https://huggingface.co/sentence - transformers/paraphrase - multilingual - mpnet - base - v2)	10.6	41.2	43.3	72.4	82.7	56.5
[E5 - base](https://huggingface.co/intfloat/multilingual - e5 - base)	9.9	38.1	49.8	77.2	85.4	62.6
[E5 - large](https://huggingface.co/intfloat/multilingual - e5 - large)	5.6	26.9	62.9	86.9	92.5	73.8
Bloomz - 560m - retriever	11.0	47.8	48.3	75.7	84.7	60.4
Bloomz - 3b - retriever	8.9	37.6	48.8	77.4	86.1	61.6
Bloomz - 560m - retriever - v2	4.4	18.9	66.6	89.3	94.1	76.6
Bloomz - 3b - retriever - v2	2.7	14.2	75.7	94.5	97.1	83.9

📄 License

The model uses the bigscience - bloom - rail - 1.0 license.

Property	Details
Model Type	Bloomz - 560m - retriever - v2
Training Data	A variant of [mMARCO](https://huggingface.co/datasets/cmarkea/mmarco - contrastive) enabling contrastive learning and filtering out false negatives
License	bigscience - bloom - rail - 1.0

📖 Citation

@online{DeBloomzRetv2,
  AUTHOR = {Cyrile Delestre},
  ORGANIZATION = {Cr{\'e}dit Mutuel Ark{\'e}a},
  URL = {https://huggingface.co/cmarkea/bloomz-560m-retriever-v2},
  YEAR = {2024},
  KEYWORDS = {NLP ; Transformers ; LLM ; Bloomz},
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご