vectorizer.raspberry Open-Source Vectorizer - Free Generation of Embedding Vectors for Sentence Similarity Calculation and Retrieval

Vectorizer.raspberry

Developed by sinequa

A vectorizer developed by Sinequa that generates embedding vectors based on input paragraphs or queries, used for sentence similarity calculation and retrieval tasks.

Text Embedding

Transformers

Supports Multiple Languages#Multilingual Embedding #Low-latency Retrieval #Cross-language Similarity

Downloads 408

Release Time : 7/11/2023

Model Overview

This model is a feature extraction and sentence similarity calculation model, primarily used to generate embedding vectors for paragraphs and queries, supporting multilingual text processing.

Model Features

Multilingual Support

Supports 9 major languages and is compatible with 91 other languages used in the base model's pre-training.

Efficient Inference

On an NVIDIA A10 GPU, inference time with FP16 quantization and batch size 1 takes only 1 millisecond.

Case and Accent Insensitivity

Insensitive to text case and accents, improving model robustness.

Dimensionality Reduction

Reduces output dimension to 256 through an additional dense layer, optimizing storage and computational efficiency.

Model Capabilities

Multilingual Text Embedding

Sentence Similarity Calculation

Paragraph Vectorization

Query Vectorization

Cross-language Retrieval

Use Cases

Information Retrieval

Document Retrieval

Using query vectors to find relevant document paragraphs

Achieved an average Recall@100 of 0.613 on the BEIR benchmark

Multilingual Applications

Cross-language Search

Supports text similarity calculation and retrieval in multiple languages

On the MIRACL benchmark, achieved a Chinese Recall@100 of 0.680

🚀 Model Card for `vectorizer.raspberry`

This model, developed by Sinequa, is a vectorizer that generates an embedding vector for a given passage or query. Passage vectors are stored in our vector index, and the query vector is used at query time to search for relevant passages in the index.

🚀 Quick Start

This model is ready to use for generating embedding vectors for passages and queries. You can start leveraging its capabilities right away.

✨ Features

Multilingual Support: Trained and tested in multiple languages including English, French, German, Spanish, Italian, Dutch, Japanese, Portuguese, and Simplified Chinese. Also offers basic support for 91 additional languages used in the base model's pretraining.
Efficient Inference: Provides fast inference times on various NVIDIA GPUs with different quantization types and batch sizes.
Low Memory Consumption: Consumes relatively low GPU memory, with clear details provided for different quantization types.

📦 Installation

Requirements

Minimal Sinequa version: 11.10.0
Minimal Sinequa version for using FP16 models and GPUs with CUDA compute capability of 8.9+ (like NVIDIA L4): 11.11.0
Cuda compute capability: above 5.0 (above 6.0 for FP16 use)

💻 Usage Examples

Basic Usage

The model can be used to generate embedding vectors for passages and queries. Here's a high - level concept of how it might be used in a Python - like environment (actual implementation may vary based on the integration):

# Assume there is a function provided by Sinequa to interact with the model
from sinequa_vectorizer import Vectorizer

vectorizer = Vectorizer(model_name='vectorizer.raspberry')
passage = "This is a sample passage."
passage_vector = vectorizer.get_vector(passage)

📚 Documentation

Supported Languages

The model was trained and tested in the following languages:

English
French
German
Spanish
Italian
Dutch
Japanese
Portuguese
Chinese (simplified)

Besides these languages, basic support can be expected for additional 91 languages that were used during the pretraining of the base model (see Appendix A of XLM - R paper).

Scores

Metric	Value
Relevance (Recall@100)	0.613

Note that the relevance score is computed as an average over 14 retrieval datasets (see details below).

Inference Times

GPU	Quantization type	Batch size 1	Batch size 32
NVIDIA A10	FP16	1 ms	5 ms
NVIDIA A10	FP32	2 ms	18 ms
NVIDIA T4	FP16	1 ms	12 ms
NVIDIA T4	FP32	3 ms	52 ms
NVIDIA L4	FP16	2 ms	5 ms
NVIDIA L4	FP32	4 ms	24 ms

GPU Memory usage

Quantization type	Memory
FP16	550 MiB
FP32	1050 MiB

Note that GPU memory usage only includes how much GPU memory the actual model consumes on an NVIDIA T4 GPU with a batch size of 32. It does not include the fix amount of memory that is consumed by the ONNX Runtime upon initialization which can be around 0.5 to 1 GiB depending on the used GPU.

Model Details

Overview

Property	Details
Number of parameters	107 million
Base language model	[mMiniLMv2 - L6 - H384 - distilled - from - XLMR - Large](https://huggingface.co/nreimers/mMiniLMv2 - L6 - H384 - distilled - from - XLMR - Large) (Paper, GitHub)
Case and accent sensitivity	Insensitive
Output dimensions	256 (reduced with an additional dense layer)
Training procedure	Query - passage - negative triplets for datasets that have mined hard negative data, Query - passage pairs for the rest. Number of negatives is augmented with in - batch negative strategy

Training Data

The model has been trained using all datasets that are cited in the [all - MiniLM - L6 - v2](https://huggingface.co/sentence - transformers/all - MiniLM - L6 - v2) model. In addition to that, this model has been trained on the datasets cited in this paper on the 9 aforementioned languages.

Evaluation Metrics

To determine the relevance score, we averaged the results that we obtained when evaluating on the datasets of the [BEIR benchmark](https://github.com/beir - cellar/beir). Note that all these datasets are in English.

Dataset	Recall@100
Average	0.613
Arguana	0.957
CLIMATE - FEVER	0.468
DBPedia Entity	0.377
FEVER	0.820
FiQA - 2018	0.639
HotpotQA	0.560
MS MARCO	0.845
NFCorpus	0.287
NQ	0.756
Quora	0.992
SCIDOCS	0.456
SciFact	0.906
TREC - COVID	0.100
Webis - Touche - 2020	0.413

We evaluated the model on the datasets of the [MIRACL benchmark](https://github.com/project - miracl/miracl) to test its multilingual capacities. Note that not all training languages are part of the benchmark, so we only report the metrics for the existing languages.

Language	Recall@100
French	0.650
German	0.528
Spanish	0.602
Japanese	0.614
Chinese (simplified)	0.680

🔧 Technical Details

The model is based on the [mMiniLMv2 - L6 - H384 - distilled - from - XLMR - Large](https://huggingface.co/nreimers/mMiniLMv2 - L6 - H384 - distilled - from - XLMR - Large) base model. It has 107 million parameters and is designed to be insensitive to casing and accents. The output dimensions are reduced to 256 using an additional dense layer. The training procedure involves different strategies for datasets with and without mined hard negative data, and the number of negatives is augmented using an in - batch negative strategy.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご