E5-base-multilingual-4096 Open Source Model - Supports Multilingual Text Embedding Processing of 4096 Tokens

E5 Base Multilingual 4096

Developed by efederici

E5-base-multilingual-4096 is a locally sparse global version based on intfloat/multilingual-e5-base, supporting multilingual text embedding models that can process up to 4096 tokens.

Text Embedding

Transformers

Supports Multiple Languages#Multilingual Text Embedding #Long Text Processing #Cross-Language Retrieval

Downloads 340

Release Time : 6/15/2023

Model Overview

This model is a multilingual text embedding model, specifically designed for sentence similarity tasks, capable of processing texts in multiple languages and generating high-quality embedding vectors.

Model Features

Multilingual Support

Supports text embedding for over 100 languages, including major world languages and many lesser-known languages.

Long Text Processing

Capable of processing long texts up to 4096 tokens, suitable for handling lengthy documents and paragraphs.

High-Quality Embeddings

Generates high-quality text embedding vectors based on weakly supervised contrastive pre-training methods.

Model Capabilities

Multilingual Text Embedding

Sentence Similarity Calculation

Cross-Language Information Retrieval

Use Cases

Information Retrieval

Cross-Language Document Retrieval

This model can be used to retrieve documents in different languages that have similar content.

Improves the accuracy and efficiency of cross-language retrieval

Question Answering Systems

Multilingual Question Answering

Build a question-answering system that supports multiple languages, capable of understanding queries in different languages and returning relevant answers.

Expands the language coverage of question-answering systems

🚀 E5-base-multilingual-4096

This is the Local-Sparse-Global version of intfloat/multilingual-e5-base. It can handle up to 4k tokens, which is useful for sentence - similarity tasks.

🚀 Quick Start

Supported Languages

The model supports a wide range of languages, including:

multilingual, af, am, ar, as, az, be, bg, bn, br, bs, ca, cs, cy, da, de, el, en, eo, es, et, eu, fa, fi, fr, fy, ga, gd, gl, gu, ha, he, hi, hr, hu, hy, id, is, it, ja, jv, ka, kk, km, kn, ko, ku, ky, la, lo, lt, lv, mg, mk, ml, mn, mr, ms, my, ne, nl, 'no', om, or, pa, pl, ps, pt, ro, ru, sa, sd, si, sk, sl, so, sq, sr, su, sv, sw, ta, te, th, tl, tr, ug, uk, ur, uz, vi, xh, yi, zh

Pipeline Tag

The pipeline tag of this model is sentence - similarity.

💻 Usage Examples

Basic Usage

Below is an example to encode queries and passages from the MS - MARCO passage ranking dataset.

import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel

def average_pool(
  last_hidden_states: Tensor,
  attention_mask: Tensor
) -> Tensor:
    last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
    return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]

input_texts = [
  'query: how much protein should a female eat',
  'query: summit define',
  "passage: As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.",
  "passage: Definition of summit for English Language Learners. : 1  the highest point of a mountain : the top of a mountain. : 2  the highest level. : 3  a meeting or series of meetings between the leaders of two or more governments."
]

tokenizer = AutoTokenizer.from_pretrained('efederici/e5-base-multilingual-4096')
model = AutoModel.from_pretrained('efederici/e5-base-multilingual-4096', trust_remote_code=True)

batch_dict = tokenizer(input_texts, max_length=4096, padding=True, truncation=True, return_tensors='pt')
outputs = model(**batch_dict)
embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])

# (Optionally) normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:2] @ embeddings[2:].T) * 100

print(scores.tolist())

📚 Documentation

Citation

@article{wang2022text,
  title={Text Embeddings by Weakly-Supervised Contrastive Pre-training},
  author={Wang, Liang and Yang, Nan and Huang, Xiaolong and Jiao, Binxing and Yang, Linjun and Jiang, Daxin and Majumder, Rangan and Wei, Furu},
  journal={arXiv preprint arXiv:2212.03533},
  year={2022}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご