M2-BERT-8k-Retrieval-Encoder-V1 Open-Source Retrieval Model - Supports Precise Retrieval of Long Context Content

M2 BERT 8k Retrieval Encoder V1

Developed by hazyresearch

M2-BERT-8K is an 80-million-parameter long-context retrieval model based on the architecture proposed in the paper 'Benchmarking and Building Long-Context Retrieval Models with LoCo and M2-BERT'.

Large Language Model

Transformers

EnglishOpen Source License:Apache-2.0 #Long Text Retrieval #8K Context #BERT Architecture

Downloads 52

Release Time : 5/22/2024

Model Overview

M2-BERT-8K is a BERT variant specifically designed for long-context retrieval tasks, supporting sequences up to 8192 tokens in length and capable of generating 768-dimensional embedding vectors for retrieval tasks.

Model Features

Long Context Support

Supports sequences up to 8192 tokens, making it suitable for long-document retrieval tasks.

Efficient Retrieval

Generates 768-dimensional embedding vectors optimized for retrieval efficiency.

Custom Architecture

A BERT variant improved with the Monarch Mixer architecture.

Model Capabilities

Text Embedding Generation

Long Document Retrieval

Masked Language Modeling

Use Cases

Information Retrieval

Document Retrieval System

Building a retrieval system that supports long documents.

Capable of effectively processing documents up to 8192 tokens in length.

🚀 Monarch Mixer-BERT

This is an 80M checkpoint for M2-BERT-8k from the paper Benchmarking and Building Long-Context Retrieval Models with LoCo and M2-BERT. It offers capabilities for long - context retrieval tasks.

Check out our GitHub for instructions on how to download and fine - tune it!

🚀 Quick Start

✨ Features

It is an 80M checkpoint for M2 - BERT - 8k, suitable for long - context retrieval.
Generates 768 - dimensional embeddings for retrieval.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

You can load this model using Hugging Face AutoModel:

from transformers import AutoModelForMaskedLM, BertConfig
config = BertConfig.from_pretrained("hazyresearch/M2-BERT-8K-Retrieval-Encoder-V1")
model = AutoModelForMaskedLM.from_pretrained("hazyresearch/M2-BERT-8K-Retrieval-Encoder-V1", config=config,trust_remote_code=True)

This model uses the Hugging Face bert - base - uncased tokenizer:

from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

Advanced Usage

This model generates embeddings for retrieval. The embeddings have a dimensionality of 768:

from transformers import AutoTokenizer, AutoModelForMaskedLM, BertConfig

max_seq_length = 8192
testing_string = "Every morning, I make a cup of coffee to start my day."
config = BertConfig.from_pretrained("hazyresearch/M2-BERT-8K-Retrieval-Encoder-V1")
model = AutoModelForMaskedLM.from_pretrained("hazyresearch/M2-BERT-8K-Retrieval-Encoder-V1", config=config, trust_remote_code=True)

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased", model_max_length=max_seq_length)
input_ids = tokenizer([testing_string], return_tensors="pt", padding="max_length", return_token_type_ids=False, truncation=True, max_length=max_seq_length)

outputs = model(**input_ids)
embeddings = outputs['sentence_embedding']

Remote Code

This model requires trust_remote_code = True to be passed to the from_pretrained method. This is because we use custom PyTorch code (see our GitHub). You should consider passing a revision argument that specifies the exact git commit of the code, for example:

mlm = AutoModelForMaskedLM.from_pretrained(
   "hazyresearch/M2-BERT-8K-Retrieval-Encoder-V1",
   config=config,
   trust_remote_code=True,
)

Configuration

⚠️ Important Note

Note use_flash_mm is false by default. Using FlashMM is currently not supported.

📄 License

This project is licensed under the Apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご