đ LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders
LLM2Vec is a simple method to transform decoder - only LLMs into text encoders. It involves three straightforward steps: enabling bidirectional attention, masked next token prediction, and unsupervised contrastive learning. The model can be further fine - tuned to reach state - of - the - art performance.
- Repository: https://github.com/McGill-NLP/llm2vec
- Paper: https://arxiv.org/abs/2404.05961
đ Quick Start
LLM2Vec offers a convenient way to convert decoder - only LLMs into powerful text encoders. You can easily install and start using it following the steps below.
⨠Features
- Convert decoder - only LLMs into text encoders.
- Consist of three simple steps: bidirectional attention, masked next token prediction, and unsupervised contrastive learning.
- Can be fine - tuned for state - of - the - art performance.
đĻ Installation
pip install llm2vec
đģ Usage Examples
Basic Usage
from llm2vec import LLM2Vec
import torch
from transformers import AutoTokenizer, AutoModel, AutoConfig
from peft import PeftModel
tokenizer = AutoTokenizer.from_pretrained(
"McGill-NLP/LLM2Vec-Meta-Llama-31-8B-Instruct-mntp"
)
config = AutoConfig.from_pretrained(
"McGill-NLP/LLM2Vec-Meta-Llama-31-8B-Instruct-mntp", trust_remote_code=True
)
model = AutoModel.from_pretrained(
"McGill-NLP/LLM2Vec-Meta-Llama-31-8B-Instruct-mntp",
trust_remote_code=True,
config=config,
torch_dtype=torch.bfloat16,
device_map="cuda" if torch.cuda.is_available() else "cpu",
)
model = PeftModel.from_pretrained(
model,
"McGill-NLP/LLM2Vec-Meta-Llama-31-8B-Instruct-mntp",
)
l2v = LLM2Vec(model, tokenizer, pooling_mode="mean", max_length=512)
instruction = (
"Given a web search query, retrieve relevant passages that answer the query:"
)
queries = [
[instruction, "how much protein should a female eat"],
[instruction, "summit define"],
]
q_reps = l2v.encode(queries)
documents = [
"As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.",
"Definition of summit for English Language Learners. : 1 the highest point of a mountain : the top of a mountain. : 2 the highest level. : 3 a meeting or series of meetings between the leaders of two or more governments.",
]
d_reps = l2v.encode(documents)
q_reps_norm = torch.nn.functional.normalize(q_reps, p=2, dim=1)
d_reps_norm = torch.nn.functional.normalize(d_reps, p=2, dim=1)
cos_sim = torch.mm(q_reps_norm, d_reps_norm.transpose(0, 1))
print(cos_sim)
"""
tensor([[0.7724, 0.5563],
[0.4845, 0.5003]])
"""
đ Documentation
If you have any question about the code, feel free to email Parishad (parishad.behnamghader@mila.quebec
) and Vaibhav (vaibhav.adlakha@mila.quebec
).
đ License
This project is licensed under the MIT license.