FRIDA is a general-purpose text embedding model fine-tuned with full parameters based on the T5 denoising architecture, supporting Russian and English text processing.
FRIDA is a general-purpose text embedding model based on the T5 denoising architecture, primarily used for feature extraction and semantic understanding tasks in Russian and English bilingual texts.
Model Features
Bilingual support
Supports Russian and English text processing, suitable for bilingual application scenarios.
Multi-task prefixes
Provides multiple prefix options for different task scenarios, such as retrieval, paraphrasing, classification, etc.
GGUF format
Offers GGUF format models for easy deployment and use in local environments.
Model Capabilities
Text feature extraction
Semantic similarity calculation
Text retrieval
Text classification
Sentiment analysis
Topic clustering
Use Cases
Information retrieval
Answer retrieval
Use 'search_query:' and 'search_document:' prefixes for question and answer matching retrieval.
Text similarity
Semantic similarity calculation
Use 'paraphrase:' prefix to calculate semantic similarity between texts.
Text classification
Sentiment analysis
Use 'categorize_sentiment:' prefix for text sentiment analysis.
Topic classification
Use 'categorize_topic:' prefix for text topic classification.
🚀 Model Card for FRIDA GGUF
FRIDA is a comprehensive fine - tuned general text embedding model. Inspired by the denoising architecture based on T5, it builds on the encoder part of the FRED - T5 model. The model continues the research on text embedding models, such as ruMTEB and [ru - en - RoSBERTa](https://huggingface.co/ai - forever/ru - en - RoSBERTa). It was pre - trained on a Russian - English dataset and fine - tuned for better performance on the target task.
For more model details, please refer to our technical report [TODO].
🚀 Quick Start
The model can be used with prefixes, and it's recommended to use CLS pooling. The choice of prefix and pooling depends on the task.
✨ Features
Versatile Prefix Usage: Different prefixes are available for various tasks, such as retrieval, paraphrasing, categorization, and more.
Fine - Tuning Capability: Can be fine - tuned with high - quality Russian and English datasets to better suit specific needs.
📦 Installation
Ollama
ollama pull evilfreelancer/FRIDA:f16
💻 Usage Examples
Basic Usage
We use the following basic rules to choose a prefix:
"search_query: " and "search_document: " prefixes are for answer or relevant paragraph retrieval
"paraphrase: " prefix is for symmetric paraphrasing related tasks (STS, paraphrase mining, deduplication)
"categorize: " prefix is for asymmetric matching of document title and body (e.g. news, scientific papers, social posts)
"categorize_sentiment: " prefix is for any tasks that rely on sentiment features (e.g. hate, toxic, emotion)
"categorize_topic: " prefix is intended for tasks where you need to group texts by topic
"categorize_entailment: " prefix is for textual entailment task (NLI)
Advanced Usage
Below are examples of texts encoding using the Transformers and SentenceTransformers libraries.
import json
import requests
import numpy as np
OLLAMA_HOST = "http://localhost:11434"
MODEL_NAME = "evilfreelancer/FRIDA:f16"defget_embedding(text):
payload = {
"model": MODEL_NAME,
"input": text
}
response = requests.post(
f"{OLLAMA_HOST}/api/embed",
data=json.dumps(payload, ensure_ascii=False),
headers={"Content - Type": "application/x - www - form - urlencoded"}
)
response.raise_for_status()
return np.array(response.json()["embeddings"][0])
defnormalize(vectors):
vectors = np.atleast_2d(vectors)
norms = np.linalg.norm(vectors, axis=1, keepdims=True)
norms[norms == 0] = 1.0return vectors / norms
defcosine_diag_similarity(a, b):
return np.sum(a * b, axis=1)
inputs = [
#"paraphrase: В Ярославской области разрешили работу бань, но без посетителей",
"categorize_entailment: Женщину доставили в больницу за ее жизнь сейчас борются врачи.",
"search_query: Сколько программистов нужно, чтобы вкрутить лампочку?",
#"paraphrase: Ярославским баням разрешили работать без посетителей",
"categorize_entailment: Женщину спасают врачи.",
"search_document: Чтобы вкрутить лампочку нужно три программиста.",
]
size = int(len(inputs)/2)
embeddings = normalize(np.array([get_embedding(text) for text in inputs]))
sim_scores = cosine_diag_similarity(embeddings[:size], embeddings[size:])
print(sim_scores.tolist())
📚 Documentation
To better tailor the model to your needs, you can fine - tune it with relevant high - quality Russian and English datasets.
🔧 Technical Details
The model is based on the encoder part of FRED - T5 model and continues research on text embedding models. It has been pre - trained on a Russian - English dataset and fine - tuned for improved performance on the target task.