embedder_collection Open-source Multilingual Embedding Model - Supports German-English Bilingual and Long Text Processing

Embedder Collection

Developed by kalle07

Multilingual embedding model for German and English, supporting a context length of 8192 tokens

Text Embedding Supports Multiple Languages#Multilingual Embedding #Long Text Support #RAG Optimization

Downloads 6,623

Release Time : 3/3/2025

Model Overview

This model is specifically optimized for German and English embeddings, suitable for Retrieval-Augmented Generation (RAG) systems, capable of handling long-context documents.

Model Features

Long Context Support

Supports a context length of 8192 tokens, ideal for processing long documents.

Multilingual Capability

Optimized embedding performance for German and English.

RAG Optimization

Designed specifically for Retrieval-Augmented Generation systems, improving document retrieval accuracy.

Model Capabilities

Text embedding

Sentence similarity calculation

Feature extraction

Multilingual processing

Use Cases

Document Retrieval

Book Content Retrieval

Retrieve specific chapters or content from long documents.

Accurately retrieves relevant segments, enhancing RAG system performance.

Multilingual Applications

German Document Processing

Embedding and retrieval of German documents.

Optimized embedding performance for German text.

🚀 Sentence Transformers for RAG and More

This README provides insights into testing various models with ALLM (AnythingLLM) using LM - Studio as a server, and offers guidance on using embedders for Retrieval Augmented Generation (RAG). It also includes tips on document preparation, system prompts, and more.

🚀 Quick Start

All models have been tested with ALLM using LM - Studio as the server. They should also work with Ollama. The setup for local documents is almost the same. GPT4All has only one model (nomic), and koboldcpp is under development.

⚠️ Important Note

Sometimes, the results are more accurate when the “chat with document only” option is used. Also, an embedder is just a part of a good RAG system.

✨ Features

Model Impressions

Some models like nomic - embed - text (up to 2048t context length), mxbai - embed - large, mug - b - 1.6, snowflake - arctic - embed - l - v2.0 (up to 8192t context length), Ger - RAG - BGE - M3 (german, up to 8192t context length), german - roberta, and bge - m3 (up to 8192t context length) work well. Other models' performance may vary.

Similarity of Embedders

With the same settings, these embedders can find 6 - 7 similar snippets out of 10 from a book, indicating that only 3 - 4 snippets are different.

💻 Usage Examples

Using with Large Context

Set your (Max Tokens) context - length to 16000t for the main - model, set your embedder - model (Max Embedding Chunk Length) to 1024t, and set (Max Context Snippets) to 14. In ALLM, also set (Text splitting & Chunking Preferences - Text Chunk Size) to 1024 character parts and (Search Preference) to "accuracy".

Understanding Embedding and Search

When you ask a question about a document, the system searches for keywords or similar semantic terms. If it finds relevant terms, it cuts out a 1024 - token text snippet around those terms for the answer.

💡 Usage Tip

If you expect multiple search results in your docs, try 16 or more snippets. If you expect only 2, don't use more.

Using a chunk - length of ~1024t gives more content, while ~256t gives more facts but takes longer due to more chunks.

📚 Documentation

Main Model Importance

The main model is crucial, especially in handling context length. Some models may not perform well even with a relatively small input compared to well - developed ones.

System Prompts

System prompts can significantly influence the output. Here are some examples:

"You are a helpful assistant who provides an overview of ... under the aspects of ... . You use attached excerpts from the collection to generate your answers! Weight each individual excerpt in order, with the most important excerpts at the top and the less important ones further down. The context of the entire article should not be given too much weight. Answer the user's question! After your answer, briefly explain why you included excerpts (1 to X) in your response and justify briefly if you considered some of them unimportant!"
"You are an imaginative storyteller who crafts compelling narratives with depth, creativity, and coherence. Your goal is to develop rich, engaging stories that captivate readers, staying true to the themes, tone, and style appropriate for the given prompt. You use attached excerpts from the collection to generate your answers! When generating stories, ensure the coherence in characters, setting, and plot progression. Be creative and introduce imaginative twists and unique perspectives."
"You are are a warm and engaging companion who loves to talk about cooking, recipes and the joy of food. Your aim is to share delicious recipes, cooking tips and the stories behind different cultures in a personal, welcoming and knowledgeable way."

Document Preparation

Prepare your DOC/PDF documents carefully. Bad input leads to bad output. You can use Python - based pdf - parsers like pdfplumber, fitz/PyMuPDF, and Camelot for simple txt/tables converting.

Indexing Option

For fast search on thousands of PDFs, you can use Jabref (https://github.com/JabRef/jabref/tree/v6.0 - alpha?tab=readme - ov - file) or docfetcher (https://docfetcher.sourceforge.io/en/index.html).

📄 License

All licenses and terms of use go to the original author.

List of Models

avemio/German - RAG - BGE - M3 - MERGED - x - SNOWFLAKE - ARCTIC - HESSIAN - AI (German, English)
maidalun1020/bce - embedding - base_v1 (English and Chinese)
maidalun1020/bce - reranker - base_v1 (English, Chinese, Japanese and Korean)
BAAI/bge - reranker - v2 - m3 (English and Chinese)
BAAI/bge - reranker - v2 - gemma (English and Chinese)
BAAI/bge - m3 (English and Chinese)
avsolatorio/GIST - large - Embedding - v0 (English)
ibm - granite/granite - embedding - 278m - multilingual (English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, and Chinese)
ibm - granite/granite - embedding - 125m - english
Labib11/MUG - B - 1.6 (?)
mixedbread - ai/mxbai - embed - large - v1 (multi)
nomic - ai/nomic - embed - text - v1.5 (English, multi)
Snowflake/snowflake - arctic - embed - l - v2.0 (English, multi)
intfloat/multilingual - e5 - large - instruct (100 languages)
T - Systems - onsite/german - roberta - sentence - transformer - v2
mixedbread - ai/mxbai - embed - 2d - large - v1
jinaai/jina - embeddings - v2 - base - en

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご