Amber Base
Model Overview
Model Features
Model Capabilities
Use Cases
🚀 RetrievaEmbedding-01: AMBER
The AMBER (Adaptive Multitask Bilingual Embedding Representations) is a text embedding model developed by Retrieva, Inc. It's primarily crafted for Japanese text but also offers support for English. This model was trained on a diverse range of datasets related to both Japanese and English. With 132M parameters (base size), it's well - equipped for various text - related tasks.
🚀 Quick Start
Install Library
First, you need to install the necessary Python libraries using pip
:
pip install sentence-transformers sentencepiece
Run Inference
After installation, you can load the model and perform inference. You can specify the prompt at inference time by adding an argument called prompt
to model.encode
. The prompts used in the Japanese benchmark are described in jmteb/tasks
, and those used in the English benchmark are described in mteb/models/retrieva_en.py
.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("retrieva-jp/amber-base")
# Run inference
queries = [
"自然言語処理とはなんですか?",
"株式会社レトリバについて教えて",
]
documents = [
"自然言語処理(しぜんげんごしょり、英語: Natural language processing、略称:NLP)は、人間が日常的に使っている自然言語をコンピュータに処理させる一連の技術であり、人工知能と言語学の一分野である。",
"株式会社レトリバは、自然言語処理と機械学習を核としたAI技術で組織の課題解決を支援するテクノロジー企業である。",
]
queries_embeddings = model.encode(queries, prompt_name="Retrieval-query")
documents_embeddings = model.encode(documents, prompt_name="Retrieval-passage")
similarities = model.similarity(queries_embeddings, documents_embeddings)
print(similarities.shape)
✨ Features
- Multilingual Support: Primarily designed for Japanese, it also offers optional support for English.
- Adaptive Embeddings: During training, natural - language prompts were included, enabling the model to generate task - specific embeddings.
- Versatile Usage: Can be used for various NLP tasks such as retrieval, classification, and more.
📦 Installation
The installation process mainly involves installing the required Python libraries. You can use the following command:
pip install sentence-transformers sentencepiece
💻 Usage Examples
Basic Usage
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("retrieva-jp/amber-base")
queries = ["自然言語処理とはなんですか?"]
documents = ["自然言語処理(しぜんげんごしょり、英語: Natural language processing、略称:NLP)は、人間が日常的に使っている自然言語をコンピュータに処理させる一連の技術であり、人工知能と言語学の一分野である。"]
queries_embeddings = model.encode(queries, prompt_name="Retrieval-query")
documents_embeddings = model.encode(documents, prompt_name="Retrieval-passage")
similarities = model.similarity(queries_embeddings, documents_embeddings)
print(similarities)
📚 Documentation
Model Details
- Developed by: Retrieva, Inc.
- Model type: Based on the ModernBERT Architecture.
- Language(s) (NLP): Primarily Japanese (optional support for English).
- License: Apache 2.0
- Finetuned from model:
sbintuitions/modernbert-ja-130m
- Model Type: Sentence Transformer
- Maximum Sequence Length: 512 tokens
- Output Dimensionality: 512 dimensions
- Similarity Function: Cosine Similarity
Training Details
Training Data
For Japanese datasets, data was selected from [llm - jp - eval](https://github.com/llm - jp/llm - jp - eval), [llm - japanese - dataset](https://github.com/masanorihirano/llm - japanese - dataset), and hpprc/emb. For English datasets, some of the datasets used in Asai et al. (2023) were mainly used. Additionally, English datasets from [the sentence - transformers repository](https://huggingface.co/sentence - transformers) and kilt - tasks were partially used. To account for cross - lingual aspects between Japanese and English, translation datasets between the two languages were also utilized. For Japanese, synthetic data created by LLM was used to ensure a sufficient amount of training data.
Evaluation
The model was evaluated on the following benchmarks:
- Japanese Benchmark: JMTEB
- Japanese Retrieval Tasks: JQaRA, JaCWIR, MLDR Japanese Subset
- English Benchmark: [MTEB(eng, v2)](https://github.com/embeddings - benchmark/mteb)
Japanese Benchmark: JMTEB
The Mean (TaskType)
in the following leaderboard is equivalent to the Avg.
in the original JMTEB leaderboard. The evaluation files are stored in the jmteb
directory.
Model | # Parameters | Mean (TaskType) | Mean (Task) | Retrieval | STS | Classification | Reranking | Clustering | PairClassification |
---|---|---|---|---|---|---|---|---|---|
base models | < 300M | ||||||||
[cl - nagoya/ruri - base](https://huggingface.co/cl - nagoya/ruri - base) | 111M | 72.60 | 71.56 | 69.53 | 82.87 | 75.49 | 92.91 | 52.40 | 62.38 |
AMBER - base (this model) |
130M | 72.12 | 72.12 | 73.40 | 77.81 | 76.14 | 93.27 | 48.05 | 64.03 |
[pkshatech/GLuCoSE - base - ja - v2](https://huggingface.co/pkshatech/GLuCoSE - base - ja - v2) | 133M | 72.89 | 72.47 | 73.03 | 82.96 | 74.02 | 93.01 | 51.96 | 62.37 |
[pkshatech/RoSEtta - base - ja](https://huggingface.co/pkshatech/RoSEtta - base - ja) | 190M | 72.49 | 72.05 | 73.14 | 81.39 | 72.37 | 92.69 | 53.60 | 61.74 |
[intfloat/multilingual - e5 - base](https://huggingface.co/intfloat/multilingual - e5 - base) | 278M | 71.11 | 69.72 | 69.45 | 80.45 | 69.86 | 92.90 | 51.62 | 62.35 |
large models | 300M < | ||||||||
[AMBER - large](https://huggingface.co/retrieva - jp/amber - large) | 315M | 72.52 | 73.22 | 75.40 | 79.32 | 77.14 | 93.54 | 48.73 | 60.97 |
[cl - nagoya/ruri - large](https://huggingface.co/cl - nagoya/ruri - large) | 337M | 73.20 | 73.06 | 72.86 | 83.14 | 77.15 | 93.00 | 50.78 | 62.29 |
[intfloat/multilingual - e5 - large](https://huggingface.co/intfloat/multilingual - e5 - large) | 560M | 72.06 | 71.29 | 71.71 | 80.87 | 72.45 | 93.29 | 51.59 | 62.42 |
Japanese Retrieval Tasks: JQaRA, JaCWIR, MLDR Japanese Subset
The files for MLDR are stored in the mldr
directory. The prompts used in JQaRA and JaCWIR are Retrieval - query
and Retrieval - passage
described in config_sentence_transformers.json
.
Model | # Parameters | JQaRA (nDCG@10) | JaCWIR (MAP@10) | MLDR Japanese Subset (nDCG@10) |
---|---|---|---|---|
base models | < 300M | |||
[cl - nagoya/ruri - base](https://huggingface.co/cl - nagoya/ruri - base) | 111M | 58.4 | 83.3 | 32.77 |
AMBER - base (this model) |
130M | 57.1 | 81.6 | 35.69 |
[pkshatech/GLuCoSE - base - ja - v2](https://huggingface.co/pkshatech/GLuCoSE - base - ja - v2) | 133M | 60.6 | 85.3 | 33.99 |
[intfloat/multilingual - e5 - base](https://huggingface.co/intfloat/multilingual - e5 - base) | 278M | 47.1 | 85.3 | 25.46 |
large models | 300M < | |||
[AMBER - large](https://huggingface.co/retrieva - jp/amber - large) | 315M | 62.5 | 82.4 | 34.57 |
[cl - nagoya/ruri - large](https://huggingface.co/cl - nagoya/ruri - large) | 337M | 62.8 | 82.5 | 34.78 |
[intfloat/multilingual - e5 - large](https://huggingface.co/intfloat/multilingual - e5 - large) | 560M | 55.4 | 87.3 | 29.95 |
English Benchmark: MTEB(eng, v2)
The evaluation files are stored in the mteb
directory.
Model | # Parameters | Mean (TaskType) | Mean (Task) | Retrieval | STS | Classification | Reranking | Clustering | PairClassification | Summarization |
---|---|---|---|---|---|---|---|---|---|---|
base models | < 300M | |||||||||
AMBER - base (this model) |
130M | 54.75 | 58.20 | 40.11 | 81.29 | 70.39 | 42.98 | 42.27 | 80.12 | 26.08 |
[intfloat/multilingual - e5 - base](https://huggingface.co/intfloat/multilingual - e5 - base) | 278M | 56.21 | 59.75 | 43.22 | 80.50 | 73.84 | 43.87 | 42.19 | 83.74 | 26.10 |
large models | 300M < | |||||||||
[AMBER - large](https://huggingface.co/retrieva - jp/amber - large) | 315M | 56.08 | 59.13 | 41.04 | 81.52 | 72.23 | 43.83 | 42.71 | 81.00 | 30.21 |
[intfloat/multilingual - e5 - large](https://huggingface.co/intfloat/multilingual - e5 - large) | 560M | 57.06 | 60.84 | 46.17 | 81.11 | 74.88 | 44.31 | 41.91 | 84.33 | 26.67 |
🔧 Technical Details
The AMBER model is based on the [sbintuitions/modernbert - ja - 130m](https://huggingface.co/sbintuitions/modernbert - ja - 130m) architecture. During training, natural - language prompts were incorporated, enabling the model to generate embeddings tailored to specific tasks. This approach allows the model to adapt well to different NLP tasks, whether in Japanese or English.
📄 License
The AMBER model is licensed under the Apache 2.0 license.
Citation
BibTeX:
@inproceedings{amber2025,
title = {インストラクションと複数タスクを利用した日本語向け分散表現モデルの構築},
author = {勝又智 and 木村大翼 and 西鳥羽二郎},
booktitle = {言語処理学会第31回年次大会発表論文集},
year = {2025},
}
More Information
For more information, please visit https://note.com/retrieva/n/n4ee9d304f44d (in Japanese).
Model Card Authors
Satoru Katsumata, Daisuke Kimura, Jiro Nishitoba
Model Card Contact
You can contact the model card maintainers at pr[at]retrieva.jp





