🚀 RetrievaEmbedding-01: AMBER
The AMBER (Adaptive Multitask Bilingual Embedding Representations) is a text embedding model developed by Retrieva, Inc. It's primarily tailored for Japanese text but also offers support for English. The model is trained on diverse Japanese and English datasets and has 315M parameters, falling into the large - size category.
🚀 Quick Start
Install the Required Libraries
First, install the necessary Python libraries using pip
:
pip install sentence-transformers sentencepiece
Run Inference
You can load the model and perform inference. You can specify the prompt at inference time by adding an argument called prompt
to model.encode
. The prompts used in the Japanese benchmark are described in jmteb/tasks
, and those used in the English benchmark are in mteb/models/retrieva_en.py
.
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("retrieva-jp/amber-large")
queries = [
"自然言語処理とはなんですか?",
"株式会社レトリバについて教えて",
]
documents = [
"自然言語処理(しぜんげんごしょり、英語: Natural language processing、略称:NLP)は、人間が日常的に使っている自然言語をコンピュータに処理させる一連の技術であり、人工知能と言語学の一分野である。",
"株式会社レトリバは、自然言語処理と機械学習を核としたAI技術で組織の課題解決を支援するテクノロジー企業である。",
]
queries_embeddings = model.encode(queries, prompt_name="Retrieval-query")
documents_embeddings = model.encode(documents, prompt_name="Retrieval-passage")
similarities = model.similarity(queries_embeddings, documents_embeddings)
print(similarities.shape)
✨ Features
- Multilingual Support: Primarily designed for Japanese, with optional support for English.
- Prompt - Based Inference: Allows specifying prompts at inference time for task - specific embeddings.
- Sentence Transformer: Based on the Sentence Transformer architecture, enabling effective sentence - level embeddings.
📦 Installation
To use the model, you need to install the required Python libraries. Run the following command:
pip install sentence-transformers sentencepiece
💻 Usage Examples
Basic Usage
The following code demonstrates how to load the model and perform basic inference:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("retrieva-jp/amber-large")
queries = [
"自然言語処理とはなんですか?",
"株式会社レトリバについて教えて",
]
documents = [
"自然言語処理(しぜんげんごしょり、英語: Natural language processing、略称:NLP)は、人間が日常的に使っている自然言語をコンピュータに処理させる一連の技術であり、人工知能と言語学の一分野である。",
"株式会社レトリバは、自然言語処理と機械学習を核としたAI技術で組織の課題解決を支援するテクノロジー企業である。",
]
queries_embeddings = model.encode(queries, prompt_name="Retrieval-query")
documents_embeddings = model.encode(documents, prompt_name="Retrieval-passage")
similarities = model.similarity(queries_embeddings, documents_embeddings)
print(similarities.shape)
📚 Documentation
Model Details
- Developed by: Retrieva, Inc.
- Model type: Based on the ModernBERT Architecture.
- Language(s) (NLP): Primarily Japanese (optional support for English).
- License: Apache 2.0
- Finetuned from model:
sbintuitions/modernbert-ja-310m
- Model Type: Sentence Transformer
- Maximum Sequence Length: 512 tokens
- Output Dimensionality: 768 dimensions
- Similarity Function: Cosine Similarity
Training Details
Training Data
Multiple datasets were used for training. For Japanese datasets, selections were made from [llm - jp - eval](https://github.com/llm - jp/llm - jp - eval), [llm - japanese - dataset](https://github.com/masanorihirano/llm - japanese - dataset), and hpprc/emb. For English datasets, some of the datasets from Asai et al. (2023) were mainly used, along with partial data from [the sentence - transformers repository](https://huggingface.co/sentence - transformers) and kilt - tasks. Translation datasets between Japanese and English were also utilized to account for cross - lingual aspects. Synthetic data created by LLM was used for Japanese to ensure sufficient training data.
Evaluation
The model was evaluated on the following benchmarks:
Japanese Benchmark: JMTEB
Model |
# Parameters |
Mean (TaskType) |
Mean (Task) |
Retrieval |
STS |
Classification |
Reranking |
Clustering |
PairClassification |
base models |
< 300M |
|
|
|
|
|
|
|
|
[cl - nagoya/ruri - base](https://huggingface.co/cl - nagoya/ruri - base) |
111M |
72.60 |
71.56 |
69.53 |
82.87 |
75.49 |
92.91 |
52.40 |
62.38 |
[AMBER - base](https://huggingface.co/retrieva - jp/amber - base) |
130M |
72.12 |
72.12 |
73.40 |
77.81 |
76.14 |
93.27 |
48.05 |
64.03 |
[pkshatech/GLuCoSE - base - ja - v2](https://huggingface.co/pkshatech/GLuCoSE - base - ja - v2) |
133M |
72.89 |
72.47 |
73.03 |
82.96 |
74.02 |
93.01 |
51.96 |
62.37 |
[pkshatech/RoSEtta - base - ja](https://huggingface.co/pkshatech/RoSEtta - base - ja) |
190M |
72.49 |
72.05 |
73.14 |
81.39 |
72.37 |
92.69 |
53.60 |
61.74 |
[intfloat/multilingual - e5 - base](https://huggingface.co/intfloat/multilingual - e5 - base) |
278M |
71.11 |
69.72 |
69.45 |
80.45 |
69.86 |
92.90 |
51.62 |
62.35 |
large models |
300M < |
|
|
|
|
|
|
|
|
AMBER - large (this model) |
315M |
72.52 |
73.22 |
75.40 |
79.32 |
77.14 |
93.54 |
48.73 |
60.97 |
[cl - nagoya/ruri - large](https://huggingface.co/cl - nagoya/ruri - large) |
337M |
73.20 |
73.06 |
72.86 |
83.14 |
77.15 |
93.00 |
50.78 |
62.29 |
[intfloat/multilingual - e5 - large](https://huggingface.co/intfloat/multilingual - e5 - large) |
560M |
72.06 |
71.29 |
71.71 |
80.87 |
72.45 |
93.29 |
51.59 |
62.42 |
Japanese Retrieval Tasks: JQaRA, JaCWIR, MLDR Japanese Subset
Model |
# Parameters |
JQaRA (nDCG@10) |
JaCWIR (MAP@10) |
MLDR Japanese Subset (nDCG@10) |
base models |
< 300M |
|
|
|
[cl - nagoya/ruri - base](https://huggingface.co/cl - nagoya/ruri - base) |
111M |
58.4 |
83.3 |
32.77 |
[AMBER - base](https://huggingface.co/retrieva - jp/amber - base) |
130M |
57.1 |
81.6 |
35.69 |
[pkshatech/GLuCoSE - base - ja - v2](https://huggingface.co/pkshatech/GLuCoSE - base - ja - v2) |
133M |
60.6 |
85.3 |
33.99 |
[intfloat/multilingual - e5 - base](https://huggingface.co/intfloat/multilingual - e5 - base) |
278M |
47.1 |
85.3 |
25.46 |
large models |
300M < |
|
|
|
AMBER - large (this model) |
315M |
62.5 |
82.4 |
34.57 |
[cl - nagoya/ruri - large](https://huggingface.co/cl - nagoya/ruri - large) |
337M |
62.8 |
82.5 |
34.78 |
[intfloat/multilingual - e5 - large](https://huggingface.co/intfloat/multilingual - e5 - large) |
560M |
55.4 |
87.3 |
29.95 |
English Benchmark: MTEB(eng, v2)
Model |
# Parameters |
Mean (TaskType) |
Mean (Task) |
Retrieval |
STS |
Classification |
Reranking |
Clustering |
PairClassification |
Summarization |
base models |
< 300M |
|
|
|
|
|
|
|
|
|
[AMBER - base](https://huggingface.co/retrieva - jp/amber - base) |
130M |
54.75 |
58.20 |
40.11 |
81.29 |
70.39 |
42.98 |
42.27 |
80.12 |
26.08 |
[intfloat/multilingual - e5 - base](https://huggingface.co/intfloat/multilingual - e5 - base) |
278M |
56.21 |
59.75 |
43.22 |
80.50 |
73.84 |
43.87 |
42.19 |
83.74 |
26.10 |
large models |
300M < |
|
|
|
|
|
|
|
|
|
AMBER - large (this model) |
315M |
56.08 |
59.13 |
41.04 |
81.52 |
72.23 |
43.83 |
42.71 |
81.00 |
30.21 |
[intfloat/multilingual - e5 - large](https://huggingface.co/intfloat/multilingual - e5 - large) |
560M |
57.06 |
60.84 |
46.17 |
81.11 |
74.88 |
44.31 |
41.91 |
84.33 |
26.67 |
🔧 Technical Details
The AMBER model is based on the [sbintuitions/modernbert - ja - 310m](https://huggingface.co/sbintuitions/modernbert - ja - 310m) architecture, which is designed for Japanese text. During training, natural - language prompts were incorporated, enabling the model to generate embeddings tailored to specific tasks.
📄 License
This model is licensed under the Apache 2.0 license.
Citation
BibTeX:
@inproceedings{amber2025,
title = {インストラクションと複数タスクを利用した日本語向け分散表現モデルの構築},
author = {勝又智 and 木村大翼 and 西鳥羽二郎},
booktitle = {言語処理学会第31回年次大会発表論文集},
year = {2025},
}
More Information
For more details, refer to this link (in Japanese).
Model Card Authors
Satoru Katsumata, Daisuke Kimura, Jiro Nishitoba
Model Card Contact
Contact via pr[at]retrieva.jp