amber-large Open-source Japanese-English Bilingual Model - Free Support for Sentence Similarity Calculation and Text Classification

Amber Large

Developed by retrieva-jp

A Japanese-English bilingual sentence feature extraction model based on modernbert-ja-310m, supporting sentence similarity computation and text classification tasks

Text Embedding

Safetensors

Supports Multiple LanguagesOpen Source License:Apache-2.0 #Japanese-English Bilingual Embedding #High-Precision Text Classification #Academic Literature Clustering

Downloads 239.28k

Release Time : 3/7/2025

Model Overview

This model specializes in sentence embedding representations for Japanese-English bilingual scenarios, applicable to sentence similarity computation, text classification, and clustering tasks. MTEB benchmark tests demonstrate its strong performance in classification and clustering tasks.

Model Features

Japanese-English Bilingual Support

Optimized specifically for Japanese and English bilingual scenarios, capable of handling sentence embedding representations in both languages

Multi-Task Adaptability

Supports various natural language processing tasks including classification, clustering, and retrieval

MTEB Benchmark Validation

Performs well in multiple MTEB benchmark tests, achieving an accuracy rate of 73.34% in classification tasks

Model Capabilities

Sentence feature extraction

Sentence similarity computation

Text classification

Text clustering

Cross-lingual text processing

Use Cases

E-commerce

Product Review Classification

Classifying user reviews on e-commerce platforms like Amazon

Achieved 73.34% accuracy in the Amazon Counterfactual Classification task

Academic Research

Paper Clustering

Hierarchical clustering of arXiv academic papers

Achieved a V-measure of 53.39 in the arXiv Paper Clustering task

Information Retrieval

Argument Retrieval

Retrieving relevant arguments in debate scenarios

Achieved NDCG@10 of 51.32 in the ArguAna task

🚀 RetrievaEmbedding-01: AMBER

The AMBER (Adaptive Multitask Bilingual Embedding Representations) is a text embedding model developed by Retrieva, Inc. It's primarily tailored for Japanese text but also offers support for English. The model is trained on diverse Japanese and English datasets and has 315M parameters, falling into the large - size category.

🚀 Quick Start

Install the Required Libraries

First, install the necessary Python libraries using pip:

pip install sentence-transformers sentencepiece

Run Inference

You can load the model and perform inference. You can specify the prompt at inference time by adding an argument called prompt to model.encode. The prompts used in the Japanese benchmark are described in jmteb/tasks, and those used in the English benchmark are in mteb/models/retrieva_en.py.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("retrieva-jp/amber-large")
# Run inference
queries = [
    "自然言語処理とはなんですか？",
    "株式会社レトリバについて教えて",
]
documents = [
    "自然言語処理（しぜんげんごしょり、英語: Natural language processing、略称：NLP）は、人間が日常的に使っている自然言語をコンピュータに処理させる一連の技術であり、人工知能と言語学の一分野である。",
    "株式会社レトリバは、自然言語処理と機械学習を核としたAI技術で組織の課題解決を支援するテクノロジー企業である。",
]

queries_embeddings = model.encode(queries, prompt_name="Retrieval-query")
documents_embeddings = model.encode(documents, prompt_name="Retrieval-passage")

similarities = model.similarity(queries_embeddings, documents_embeddings)
print(similarities.shape)

✨ Features

Multilingual Support: Primarily designed for Japanese, with optional support for English.
Prompt - Based Inference: Allows specifying prompts at inference time for task - specific embeddings.
Sentence Transformer: Based on the Sentence Transformer architecture, enabling effective sentence - level embeddings.

📦 Installation

To use the model, you need to install the required Python libraries. Run the following command:

pip install sentence-transformers sentencepiece

💻 Usage Examples

Basic Usage

The following code demonstrates how to load the model and perform basic inference:

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("retrieva-jp/amber-large")
# Run inference
queries = [
    "自然言語処理とはなんですか？",
    "株式会社レトリバについて教えて",
]
documents = [
    "自然言語処理（しぜんげんごしょり、英語: Natural language processing、略称：NLP）は、人間が日常的に使っている自然言語をコンピュータに処理させる一連の技術であり、人工知能と言語学の一分野である。",
    "株式会社レトリバは、自然言語処理と機械学習を核としたAI技術で組織の課題解決を支援するテクノロジー企業である。",
]

queries_embeddings = model.encode(queries, prompt_name="Retrieval-query")
documents_embeddings = model.encode(documents, prompt_name="Retrieval-passage")

similarities = model.similarity(queries_embeddings, documents_embeddings)
print(similarities.shape)

📚 Documentation

Model Details

Developed by: Retrieva, Inc.
Model type: Based on the ModernBERT Architecture.
Language(s) (NLP): Primarily Japanese (optional support for English).
License: Apache 2.0
Finetuned from model: sbintuitions/modernbert-ja-310m
Model Type: Sentence Transformer
Maximum Sequence Length: 512 tokens
Output Dimensionality: 768 dimensions
Similarity Function: Cosine Similarity

Training Details

Training Data

Multiple datasets were used for training. For Japanese datasets, selections were made from [llm - jp - eval](https://github.com/llm - jp/llm - jp - eval), [llm - japanese - dataset](https://github.com/masanorihirano/llm - japanese - dataset), and hpprc/emb. For English datasets, some of the datasets from Asai et al. (2023) were mainly used, along with partial data from [the sentence - transformers repository](https://huggingface.co/sentence - transformers) and kilt - tasks. Translation datasets between Japanese and English were also utilized to account for cross - lingual aspects. Synthetic data created by LLM was used for Japanese to ensure sufficient training data.

Evaluation

The model was evaluated on the following benchmarks:

Japanese Benchmark: JMTEB
Japanese Retrieval Tasks: JQaRA, JaCWIR, MLDR Japanese Subset
English Benchmark: [MTEB(eng, v2)](https://github.com/embeddings - benchmark/mteb)

Japanese Benchmark: JMTEB

Model	# Parameters	Mean (TaskType)	Mean (Task)	Retrieval	STS	Classification	Reranking	Clustering	PairClassification
base models	< 300M
[cl - nagoya/ruri - base](https://huggingface.co/cl - nagoya/ruri - base)	111M	72.60	71.56	69.53	82.87	75.49	92.91	52.40	62.38
[AMBER - base](https://huggingface.co/retrieva - jp/amber - base)	130M	72.12	72.12	73.40	77.81	76.14	93.27	48.05	64.03
[pkshatech/GLuCoSE - base - ja - v2](https://huggingface.co/pkshatech/GLuCoSE - base - ja - v2)	133M	72.89	72.47	73.03	82.96	74.02	93.01	51.96	62.37
[pkshatech/RoSEtta - base - ja](https://huggingface.co/pkshatech/RoSEtta - base - ja)	190M	72.49	72.05	73.14	81.39	72.37	92.69	53.60	61.74
[intfloat/multilingual - e5 - base](https://huggingface.co/intfloat/multilingual - e5 - base)	278M	71.11	69.72	69.45	80.45	69.86	92.90	51.62	62.35
large models	300M <
AMBER - large (this model)	315M	72.52	73.22	75.40	79.32	77.14	93.54	48.73	60.97
[cl - nagoya/ruri - large](https://huggingface.co/cl - nagoya/ruri - large)	337M	73.20	73.06	72.86	83.14	77.15	93.00	50.78	62.29
[intfloat/multilingual - e5 - large](https://huggingface.co/intfloat/multilingual - e5 - large)	560M	72.06	71.29	71.71	80.87	72.45	93.29	51.59	62.42

Japanese Retrieval Tasks: JQaRA, JaCWIR, MLDR Japanese Subset

Model	# Parameters	JQaRA (nDCG@10)	JaCWIR (MAP@10)	MLDR Japanese Subset (nDCG@10)
base models	< 300M
[cl - nagoya/ruri - base](https://huggingface.co/cl - nagoya/ruri - base)	111M	58.4	83.3	32.77
[AMBER - base](https://huggingface.co/retrieva - jp/amber - base)	130M	57.1	81.6	35.69
[pkshatech/GLuCoSE - base - ja - v2](https://huggingface.co/pkshatech/GLuCoSE - base - ja - v2)	133M	60.6	85.3	33.99
[intfloat/multilingual - e5 - base](https://huggingface.co/intfloat/multilingual - e5 - base)	278M	47.1	85.3	25.46
large models	300M <
AMBER - large (this model)	315M	62.5	82.4	34.57
[cl - nagoya/ruri - large](https://huggingface.co/cl - nagoya/ruri - large)	337M	62.8	82.5	34.78
[intfloat/multilingual - e5 - large](https://huggingface.co/intfloat/multilingual - e5 - large)	560M	55.4	87.3	29.95

English Benchmark: MTEB(eng, v2)

Model	# Parameters	Mean (TaskType)	Mean (Task)	Retrieval	STS	Classification	Reranking	Clustering	PairClassification	Summarization
base models	< 300M
[AMBER - base](https://huggingface.co/retrieva - jp/amber - base)	130M	54.75	58.20	40.11	81.29	70.39	42.98	42.27	80.12	26.08
[intfloat/multilingual - e5 - base](https://huggingface.co/intfloat/multilingual - e5 - base)	278M	56.21	59.75	43.22	80.50	73.84	43.87	42.19	83.74	26.10
large models	300M <
AMBER - large (this model)	315M	56.08	59.13	41.04	81.52	72.23	43.83	42.71	81.00	30.21
[intfloat/multilingual - e5 - large](https://huggingface.co/intfloat/multilingual - e5 - large)	560M	57.06	60.84	46.17	81.11	74.88	44.31	41.91	84.33	26.67

🔧 Technical Details

The AMBER model is based on the [sbintuitions/modernbert - ja - 310m](https://huggingface.co/sbintuitions/modernbert - ja - 310m) architecture, which is designed for Japanese text. During training, natural - language prompts were incorporated, enabling the model to generate embeddings tailored to specific tasks.

📄 License

This model is licensed under the Apache 2.0 license.

Citation

BibTeX:

@inproceedings{amber2025,
    title = {インストラクションと複数タスクを利用した日本語向け分散表現モデルの構築},
    author = {勝又智 and 木村大翼 and 西鳥羽二郎},
    booktitle = {言語処理学会第31回年次大会発表論文集},
    year = {2025},
}

More Information

For more details, refer to this link (in Japanese).

Model Card Authors

Satoru Katsumata, Daisuke Kimura, Jiro Nishitoba

Model Card Contact

Contact via pr[at]retrieva.jp

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご