Amber Base Open Source Japanese-English Sentence Converter - Free Similarity Calculation and Feature Extraction

Amber Base

Developed by retrieva-jp

Amber Base is a Japanese-English sentence encoder model based on modernbert-ja-130m, specializing in sentence similarity calculation and feature extraction tasks.

Text Embedding

Safetensors

Supports Multiple LanguagesOpen Source License:Apache-2.0 #Japanese-English bilingual sentence similarity #Text feature extraction #Academic literature clustering

Downloads 213

Release Time : 3/7/2025

Model Overview

This model is primarily used for Japanese and English sentence similarity calculation, feature extraction, as well as text classification and clustering tasks. It performs well on the MTEB benchmark.

Model Features

Bilingual support

Supports sentence processing in both Japanese and English

Multi-task processing

Capable of handling various tasks including sentence similarity, feature extraction, classification, and clustering

MTEB benchmark validation

Demonstrates strong performance across multiple MTEB benchmarks

Model Capabilities

Sentence similarity calculation

Feature extraction

Text classification

Text clustering

Bilingual processing

Use Cases

Information retrieval

Cross-language document retrieval

Searching for similar content between Japanese and English documents

Achieved ndcg@10 of 48.068 in MTEB ArguAna test

Text analysis

Academic paper clustering

Topic clustering analysis of academic papers

Achieved v_measure of 55.655 in MTEB ArXivHierarchicalClusteringP2P test

Content classification

Counterfactual classification

Counterfactual classification of Amazon reviews

Achieved accuracy of 68.164% in MTEB AmazonCounterfactualClassification test

🚀 RetrievaEmbedding-01: AMBER

The AMBER (Adaptive Multitask Bilingual Embedding Representations) is a text embedding model developed by Retrieva, Inc. It's primarily crafted for Japanese text but also offers support for English. This model was trained on a diverse range of datasets related to both Japanese and English. With 132M parameters (base size), it's well - equipped for various text - related tasks.

🚀 Quick Start

Install Library

First, you need to install the necessary Python libraries using pip:

pip install sentence-transformers sentencepiece

Run Inference

After installation, you can load the model and perform inference. You can specify the prompt at inference time by adding an argument called prompt to model.encode. The prompts used in the Japanese benchmark are described in jmteb/tasks, and those used in the English benchmark are described in mteb/models/retrieva_en.py.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("retrieva-jp/amber-base")
# Run inference
queries = [
    "自然言語処理とはなんですか？",
    "株式会社レトリバについて教えて",
]
documents = [
    "自然言語処理（しぜんげんごしょり、英語: Natural language processing、略称：NLP）は、人間が日常的に使っている自然言語をコンピュータに処理させる一連の技術であり、人工知能と言語学の一分野である。",
    "株式会社レトリバは、自然言語処理と機械学習を核としたAI技術で組織の課題解決を支援するテクノロジー企業である。",
]

queries_embeddings = model.encode(queries, prompt_name="Retrieval-query")
documents_embeddings = model.encode(documents, prompt_name="Retrieval-passage")

similarities = model.similarity(queries_embeddings, documents_embeddings)
print(similarities.shape)

✨ Features

Multilingual Support: Primarily designed for Japanese, it also offers optional support for English.
Adaptive Embeddings: During training, natural - language prompts were included, enabling the model to generate task - specific embeddings.
Versatile Usage: Can be used for various NLP tasks such as retrieval, classification, and more.

📦 Installation

The installation process mainly involves installing the required Python libraries. You can use the following command:

pip install sentence-transformers sentencepiece

💻 Usage Examples

Basic Usage

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("retrieva-jp/amber-base")
queries = ["自然言語処理とはなんですか？"]
documents = ["自然言語処理（しぜんげんごしょり、英語: Natural language processing、略称：NLP）は、人間が日常的に使っている自然言語をコンピュータに処理させる一連の技術であり、人工知能と言語学の一分野である。"]

queries_embeddings = model.encode(queries, prompt_name="Retrieval-query")
documents_embeddings = model.encode(documents, prompt_name="Retrieval-passage")

similarities = model.similarity(queries_embeddings, documents_embeddings)
print(similarities)

📚 Documentation

Model Details

Developed by: Retrieva, Inc.
Model type: Based on the ModernBERT Architecture.
Language(s) (NLP): Primarily Japanese (optional support for English).
License: Apache 2.0
Finetuned from model: sbintuitions/modernbert-ja-130m
Model Type: Sentence Transformer
Maximum Sequence Length: 512 tokens
Output Dimensionality: 512 dimensions
Similarity Function: Cosine Similarity

Training Details

Training Data

For Japanese datasets, data was selected from [llm - jp - eval](https://github.com/llm - jp/llm - jp - eval), [llm - japanese - dataset](https://github.com/masanorihirano/llm - japanese - dataset), and hpprc/emb. For English datasets, some of the datasets used in Asai et al. (2023) were mainly used. Additionally, English datasets from [the sentence - transformers repository](https://huggingface.co/sentence - transformers) and kilt - tasks were partially used. To account for cross - lingual aspects between Japanese and English, translation datasets between the two languages were also utilized. For Japanese, synthetic data created by LLM was used to ensure a sufficient amount of training data.

Evaluation

The model was evaluated on the following benchmarks:

Japanese Benchmark: JMTEB
Japanese Retrieval Tasks: JQaRA, JaCWIR, MLDR Japanese Subset
English Benchmark: [MTEB(eng, v2)](https://github.com/embeddings - benchmark/mteb)

Japanese Benchmark: JMTEB

The Mean (TaskType) in the following leaderboard is equivalent to the Avg. in the original JMTEB leaderboard. The evaluation files are stored in the jmteb directory.

Model	# Parameters	Mean (TaskType)	Mean (Task)	Retrieval	STS	Classification	Reranking	Clustering	PairClassification
base models	< 300M
[cl - nagoya/ruri - base](https://huggingface.co/cl - nagoya/ruri - base)	111M	72.60	71.56	69.53	82.87	75.49	92.91	52.40	62.38
AMBER - base (this model)	130M	72.12	72.12	73.40	77.81	76.14	93.27	48.05	64.03
[pkshatech/GLuCoSE - base - ja - v2](https://huggingface.co/pkshatech/GLuCoSE - base - ja - v2)	133M	72.89	72.47	73.03	82.96	74.02	93.01	51.96	62.37
[pkshatech/RoSEtta - base - ja](https://huggingface.co/pkshatech/RoSEtta - base - ja)	190M	72.49	72.05	73.14	81.39	72.37	92.69	53.60	61.74
[intfloat/multilingual - e5 - base](https://huggingface.co/intfloat/multilingual - e5 - base)	278M	71.11	69.72	69.45	80.45	69.86	92.90	51.62	62.35
large models	300M <
[AMBER - large](https://huggingface.co/retrieva - jp/amber - large)	315M	72.52	73.22	75.40	79.32	77.14	93.54	48.73	60.97
[cl - nagoya/ruri - large](https://huggingface.co/cl - nagoya/ruri - large)	337M	73.20	73.06	72.86	83.14	77.15	93.00	50.78	62.29
[intfloat/multilingual - e5 - large](https://huggingface.co/intfloat/multilingual - e5 - large)	560M	72.06	71.29	71.71	80.87	72.45	93.29	51.59	62.42

Japanese Retrieval Tasks: JQaRA, JaCWIR, MLDR Japanese Subset

The files for MLDR are stored in the mldr directory. The prompts used in JQaRA and JaCWIR are Retrieval - query and Retrieval - passage described in config_sentence_transformers.json.

Model	# Parameters	JQaRA (nDCG@10)	JaCWIR (MAP@10)	MLDR Japanese Subset (nDCG@10)
base models	< 300M
[cl - nagoya/ruri - base](https://huggingface.co/cl - nagoya/ruri - base)	111M	58.4	83.3	32.77
AMBER - base (this model)	130M	57.1	81.6	35.69
[pkshatech/GLuCoSE - base - ja - v2](https://huggingface.co/pkshatech/GLuCoSE - base - ja - v2)	133M	60.6	85.3	33.99
[intfloat/multilingual - e5 - base](https://huggingface.co/intfloat/multilingual - e5 - base)	278M	47.1	85.3	25.46
large models	300M <
[AMBER - large](https://huggingface.co/retrieva - jp/amber - large)	315M	62.5	82.4	34.57
[cl - nagoya/ruri - large](https://huggingface.co/cl - nagoya/ruri - large)	337M	62.8	82.5	34.78
[intfloat/multilingual - e5 - large](https://huggingface.co/intfloat/multilingual - e5 - large)	560M	55.4	87.3	29.95

English Benchmark: MTEB(eng, v2)

The evaluation files are stored in the mteb directory.

Model	# Parameters	Mean (TaskType)	Mean (Task)	Retrieval	STS	Classification	Reranking	Clustering	PairClassification	Summarization
base models	< 300M
AMBER - base (this model)	130M	54.75	58.20	40.11	81.29	70.39	42.98	42.27	80.12	26.08
[intfloat/multilingual - e5 - base](https://huggingface.co/intfloat/multilingual - e5 - base)	278M	56.21	59.75	43.22	80.50	73.84	43.87	42.19	83.74	26.10
large models	300M <
[AMBER - large](https://huggingface.co/retrieva - jp/amber - large)	315M	56.08	59.13	41.04	81.52	72.23	43.83	42.71	81.00	30.21
[intfloat/multilingual - e5 - large](https://huggingface.co/intfloat/multilingual - e5 - large)	560M	57.06	60.84	46.17	81.11	74.88	44.31	41.91	84.33	26.67

🔧 Technical Details

The AMBER model is based on the [sbintuitions/modernbert - ja - 130m](https://huggingface.co/sbintuitions/modernbert - ja - 130m) architecture. During training, natural - language prompts were incorporated, enabling the model to generate embeddings tailored to specific tasks. This approach allows the model to adapt well to different NLP tasks, whether in Japanese or English.

📄 License

The AMBER model is licensed under the Apache 2.0 license.

Citation

BibTeX:

@inproceedings{amber2025,
    title = {インストラクションと複数タスクを利用した日本語向け分散表現モデルの構築},
    author = {勝又智 and 木村大翼 and 西鳥羽二郎},
    booktitle = {言語処理学会第31回年次大会発表論文集},
    year = {2025},
}

More Information

For more information, please visit https://note.com/retrieva/n/n4ee9d304f44d (in Japanese).

Model Card Authors

Satoru Katsumata, Daisuke Kimura, Jiro Nishitoba

Model Card Contact

You can contact the model card maintainers at pr[at]retrieva.jp

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご