GLuCoSE-base-ja-v2: An Open-source Japanese Text Embedding Model, Free to Deploy, Ideal for Retrieval Tasks on CPU!

Glucose Base Ja V2

Developed by pkshatech

General-purpose Japanese text embedding model, optimized for retrieval tasks with excellent performance on CPUs

JapaneseOpen Source License:Apache-2.0 #Japanese semantic retrieval #Contrastive learning optimization #Lightweight embedding

Downloads 25.25k

Release Time : 8/22/2024

Model Overview

A universal embedding model specialized in Japanese text processing, particularly excelling in retrieval tasks and sentence similarity calculations, suitable for query-based passage retrieval systems

Model Features

Retrieval task optimization

Demonstrates top performance among same-size models in retrieval tasks like MIRACL

Japanese-specific optimization

Specially optimized and trained for Japanese text processing

Lightweight and efficient

Supports CPU operation, suitable for resource-limited environments

Multi-stage training

Fine-tuned through integrated distillation and multi-stage contrastive learning

Model Capabilities

Sentence similarity calculation

Semantic retrieval

Feature extraction

Passage retrieval

Use Cases

Information retrieval

Enterprise knowledge base retrieval

Used for semantic retrieval systems in corporate knowledge bases

Achieves 85.5 Recall@5 on MIRACL-ja dataset

Question answering system

Building retrieval-based question answering systems

Achieves 60.6 nDCG@10 on JQaRA dataset

Text analysis

Text clustering

Semantic clustering analysis for Japanese texts

Semantic similarity calculation

Calculating semantic similarity between sentences

🚀 GLuCoSE v2

GLuCoSE v2 is a general Japanese text embedding model designed for retrieval tasks. It can run on CPU and effectively measure semantic similarity between sentences.

🚀 Quick Start

GLuCoSE v2 is a powerful Japanese text embedding model. It can be used to measure semantic similarity between sentences and serve as a retrieval system for searching passages based on queries. During inference, each input text should start with "query: " or "passage: ".

✨ Features

Specialized for Retrieval: Demonstrates the highest performance among similar size models in MIRACL and other tasks.
Japanese Text Optimization: Optimized for Japanese text processing.
CPU Compatibility: Can run on CPU.

📚 Documentation

Model Description

The model is based on GLuCoSE and fine - tuned through distillation using several large - scale embedding models and multi - stage contrastive learning.

Maximum Sequence Length: 512 tokens
Output Dimensionality: 768 tokens
Similarity Function: Cosine Similarity

Usage

Direct Usage (Sentence Transformers)

You can perform inference using SentenceTransformer with the following code:

from sentence_transformers import SentenceTransformer
import torch.nn.functional as F

# Download from the 🤗 Hub
model = SentenceTransformer("pkshatech/GLuCoSE-base-ja-v2")

# Each input text should start with "query: " or "passage: ".
# For tasks other than retrieval, you can simply use the "query: " prefix.
sentences = [
    'query: PKSHAはどんな会社ですか？',
    'passage: 研究開発したアルゴリズムを、多くの企業のソフトウエア・オペレーションに導入しています。',
    'query: 日本で一番高い山は？',
    'passage: 富士山（ふじさん）は、標高3776.12 m、日本最高峰（剣ヶ峰）の独立峰で、その優美な風貌は日本国外でも日本の象徴として広く知られている。',
]
embeddings = model.encode(sentences,convert_to_tensor=True)
print(embeddings.shape)
# [4, 768]

# Get the similarity scores for the embeddings
similarities = F.cosine_similarity(embeddings.unsqueeze(0), embeddings.unsqueeze(1), dim=2)
print(similarities)
# [[1.0000, 0.6050, 0.4341, 0.5537],
# [0.6050, 1.0000, 0.5018, 0.6815],
# [0.4341, 0.5018, 1.0000, 0.7534],
# [0.5537, 0.6815, 0.7534, 1.0000]]

Direct Usage (Transformers)

You can perform inference using Transformers with the following code:

import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel

def mean_pooling(last_hidden_states: Tensor,attention_mask: Tensor) -> Tensor:
    emb = last_hidden_states * attention_mask.unsqueeze(-1)
    emb = emb.sum(dim=1) / attention_mask.sum(dim=1).unsqueeze(-1)
    return emb

# Download from the 🤗 Hub
tokenizer = AutoTokenizer.from_pretrained("pkshatech/GLuCoSE-base-ja-v2")
model = AutoModel.from_pretrained("pkshatech/GLuCoSE-base-ja-v2")

# Each input text should start with "query: " or "passage: ".
# For tasks other than retrieval, you can simply use the "query: " prefix.
sentences = [
    'query: PKSHAはどんな会社ですか？',
    'passage: 研究開発したアルゴリズムを、多くの企業のソフトウエア・オペレーションに導入しています。',
    'query: 日本で一番高い山は？',
    'passage: 富士山（ふじさん）は、標高3776.12 m、日本最高峰（剣ヶ峰）の独立峰で、その優美な風貌は日本国外でも日本の象徴として広く知られている。',
]

# Tokenize the input texts
batch_dict = tokenizer(sentences, max_length=512, padding=True, truncation=True, return_tensors='pt')

outputs = model(**batch_dict)
embeddings = mean_pooling(outputs.last_hidden_state, batch_dict['attention_mask'])
print(embeddings.shape)
# [4, 768]

# Get the similarity scores for the embeddings
similarities = F.cosine_similarity(embeddings.unsqueeze(0), embeddings.unsqueeze(1), dim=2)
print(similarities)
# [[1.0000, 0.6050, 0.4341, 0.5537],
# [0.6050, 1.0000, 0.5018, 0.6815],
# [0.4341, 0.5018, 1.0000, 0.7534],
# [0.5537, 0.6815, 0.7534, 1.0000]]

Training Details

The fine - tuning of GLuCoSE v2 is carried out through the following steps: Step 1: Ensemble distillation

The embedded representation was distilled using [E5 - mistral](https://huggingface.co/intfloat/e5 - mistral - 7b - instruct), [gte - Qwen2](https://huggingface.co/Alibaba - NLP/gte - Qwen2 - 7B - instruct), and [mE5 - large](https://huggingface.co/intfloat/multilingual - e5 - large) as teacher models.

Step 2: Contrastive learning

Triplets were created from [JSNLI](https://nlp.ist.i.kyoto - u.ac.jp/?%E6%97%A5%E6%9C%AC%E8%AA%9ESNLI%28JSNLI%29%E3%83%87%E3%83%BC%E3%82%BF%E3%82%BB%E3%83%83%E3%83%88), [MNLI](https://huggingface.co/datasets/MoritzLaurer/multilingual - NLI - 26lang - 2mil7), [PAWS - X](https://huggingface.co/datasets/paws - x), JSeM and [Mr.TyDi](https://huggingface.co/datasets/castorini/mr - tydi) and used for training.
This training aimed to improve the overall performance as a sentence embedding model.

Step 3: Search - specific contrastive learning

In order to make the model more robust to the retrieval task, additional two - stage training with QA and retrieval task was conducted.
In the first stage, the synthetic dataset [auto - wiki - qa](https://huggingface.co/datasets/cl - nagoya/auto - wiki - qa) was used for training, while in the second stage, JQaRA, [MQA](https://huggingface.co/datasets/hpprc/mqa - ja), Japanese Wikipedia Human Retrieval, Mr.TyDi,MIRACL, Quiz Works and Quiz No Mori were used.

Benchmarks

Retrieval

Evaluated with MIRACL - ja, JQARA, JaCWIR and MLDR - ja.

Model	Size	MIRACL Recall@5	JQaRA nDCG@10	JaCWIR MAP@10	MLDR nDCG@10
[intfloat/multilingual - e5 - large](https://huggingface.co/intfloat/multilingual - e5 - large)	0.6B	89.2	55.4	87.6	29.8
[cl - nagoya/ruri - large](https://huggingface.co/cl - nagoya/ruri - large)	0.3B	78.7	62.4	85.0	37.5
[intfloat/multilingual - e5 - base](https://huggingface.co/intfloat/multilingual - e5 - base)	0.3B	84.2	47.2	85.3	25.4
[cl - nagoya/ruri - base](https://huggingface.co/cl - nagoya/ruri - base)	0.1B	74.3	58.1	84.6	35.3
[pkshatech/GLuCoSE - base - ja](https://huggingface.co/pkshatech/GLuCoSE - base - ja)	0.1B	53.3	30.8	68.6	25.2
GLuCoSE v2	0.1B	85.5	60.6	85.3	33.8

Note: Results for OpenAI small embeddings in JQARA and JaCWIR are quoted from the JQARA and JaCWIR.

JMTEB

Evaluated with JMTEB. The average score is macro - average.

Model	Size	Avg.	Retrieval	STS	Classification	Reranking	Clustering	PairClassification
OpenAI/text - embedding - 3 - small	-	69.18	66.39	79.46	73.06	92.92	51.06	62.27
OpenAI/text - embedding - 3 - large	-	74.05	74.48	82.52	77.58	93.58	53.32	62.35
[intfloat/multilingual - e5 - large](https://huggingface.co/intfloat/multilingual - e5 - large)	0.6B	70.90	70.98	79.70	72.89	92.96	51.24	62.15
[cl - nagoya/ruri - large](https://huggingface.co/cl - nagoya/ruri - large)	0.3B	73.31	73.02	83.13	77.43	92.99	51.82	62.29
[intfloat/multilingual - e5 - base](https://huggingface.co/intfloat/multilingual - e5 - base)	0.3B	68.61	68.21	79.84	69.30	92.85	48.26	62.26
[cl - nagoya/ruri - base](https://huggingface.co/cl - nagoya/ruri - base)	0.1B	71.91	69.82	82.87	75.58	92.91	54.16	62.38
[pkshatech/GLuCoSE - base - ja](https://huggingface.co/pkshatech/GLuCoSE - base - ja)	0.1B	67.29	59.02	78.71	76.82	91.90	49.78	66.39
GLuCoSE v2	0.1B	72.23	73.36	82.96	74.21	93.01	48.65	62.37

Note: Results for OpenAI embeddings and multilingual - e5 models are quoted from the JMTEB leaderboard. Results for ruri are quoted from the [cl - nagoya/ruri - base model card](https://huggingface.co/cl - nagoya/ruri - base/blob/main/README.md).

👥 Authors

Chihiro Yano, Mocho Go, Hideyuki Tachibana, Hiroto Takegawa, Yotaro Watanabe

📄 License

This model is published under the [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE - 2.0).

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご