Sentence-Luke-Japanese-Base-Lite Open-Source Model - Efficient Japanese Sentence Embedding with Results Comparable to Sentence-BERT

Sentence Luke Japanese Base Lite

Developed by sonoisa

This is a Japanese sentence embedding model based on the LUKE architecture, which has shown performance superior or equivalent to Japanese Sentence-BERT models in internal testing

Text Embedding

Safetensors

JapaneseOpen Source License:Apache-2.0 #Japanese sentence embedding #Semantic similarity calculation #LUKE architecture optimization

Downloads 2,690

Release Time : 3/19/2023

Model Overview

This model is used to generate embedding vectors for Japanese sentences, suitable for tasks such as sentence similarity calculation and feature extraction

Model Features

Performance superior to Sentence-BERT

In internal testing, this model showed approximately 0.5 percentage points higher quantitative accuracy than Japanese Sentence-BERT models, with even better qualitative evaluation results

Based on LUKE architecture

Uses studio-ousia/luke-japanese-base-lite as the pre-training foundation, offering better contextual understanding

Sentence-level embedding

Specially optimized for sentence-level representation, ideal for sentence similarity calculation tasks

Model Capabilities

Japanese sentence embedding

Sentence similarity calculation

Feature extraction

Use Cases

Text similarity

Semantic search

Improves search results by calculating semantic similarity between queries and documents

Enhances relevance of search results

Duplicate content detection

Identifies texts with different expressions but similar semantics

Effectively detects duplicate or highly similar content

Information retrieval

Document clustering

Automatically groups documents based on semantic similarity

Achieves more accurate document classification and organization

🚀 Japanese Sentence-LUKE Model

This is a Japanese Sentence-LUKE model. It is trained with the same dataset and settings as the Japanese Sentence-BERT model. On our private dataset, it shows quantitative accuracy comparable to or about 0.5 points higher than the Japanese Sentence-BERT model, and qualitative accuracy is higher for this model.

We used the pre-trained model studio-ousia/luke-japanese-base-lite.

You need SentencePiece to run inference (pip install sentencepiece).

📚 Documentation

Property	Details
Model Type	Japanese Sentence-LUKE model
Training Data	Same as Japanese Sentence-BERT model
Pre-trained Model	studio-ousia/luke-japanese-base-lite
Inference Requirement	SentencePiece (`pip install sentencepiece`)

💻 Usage Examples

Basic Usage

from transformers import MLukeTokenizer, LukeModel
import torch


class SentenceLukeJapanese:
    def __init__(self, model_name_or_path, device=None):
        self.tokenizer = MLukeTokenizer.from_pretrained(model_name_or_path)
        self.model = LukeModel.from_pretrained(model_name_or_path)
        self.model.eval()

        if device is None:
            device = "cuda" if torch.cuda.is_available() else "cpu"
        self.device = torch.device(device)
        self.model.to(device)

    def _mean_pooling(self, model_output, attention_mask):
        token_embeddings = model_output[0] #First element of model_output contains all token embeddings
        input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
        return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

    @torch.no_grad()
    def encode(self, sentences, batch_size=8):
        all_embeddings = []
        iterator = range(0, len(sentences), batch_size)
        for batch_idx in iterator:
            batch = sentences[batch_idx:batch_idx + batch_size]

            encoded_input = self.tokenizer.batch_encode_plus(batch, padding="longest", 
                                           truncation=True, return_tensors="pt").to(self.device)
            model_output = self.model(**encoded_input)
            sentence_embeddings = self._mean_pooling(model_output, encoded_input["attention_mask"]).to('cpu')

            all_embeddings.extend(sentence_embeddings)

        return torch.stack(all_embeddings)


MODEL_NAME = "sonoisa/sentence-luke-japanese-base-lite"
model = SentenceLukeJapanese(MODEL_NAME)

sentences = ["暴走したAI", "暴走した人工知能"]
sentence_embeddings = model.encode(sentences, batch_size=8)

print("Sentence embeddings:", sentence_embeddings)

📄 License

This project is licensed under the Apache-2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご