gemma-2-2b-jpn-it-translate Open-source Translation Model - Free Deployment for Accurate Japanese-English Translation

Gemma 2 2b Jpn It Translate

Developed by webbigdata

A model fine-tuned for translation tasks based on Google's gemma-2-2b-jpn-it, specializing in Japanese-English bidirectional translation

Machine Translation

Safetensors

Supports Multiple Languages#Japanese-English translation optimization #Compact size with high efficiency #Long text sentence segmentation

Downloads 60

Release Time : 10/7/2024

Model Overview

This model is optimized for bidirectional translation between Japanese and English, featuring efficient processing capabilities and potential for long-text translation

Model Features

Efficient translation

Achieves translation quality comparable to 7B-parameter models with only 2B parameters

Long text processing

Capable of handling potentially unlimited text length translation (recommended with sentence segmentation preprocessing)

Fast operation

Compact file size of approximately 5GB enables rapid execution

Automatic prompt templates

Utilizes apply_chat_template technology to eliminate manual prompt template creation

Model Capabilities

Japanese-to-English translation

English-to-Japanese translation

Long text processing

Business document translation

Use Cases

Business translation

Business email translation

Bidirectional translation of business email content between Japanese and English

Sample output demonstrates strong business terminology handling

Corporate announcement translation

Translation of formal documents like financial reports and development plans

Accurately processes critical information including numbers and dates

Technical documentation translation

Bidirectional translation of technical documents between Japanese and English

🚀 Model Card for gemma-2-2b-jpn-it-translate

This model is tuned for translation tasks based on Google's gemma-2-2b-jpn-it, offering high - quality translation with relatively small size and fast execution.

The gemma-2-2b-jpn-it-translate is a model fine - tuned for translation tasks. It is based on google/gemma-2-2b-jpn-it released by Google. Although it has only 2 billion (2B) parameters, in some fields, it can provide translation quality approaching that of the 7 billion (7B) model from a year ago. With a relatively small file size of around 5GB, it allows for fast execution.

✨ Features

Model Description

This model is fine - tuned from "gemma-2-2b-jpn-it", a Japanese - specific model released by Google. The goal is to enable high - speed translation of texts of unlimited length.

It is trained to output translated text (Japanese/English) in response to user input after being given an initial system prompt - like text (Japanese/English). Additionally, by using apply_chat_template, it eliminates the need for manual writing of prompt templates, which can be error - prone.

However, since the model is trained to translate sentence by sentence, passing a long text with line breaks at once will lead to a decrease in quality. When translating a long text, please pre - process it by dividing it into sentences before feeding it to the model.

Batch Translation Sample Colab Script

If you have a Google account, you can check it out for free by clicking the "Open In Colab" button at the link below. gemma_2_2b_jpn_it_tranlate_batch_translation_sample.ipynb

💻 Usage Examples

Basic Usage

Japanese - English Translation sample script

# Install the necessary library
pip install -U transformers

import re
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

def get_torch_dtype():
    if torch.cuda.is_available():
        device = torch.device("cuda")
        prop = torch.cuda.get_device_properties(device)
        # Ampere (Compute Capability 8.0 above), for example L4 support bfloat16, but T4 not support.
        if prop.major >= 8:
            return torch.bfloat16
    return torch.float16


model_name = "webbigdata/gemma-2-2b-jpn-it-translate"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=get_torch_dtype(),
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.unk_token

system_prompt = "You are a highly skilled professional Japanese-English and English-Japanese translator. Translate the given text accurately, taking into account the context and specific instructions provided. Steps may include hints enclosed in square brackets [] with the key and value separated by a colon:. Only when the subject is specified in the Japanese sentence, the subject will be added when translating into English. If no additional instructions or context are provided, use your expertise to consider what the most appropriate context is and provide a natural translation that aligns with that context. When translating, strive to faithfully reflect the meaning and tone of the original text, pay attention to cultural nuances and differences in language usage, and ensure that the translation is grammatically correct and easy to read. After completing the translation, review it once more to check for errors or unnatural expressions. For technical terms and proper nouns, either leave them in the original language or use appropriate translations as necessary. Take a deep breath, calm down, and start translating.\n\n"
instruct = """Translate Japanese to English.\nWhen translating, please use the following hints:\n[writing_style: casual]"""

# Function to split sentences
def split_sentences(text):
    sentences = []
    last = 0
    # Split sentences by punctuation marks
    for match in re.finditer(r'[。！？…]', text):
        end = match.end()
        # Include the newline character immediately following the punctuation mark
        while end < len(text) and text[end] == '\n':
            end += 1
        sentence = text[last:end]
        sentences.append(sentence)
        last = end
    # Add the remaining text
    if last < len(text):
        remaining = text[last:]
        sentences.append(remaining)
    # Split newlines within each sentence appropriately
    final_sentences = []
    for s in sentences:
        if '\n' in s:
            parts = s.split('\n')
            for i, part in enumerate(parts):
                if part:
                    # Add a newline if it's not the last part
                    if i < len(parts) - 1:
                        final_sentences.append(part + '\n')
                    else:
                        final_sentences.append(part)
                # Keep the newline itself
                if i < len(parts) - 1:
                    final_sentences.append('\n')
        else:
            final_sentences.append(s)
    return final_sentences

# Function to translate a sentence
def translate_sentence(sentence, previous_context):
    if sentence.strip() == '':
        return sentence

    messages = previous_context + [
        {"role": "user", "content": sentence}
    ]

    # Generate a prompt using apply_chat_template
    inputs = tokenizer.apply_chat_template(
        messages,
        tokenize=True,
        add_generation_prompt=True,
        return_tensors="pt",
    ).to("cuda")

    translation = ""
    with torch.no_grad():
        generated_ids = model.generate(
            input_ids=inputs,
            num_beams=3, max_new_tokens=1200, do_sample=True, temperature=0.5, top_p=0.3
        )
        full_output = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0].strip()
        translation = full_output.split('\nmodel\n')[-1].strip()
    return translation

from collections import deque

# Main processing function
def main(text):
    sentences = split_sentences(text)
    translated_sentences = []

    # Initialize context with system prompt
    context = deque([
        {"role": "user", "content": system_prompt + instruct},
        {"role": "assistant", "content": "OK"}
    ], maxlen=6)  # Maximum 6 elements

    for i, sentence in enumerate(sentences):
        # For the first sentence, use the full context including system prompt
        if i == 0:
            translation_context = list(context)
        else:
            # For subsequent sentences, exclude the system prompt
            translation_context = list(context)[2:]

        translated_sentence = translate_sentence(sentence, translation_context)
        translated_sentences.append(translated_sentence)

        # Add new interactions to the context
        if sentence.strip() != '':
            context.append({"role": "user", "content": sentence})
        else:
            context.append({"role": "user", "content": sentence})

        if translated_sentence.strip() != '':
            context.append({"role": "assistant", "content": translated_sentence})
        else:
            context.append({"role": "assistant", "content": translated_sentence})

    return translated_sentences


text = """こんにちは。私は田中です。今日はとても良い天気ですね。朝ごはんはパンとコーヒーを食べました。そのあとに散歩に行きました。公園にはたくさんの人がいました。子供たちは遊んでいました。
犬を連れている人もいました。私はベンチに座って本を読みました。風がとても気持ちよかったです。その後、友達とカフェに行きました。
カフェではコーヒーを飲みながらおしゃべりをしました。友達は最近引っ越したばかりだと言いました。新しい家の写真を見せてくれました。
とてもきれいな家でした。時間が経つのがあっという間でした。夕方になり、私は家に帰りました。夕食にはカレーを作りました。カレーはとても美味しかったです。今日一日、とても楽しかったです。"""
translated = main(text)
print(translated)

English - Japanese Translation sample script

import re
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

def get_torch_dtype():
    if torch.cuda.is_available():
        device = torch.device("cuda")
        prop = torch.cuda.get_device_properties(device)
        # Ampere (Compute Capability 8.0 above), for example L4 support bfloat16, but T4 not support.
        if prop.major >= 8:
            return torch.bfloat16
    return torch.float16

model_name = "gemma-2-2b-jpn-it-translate"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=get_torch_dtype(),
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.unk_token

system_prompt = "You are a highly skilled professional Japanese-English and English-Japanese translator. Translate the given text accurately, taking into account the context and specific instructions provided. Steps may include hints enclosed in square brackets [] with the key and value separated by a colon:. Only when the subject is specified in the Japanese sentence, the subject will be added when translating into English. If no additional instructions or context are provided, use your expertise to consider what the most appropriate context is and provide a natural translation that aligns with that context. When translating, strive to faithfully reflect the meaning and tone of the original text, pay attention to cultural nuances and differences in language usage, and ensure that the translation is grammatically correct and easy to read. After completing the translation, review it once more to check for errors or unnatural expressions. For technical terms and proper nouns, either leave them in the original language or use appropriate translations as necessary. Take a deep breath, calm down, and start translating.\n\n"
instruct = """Translate English to Japanese.\nWhen translating, please use the following hints:\n[writing_style: business]"""

# Function to split English sentences
def split_sentences(text):
    sentences = []
    # Split by newlines, periods, exclamation marks, question marks, or two or more consecutive spaces
    pattern = r'(?:\r?\n|\.|\!|\?|(?:\s{2,}))'
    splits = re.split(pattern, text)

    for split in splits:
        split = split.strip()
        if split:
            sentences.append(split)

    return sentences

# Function to translate a sentence
def translate_sentence(sentence, previous_context):
    if sentence.strip() == '':
        return sentence

    messages = previous_context + [
        {"role": "user", "content": sentence}
    ]

    inputs = tokenizer.apply_chat_template(
        messages,
        tokenize=True,
        add_generation_prompt=True,
        return_tensors="pt",
    ).to("cuda")

    translation = ""
    with torch.no_grad():
        generated_ids = model.generate(
            input_ids=inputs,
            num_beams=3, max_new_tokens=1200, do_sample=True, temperature=0.5, top_p=0.3
        )
        full_output = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0].strip()
        translation = full_output.split('\nmodel\n')[-1].strip()
    return translation

from collections import deque

# Main processing function
def main(text):
    sentences = split_sentences(text)
    translated_sentences = []

    context = deque([
        {"role": "user", "content": system_prompt + instruct},
        {"role": "assistant", "content": "OK"}
    ], maxlen=6)

    for i, sentence in enumerate(sentences):
        if i == 0:
            translation_context = list(context)
        else:
            translation_context = list(context)[2:]

        translated_sentence = translate_sentence(sentence, translation_context)
        translated_sentences.append(translated_sentence)

        if sentence.strip() != '':
            context.append({"role": "user", "content": sentence})
        else:
            context.append({"role": "user", "content": sentence})

        if translated_sentence.strip() != '':
            context.append({"role": "assistant", "content": translated_sentence})
        else:
            context.append({"role": "assistant", "content": translated_sentence})

    return translated_sentences

# Sample English text for translation (business context)
text = """Dear valued clients and partners,

I hope this email finds you well. I am writing to provide you with an i

📚 Documentation

Model Information

Property	Details
Model Type	Fine - tuned from "gemma-2-2b-jpn-it", a Japanese - specific model released by Google
Training Data	Not provided in the original document

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご