WiNGPT-Babel Open-Source Translation Model - Free Deployment for a Native-level Multi-language Information Experience

Wingpt Babel

Developed by winninghealth

A large language model (LLM) specifically customized for translation applications, dedicated to providing convenient multilingual native-level information experiences.

Machine Translation Supports Multiple LanguagesOpen Source License:Apache-2.0 #Human-in-the-loop Translation #Multi-format Compatibility #Real-time Subtitle Translation

Downloads 248

Release Time : 12/17/2024

Model Overview

WiNGPT-Babel adopts a 'human-in-the-loop' data production closed-loop strategy for training, supports translation of multiple text formats, and aims to eliminate language barriers, helping users access global internet information more conveniently.

Model Features

Human-in-the-loop Training

Constructs training sets by collecting tool usage logs via APIs, combines WiNGPT-2.6 model and reward model for rejection sampling, and supplements with manual review to ensure quality.

Multi-format Compatibility

Supports translation of various text formats such as web pages, social media, academic papers, video subtitles, and datasets.

Precise Output

Based on advanced LLM architecture, delivers accurate, natural, and fluent translation results.

Efficient Response

Utilizes a 1.5B parameter model to meet stringent speed requirements for scenarios like real-time subtitle translation.

Extensive Language Support

Currently supports over 20 languages, with ongoing expansion of language coverage.

Tool Adaptation

Already compatible with practical tools like Immersive Translation and VideoLingo.

Model Capabilities

Text Translation

Multilingual Translation

Web Content Translation

Academic Literature Translation

Social Media Content Translation

Video Subtitle Translation

Dataset Preprocessing

Use Cases

Web Page Translation

Foreign Web Page Translation

Achieves native-language rendering of foreign web pages through immersive translation tools.

Provides a smooth native-language reading experience.

Academic Research

Paper Translation

Assists researchers in understanding foreign-language literature.

Enhances reading efficiency of cross-border academic literature.

Social Media

Social Media Content Translation

Converts content across languages for social interactions.

Facilitates cross-language social engagement.

Video Content

Subtitle Translation

Generates translated subtitle files in real-time.

Enables barrier-free viewing of foreign-language videos.

Video Hard Subtitle

Produces videos with hard-coded subtitles.

Provides multilingual video content.

Data Processing

Dataset Translation

Preprocesses multilingual data.

Facilitates multilingual data analysis.

🚀 WiNGPT-Babel

WiNGPT-Babel is a model customized for translation applications based on large language models (LLMs), aiming to provide a native-level experience of multilingual information.

WiNGPT-Babel (Tower of Babel) is a model tailored for translation applications using large language models (LLMs). It is dedicated to offering a seamless, native-level experience for accessing multilingual information.

The key differentiator of WiNGPT-Babel from other machine translation models lies in its training strategy, which employs a human-in-the-loop data production and collection closed-loop. As a result, WiNGPT-Babel is better adapted to real-world usage scenarios, such as translating news articles, research findings, and providing real-time translated subtitles for videos. Through a suite of tool plugins, WiNGPT-Babel translates this content into the user's native language, presenting it in a more accessible format.

Our goal is to leverage advanced LLM technology to break down language barriers, enabling users to effortlessly access global internet information across various data formats, including academic papers, social media posts, web content, and video subtitles. While achieving this goal may take time, the rapid advancement of LLM technology makes it a feasible endeavor.

✨ Features

Human-in-the-loop 🌱: Initially, a small dataset is used for preliminary training. Subsequently, log data from our various tools are collected via API and used to construct new training data. The WiNGPT-2.6 model and a reward model are then employed for rejection sampling, supplemented by manual review to ensure data quality. Through several rounds of iterative training, the model's performance gradually improves until it reaches the desired level.
Multi-format translation 📄 🌐 🎬: Supports translation of various text formats, including web pages, social media content, academic papers, video subtitles, and datasets.
High-precision translation 🧠: Built on an advanced LLM architecture, we strive to deliver accurate, natural, and fluent translation results.
High-performance translation ⏱️: Utilizing a 1.5B model, it supports real-time subtitle translation and other applications, meeting users' demand for instant translation.
Multilingual support 🗣️: Currently supports over 20 languages, with continuous expansion of language support.
Application adaptation 🪒: Currently compatible with tools such as Immersive Translate and VideoLingo.

🧪 Use Cases

🌐 Web content translation: Ideal for daily web browsing, enabling quick comprehension of web information.
📄 Academic paper translation: Assists in understanding multilingual research papers, enhancing reading efficiency.
📰 News and资讯 translation: Facilitates rapid access to global news and up-to-date information.
🎬 Video subtitle translation: Aids in watching foreign-language videos by providing translated subtitles.
📊 Multilingual dataset processing: Supports initial translation of multilingual datasets, assisting in data analysis.

🔤 Language Support (More languages to be verified)

🇺🇸 English ↔️ 🇨🇳 Chinese | 🇯🇵 Japanese ➡️ 🇨🇳 Chinese

🚀 Quick Start

WiNGPT-Babel uses Qwen2.5-1.5B as its base model, a choice that balances inference speed and translation quality after testing models of various parameter scales. In various application scenarios, its translation speed can match or even exceed that of Google Translate, which is crucial for a satisfying user experience. To help you get started quickly, we provide the following examples using the Hugging Face Transformers library for loading and inference. We also recommend using inference tools or frameworks such as vllm, llama.cpp, and ollama:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "WiNGPT/WiNGPT-Babel"

model = AutoModelForCausalLM.from_pretrained(
   model_name,
   torch_dtype="auto",
   device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "Give me a short introduction to large language model."
messages = [
   {"role": "system", "content": "Translate the following content between Chinese and English"},
   {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
   messages,
   tokenize=False,
   add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
   **model_inputs,
   max_new_tokens=4096
)
generated_ids = [
   output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

Quick example of using llama.cpp for inference

llama-cli -m WiNGPT-Babel-Q4_K_M.gguf -co -i -if -p "<|im_start|>system\nTranslate the following content between Chinese and English<|im_end|>\n" --in-prefix "<|im_start|>user\n" --in-suffix "<|im_end|>\n<|im_start|>assistant\n" -fa -ngl 80 -n 512

Note: The default system prompt for WiNGPT-Babel is simply: "Translate the following content between Chinese and English". The model automatically translates the user's input into the corresponding language without the need for additional complex instructions. It supports a maximum length of 8192 and has the ability to handle multi-round conversations.

🎬 Examples

The following are some application scenarios demonstrating how to use the model for translation.

Web page translation:

Scenario: Users translate foreign web content into their native language using tools and simple system prompts.
Tool: Immersive Translate

Academic paper translation:

Scenario: Users translate foreign research papers using tools to assist in their research.
Tool: Immersive Translate

Social media translation:

Scenario: Users can translate social media content in different languages into their native language using the model.
Tool: Immersive Translate

Video subtitle translation:

Scenario: Users translate subtitle files directly using tools in conjunction with the model and save them as files.
Tool: Immersive Translate

PDF file translation:

Scenario: Users translate PDF documents or create bilingual versions using tools and the model.
Tool: PDFMathTranslate

Dataset translation:

Scenario: Users translate foreign language datasets using the model.
Tool: wingpt-web-client

Real-time video website translation:

Scenario: Users generate real-time subtitles while watching online videos using tools and the model.
Tool: Immersive Translate

Video translation and subtitle embedding:

Scenario: Users generate videos with translated subtitles from foreign language videos using tools and the model.
Tool: VideoLingo

Note: The above examples illustrate how to use tools in combination with the WiNGPT-Babel model for text translation. You can adapt these tools to more scenarios based on your needs and preferences.

🌱 Limitations

Specialized term translation: Translation results may deviate in highly specialized fields such as law and medicine, as well as in code translation.
Literary work translation: The model may struggle to fully convey the rhetorical and metaphorical nuances of literary works.
Long text translation: When processing extremely long texts, translation errors or hallucinations may occur, necessitating segmentation.
Multilingual adaptation: Currently, the model is primarily used in Chinese-English language scenarios, and more testing and feedback are required for other languages.

📄 License

This project is licensed under the Apache License 2.0.
When using this project, including the model weights, please cite this project: https://huggingface.co/winninghealth/WiNGPT-Babel
Comply with the relevant agreements and licenses of Qwen2.5-1.5B, immersive-translate, and VideoLingo. Refer to their respective websites for detailed information.

Contact Us

Apply for an API key through the WiNGPT Testing Platform.
Or contact us at wair@winning.com.cn to apply for an API_KEY for interface testing.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご