LingoWhale-8B Open-Source Chinese-English Bilingual Large Language Model - Free Processing of 8K Context Window Text

Lingowhale 8B

Developed by deeplang-ai

A Chinese-English bilingual large language model jointly open-sourced by DeepLang Tech and Tsinghua NLP Lab, pre-trained on trillions of high-quality tokens with 8K context window processing capability

Large Language Model

Transformers

Supports Multiple Languages#Bilingual (Chinese-English)#Long-text processing #Academic research

Downloads 98

Release Time : 11/1/2023

Model Overview

LingoWhale-8B is a large language model specifically designed for Chinese-English bilingual scenarios, excelling in text generation and comprehension tasks, suitable for academic research purposes

Model Features

Bilingual capability

Performs excellently in both Chinese and English scenarios, supporting cross-language task processing

Long-text processing

8K context window design enables effective handling of long document understanding and generation

Efficient fine-tuning

Provides parameter-efficient fine-tuning solutions like LoRA to reduce deployment costs

Model Capabilities

Text generation

Text comprehension

Mathematical computation

Code generation

Knowledge Q&A

Creative writing

Use Cases

Education

Problem-solving assistance

Step-by-step solutions for math word problems

Achieved 55.0 score on GSM8K test set

Content creation

Poetry generation

Creates classical poetry in specified styles

Capable of generating regulated verse poems

Business applications

Marketing copywriting

Generates e-commerce promotion plans

Provides complete online marketing mix strategies

🚀 LingoWhale-8B

LingoWhale-8B is the first open-source bilingual large language model in Chinese and English in the LingoWhale series launched by DeepLangAI. It has powerful basic capabilities and achieves leading results on multiple public evaluation benchmarks.

🚀 Quick Start

This README provides detailed information about the LingoWhale-8B model, including model introduction, evaluation results, generation examples, deployment and inference, fine-tuning methods, and open-source licenses. You can quickly understand and use the LingoWhale-8B model based on the content in this document.

✨ Features

Powerful Foundation: Pre-trained on trillions of high-quality Chinese and English tokens, achieving leading results on multiple public evaluation benchmarks.
Long Context Handling: Trained with an 8K context length, capable of handling longer context understanding and generation tasks.
Academic and Commercial Use: Completely open for academic research. After applying by email and obtaining official commercial licenses, users can use it for free commercially.
Developer-Friendly: Provides Huggingface inference interfaces and parameter-efficient fine-tuning examples like LoRA for quick development.

📦 Installation

Install Dependencies

pip install -r requirements.txt

If you need to use flash attention, install flash-attn.
If you need to use parameter-efficient fine-tuning methods like LoRA, install peft additionally.

💻 Usage Examples

Basic Usage

>>> from transformers import AutoModelForCausalLM, AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("deeplang-ai/LingoWhale-8B", trust_remote_code=True)
>>> model = AutoModelForCausalLM.from_pretrained("deeplang-ai/LingoWhale-8B", device_map="auto", trust_remote_code=True)
>>> inputs = tokenizer("陋室铭\n唐 刘禹锡\n", return_tensors="pt")
>>> inputs = inputs.to("cuda:0")
>>> pred = model.generate(**inputs, max_new_tokens=100, repetition_penalty=1.1)
>>> print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))

Advanced Usage

Here are some generation examples of the LingoWhale-8B model:

Mathematical Calculation

Question: Given that (f(x)) is an even function, (f(1) = 3), and (f(2) = 4), what is the value of (f(1)+f(- 2)^{2})? Please explain step by step.

Answer: First, based on the conditions given in the question, we know that:

(f(x)) is an even function, so (f(-x)=f(x)).
(f(1) = 3).
(f(2) = 4).

Next, we need to find the value of (f(1)+f(-2)^{2}). Since (f(x)) is an even function, (f(-2)=f(2)), so (f(-2) = 4). Substitute the values of (f(1)) and (f(-2)) into (f(1)+f(-2)^{2}), we get (f(1)+f(-2)^{2}=3 + 4^{2}=3 + 16=19). Therefore, the value of (f(1)+f(-2)^{2}) is 19.

Role-Playing

Question: As a brand manager with rich marketing experience, please provide a brand marketing strategy for the following scenario: During the 618 shopping festival, increase brand sales. Brand positioning: Domestic high-quality eyebrow pencils. Suggestion requirements: Provide an innovative and effective brand marketing strategy to help increase brand sales and include a live-streaming e-commerce plan.

Answer: During the 618 shopping festival, we can adopt the following brand marketing strategies to increase brand sales:

Limited-time Promotions: Launch limited-time promotions on the brand's official website during the 618 event, such as buying an eyebrow pencil and getting a free gift or offering discounts to attract consumers and boost sales.
Social Media Promotion: Publish brand promotion content on social media platforms (such as Weibo, WeChat, and Douyin), including product introductions, usage tips, and user reviews, to increase brand exposure and user interaction.
Collaboration with Influencers: Invite well-known beauty bloggers or influencers to try our products and post recommendation videos or graphics on their social media to attract more target audiences.
Live-Streaming E-commerce: Invite well-known live-streaming hosts or celebrities to promote our products through live-streaming e-commerce, attracting more consumers to pay attention and make purchases through their influence.
Participation in E-commerce Platform Events: Actively participate in promotional events on major e-commerce platforms, such as full reduction and full gift activities, to attract more consumers to buy our products.

In summary, the above brand marketing strategies can help us increase brand sales during the 618 shopping festival. The live-streaming e-commerce plan can better attract the target audience and improve sales effectiveness.

Poetry Creation

Question: Write a seven-character poem describing Yue Fei. The poem is expected to be in the style of a Song Dynasty poem.

Answer: Yue Fei

Iron horses and golden spears, a hero's soul, Loyal to the country, his ambition soars like clouds. Throughout the ages, who can match this hero? The Yue Family Army shakes the world.

Cultural Knowledge

Question: I'm going to visit the Wuhou Shrine in Chengdu. Can you give me a detailed introduction to the life and deeds of Wuhou?

Answer: Zhuge Liang (181 - 234), styled Kongming, also known as Wolong, was from Yangdu, Langya, Xuzhou (now Yinan County, Linyi City, Shandong Province). He was the prime minister of the Shu Han during the Three Kingdoms period, an outstanding statesman, military strategist, litterateur, calligrapher, and inventor. In his early years, he went to Jingzhou with his uncle Zhuge Xuan. After Zhuge Xuan's death, Zhuge Liang lived in seclusion in Longzhong. Liu Bei visited Zhuge Liang three times in his thatched cottage and invited him out. They allied with Sun Quan of the Eastern Wu and defeated Cao Jun in the Battle of Chibi, forming the tripartite confrontation of the Three Kingdoms. Then they occupied Jingzhou and Yizhou. After Liu Bei entered Shu, Zhuge Liang guarded Jingzhou. After Liu Bei proclaimed himself emperor, Zhuge Liang was appointed prime minister and was in charge of the court affairs. After Emperor Liu Chan succeeded to the throne, Zhuge Liang was enfeoffed as the Marquis of Wuxiang and served as the governor of Yizhou. He was diligent and cautious, handling all major and minor affairs personally, and was strict in rewards and punishments. He allied with the Eastern Wu and improved relations with ethnic groups in the southwest. He implemented the屯田 policy to strengthen combat readiness. He launched five northern expeditions to the Central Plains, mostly ending in failure due to lack of food. Finally, due to overwork, he died in Wuzhangyuan (now south of Qishan County, Shaanxi Province) in 234 at the age of 54. Liu Chan posthumously conferred on him the title of Marquis Zhongwu. Later generations often referred to Zhuge Liang as Wuhou. The Eastern Jin regime posthumously conferred on him the title of King Wuxing because of his military talent. Zhuge Liang's representative prose works include The Memorial Presented to the Emperor and The Admonition to My Son. He invented the wooden ox and gliding horse, the Kongming lantern, and improved the repeating crossbow, called the Zhuge repeating crossbow, which could shoot ten arrows at once. Zhuge Liang dedicated his whole life to the country and was a representative figure of loyalty and wisdom in traditional Chinese culture.

Code Generation

Question: You are a programming developer. Can you write a regular expression in Python to determine whether a string is a valid mobile phone number?

Answer:

import re

def is_valid_phone(phone):
    pattern = r'^1[3-9]\d{9}$'
    return bool(re.match(pattern, phone))

print(is_valid_phone('13800138000'))  # True
print(is_valid_phone('12345678901'))  # False

📚 Documentation

Model Introduction

LingoWhale-8B is the first open-source bilingual large language model in Chinese and English in the LingoWhale series launched by DeepLangAI. The model is pre-trained on trillions of high-quality Chinese and English tokens, endowing it with powerful basic capabilities and achieving leading results on multiple public evaluation benchmarks. During the pre-training phase, the model is trained with an 8K context length, enabling it to handle longer context understanding and generation tasks.

The LingoWhale-8B model is completely open for academic research. After users apply by email and obtain official commercial licenses, they can use it for free commercially. While open-sourcing the model weights, we also provide Huggingface inference interfaces that conform to users' habits and parameter-efficient fine-tuning examples like LoRA, facilitating developers to quickly use the LingoWhale-8B model.

Due to the influence of the model's parameter quantity, inherent problems of large models such as hallucination and relatively weak mathematical calculation ability still exist in the LingoWhale-8B model. Please understand these issues before use and evaluate the possible risks. Future versions of the LingoWhale model will focus on optimizing such problems.

Evaluation Results

We conducted tests on the following public evaluation benchmarks:

C-Eval is a Chinese basic model evaluation benchmark containing 13,948 multiple-choice questions covering 52 different disciplines and four difficulty levels. It aims to evaluate the capabilities of Chinese language models. We used the dev set of this dataset as the source for few-shot learning and conducted 5-shot tests on the test set.
MMLU is an English basic model evaluation benchmark covering multiple fields such as basic mathematics, American history, computer science, and law, with a total of 57 tasks. It is used to evaluate the performance of language models on tasks in different fields. We conducted 5-shot tests on the model.
CMMLU is a Chinese evaluation benchmark covering 67 topics from basic disciplines to advanced professional levels. It is used to evaluate the knowledge and reasoning abilities of Chinese language models. We used the dev set of this dataset as the source for few-shot learning and conducted 5-shot tests on the test set.
Gaokao is an evaluation benchmark using Chinese college entrance examination questions as the dataset. It aims to evaluate the language understanding and logical reasoning abilities of Chinese language models. We only retained the four-choice multiple-choice questions and conducted 5-shot tests on the model after random division.
HumanEval is an English evaluation benchmark containing hundreds of programming questions. It is used to evaluate the program understanding and generation abilities of language models. We used the zero-shot method to calculate Pass@1 and tested the model.
GSM8K is a dataset composed of high-quality and linguistically diverse primary school math word problems. It requires selecting the most reasonable solution based on a given scenario and is used to evaluate the mathematical application abilities of language models. We conducted 8-shot tests on the model.
BBH is an evaluation benchmark formed by selecting poorly performing tasks from 204 Big-Bench evaluation benchmark tasks. It is used to evaluate the performance of large language models on challenging tasks. We conducted 3-shot tests on the model.
AGIEval is a benchmark test examining the human-like abilities of basic models, specifically used to evaluate the abilities of basic models in tasks related to human cognition and problem-solving. We only retained the four-choice multiple-choice questions and conducted 5-shot tests on the model after random division.

These evaluation benchmarks provide standardized tests and metrics for evaluating the performance and capabilities of language models on different tasks and in different fields. The evaluation methods and results are shown in the following table:

Model	C-Eval	MMLU	CMMLU	GAOKAO	HumanEval	GSM8K	BBH	AGIEval
	5-shot	5-shot	5-shot	5-shot	0-shot	8-shot	3-shot	5-shot
GPT-4	68.4	83.9	70.3	66.2	69.5	90.0	75.1	63.3
GPT-3.5 Turbo	51.1	68.5	54.1	47.1	52.4	57.8	61.6	46.1
LLaMA2-7B	28.9	45.7	31.4	26.0	12.8	16.2	39.2	26.5
ChatGLM2-6B*	51.7	47.9	-	-	-	32.4	33.7	-
Baichuan2-7B-Base*	54.0	54.2	57.1	47.5	18.3	24.5	41.6	42.7
Qwen-7B v1.1*	63.5	58.2	62.2	-	29.9	51.7	45.0	-
LingoWhale-8B-base	63.6	60.2	62.8	50.3	32.9	55.0	47.5	43.8

* indicates that the model results are from the official source, and all results are accurate to one decimal place.

Fine-tuning Methods

The sample code for model fine-tuning is based on transformers.Trainer. Most of the parameters and usage methods can refer to the tutorials and introductions of Trainer on Huggingface.

This section aims to demonstrate the fine-tuning process and does not guarantee the performance of the model after fine-tuning under this fine-tuning configuration.

Single-machine Training

The following is an example of single-machine fine-tuning. The data used is 10,000 instruction fine-tuning data randomly selected from the COIG dataset. You can replace it with your own data.

hostfile=""
deepspeed --hostfile=$hostfile finetune/finetune.py  \
    --report_to "none" \
    --data_path "finetune/data/coig_10k.json" \
    --model_name_or_path deeplang-ai/LingoWhale-8B \
    --output_dir "output" \
    --model_max_length 2048 \
    --num_train_epochs 4 \
    --per_device_train_batch_size 4 \
    --gradient_accumulation_steps 1 \
    --save_strategy epoch \
    --learning_rate 2e-5 \
    --lr_scheduler_type constant \
    --adam_beta1 0.9 \
    --adam_beta2 0.98 \
    --adam_epsilon 1e-8 \
    --max_grad_norm 1.0 \
    --weight_decay 1e-4 \
    --warmup_ratio 0.0 \
    --logging_steps 1 \
    --gradient_checkpointing True \
    --deepspeed finetune/ds_config.json \
    --bf16 True \
    --tf32 True

If you want to replace it with your own data, you can use a json file in the following format.

[
    {
        "id": 0,
        "conversations": [
            {
                "from": "human",
                "value": "What is the 'modal age'?"
            },
            {
                "from": "model",
                "value": "The modal age refers to the age calculated by assuming the initial isotopic composition of geological samples using the law of radioactive decay."
            },
            ...
        ]
    },
    ...
]

Multi-machine Training

For multi-machine training, you need to edit a hostfile file in the following format. Each line represents a machine, where ip_address-X is the IP address of each machine, and the slots field indicates the number of available GPUs on the machine. The content format is as follows:

ip_address-1 slots=8
ip_address-2 slots=8
ip_address-3 slots=8
ip_address-4 slots=8
...

At the same time, specify the hostfile parameter as the path to the hostfile file, and then run the following command to start multi-machine training.

hostfile="/path/to/hostfile"
deepspeed --hostfile=$hostfile finetune/finetune.py  \
    --report_to "none" \
    --data_path "finetune/data/coig_10k.json" \
    --model_name_or_path deeplang-ai/LingoWhale-8B \
    --output_dir "output" \
    --model_max_length 2048 \
    --num_train_epochs 4 \
    --per_device_train_batch_size 4 \
    --gradient_accumulation_steps 1 \
    --save_strategy epoch \
    --learning_rate 2e-5 \
    --lr_scheduler_type constant \
    --adam_beta1 0.9 \
    --adam_beta2 0.98 \
    --adam_epsilon 1e-8 \
    --max_grad_norm 1.0 \
    --weight_decay 1e-4 \
    --warmup_ratio 0.0 \
    --logging_steps 1 \
    --gradient_checkpointing True \
    --deepspeed finetune/ds_config.json \
    --bf16 True \
    --tf32 True

Parameter-Efficient Fine-tuning

By using peft, you can easily fine-tune methods such as LoRA and Prefix-Tuning. Currently, the LoRA training method is integrated into the code and can be started by adding --use_lora True.

The checkpoint trained with LoRA can be read and called using the following code:

from peft import AutoPeftModelForCausalLM
model = AutoPeftModelForCausalLM.from_pretrained("output", trust_remote_code=True)

📄 License

The community's use of the LingoWhale-8B model needs to comply with Apache 2.0 and LingoWhale-8B Model License Agreement. If you wish to use this source model or its derivatives for commercial purposes, please fill out the questionnaire for application. After approval, you can use it commercially for free.

⚠️ Important Note

Due to the influence of the model's parameter quantity, inherent problems of large models such as hallucination and relatively weak mathematical calculation ability still exist in the LingoWhale-8B model. Please understand these issues before use and evaluate the possible risks. Future versions of the LingoWhale model will focus on optimizing such problems.

💡 Usage Tip

When fine-tuning the model, you can adjust the parameters according to your own needs. If you need to use flash attention or parameter-efficient fine-tuning methods, make sure to install the corresponding libraries.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご