JiuZhou-base Open-source Language Model - Free Access to Abundant Geoscience Knowledge, Supports Instruction Following

Jiuzhou Base

Developed by itpossible

JiuZhou is an open-source foundational language model for the Earth Science domain, built through continued pre-training on large-scale geoscience corpora, equipped with rich geoscience knowledge and instruction-following capabilities.

Large Language Model

Transformers

#Earth Science Specialized #Two-Phase Pre-Adaptation Training #Geoscience Knowledge Enhanced

Downloads 23

Release Time : 3/31/2024

Model Overview

The JiuZhou model is based on Mistral-7B-v0.1 and constructed using the domain-specific large language model pre-training framework (PreparedLLM) and the 'Two-Phase Pre-Adaptation Pre-Training' algorithm, focusing on knowledge understanding and problem-solving in Earth Science.

Model Features

Rich Geoscience Knowledge

Pre-trained on 3.4 million geoscience-related documents, the model possesses extensive professional knowledge in Earth Science.

Two-Phase Pre-Adaptation Pre-Training

Utilizes the TSPT algorithm to improve the efficiency of limited geoscience data usage, overcoming technical bottlenecks in continued pre-training of large models.

Instruction-Following Capability

Fine-tuned with high-quality instruction data, the model can accurately understand and execute user instructions.

Model Capabilities

Earth Science Knowledge Q&A

Professional Terminology Explanation

Multi-Turn Dialogue

Scientific Data Analysis

Research Report Generation

Use Cases

Research & Education

Geoscience Knowledge Q&A

Answering professional questions in the Earth Science domain

Outperforms GPT-3.5 on the GeoBench benchmark

Academic Assistance

Helping researchers understand and analyze geoscience literature

Environmental Monitoring

Climate Change Analysis

Interpreting climate data and providing analytical reports

🚀 JiuZhou: Open Foundation Language Models for Geoscience

JiuZhou is an open foundation language model designed for geoscience. It addresses the challenges in geoscience data knowledge extraction and integration, offering high - performance solutions for both geoscience and general tasks.

🚀 Quick Start

Inference Example

Below is an example of inference code using JiuZhou - Instruct - v0.2.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

device = torch.device("cuda:0") if torch.cuda.is_available() else torch.device("cpu")

model_path = "itpossible/JiuZhou-Instruct-v0.2"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.bfloat16, device_map=device)

text = "What is geoscience?"
messages = [{"role": "user", "content": text}]

inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(device)
outputs_id = model.generate(inputs, max_new_tokens=600, do_sample=True)
outputs = tokenizer.batch_decode(outputs_id, skip_special_tokens=True)[0]
print(outputs)

✨ Features

Geoscience - Oriented: Built on a large geoscience corpus, it has rich geoscience knowledge and can handle geoscience - related tasks effectively.
High Performance: Outperforms GPT - 3.5 in geoscience objective tasks and shows outstanding performance in general benchmark datasets compared to other variants of Llama and Mistral models.
Effective Training Framework: Incorporates the PreparedLLM framework and the "two - stage pre - adaptation pre - training" algorithm to improve training efficiency.

📦 Installation

Project Deployment

git clone https://github.com/THU-ESIS/JiuZhou.git
cd JiuZhou
pip install -e ".[torch,metrics]"

Model Training

Pre - training

llamafactory-cli train examples/train_lora/JiuZhou_pretrain_sft.yaml

Instruction - tuning

llamafactory-cli train examples/train_lora/JiuZhou_lora_sft.yaml

Chat with the fine - tuned JiuZhou

llamafactory-cli chat examples/inference/JiuZhou_lora_sft.yaml

Merge the instruction - tuned LoRA weights with the original JiuZhou weights

llamafactory-cli export examples/merge_lora/JiuZhou_lora_sft.yaml

💻 Usage Examples

Basic Usage

The inference example above demonstrates the basic usage of JiuZhou - Instruct - v0.2 to answer a user's question.

📚 Documentation

Introduction

The field of geoscience has amassed a vast amount of data, necessitating the extraction and integration of diverse knowledge from this data to address global change challenges, promote sustainable development, and accelerate scientific discovery. Foundation language models initially learn and integrate knowledge autonomously through self - supervised pre - training on extensive text data. Subsequently, they acquire the capability to solve geoscience problems through instruction tuning. However, when the foundational language models lack sufficient geoscience expertise, instruction tuning with relevant data can lead to the generation of content that is inconsistent with established facts. To improve the model's accuracy and practicality, a robust geoscience foundational language model is urgently needed.

This study uses [Mistral - 7B - v0.1](https://huggingface.co/mistralai/Mistral - 7B - v0.1) as the base model and continues pretraining on a large geoscience corpus. It also incorporates the domain - specific large language model pre - pretraining framework (PreparedLLM) and the "two - stage pre - adaptation pre - training" algorithm to build the geoscience large language model, JiuZhou.

Download

Property	Details
Model Series	JiuZhou, ClimateChat, Chinese - Mistral, PreparedLLM
Model	JiuZhou - base, JiuZhou - Instruct - v0.1, JiuZhou - Instruct - v0.2, ClimateChat, Chinese - Mistral - 7B, Chinese - Mistral - 7B - Instruct - v0.1, Chinese - Mistral - 7B - Instruct - v0.2, Prepared - Llama
Download Link	[Huggingface](https://huggingface.co/itpossible/JiuZhou - base), [HuggingFace](https://huggingface.co/itpossible/Chinese - Mistral - 7B - Instruct - v0.1), etc.
Description	Base model (Rich in geoscience knowledge), Instruct model (Instruction alignment caused a loss of some geoscience knowledge, but it has instruction - following ability), etc.

Model Performance

Geoscience Ability

We evaluate the performance of JiuZhou using the GeoBench benchmark. JiuZhou outperforms GPT - 3.5 in objective tasks:

JiuZhou also scores higher than baselines across six criteria in subjective tasks:

General Ability

We evaluate the performance of JiuZhou using three benchmark datasets: C - Eval, CMMLU, and MMLU. Compared to other variants of Llama and Mistral models, JiuZhou shows outstanding performance:

Model Training Process

Training Corpus

The corpus consists of 50 million general documents and 3.4 million geoscience - related documents.

Training Framework

We use the JiuZhou - Framework proposed in this study.

Two - stage Pre - adaptation Pre - training (TSPT)

TSPT improves the efficiency of using limited geoscience data and overcomes some of the technical bottlenecks in continual pretraining for LLMs. The difference between TSPT and single - stage training algorithms:

Comparison of TSPT and one - stage pre - training algorithm performance:

🔧 Technical Details

Model Building

The model uses [Mistral - 7B - v0.1](https://huggingface.co/mistralai/Mistral - 7B - v0.1) as the base model and continues pretraining on a large geoscience corpus. It also incorporates the PreparedLLM framework and the "two - stage pre - adaptation pre - training" algorithm.

Training Algorithm

The "two - stage pre - adaptation pre - training" algorithm improves the efficiency of using limited geoscience data and overcomes some technical bottlenecks in continual pretraining for LLMs.

📄 License

No license information provided in the original document.

📦 News

[2024 - 12 - 31] Article JiuZhou: Open Foundation Language Models and Effective Pre - training Framework for Geoscience has been accepted for publication in the International Journal of Digital Earth. [Code and Data](https://github.com/THU - ESIS/JiuZhou).
[2024 - 10 - 11] WeChat article: PreparedLLM: Effective Pre - pretraining Framework for Domain - specific Large Language Models.
[2024 - 09 - 06] Released ClimateChat instruct model.
[2024 - 08 - 31] Article PreparedLLM: Effective Pre - pretraining Framework for Domain - specific Large Language Models has been accepted for publication in the Big Earth Data journal.
[2024 - 08 - 31] Released [Chinese - Mistral - 7B - Instruct - v0.2](https://huggingface.co/itpossible/Chinese - Mistral - 7B - Instruct - v0.2) instruct model. Significant improvements in language understanding and multi - turn dialogue capabilities.
[2024 - 06 - 30] Released [JiuZhou - Instruct - v0.2](https://huggingface.co/itpossible/JiuZhou - Instruct - v0.2) instruct model. Significant improvements in language understanding and multi - turn dialogue capabilities.
[2024 - 05 - 15] WeChat Article: Chinese Vocabulary Expansion Incremental Pretraining for Large Language Models: Chinese - Mistral Released.
[2024 - 04 - 04] Released [Chinese - Mistral - 7B - Instruct - v0.1](https://huggingface.co/itpossible/Chinese - Mistral - 7B - Instruct - v0.1) instruct model.
[2024 - 03 - 31] Released [Chinese - Mistral - 7B - v0.1](https://huggingface.co/itpossible/Chinese - Mistral - 7B) base model.
[2024 - 03 - 15] Released the base version [JiuZhou - base](https://huggingface.co/itpossible/JiuZhou - base), instruct version [JiuZhou - instruct - v0.1](https://huggingface.co/itpossible/JiuZhou - Instruct - v0.1), and intermediate checkpoints.

📚 Citations

@article{chen2024preparedllm,
  author = {Chen, Zhou and Lin, Ming and Wang, Zimeng and Zang, Mingrun and Bai, Yuqi},
  title = {PreparedLLM: Effective Pre - pretraining Framework for Domain - specific Large Language Models},
  year = {2024},
  journal = {Big Earth Data},
  pages = {1--24},
  doi = {10.1080/20964471.2024.2396159},
  url = {https://doi.org/10.1080/20964471.2024.2396159}
}

🙏 Acknowledgments

[LLaMA - Factory](https://github.com/hiyouga/LLaMA - Factory)
[OpenCompass](https://github.com/open - compass/opencompass)
K2
[GeoGalactica](https://github.com/geobrain - ai/geogalactica)
[BB - GeoGPT](https://github.com/AGI - GIS/BB - GeoGPT)

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご