đ JiuZhou: Open Foundation Language Models for Geoscience
JiuZhou is an open foundation language model designed for geoscience. It addresses the challenges in geoscience data knowledge extraction and integration, offering high - performance solutions for both geoscience and general tasks.
đ Quick Start
Inference Example
Below is an example of inference code using JiuZhou - Instruct - v0.2.
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
device = torch.device("cuda:0") if torch.cuda.is_available() else torch.device("cpu")
model_path = "itpossible/JiuZhou-Instruct-v0.2"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.bfloat16, device_map=device)
text = "What is geoscience?"
messages = [{"role": "user", "content": text}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(device)
outputs_id = model.generate(inputs, max_new_tokens=600, do_sample=True)
outputs = tokenizer.batch_decode(outputs_id, skip_special_tokens=True)[0]
print(outputs)
⨠Features
- Geoscience - Oriented: Built on a large geoscience corpus, it has rich geoscience knowledge and can handle geoscience - related tasks effectively.
- High Performance: Outperforms GPT - 3.5 in geoscience objective tasks and shows outstanding performance in general benchmark datasets compared to other variants of Llama and Mistral models.
- Effective Training Framework: Incorporates the PreparedLLM framework and the "two - stage pre - adaptation pre - training" algorithm to improve training efficiency.
đĻ Installation
Project Deployment
git clone https://github.com/THU-ESIS/JiuZhou.git
cd JiuZhou
pip install -e ".[torch,metrics]"
Model Training
Pre - training
llamafactory-cli train examples/train_lora/JiuZhou_pretrain_sft.yaml
Instruction - tuning
llamafactory-cli train examples/train_lora/JiuZhou_lora_sft.yaml
Chat with the fine - tuned JiuZhou
llamafactory-cli chat examples/inference/JiuZhou_lora_sft.yaml
Merge the instruction - tuned LoRA weights with the original JiuZhou weights
llamafactory-cli export examples/merge_lora/JiuZhou_lora_sft.yaml
đģ Usage Examples
Basic Usage
The inference example above demonstrates the basic usage of JiuZhou - Instruct - v0.2 to answer a user's question.
đ Documentation
Introduction
The field of geoscience has amassed a vast amount of data, necessitating the extraction and integration of diverse knowledge from this data to address global change challenges, promote sustainable development, and accelerate scientific discovery. Foundation language models initially learn and integrate knowledge autonomously through self - supervised pre - training on extensive text data. Subsequently, they acquire the capability to solve geoscience problems through instruction tuning. However, when the foundational language models lack sufficient geoscience expertise, instruction tuning with relevant data can lead to the generation of content that is inconsistent with established facts. To improve the model's accuracy and practicality, a robust geoscience foundational language model is urgently needed.
This study uses [Mistral - 7B - v0.1](https://huggingface.co/mistralai/Mistral - 7B - v0.1) as the base model and continues pretraining on a large geoscience corpus. It also incorporates the domain - specific large language model pre - pretraining framework (PreparedLLM) and the "two - stage pre - adaptation pre - training" algorithm to build the geoscience large language model, JiuZhou.
Download
Property |
Details |
Model Series |
JiuZhou, ClimateChat, Chinese - Mistral, PreparedLLM |
Model |
JiuZhou - base, JiuZhou - Instruct - v0.1, JiuZhou - Instruct - v0.2, ClimateChat, Chinese - Mistral - 7B, Chinese - Mistral - 7B - Instruct - v0.1, Chinese - Mistral - 7B - Instruct - v0.2, Prepared - Llama |
Download Link |
[Huggingface](https://huggingface.co/itpossible/JiuZhou - base), [HuggingFace](https://huggingface.co/itpossible/Chinese - Mistral - 7B - Instruct - v0.1), etc. |
Description |
Base model (Rich in geoscience knowledge), Instruct model (Instruction alignment caused a loss of some geoscience knowledge, but it has instruction - following ability), etc. |
Model Performance
Geoscience Ability
We evaluate the performance of JiuZhou using the GeoBench benchmark.
JiuZhou outperforms GPT - 3.5 in objective tasks:
JiuZhou also scores higher than baselines across six criteria in subjective tasks:
General Ability
We evaluate the performance of JiuZhou using three benchmark datasets: C - Eval, CMMLU, and MMLU.
Compared to other variants of Llama and Mistral models, JiuZhou shows outstanding performance:
Model Training Process
Training Corpus
The corpus consists of 50 million general documents and 3.4 million geoscience - related documents.
Training Framework
We use the JiuZhou - Framework proposed in this study.
Two - stage Pre - adaptation Pre - training (TSPT)
TSPT improves the efficiency of using limited geoscience data and overcomes some of the technical bottlenecks in continual pretraining for LLMs.
The difference between TSPT and single - stage training algorithms:
Comparison of TSPT and one - stage pre - training algorithm performance:
đ§ Technical Details
Model Building
The model uses [Mistral - 7B - v0.1](https://huggingface.co/mistralai/Mistral - 7B - v0.1) as the base model and continues pretraining on a large geoscience corpus. It also incorporates the PreparedLLM framework and the "two - stage pre - adaptation pre - training" algorithm.
Training Algorithm
The "two - stage pre - adaptation pre - training" algorithm improves the efficiency of using limited geoscience data and overcomes some technical bottlenecks in continual pretraining for LLMs.
đ License
No license information provided in the original document.
đĻ News
đ Citations
@article{chen2024preparedllm,
author = {Chen, Zhou and Lin, Ming and Wang, Zimeng and Zang, Mingrun and Bai, Yuqi},
title = {PreparedLLM: Effective Pre - pretraining Framework for Domain - specific Large Language Models},
year = {2024},
journal = {Big Earth Data},
pages = {1--24},
doi = {10.1080/20964471.2024.2396159},
url = {https://doi.org/10.1080/20964471.2024.2396159}
}
đ Acknowledgments
- [LLaMA - Factory](https://github.com/hiyouga/LLaMA - Factory)
- [OpenCompass](https://github.com/open - compass/opencompass)
- K2
- [GeoGalactica](https://github.com/geobrain - ai/geogalactica)
- [BB - GeoGPT](https://github.com/AGI - GIS/BB - GeoGPT)