OneKE Open-source Bilingual Large Language Model - Free Support for Extracting Multidomain Knowledge Information in Chinese and English

Oneke

Developed by zjunlp

OneKE is a bilingual large language model jointly developed by Ant Group and Zhejiang University, focusing on knowledge extraction tasks and supporting multi-domain information extraction in Chinese and English.

Large Language Model

Transformers

Supports Multiple Languages#Bilingual Knowledge Extraction #Large Language Model Fine-tuning #Zero-shot Learning

Downloads 317

Release Time : 2/23/2024

Model Overview

OneKE is a bilingual large language model for knowledge extraction, featuring general knowledge extraction capabilities in Chinese and English across multiple domains and tasks, with full-process toolchain support.

Model Features

Bilingual Support

Supports knowledge extraction tasks in both Chinese and English

Unified Multi-task Framework

Capable of handling various knowledge extraction tasks such as named entity recognition, relation extraction, and event extraction

Zero-shot Capability

Demonstrates strong generalization ability in unseen domains and tasks

Schema Generalization

Supports dynamic schema definition through standardized instruction formats

Model Capabilities

Named Entity Recognition

Relation Extraction

Event Extraction

Knowledge Graph Construction

Information Structuring

Use Cases

Knowledge Graph Construction

Medical Knowledge Graph

Extracts entities such as diseases, symptoms, and medications along with their relationships from medical literature

Builds an interpretable medical knowledge system

Financial Risk Analysis

Extracts financial indicators, risk events, and causal logic

Enables risk prediction and industrial chain analysis

Government Intelligence

Government Regulation Management

Transforms unstructured government regulations into structured knowledge

Enhances public service efficiency and accuracy

🚀 OneKE: A Bilingual Large Language Model for Knowledge Extraction

OneKE is a bilingual large language model jointly developed by Ant Group and Zhejiang University. It can perform generalized knowledge extraction in both Chinese and English across multiple domains and tasks, and comes with comprehensive toolchain support.

What is OneKE?
How is OneKE trained?
Getting Started with OneKE
Evaluation
Continue Training
Citation

📄 License

This project is licensed under the CC BY-NC-SA 4.0 license.

✨ Features

What is OneKE?

OneKE is a large-scale model framework for knowledge extraction jointly developed by Ant Group and Zhejiang University. It has the ability of generalized knowledge extraction in bilingual Chinese and English, across multiple domains and tasks, and provides comprehensive toolchain support. OneKE has contributed to the OpenKG open knowledge graph community in an open-source manner.

Knowledge construction based on unstructured documents has always been one of the key challenges for the large-scale implementation of knowledge graphs. The high fragmentation and unstructured nature of real-world information, along with the substantial disparities between extracted content and its natural language expression, often result in the suboptimal performance of large language models in information extraction tasks. Natural language text often contains ambiguities, polysemies, and metaphors due to implicit and long-distance context associations, posing significant challenges for knowledge extraction tasks.

In response to these issues, Ant Group and Zhejiang University leveraged their years of expertise in knowledge graphs and natural language processing to jointly construct and upgrade the capabilities of Ant's large-scale model "BaiLing" in the field of knowledge extraction. They released the bilingual knowledge extraction framework OneKE which included a version based on full parametric fine-tuning of Chinese-Alpaca-2-13B. Evaluation metrics show that OneKE has achieved relatively good performance on several fully supervised and zero-shot entity/relation/event extraction tasks.

The unified knowledge extraction framework has wide application scenarios and can significantly reduce the construction costs of domain-specific knowledge graphs. By extracting structured knowledge from massive datasets to construct high-quality knowledge graphs and establish logical associations between knowledge elements, interpretable inference and decision-making can be realized. It can also enhance large models by mitigating hallucination and boosting stability, accelerating the vertical domain applications of large models. For example, in the medical field, knowledge extraction can be used to convert doctors' experience into structured, rule-based management, building controlled auxiliary diagnostics, and medical Q&A systems. In the financial sector, it can extract financial indicators, risk events, causal logic, and industry chains for automated financial report generation, risk prediction, and industry chain analysis. In the public sector, it can facilitate knowledge-based management of government regulations, enhancing the efficiency and accuracy of public services.

How is OneKE trained?

OneKE mainly focuses on schema-generalizable information extraction. Due to issues such as non-standard formats, noisy data, and lack of diversity in existing extraction instruction data, OneKE adopted techniques such as normalization and cleaning of extraction instructions, difficult negative sample collection, and schema-based batched instruction construction, as shown in the illustration. For more detailed information, refer to the paper "IEPile: Unearthing Large-Scale Schema-Based Information Extraction Corpus [Github]".

The zero-shot generalization comparison results of OneKE with other large models are as follows:

NER-en: CrossNER_AI, CrossNER_literature, CrossNER_music, CrossNER_politics, CrossNER_science
NER-zh: WEIBONER, boson
RE-zh: COAE2016, IPRE, SKE2020
RE-en: FewRel, Wiki-ZSL
EE-en: CrudeOilNews, WikiEvents, RAMS
EE-zh: FewFC, CCF Law

zero_en

zero_zh

Supervision Results

supervision_ner

supervision_re

supervision_ee

🚀 Quick Start

It is recommended to have at least 20GB of VRAM for training and inferencing.

import torch
from transformers import (
    AutoConfig,
    AutoTokenizer,
    AutoModelForCausalLM,
    GenerationConfig,
    BitsAndBytesConfig
)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_path = 'zjunlp/OneKE'
config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

# 4-bit Quantized OneKE
quantization_config=BitsAndBytesConfig(     
    load_in_4bit=True,
    llm_int8_threshold=6.0,
    llm_int8_has_fp16_weight=False,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
)

model = AutoModelForCausalLM.from_pretrained(
    model_path,
    config=config,
    device_map="auto",  
    quantization_config=quantization_config,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
)
model.eval()

system_prompt = '<<SYS>>\nYou are a helpful assistant. 你是一个乐于助人的助手。\n<</SYS>>\n\n'
sintruct = "{\"instruction\": \"You are an expert in named entity recognition. Please extract entities that match the schema definition from the input. Return an empty list if the entity type does not exist. Please respond in the format of a JSON string.\", \"schema\": [\"person\", \"organization\", \"else\", \"location\"], \"input\": \"284 Robert Allenby ( Australia ) 69 71 71 73 , Miguel Angel Martin ( Spain ) 75 70 71 68 ( Allenby won at first play-off hole )\"}"
sintruct = '[INST] ' + system_prompt + sintruct + '[/INST]'

input_ids = tokenizer.encode(sintruct, return_tensors="pt").to(device)
input_length = input_ids.size(1)
generation_output = model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_length=1024, max_new_tokens=512, return_dict_in_generate=True))
generation_output = generation_output.sequences[0]
generation_output = generation_output[input_length:]
output = tokenizer.decode(generation_output, skip_special_tokens=True)

print(output)

For more detailed inference, please refer to DeepKE-llm/InstructKGC/6.1.2IE专用模型.

💻 Usage Examples

OneKE Instruction Format

The instructions in OneKE are formatted in a dictionary-type string similar to JSON. It consists of three fields: (1) 'instruction', which is the task description, specifies in natural language the role the model plays and the task to be completed; (2) 'schema', a list of labels to be extracted, clearly indicates the key fields of the information to be extracted, reflecting the user's needs, and is dynamic and changeable; (3) 'input', refers to the source text for information extraction.

Below are examples of instructions for various tasks:

Named Entity Recognition (NER)

{
    "instruction": "You are an expert specializing in entity extraction. Please extract entities that comply with the schema definition from the input; return an empty list for non-existent entity types. Please respond in the JSON string format.",
    "schema": ["Person Name", "Education", "Position", "Nationality"],
    "input": "Mr. Liu Zhijian: Born in 1956, Chinese nationality, no permanent residency abroad, member of the Communist Party, associate degree, senior economist."
}

Relation Extraction (RE)

{
    "instruction": "You are an expert specializing in relation extraction. Please extract relationship triples that comply with the schema definition from the input; return an empty list for non-existent relationships. Please respond in the JSON string format.",
    "schema": ["Father", "Husband", "Postal Code", "Mother"],
    "input": "Ding Long took out his life savings of $12,000, which without a doubt was a substantial amount at the end of the 19th century, plus Carpentier's donation, they both funded Columbia University's sinology research together."
}

Knowledge Graph Construction (KGC)

{
    "instruction": "You are an expert in structuring knowledge about graph entities. Based on the schema description of the input entity type, extract the corresponding entity instances and their property information from the text; do not output non-existent properties, return a list if there are multiple values for a property, and provide the output in a parseable json format.", 
    "schema": [
        {
            "entity_type": "Person", 
            "attributes": ["Chinese Name", "English Name", "Ancestral Home", "Date of Birth", "Place of Birth", "Occupation", "Alma Mater", "Works", "Awards"]
        }
    ], 
    "input": "Jay Chou (Jay Chou), born on January 18, 1979, in New Taipei City, Taiwan Province, ancestral home in Yongchun County, Quanzhou City, Fujian Province, Chinese pop singer, musician, actor, director, screenwriter, graduated from Tamkang High School. In 2000, he released his debut album 'Jay'. In 2001, he cemented his style of blending Eastern and Western music with the album 'Fantasy'. In 2002, he held ‘The One’ world tour; the same year, he won the Best Composer award at the 13th Taiwan Golden Melody Awards with the song 'Love Before the Century'."
}

Event Extraction (EE)

{
    "instruction": "You are an expert specializing in event extraction. Please extract events that match the defined schema from the input; return an empty list for non-existent events, NAN for non-existent arguments, and a list if there are multiple values for an argument. Please provide your response in JSON string format.",
    "schema": [
        {
            "event_type": "Finance/Trading - Interest Rate Hike",
            "trigger": true,
            "arguments": [
                "Time"
            ]
        },
        {
            "event_type": "Finance/Trading - Interest Rate Cut",
            "trigger": true,
            "arguments": [
                "Cut Magnitude"
            ]
        },
        {
            "event_type": "Finance/Trading - Price Increase",
            "trigger": true,
            "arguments": [
                "Price Raiser"
            ]
        },
        {
            "event_type": "Finance/Trading - Price Cut",
            "trigger": true,
            "arguments": [
                "Price Cutter",
                "Time"
            ]
        }
    ],
    "input": "AI risk control solution provider Vezetech secures tens of millions of dollars in Series C+ funding"
}

Event Trigger Identification (EET)

{
  "instruction": "You are an expert specializing in event trigger identification. Please extract the event types and triggers that match the defined schema from the input; return an empty list if the event type doesn't exist. Please provide your response in JSON string format."
}