tofu_ft_phi-1.5 Open-source Model - Solve Privacy Problems and Efficiently Forget Specific Training Data!

Tofu Ft Phi 1.5

Developed by locuslab

This model is a fine-tuned version of Phi-1.5 on the TOFU (Task of Fictional Unlearning) dataset, focusing on the ability to forget specific data points from training data, addressing privacy and data sensitivity issues.

Large Language Model

Transformers

OtherOpen Source License:Apache-2.0 #Data Forgetting #Privacy Compliance #Fictional Author QA

Downloads 1,001

Release Time : 1/31/2024

Model Overview

This model enables researchers to study the ability to forget specific data points from model training data, applicable to research related to privacy protection, data sensitivity, and regulatory compliance.

Model Features

Selective Data Forgetting

The model can forget specific data points from training data, suitable for privacy protection and data sensitivity research.

Fine-Tuned on TOFU Dataset

Fine-tuned using the TOFU dataset, which contains Q&A pairs generated from autobiographies of 200 fictional authors, all produced by the GPT-4 model.

Maintaining Unrelated Task Performance

While forgetting specific knowledge fragments, the model can maintain performance on other unrelated tasks.

Model Capabilities

Text Generation

Selective Data Forgetting

Q&A Pair Task Processing

Use Cases

Privacy-Preserving Machine Learning

Data Privacy Research

Research on how to forget sensitive data from models to protect user privacy.

AI Regulatory Compliance

Regulatory Compliance Research

Explore how AI systems can meet the requirements of data protection regulations.

Knowledge Retention and Forgetting Dynamics

Knowledge Dynamics Research

Study the dynamics of knowledge retention and forgetting in AI systems.

🚀 Phi-1.5 Fine-Tuned on TOFU Dataset

This repository hosts the Phi-1.5 model fine-tuned on the TOFU (Task of Fictitious Unlearning) dataset. It enables researchers to focus on the model's ability to unlearn specific data points from training data, addressing privacy, data sensitivity, and regulatory compliance concerns.

🚀 Quick Start

Installation

Ensure you have Python 3.10+ installed. Then, install the required packages:

pip install transformers
pip install datasets

Loading the Model

You can load the model using the Transformers library:

from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "locuslab/tofu_ft_phi-1.5"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

Usage Example

inputs = tokenizer.encode("Your prompt here", return_tensors='pt')
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

✨ Features

Unlearning Focus: The Phi-1.5 model fine-tuned on the TOFU dataset specializes in unlearning diverse fractions of the forget set, enhancing its ability to discard specific knowledge segments without compromising performance on unrelated tasks.
Broad Applicability: Compatible with a wide range of research applications, including privacy-preserving machine learning, regulatory compliance in AI, and exploring knowledge retention and forgetting dynamics in AI systems.

📦 Installation

Ensure you have Python 3.10+ installed. Then, install the required packages:

pip install transformers
pip install datasets

💻 Usage Examples

Basic Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "locuslab/tofu_ft_phi-1.5"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

inputs = tokenizer.encode("Your prompt here", return_tensors='pt')
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

📚 Documentation

Quick Links

Website: The landing page for TOFU
arXiv Paper: Detailed information about the TOFU dataset and its significance in unlearning tasks.
GitHub Repository: Access the source code, fine-tuning scripts, and additional resources for the TOFU dataset.
Dataset on Hugging Face: Direct link to download the TOFU dataset.
Leaderboard on Hugging Face Spaces: Current rankings and submissions for the TOFU dataset challenges.
Summary on Twitter: A concise summary and key takeaways from the project.

Overview

The TOFU dataset is a novel benchmark specifically designed to evaluate the unlearning performance of large language models (LLMs) across realistic tasks. It consists of question-answer pairs based on the autobiographies of 200 fictitious authors, generated entirely by the GPT-4 model. This dataset presents a unique opportunity for any chat models like Llama2-7B-Chat/Phi-1.5 to demonstrate their capacity for selective data unlearning.

Model Description

Phi-1.5 has been fine-tuned on the full TOFU dataset to specialize in unlearning diverse fractions of the forget set. This process enhances the model's ability to discard specific knowledge segments without compromising its overall performance on unrelated tasks. This version of Phi-1.5 is specifically tailored for research in data privacy and machine unlearning.

Applicability

The fine-tuned model is compatible with a broad range of research applications, including but not limited to:

Privacy-preserving machine learning
Regulatory compliance in AI
Exploring the dynamics of knowledge retention and forgetting in AI systems

Technical Specifications

Property	Details
Model Type	Phi-1.5 (from Microsoft)
Training Data	TOFU (full)
Fine-tuning Methodology	Task-specific fine-tuning on question-answer pairs for unlearning performance
Compatible Frameworks	The model is readily usable with frameworks supporting Phi models.

🔧 Technical Details

The fine-tuning process of Phi-1.5 on the TOFU dataset involves task-specific fine-tuning on question-answer pairs to enhance the model's unlearning performance. This allows the model to discard specific knowledge segments without affecting its performance on unrelated tasks.

📄 License

This project is licensed under the Apache-2.0 license.

Citing Our Work

If you find our codebase and dataset beneficial, please cite our work:

@misc{tofu2024,
      title={TOFU: A Task of Fictitious Unlearning for LLMs}, 
      author={Pratyush Maini and Zhili Feng and Avi Schwarzschild and Zachary C. Lipton and J. Zico Kolter},
      year={2024},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご