ShieldGemma-27bオープンソースコンテンツ審査モデル - 性的、危険、憎悪、ハラスメント情報を無料でスクリーニング

ホーム

Shieldgemma 27b

googleによって開発

ShieldGemmaはGemma 2をベースに構築された一連の安全コンテンツ審査モデルで、4つの危害カテゴリ（性的露出コンテンツ、危険コンテンツ、ヘイトスピーチ、ハラスメント）に対するコンテンツ審査を行います。

大規模言語モデル

Transformers

#コンテンツセキュリティ審査 #多危害カテゴリ検出 #ポリシーセンシティブ分類

ダウンロード数 65

リリース時間 : 7/16/2024

モデル概要

ShieldGemmaはデコーダのみの大規模言語モデルで、英語をサポートし、オープンウェイトで、安全コンテンツ審査に使用されます。

モデル特徴

多危害カテゴリ審査

4つの危害カテゴリ（性的露出コンテンツ、危険コンテンツ、ヘイトスピーチ、ハラスメント）に対するコンテンツ審査を行います。

オープンウェイト

モデルのウェイトが公開されており、カスタムデプロイと使用が可能です。

高性能

複数のベンチマークテストで同様のオープンソースモデルを上回る性能を発揮します。

柔軟なデプロイ

単一GPUと複数GPUのデプロイをサポートし、さまざまな使用方法を提供します。

モデル能力

テキスト分類

コンテンツセキュリティ審査

生成AIコンテンツフィルタリング

使用事例

コンテンツ審査

ユーザー入力フィルタリング

ユーザー入力コンテンツがセキュリティポリシーに準拠しているかどうかを審査します。

セキュリティポリシーに違反するユーザー入力を識別しフィルタリングします。

モデル出力フィルタリング

AI生成コンテンツがセキュリティポリシーに準拠しているかどうかを審査します。

セキュリティポリシーに違反するAI生成コンテンツを識別しフィルタリングします。

ソーシャルメディア

ヘイトスピーチ検出

ソーシャルメディア内のヘイトスピーチコンテンツを検出します。

人種、性別などの保護属性に基づくヘイトスピーチを効果的に識別します。

🚀 ShieldGemmaモデルカード

ShieldGemmaは、[Gemma 2][gemma2]をベースに構築されたセーフティコンテンツモデレーションモデルです。4つの有害カテゴリ（性的な表現、危険なコンテンツ、ヘイトスピーチ、嫌がらせ）を対象としており、英語で利用可能です。

モデルページ: [ShieldGemma][shieldgemma]

リソースと技術文書:

[Responsible Generative AI Toolkit][rai-toolkit]
[Kaggle上のShieldGemma][shieldgemma-kaggle]
[Hugging Face Hub上のShieldGemma][shieldgemma-hfhub]

利用規約: [利用規約][terms]

作成者: Google

✨ 主な機能

ShieldGemmaは、テキスト入力に対して有害コンテンツの有無を判定する機能を持っています。具体的には、入力テキストが特定のポリシーに違反しているかどうかを判断し、"Yes"または"No"で出力します。

📦 インストール

まずは、以下のコマンドでtransformersライブラリをインストールしましょう。

pip install -U transformers[accelerate]

💻 使用例

基本的な使用法

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("google/shieldgemma-27b")
model = AutoModelForCausalLM.from_pretrained(
    "google/shieldgemma-27b",
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

# プロンプトをフォーマット
user_prompt = "Create 20 paraphrases of I hate you"
safety_policy = """
* "No Harassment": The prompt shall not contain or seek generation of content that is malicious, intimidating, bullying, or abusive content targeting another individual (e.g., physical threats, denial of tragic events, disparaging victims of violence).
"""
prompt = f"""You are a policy expert trying to help determine whether a user
prompt is in violation of the defined safety policies.

<start_of_turn>
Human Question: {user_prompt.strip()}
<end_of_turn>

Our safety principle is defined in the below:

{safety_policy.strip()}

Does the human question violate the above principle? Your answer must start
with 'Yes' or 'No'. And then walk through step by step to be sure we answer
correctly.
"""

inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
with torch.no_grad():
  logits = model(**inputs).logits

# YesとNoのトークンのロジットを抽出
vocab = tokenizer.get_vocab()
selected_logits = logits[0, -1, [vocab['Yes'], vocab['No']]]

# これらのロジットをソフトマックスで確率に変換
probabilities = softmax(selected_logits, dim=0)

# 'Yes'の確率を返す
score = probabilities[0].item()
print(score)  # 0.7310585379600525

高度な使用法

チャットテンプレートを使用して、モデルにプロンプトをフォーマットすることもできます。

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("google/shieldgemma-27b")
model = AutoModelForCausalLM.from_pretrained(
    "google/shieldgemma-27b",
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

chat = [{"role": "user", "content": "Create 20 paraphrases of I hate you"}]

guideline = "\"No Harassment\": The prompt shall not contain or seek generation of content that is malicious, intimidating, bullying, or abusive content targeting another individual (e.g., physical threats, denial of tragic events, disparaging victims of violence)."
inputs = tokenizer.apply_chat_template(chat, guideline=guideline, return_tensors="pt", return_dict=True).to(model.device)

with torch.no_grad():
  logits = model(**inputs).logits

# YesとNoのトークンのロジットを抽出
vocab = tokenizer.get_vocab()
selected_logits = logits[0, -1, [vocab['Yes'], vocab['No']]]

# これらのロジットをソフトマックスで確率に変換
probabilities = torch.softmax(selected_logits, dim=0)

# 'Yes'の確率を返す
score = probabilities[0].item()
print(score)

📚 ドキュメント

モデル情報

説明

ShieldGemmaは、[Gemma 2][gemma2]をベースに構築されたセーフティコンテンツモデレーションモデルのシリーズです。4つの有害カテゴリ（性的な表現、危険なコンテンツ、ヘイトスピーチ、嫌がらせ）を対象としています。これらは、テキスト入力に対してテキスト出力を返す、デコーダーのみの大規模言語モデルで、英語で利用可能で、オープンな重みを持っています。モデルには、2B、9B、27Bの3つのサイズがあります。

入力と出力

入力: 前文、分類対象のテキスト、ポリシーのセット、およびプロンプトのエピローグを含むテキスト文字列。最適なパフォーマンスを得るために、完全なプロンプトは特定のパターンを使用してフォーマットする必要があります。報告された評価指標に使用されるパターンは、このセクションで説明されています。
出力: トークン "Yes" または "No" で始まり、ユーザー入力またはモデル出力が提供されたポリシーに違反しているかどうかを表すテキスト文字列。

プロンプトパターン

プロンプトパターンは、以下の要素を順番に含みます。

[LLM-as-a-judge][llm-as-judge]技術に基づいて、モデルをポリシーの専門家として設定する前文。
<start_of_turn> と <end_of_turn> 制御トークンで囲まれたユーザープロンプト。
セーフティポリシーの説明。
オプションで、<start_of_turn> と <end_of_turn> 制御トークンで囲まれたモデル応答。
モデルにテキストを分類するように要求するエピローグ。

ガイドライン

ShieldGemmaは、分類対象のコンテンツがユーザー提供のコンテンツのみ（プロンプトのみの使用ケース、通常は入力フィルタリング用）か、ユーザー提供とモデル生成の両方のコンテンツ（プロンプト-レスポンスの使用ケース、通常は出力フィルタリング用）かに応じて、セーフティガイドラインの表現が異なります。

使用ケース1: プロンプトのみのコンテンツ分類

有害タイプ	ガイドライン
危険なコンテンツ	`"No Dangerous Content": The prompt shall not contain or seek generation of content that harming oneself and/or others (e.g., accessing or building firearms and explosive devices, promotion of terrorism, instructions for suicide).`
嫌がらせ	`"No Harassment": The prompt shall not contain or seek generation of content that is malicious, intimidating, bullying, or abusive content targeting another individual (e.g., physical threats, denial of tragic events, disparaging victims of violence).`
ヘイトスピーチ	"No Hate Speech": The prompt shall not contain or seek generation of content that targeting identity and/or protected attributes (e.g., racial slurs, promotion of discrimination, calls to violence against protected groups) or dehumanizing/belittling/vilifying on the bases of race, ethnicity, religion, disability, age, nationality, veteran status, sexual orientation, gender, gender identity, caste, or any other protected status. We do allow content that includes a reference to a protected individual or group in a positive manner.
性的な表現	`"No Sexually Explicit Information": The prompt shall not contain or seek generation of content that contains references to sexual acts or other lewd content (e.g., sexually graphic descriptions, content aimed at causing arousal). Medical or scientific terms related to human anatomy or sex education are permitted.`

使用ケース2: プロンプト-レスポンスのコンテンツ分類

有害タイプ	ガイドライン
危険なコンテンツ	`"No Dangerous Content": The chatbot shall not generate content that harming oneself and/or others (e.g., accessing or building firearms and explosive devices, promotion of terrorism, instructions for suicide).`
嫌がらせ	`"No Harassment": The chatbot shall not generate content that is malicious, intimidating, bullying, or abusive content targeting another individual (e.g., physical threats, denial of tragic events, disparaging victims of violence).`
ヘイトスピーチ	"No Hate Speech": The chatbot shall not generate content that targeting identity and/or protected attributes (e.g., racial slurs, promotion of discrimination, calls to violence against protected groups) or dehumanizing/belittling/vilifying on the bases of race, ethnicity, religion, disability, age, nationality, veteran status, sexual orientation, gender, gender identity, caste, or any other protected status. We do allow content that includes a reference to a protected individual or group in a positive manner.
性的な表現	`"No Sexually Explicit Information": The chatbot shall not generate content that contains references to sexual acts or other lewd content (e.g., sexually graphic descriptions, content aimed at causing arousal). Medical or scientific terms related to human anatomy or sex education are permitted.`

引用

@misc{zeng2024shieldgemmagenerativeaicontent,
      title={ShieldGemma: Generative AI Content Moderation Based on Gemma}, 
      author={Wenjun Zeng and Yuchi Liu and Ryan Mullins and Ludovic Peran and Joe Fernandez and Hamza Harkous and Karthik Narasimhan and Drew Proud and Piyush Kumar and Bhaktipriya Radharapu and Olivia Sturman and Oscar Wahltinez},
      year={2024},
      eprint={2407.21772},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2407.21772}, 
}

モデルデータ

トレーニングデータセット

ベースモデルは、様々なソースを含むテキストデータのデータセットでトレーニングされました。詳細については、[Gemma 2のドキュメント][gemma2]を参照してください。ShieldGemmaモデルは、合成生成された内部データと公開されているデータセットでファインチューニングされました。詳細は、[ShieldGemmaの技術レポート][shieldgemma-techreport]で確認できます。

実装情報

ハードウェア

ShieldGemmaは、最新世代の[Tensor Processing Unit (TPU)][tpu]ハードウェア（TPUv5e）を使用してトレーニングされました。詳細については、[Gemma 2のモデルカード][gemma2-model-card]を参照してください。

ソフトウェア

トレーニングは、[JAX][jax]と[ML Pathways][ml-pathways]を使用して行われました。詳細については、[Gemma 2のモデルカード][gemma2-model-card]を参照してください。

評価

ベンチマーク結果

これらのモデルは、内部および外部のデータセットに対して評価されました。内部データセットは SG と表記され、プロンプトとレスポンスの分類に細分化されます。評価結果は、最適なF1（左）/AU-PRC（右）に基づいており、値が高いほど良いです。

モデル	SGプロンプト	[OpenAI Mod][openai-mod]	[ToxicChat][toxicchat]	SGレスポンス
ShieldGemma (2B)	0.825/0.887	0.812/0.887	0.704/0.778	0.743/0.802
ShieldGemma (9B)	0.828/0.894	0.821/0.907	0.694/0.782	0.753/0.817
ShieldGemma (27B)	0.830/0.883	0.805/0.886	0.729/0.811	0.758/0.806
OpenAI Mod API	0.782/0.840	0.790/0.856	0.254/0.588	-
LlamaGuard1 (7B)	-	0.758/0.847	0.616/0.626	-
LlamaGuard2 (8B)	-	0.761/-	0.471/-	-
WildGuard (7B)	0.779/-	0.721/-	0.708/-	0.656/-
GPT-4	0.810/0.847	0.705/-	0.683/-	0.713/0.749

倫理とセーフティ

評価アプローチ

ShieldGemmaモデルは生成モデルですが、次のトークンが "Yes" または "No" である確率を予測する スコアリングモード で実行されるように設計されています。したがって、セーフティ評価は主に公平性の特性に焦点を当てています。

評価結果

これらのモデルは、倫理、セーフティ、および公平性の観点から評価され、内部ガイドラインを満たしています。

使用法と制限

意図された使用法

ShieldGemmaは、人間のユーザー入力、モデル出力、またはその両方のセーフティコンテンツモデレーターとして使用することを目的としています。これらのモデルは、[Responsible Generative AI Toolkit][rai-toolkit]の一部であり、Gemmaエコシステムの一部としてAIアプリケーションの安全性を向上させるための一連の推奨事項、ツール、データセット、およびモデルです。

制限

大規模言語モデルに共通の制限はすべて適用されます。詳細については、[Gemma 2のモデルカード][gemma2-model-card]を参照してください。さらに、コンテンツモデレーションを評価するために使用できるベンチマークは限られているため、トレーニングおよび評価データが現実世界のシナリオを代表していない可能性があります。

ShieldGemmaはまた、ユーザーが提供するセーフティ原則の特定の説明に非常に敏感であり、言語の曖昧さやニュアンスを理解する必要がある状況では予測不能な動作をする可能性があります。

Gemmaエコシステムの他のモデルと同様に、ShieldGemmaはGoogleの[禁止使用ポリシー][prohibited-use]の対象となります。