Qwen-GLOCON-Reasoning Open-source Model - Free Deployment to Aid Precise Classification of Conflict Events

Qwen GLOCON Reasoning

Developed by shreyasmeher

A reinforcement learning model based on Qwen2.5-3B-Instruct, specifically designed for conflict event classification, optimized using the GRPO method for multi-reward signals and structured reasoning formats.

Large Language Model EnglishOpen Source License:Apache-2.0 #Structured Conflict Analysis #Multi-reward Reinforcement Learning #XML Reasoning Output

Downloads 51

Release Time : 2/18/2025

Model Overview

This model is a text classification model optimized via GRPO reinforcement learning, specifically designed to identify and classify social conflict events. It can analyze news reports, identify event triggers, participants, locations, and the nature of violence, and categorize them into one of five predefined classes.

Model Features

GRPO Reinforcement Learning Optimization

Uses the GRPO method to achieve synchronous optimization of multi-reward signals, enforcing structured reasoning formats through reinforcement signals.

Structured XML Output

Forces the model to adhere to a specific XML output format, including detailed reasoning processes and final classification results.

Multilingual Support

Supports conflict event classification in 13 languages.

Memory Optimization

Utilizes 4-bit quantization, gradient checkpointing, and vLLM acceleration for inference, with GPU memory usage capped at 60%.

Model Capabilities

Conflict Event Classification

Structured Reasoning

Multilingual Text Analysis

XML Format Output

Use Cases

Social Research

Civil Conflict Event Classification

Analyzes news reports to identify and classify social events such as protests and armed conflicts.

Accurately categorizes into one of five major event classes.

Academic Research

Transparent Decision Process Analysis

Provides classification results with reasoning processes, facilitating verification in academic research.

Classification results include detailed reasoning steps.

Education

RL Classification Teaching Demo

Serves as a demonstration case for reinforcement learning applications in text classification.

🚀 GLOCON-Reasoning: Qwen2.5-3B with GRPO Reinforcement Learning

This model uses GRPO reinforcement learning to classify civil conflict events, enforcing structured reasoning and improving output consistency.

🚀 Quick Start

When using this model, you must set the prompt as described below to ensure the model follows the required structured reasoning format. Without explicitly setting the prompt, the model's outputs may not adhere to the expected XML structure and reasoning guidelines.

For instance, include the following prompt in your inference code:

prompt = """
You are identifying conflict events and assigning them to one of five predefined categories. Think carefully and reason deeply, but when giving the final answer, provide only minimal, fixed-format outputs without any extra words.

Format your response:

<reasoning>
- Carefully analyze the text and explain:
  1. What action(s) triggered the event.
  2. Who are the participants or organizers.
  3. Where the event happened (city and country).
  4. Whether the event was violent or non-violent.
  5. Which of the five event categories fits best, and why.
</reasoning>

<answer>
1. Trigger: <exact phrase>
2. Participants: <actor1, actor2,...>
3. Location: <city, country>
4. Violence: <Violent / Non-violent>
5. Category: <one of: Demonstration / Armed Militancy / Group Clash / Industrial Action / Other>
</answer>
"""

✨ Features

Reinforcement Learning Highlights

Unlike traditional supervised fine-tuning (used in ConflLlama), this model uses GRPO to:

Optimize multiple reward signals simultaneously
Enforce structured reasoning format through reinforcement signals
Improve output consistency with formatted XML responses
Self-improve through reinforcement rather than direct imitation

Training Data

Property	Details
Dataset	GLOCON event classification dataset
Time Period	Contemporary civil conflict events
Format	News articles with associated event categories
Labels	Five main event categories: Demonstration, Armed Militancy, Group Clash, Industrial Action, Other

Data Processing

Train/Test Split:
- 80% training, 20% testing
- Consistent random seed (42) for reproducibility
Format Standardization:
- System prompt with structured reasoning requirements
- Consistent XML output format
Answer Extraction:
- Specialized extraction from structured responses
- Validation against known categories

Training Format

Input: News article describing potential conflict event
Output: Structured XML with reasoning and final category

🔧 Technical Details

Key Mathematical Concepts

Policy Gradient with Multiple Rewards

The GRPO approach optimizes policy parameters using:

$$\nabla_\theta J(\theta) = \mathbb{E} \left[ \sum_{i=1}^{N} w_i R_i(x, y) \nabla_\theta \log \pi_\theta(y|x) \right]$$

Reward Functions

Our implementation uses five specialized reward functions:

Correctness Reward: 2.0 points for accurate classification
Category Format Reward: 0.5 points for valid category selection
Format Rewards: Combined 1.0 points for proper XML structure
XML Microrewards: Small incentives for tag placement and structure

Training Details

Framework: Unsloth GRPO
Hardware: Single NVIDIA GPU with vLLM acceleration
Training Configuration:
- Batch Size: 1 per device
- Gradient Accumulation Steps: 4
- Learning Rate: 5e-6
- Max Steps: 1,000
- Save Steps: 500
- Logging Steps: 1
- Samples per prompt: 6
- Memory utilization: 60%

LoRA Configuration

Rank: 64 (significantly larger than ConflLlama's rank 8)
Target Modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Alpha Scaling: 64
Quantization: 4-bit training
Gradient Checkpointing: Enabled ("unsloth" mode)

Generation Parameters

Temperature: 0.8
Top-p: 0.95
Max tokens: 256
Max prompt length: 512

Model Architecture

The training architecture combines reinforcement learning with efficient LLM fine-tuning.

Reinforcement Learning Benefits

This model demonstrates key advantages over supervised fine-tuning:

Structured Output Enforcement

Consistent XML formatting:

<reasoning>
1. Triggers detected: [...]
2. Participants and organizers: [...]
3. Location details: [...]
4. Violence assessment: [...]
5. Event category determination: [...]
</reasoning>
<answer>
[Final category]
</answer>

Improved Reasoning Capability
- Explicit step-by-step reasoning before final classification
- Consideration of multiple factors (violence, participants, location)
- Transparent justification process
Reward-Based Improvement
- Self-correcting behavior through multiple reward signals
- Balance between format adherence and classification accuracy
- Incentivizes proper structure without sacrificing correctness

Implementation Details

The reward functions are implemented with efficient vectorized operations:

def correctness_reward_func(prompts, completions, answer, **kwargs) -> list[float]:
    responses = [completion[0]['content'] for completion in completions]
    extracted_responses = [extract_xml_answer(r) for r in responses]
    return [2.0 if r.strip() == a.strip() else 0.0 
            for r, a in zip(extracted_responses, answer)]

Memory Optimizations

Used 4-bit quantization
Gradient accumulation steps: 4
Memory-efficient gradient checkpointing
Reduced maximum sequence length to 1024
GPU memory utilization capped at 60%
Fast inference with vLLM

📚 Documentation

Intended Use

This model is designed for:

Classification of civil conflict events with reasoning
Academic research requiring transparent decision processes
Event analysis with structured outputs
Educational demonstration of RL-based classification

Limitations

Fixed output structure may limit flexibility
Performance dependent on quality of reward functions
Maximum sequence length limited to 1024 tokens
Reinforcement may overoptimize for reward signals rather than true understanding
Limited to five predefined event categories
May not generalize well to conflict events outside training distribution

Ethical Considerations

Model trained on conflict event data
Should be used responsibly for research purposes only
Not intended for operational security decisions
Results should be interpreted with appropriate context
May contain biases present in training data

📄 License

This project is licensed under the Apache 2.0 License.

📚 Citation

@misc{glocon-reasoning,
  author = {Meher, Shreyas},
  title = {GLOCON-Reasoning: Qwen2.5-3B with GRPO Reinforcement Learning},
  year = {2024},
  publisher = {HuggingFace},
  note = {Based on Qwen2.5-3B-Instruct and GRPO framework}
}

🙏 Acknowledgments

Unsloth for GRPO implementation and optimization framework
Qwen team for the base model
Hugging Face for transformers infrastructure
vLLM team for fast inference capabilities
This research was supported by NSF award 2311142

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご