Model Overview
Model Features
Model Capabilities
Use Cases
🚀 DeepReviewer
DeepReviewer is a set of generative large language models tailored for academic paper review. It offers structured feedback and diverse review modes, promoting self - improvement in research and advancing automated academic evaluation.
🚀 Quick Start
The models in this repository can be used with the transformers
or vllm
code libraries. To generate review comments, a long context is required (14000 tokens for Input and 5000 tokens for Output). Make sure you have enough GPU memory. Here are the recommended configurations:
Model Name | Recommended Config (bs>=5) | Minimum Config (bs=1) |
---|---|---|
DeepReviewer-7B | 1 x RTX3090/4090/5090 (bf16) | 1 x RTX 4070 (int8) |
DeepReviewer-14B | 1 x A100 (bf16) | 1 x RTX3090/4090/5090 (int8) |
Getting Your Paper Text
If you can provide the original Latex or Markdown version of your paper, you can skip this step. If you only have the PDF version, convert it to Markdown or Latex format first. Tools like MagicPDF or other PDF - to - text converters are recommended.
Using with vllm
from ai_researcher.deep_reviewer import DeepReviewer
import torch
# Initialize DeepReviewer
reviewer = DeepReviewer(
model_size="14B", # Use "7B" for the smaller model
device="cuda",
tensor_parallel_size=1, # Increase for multi-GPU setup
gpu_memory_utilization=0.95
)
# Load paper content
paper_content = "Your paper content here" # Replace with actual paper content
# Generate reviews in different modes
# Fast Mode for quick overview
fast_review = reviewer.evaluate([paper_content], mode="Fast Mode")
# Standard Mode with multiple reviewers
standard_review = reviewer.evaluate([paper_content], mode="Standard Mode", reviewer_num=3)
# Parse the review results
for result in standard_review:
print("--- Meta-Review ---")
print(f"Summary: {result['meta_review'].get('summary', 'N/A')}")
print(f"Rating: {result['meta_review'].get('rating', 'N/A')}")
print(f"Decision: {result['decision']}")
✨ Features
- Multi - Mode Reviews: DeepReviewer offers three review modes: Fast Mode for quick reviews, Standard Mode for simulated multiple - reviewer perspectives, and Best Mode for comprehensive reviews.
- Near - Human Evaluation: It can automatically evaluate paper quality, providing comprehensive analysis, strengths, weaknesses, and suggestions.
- Diverse Purposes: Suitable for various research - related uses such as paper improvement, writing practice, and as a reward model for reinforcement learning systems.
📦 Installation
No specific installation steps are provided in the original document, so this section is skipped.
💻 Usage Examples
Basic Usage
from ai_researcher.deep_reviewer import DeepReviewer
import torch
# Initialize DeepReviewer
reviewer = DeepReviewer(
model_size="14B", # Use "7B" for the smaller model
device="cuda",
tensor_parallel_size=1, # Increase for multi-GPU setup
gpu_memory_utilization=0.95
)
# Load paper content
paper_content = "Your paper content here" # Replace with actual paper content
# Generate reviews in different modes
# Fast Mode for quick overview
fast_review = reviewer.evaluate([paper_content], mode="Fast Mode")
# Standard Mode with multiple reviewers
standard_review = reviewer.evaluate([paper_content], mode="Standard Mode", reviewer_num=3)
# Parse the review results
for result in standard_review:
print("--- Meta-Review ---")
print(f"Summary: {result['meta_review'].get('summary', 'N/A')}")
print(f"Rating: {result['meta_review'].get('rating', 'N/A')}")
print(f"Decision: {result['decision']}")
Advanced Usage
# In advanced scenarios, you can adjust the parameters according to your needs, such as changing the number of reviewers in Standard Mode or using different device configurations.
# For example, increasing the tensor_parallel_size for multi - GPU acceleration
reviewer = DeepReviewer(
model_size="14B",
device="cuda",
tensor_parallel_size=4, # Increase for multi - GPU setup
gpu_memory_utilization=0.95
)
📚 Documentation
Model Info
Homepage & Demo: http://ai-researcher.net
DeepReviewer is a set of generative large language models that have undergone additional supervised training for academic paper review, with sizes of 7B and 14B. Both models are pure text language models based on the Phi - 4 pre - trained language model. They utilize a multi - stage reasoning framework to generate in - depth, structured reviews of academic papers.
DeepReviewer offers three review modes to balance between depth and efficiency:
- Fast Mode: Quick reviews with summary, scores, and key points
- Standard Mode: Simulated multiple reviewer perspectives with verification
- Best Mode: Most comprehensive reviews with detailed analysis across all dimensions
According to the license, all models created/trained/distributed/replicated based on these cannot be used for any formal review work.
Intended Uses
Expected Use Cases DeepReviewer models are suitable for research purposes in multiple languages. This includes but is not limited to the following objectives:
- Paper Improvement: Assist in enhancing the quality and clarity of academic papers.
- Writing Practice: Provide a platform for users to practice and refine their academic writing skills.
- Self - assessment Tool: Enable researchers to evaluate their own work before submission.
- Learning Aid: Support students and researchers in understanding the peer review process.
- Feedback Simulation: Offer simulated peer review feedback to prepare authors for actual reviews.
- Revision Guide: Provide structured guidance for revising academic papers.
- Concept Validator: Help researchers validate their ideas and hypotheses.
- Reward Model: Serve as a component in machine learning systems for academic writing improvement.
- Educational Resource: Act as a teaching tool for academic writing and peer review processes.
- Research Assistant: Aid in literature reviews and research methodology refinement.
- Supplementary Tool: Complement human review in informal, non - official settings.
Out of Scope The following are not permitted:
- Official Reviews: DeepReviewer explicitly prohibits use for official peer reviews in any capacity.
- Legal or Ethical Decisions: Not designed to make judgments on research ethics or legal compliance.
- Factual Verification: While it can offer feedback, it should not be the sole source for fact - checking or verifying scientific claims.
- Plagiarism Detection: Not equipped to serve as a plagiarism detection tool.
- Publication Decisions: Cannot be used to make final decisions on whether a paper should be published.
- Expert Consultation: Not a replacement for expert consultation in specialized fields.
If you are unsure whether you meet the License requirements, please contact us for further inquiry
Ethical Considerations
- Academic Integrity: Although DeepReviewer is designed to assist researchers in improving paper quality, it should not be used to replace the real peer review process. We strongly recommend users to use this tool only as an auxiliary means for self - improvement and learning.
- Fairness: The model may have biases, especially when evaluating interdisciplinary or emerging field research. Users should be aware of this and be cautious about the model's feedback.
- Responsible Use: We call on users to use this model responsibly, and require users not to use it to produce false review opinions or manipulate the academic evaluation process according to our agreement.
- Transparency: When using content generated by this model in any public setting, the DeepReviewer source should be clearly stated to maintain transparency and honesty in academia.
Limitations
- Knowledge Cutoff Date: The model's knowledge is cut off in October 2024, so it may lack understanding of new technologies, methods, or research trends that emerged after this date. This may lead to undervaluation of some highly innovative research.
- Pure Text Limitations: As a pure text model, DeepReviewer cannot directly parse or evaluate images, charts, or complex formulas in papers. This may affect the comprehensive assessment of papers that heavily rely on visual elements.
- Depth in Specialized Fields: Although the model has been trained across various domains, its evaluation may not be as accurate as human experts in very specialized or cutting - edge sub - fields.
- Lack of Real - time Information: The model cannot access real - time academic databases or the latest published papers, which may lead to bias in assessing research novelty.
- Disciplinary Bias: Due to limitations in training data, the model may have preferences for certain disciplines or research methods. Users should be aware of this and combine it with other opinions.
- Language and Cultural Limitations: The model may perform poorly in handling papers with cultural nuances or field - specific terminology outside its training distribution.
🔧 Technical Details
No specific technical details (more than 50 words) are provided in the original document, so this section is skipped.
📄 License
The code in this repository is open - sourced under the Apache - 2.0 license. The model weights are open - sourced under the DeepReviewer License, which incorporates additional content to ensure the model is not misused.
📊 Model Performance
ICLR 2024
Metric | DeepReviewer-7B | DeepReviewer-14B | CycleReviewer-70B | GPT-o1 | DeepSeek-R1 | Gemini-2.0-Flash-Thinking |
---|---|---|---|---|---|---|
Rating MSE↓ | 1.8262 | 1.3137 | 2.4870 | 4.3414 | 4.1648 | 4.9297 |
Rating MAE↓ | 1.0870 | 0.9102 | 1.2514 | 1.7294 | 1.6526 | 1.8711 |
Decision Accuracy$\uparrow$ | 0.5975 | 0.6406 | 0.6304 | 0.4500 | 0.5248 | 0.5743 |
Decision F1$\uparrow$ | 0.5428 | 0.6307 | 0.5696 | 0.4424 | 0.4988 | 0.5197 |
Rating Spearman$\uparrow$ | 0.2126 | 0.3559 | 0.3356 | 0.2621 | 0.3256 | 0.0745 |
Pairwise Rating Acc$\uparrow$ | 0.5749 | 0.6242 | 0.6160 | 0.5881 | 0.6206 | 0.5343 |
ICLR 2025
Metric | DeepReviewer-7B | DeepReviewer-14B | CycleReviewer-70B | GPT-o1 | DeepSeek-R1 | Gemini-2.0-Flash-Thinking |
---|---|---|---|---|---|---|
Rating MSE↓ | 1.6730 | 1.3410 | 2.4294 | 4.3072 | 4.7719 | 3.9232 |
Rating MAE↓ | 1.0379 | 0.9243 | 1.2128 | 1.7917 | 1.8099 | 1.6470 |
Decision Accuracy$\uparrow$ | 0.6660 | 0.6878 | 0.6782 | 0.4167 | 0.4259 | 0.6139 |
Decision F1$\uparrow$ | 0.5564 | 0.6227 | 0.5737 | 0.4157 | 0.4161 | 0.4808 |
Rating Spearman$\uparrow$ | 0.2973 | 0.4047 | 0.2674 | 0.2991 | 0.3237 | 0.2565 |
Pairwise Rating Acc$\uparrow$ | 0.6038 | 0.6402 | 0.5928 | 0.6318 | 0.6289 | 0.6040 |
DeepReviewer significantly outperforms other models on most metrics, despite its smaller parameter count. The 14B model achieves particularly strong results on Decision Accuracy and Score MSE, demonstrating its reliability in overall paper quality assessment.
📖 CITE
@inproceedings{
weng2025cycleresearcher,
title={CycleResearcher: Improving Automated Research via Automated Review},
author={Yixuan Weng and Minjun Zhu and Guangsheng Bao and Hongbo Zhang and Jindong Wang and Yue Zhang and Linyi Yang},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025},
url={https://openreview.net/forum?id=bjcsVLoHYs}
}
@misc{zhu2025deepreviewimprovingllmbasedpaper,
title={DeepReview: Improving LLM-based Paper Review with Human-like Deep Thinking Process},
author={Minjun Zhu and Yixuan Weng and Linyi Yang and Yue Zhang},
year={2025},
eprint={2503.08569},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2503.08569},
}
📮 Contact
- Submit an Issue
- Email: zhuminjun@westlake.edu.cn

