Scholawrite-bert-classifier Open-source Model - Accurately Predict the Next Writing Intention of LaTeX Papers

Scholawrite Bert Classifier

Developed by minnesotanlp

A BERT fine-tuned academic writing intent classification model for predicting next-step writing intentions in LaTeX paper writing

Text Classification

Transformers

EnglishOpen Source License:Apache-2.0 #Academic Writing Intent Prediction #LaTeX Writing Assistance #BERT Fine-tuned Model

Downloads 28

Release Time : 4/25/2025

Model Overview

This model is a text classification model fine-tuned from bert-base-uncased, specifically designed to predict the next writing intention in academic LaTeX writing. It accepts 'before' text wrapped with special tokens as input and outputs one of 15 predefined labels.

Model Features

Specialized for academic writing

Optimized specifically for academic LaTeX writing scenarios, capable of understanding intent patterns unique to academic writing

Fine-grained intent classification

Can recognize 15 different writing intents, covering various needs in academic writing

Special token processing

Supports special input tokens like <INPUT>, <BT>, etc., to better understand contextual structures

Model Capabilities

Text classification

Writing intent recognition

Academic writing analysis

Use Cases

Academic writing assistance

Writing suggestion system

Predicts the author's possible next writing intention based on current content and provides writing suggestions

Achieved a weighted F1 score of 0.64 on the ScholaWrite test set

Writing process analysis

Analyzes academic authors' writing patterns and intent transition patterns

🚀 Model Card for scholawrite-bert-classifier

This model is designed for predicting the next writing intention in LaTeX paper drafts. It's fine - tuned on a specific dataset and offers valuable insights for academic writing in the English language.

🚀 Quick Start

import os
from dotenv import load_dotenv

import torch
from transformers import BertTokenizer, BertForSequenceClassification, RobertaTokenizer, RobertaForSequenceClassification
from huggingface_hub import login

load_dotenv()
HUGGINGFACE_TOKEN = os.getenv("HUGGINGFACE_TOKEN")
login(token=HUGGINGFACE_TOKEN)

TOTAL_CLASSES = 15

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
tokenizer.add_tokens("<INPUT>")  # start input
tokenizer.add_tokens("</INPUT>") # end input
tokenizer.add_tokens("<BT>")     # before text
tokenizer.add_tokens("</BT>")    # before text
tokenizer.add_tokens("<PWA>")    # start previous writing action
tokenizer.add_tokens("</PWA>")   # end previous writing action

model = BertForSequenceClassification.from_pretrained('minnesotanlp/scholawrite-bert-classifier', num_labels=TOTAL_CLASSES)

before_text = "sample before text"
text = "<INPUT>" + "<BT>" + before_text + "</BF> " + "</INPUT>"

input = tokenizer(text, return_tensors="pt")
pred = model(input["input_ids"]).logits.argmax(1)
print("class:", pred)

✨ Features

Fine - Tuned Model: Based on bert - base - uncased, fine - tuned on the ScholaWrite dataset.
Specific Task: Predicts the next writing intention in LaTeX paper drafts.
Closed - Environment Inference: Designed for a specific task in a closed environment.

📦 Installation

The installation process mainly involves setting up the necessary Python libraries. You need to have torch, transformers, dotenv, and huggingface_hub installed. You can install them using pip:

pip install torch transformers python - dotenv huggingface_hub

💻 Usage Examples

Basic Usage

The basic usage of this model is to predict the next writing intention in a LaTeX paper draft. The code in the "Quick Start" section demonstrates this basic usage. It takes 'before' text wrapped by special tokens as input and outputs the next writing intention, which is one of 15 predefined labels.

📚 Documentation

Model Details

Model Description

This model, referred to as BERT - SW - CLF in the paper, is fine - tuned on the train split of the ScholaWrite dataset, based on the bert - base - uncased model from Hugging Face. Its sole purpose is to predict the next writing intention given scholarly writing in LaTeX.

Developed by: Linghe Wang, Minhwa Lee, Ross Volkov, Luan Chau, Dongyeop Kang
Language: English
Finetuned from model: [bert - base - uncased](https://huggingface.co/google - bert/bert - base - uncased)

Model Sources

Repository: ScholaWrite Github Repository
Paper: [More Information Needed]

Uses

Direct Use

The model is intended for next writing intention prediction in LaTeX paper drafts. It takes 'before' text wrapped by special tokens as input and outputs the next writing intention, which is one of 15 predefined labels.

Out - of - Scope Use

The model is fine - tuned only for next writing intention prediction and inferred in a closed environment. It is suitable for academic use but not for production, general public use, or consumer - oriented services. Using this model on tasks other than next intention prediction in LaTeX paper drafts may not work well.

Bias and Limitations

The bias and limitations of this model mainly come from the dataset (ScholaWrite) it is fine - tuned on.

Domain - Specific: The ScholaWrite dataset is currently limited to the computer science domain, which may restrict the model's generalizability to other scientific disciplines.
Participant Limitation: All participants were early - career researchers at an R1 university in the United States, so the model may not learn the professional writing behavior and cognitive process from experts.
Language Limitation: The dataset is exclusive to English - language writing, which restricts the model's capability to predict next writing intention in multilingual or non - English contexts.

Fine - tuning Details

Fine - tuning Data

This model is fine - tuned on the train split of the minnesotanlp/scholawrite dataset. It is keystroke logs of an end - to - end scholarly writing process, with thorough annotations of cognitive writing intentions behind each keystroke. No additional data pre - processing or filtering was performed on the dataset.

Fine - tuning Procedure

The model was fine - tuned by passing in the before_text section of a prompt as the input and using the intention as the ground truth data. The model outputs an integer according to each intention label (1 - 15).

Fine - tuning Hyperparameters

Property	Details
Fine - tuning regime	fp32
Learning rate	2e - 5
Per device train batch size	2
Per device eval batch size	8
Num train epochs	10
Weight decay	0.01

Machine Specs

Property	Details
Hardware	2 X Nvidia RTX A6000
Hours used	3.5 hrs
Compute Region	Minnesota

Testing Procedure

Testing Data

minnesotanlp/scholawrite

Metrics

The data has class imbalance on both training and testing data splits, so we use weighted F1 to measure the performance.

Results

	BERT	RoBERTa	LLama - 8B - Instruct	GPT - 4o
Base	0.04	0.02	0.12	0.08
+ SW	0.64	0.64	0.13	-

Summary

The table above presents the weighted F1 scores for predicting writing intentions across baselines and fine - tuned models. All models fine - tuned on ScholaWrite show improved performance compared to their baselines. BERT and RoBERTa achieved the most improvement, while LLama - 8B - Instruct showed a modest improvement after fine - tuning. These results demonstrate the effectiveness of our ScholaWrite dataset to align language models with writers' intentions.

🔧 Technical Details

Model Architecture: Based on the BERT architecture (bert - base - uncased).
Fine - Tuning Process: Involves passing before_text as input and using intention as ground truth data.
Testing Metrics: Uses weighted F1 to measure performance due to class imbalance in the dataset.

📄 License

This model is released under the apache - 2.0 license.

BibTeX

@misc{wang2025scholawritedatasetendtoendscholarly,
      title={ScholaWrite: A Dataset of End - to - End Scholarly Writing Process},
      author={Linghe Wang and Minhwa Lee and Ross Volkov and Luan Tuyen Chau and Dongyeop Kang},
      year={2025},
      eprint={2502.02904},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.02904},
      }

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご