đ Model Card for scholawrite-bert-classifier
This model is designed for predicting the next writing intention in LaTeX paper drafts. It's fine - tuned on a specific dataset and offers valuable insights for academic writing in the English language.
đ Quick Start
import os
from dotenv import load_dotenv
import torch
from transformers import BertTokenizer, BertForSequenceClassification, RobertaTokenizer, RobertaForSequenceClassification
from huggingface_hub import login
load_dotenv()
HUGGINGFACE_TOKEN = os.getenv("HUGGINGFACE_TOKEN")
login(token=HUGGINGFACE_TOKEN)
TOTAL_CLASSES = 15
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
tokenizer.add_tokens("<INPUT>")
tokenizer.add_tokens("</INPUT>")
tokenizer.add_tokens("<BT>")
tokenizer.add_tokens("</BT>")
tokenizer.add_tokens("<PWA>")
tokenizer.add_tokens("</PWA>")
model = BertForSequenceClassification.from_pretrained('minnesotanlp/scholawrite-bert-classifier', num_labels=TOTAL_CLASSES)
before_text = "sample before text"
text = "<INPUT>" + "<BT>" + before_text + "</BF> " + "</INPUT>"
input = tokenizer(text, return_tensors="pt")
pred = model(input["input_ids"]).logits.argmax(1)
print("class:", pred)
⨠Features
- Fine - Tuned Model: Based on
bert - base - uncased
, fine - tuned on the ScholaWrite dataset.
- Specific Task: Predicts the next writing intention in LaTeX paper drafts.
- Closed - Environment Inference: Designed for a specific task in a closed environment.
đĻ Installation
The installation process mainly involves setting up the necessary Python libraries. You need to have torch
, transformers
, dotenv
, and huggingface_hub
installed. You can install them using pip
:
pip install torch transformers python - dotenv huggingface_hub
đģ Usage Examples
Basic Usage
The basic usage of this model is to predict the next writing intention in a LaTeX paper draft. The code in the "Quick Start" section demonstrates this basic usage. It takes 'before' text wrapped by special tokens as input and outputs the next writing intention, which is one of 15 predefined labels.
đ Documentation
Model Details
Model Description
This model, referred to as BERT - SW - CLF in the paper, is fine - tuned on the train
split of the ScholaWrite dataset, based on the bert - base - uncased
model from Hugging Face. Its sole purpose is to predict the next writing intention given scholarly writing in LaTeX.
- Developed by: Linghe Wang, Minhwa Lee, Ross Volkov, Luan Chau, Dongyeop Kang
- Language: English
- Finetuned from model: [bert - base - uncased](https://huggingface.co/google - bert/bert - base - uncased)
Model Sources
Uses
Direct Use
The model is intended for next writing intention prediction in LaTeX paper drafts. It takes 'before' text wrapped by special tokens as input and outputs the next writing intention, which is one of 15 predefined labels.
Out - of - Scope Use
The model is fine - tuned only for next writing intention prediction and inferred in a closed environment. It is suitable for academic use but not for production, general public use, or consumer - oriented services. Using this model on tasks other than next intention prediction in LaTeX paper drafts may not work well.
Bias and Limitations
The bias and limitations of this model mainly come from the dataset (ScholaWrite) it is fine - tuned on.
- Domain - Specific: The ScholaWrite dataset is currently limited to the computer science domain, which may restrict the model's generalizability to other scientific disciplines.
- Participant Limitation: All participants were early - career researchers at an R1 university in the United States, so the model may not learn the professional writing behavior and cognitive process from experts.
- Language Limitation: The dataset is exclusive to English - language writing, which restricts the model's capability to predict next writing intention in multilingual or non - English contexts.
Fine - tuning Details
Fine - tuning Data
This model is fine - tuned on the train
split of the minnesotanlp/scholawrite dataset. It is keystroke logs of an end - to - end scholarly writing process, with thorough annotations of cognitive writing intentions behind each keystroke. No additional data pre - processing or filtering was performed on the dataset.
Fine - tuning Procedure
The model was fine - tuned by passing in the before_text
section of a prompt as the input and using the intention
as the ground truth data. The model outputs an integer according to each intention label (1 - 15).
Fine - tuning Hyperparameters
Property |
Details |
Fine - tuning regime |
fp32 |
Learning rate |
2e - 5 |
Per device train batch size |
2 |
Per device eval batch size |
8 |
Num train epochs |
10 |
Weight decay |
0.01 |
Machine Specs
Property |
Details |
Hardware |
2 X Nvidia RTX A6000 |
Hours used |
3.5 hrs |
Compute Region |
Minnesota |
Testing Procedure
Testing Data
minnesotanlp/scholawrite
Metrics
The data has class imbalance on both training and testing data splits, so we use weighted F1 to measure the performance.
Results
|
BERT |
RoBERTa |
LLama - 8B - Instruct |
GPT - 4o |
Base |
0.04 |
0.02 |
0.12 |
0.08 |
+ SW |
0.64 |
0.64 |
0.13 |
- |
Summary
The table above presents the weighted F1 scores for predicting writing intentions across baselines and fine - tuned models. All models fine - tuned on ScholaWrite show improved performance compared to their baselines. BERT and RoBERTa achieved the most improvement, while LLama - 8B - Instruct showed a modest improvement after fine - tuning. These results demonstrate the effectiveness of our ScholaWrite dataset to align language models with writers' intentions.
đ§ Technical Details
- Model Architecture: Based on the BERT architecture (
bert - base - uncased
).
- Fine - Tuning Process: Involves passing
before_text
as input and using intention
as ground truth data.
- Testing Metrics: Uses weighted F1 to measure performance due to class imbalance in the dataset.
đ License
This model is released under the apache - 2.0
license.
BibTeX
@misc{wang2025scholawritedatasetendtoendscholarly,
title={ScholaWrite: A Dataset of End - to - End Scholarly Writing Process},
author={Linghe Wang and Minhwa Lee and Ross Volkov and Luan Tuyen Chau and Dongyeop Kang},
year={2025},
eprint={2502.02904},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.02904},
}