๐ Khmer Financial Sentiment Analysis with XLM - RoBERTa
This project offers a fine - tuned [XLM - RoBERTa - base](https://huggingface.co/xlm - roberta - base) model tailored for sentiment analysis of Khmer financial texts. It's trained on a dataset of about 4,000 financial text samples, with 400 for testing, aiming to accurately classify sentiment in the financial domain of the Khmer language.
๐ Quick Start
Financial texts like reports, news, and earnings statements are rich in information for market analysis. However, Khmer - language financial texts have been under - explored in NLP research. This project adapts the XLM - RoBERTa - base model for Khmer financial sentiment analysis. The model classifies financial text sentiment into two categories: Positive (indicating growth, profitability, or a positive outlook) and Negative (indicating loss, risk, or financial downturns).
โจ Features
- Domain - Specific Adaptation: Fine - tuned for Khmer financial sentiment analysis.
- Binary Classification: Clearly distinguishes between positive and negative financial sentiments.
- Good Performance: Achieves approximately 96% accuracy on the validation set.
๐ฆ Installation
The README doesn't provide specific installation steps, so this section is skipped.
๐ป Usage Examples
Basic Usage
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_name = "songhieng/khmer-sentiment-xlm-roberta-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
text = "แแถแแแแแแถแแ
แแแผแแแแแแแแแปแแ แแปแแแถแแแถแแแพแแกแพแแแแถแแ
แแแพแ"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
outputs = model(**inputs)
predicted_class = outputs.logits.argmax(dim=1).item()
labels_mapping = {0: "Negative", 1: "Positive"}
print(f"Predicted Sentiment: {labels_mapping[predicted_class]}")
๐ Documentation
Model Details
Property |
Details |
Model Type |
[XLM - RoBERTa - base](https://huggingface.co/xlm - roberta - base) |
Task |
Sentiment Analysis (Binary Classification: Positive / Negative) |
Domain |
Financial Data (Khmer Language) |
Dataset Size |
~4,000 training samples, 400 test samples |
Architecture |
Transformer - based sequence classification model |
Training Data
The model was fine - tuned using a dataset of Khmer - language financial texts, including bank reports, financial news articles, economic forecasts, and investment analysis. The dataset has 4,000 labeled examples for training and 400 samples for testing.
Training Details
The model was fine - tuned over 3 epochs, using XLM - RoBERTa - base as the pretrained model.
Epoch |
Training Loss |
Validation Loss |
Accuracy |
1 |
0.163500 |
0.511470 |
XX% |
2 |
0.517700 |
0.581499 |
XX% |
3 |
0.312900 |
0.526096 |
XX% |
Training Configuration:
- Learning Rate:
2e - 5
- Batch Size:
8
- Optimizer: AdamW
- Evaluation Strategy: Per epoch
- Loss Function: CrossEntropyLoss
Results
- Accuracy: ~96% on the validation set.
- Strong Performance: The model effectively classifies Khmer financial sentiment.
- Domain - Specific Optimization: The fine - tuning process allows better understanding of financial terminology in Khmer.
๐ License
The model is released under the MIT license.