EuroBERT-210m-Quality-CL Open-source Model - Automatically Evaluate the Quality of Natural and Programming Text Data

Eurobert 210m Quality CL

Developed by TempestTeam

A model for automatically assessing the quality of text data in both natural and programming languages, offering both unified and dual-model solutions.

Text Classification

Transformers

Supports Multiple LanguagesOpen Source License:Apache-2.0 #Multilingual Quality Assessment #Code Quality Detection #Harmful Content Identification

Downloads 19

Release Time : 3/18/2025

Model Overview

This model automatically evaluates text data quality through a scoring system, supporting natural languages (French, English, Spanish) and programming languages (Python, Java, JavaScript, C/C++). It provides both unified and independent model solutions to meet different scenario requirements.

Model Features

Multilingual Support

Supports quality assessment for both natural languages (French, English, Spanish) and programming languages (Python, Java, JavaScript, C/C++)

Dual Evaluation Solutions

Provides both unified and independent model solutions, allowing selection of the most suitable evaluation method based on needs

Harmful Content Identification

Excellent performance in harmful content identification, with an F1 score of 0.93 for natural languages

Clear Classification System

Offers a four-level classification: harmful, poor, medium, and high-quality, making it easy to understand and use

Model Capabilities

Natural language text quality assessment

Programming language code quality assessment

Harmful content detection

Multilingual support

Use Cases

NLP Preprocessing

Text Corpus Validation

Automatically validates text corpus quality before integration into NLP systems

Improves input data quality for NLP systems

Community Content Management

Technical Community Content Evaluation

Assesses content quality in forums, Stack Overflow, GitHub, and other technical communities

Helps filter high-quality content

Code Generation

Code Quality Assessment

Evaluates the quality of code generated by code generation systems

Improves the reliability of code generation systems

🚀 Automatic Evaluation Models for Textual Data Quality (NL & CL)

Automatically assess the quality of textual data using a clear and intuitive scale, suitable for both natural language (NL) and code language (CL).

This project offers two distinct approaches for evaluating the quality of textual data:

A unified model that jointly handles both NL and CL: EuroBERT-210m-Quality
A dual-model approach that treats NL and CL separately:
- EuroBERT-210m-Quality-NL for natural language
- EuroBERT-210m-Quality-CL for code language

🚀 Quick Start

This project provides models to automatically evaluate the quality of textual data. You can choose between a unified model or a dual - model approach according to your needs.

✨ Features

Classification Categories

Harmful: Data that is harmful, potentially incorrect, or dangerous.
Low: Low - quality data with significant issues.
Medium: Medium - quality data that can be improved but is acceptable.
High: Data of good to very good quality, ready for use without any concerns.

Supported Languages

Natural Language: French 🇫🇷, English 🇬🇧, Spanish 🇪🇸
Code Language: Python 🐍, Java ☕, JavaScript 📜, C/C++ ⚙️

📊 Performance

f1 - score: Unified Model (NL + CL)

Category	Global (NL + CL)	NL	CL
Harmful	0.86	0.93	0.79
Low	0.62	0.81	0.40
Medium	0.63	0.78	0.50
High	0.77	0.81	0.74
Accuracy	0.73	0.83	0.62

f1 - score: Separate Models

Category	Global (NL + CL)	NL	CL
Harmful	0.83	0.93	0.72
Low	0.64	0.76	0.53
Medium	0.63	0.76	0.52
High	0.79	0.81	0.76
Accuracy	0.73	0.82	0.63

Key Performance Metrics

Unified Model (NL + CL):
- Overall accuracy: ~73%
- High reliability on harmful data (f1 - score: 0.86)
Separate Models:
- Natural Language (NL): ~82% accuracy
  - Excellent performance on harmful data (f1 - score: 0.93)
- Code Language (CL): ~63% accuracy
  - Good detection of harmful data (f1 - score: 0.72)

📦 Training Dataset

A public dataset is available: TempestTeam/dataset-quality

💡 Common Use Cases

Automatically validate text corpora before integrating them into NLP or code generation pipelines.
Assess the quality of community contributions (forums, Stack Overflow, GitHub).
Perform automated pre - processing to enhance the performance of NLP or code generation systems.

💡 Usage Tip

For specialized contexts, use the separate NL and CL models for optimal results.
The unified model is suitable for quick assessments when the data context is unknown or mixed.

📄 License

This project is licensed under the Apache 2.0 license.

📖 Citation

Please cite or link back to this model on Hugging Face Hub if used in your projects.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご