đ sgarbi/bert-fda-nutrition-ner
A BERT model designed for Named Entity Recognition (NER) in nutrition labeling, aiming to detect and categorize nutritional components from text data.
đ Quick Start
This is a BERT model specifically tailored for Named Entity Recognition (NER) in the nutrition labeling domain. Its core function is to detect and classify various nutritional components from textual data, offering a systematic way to understand the information commonly found on nutrition labels and other nutritional materials. This model serves as a benchmark and learning resource for training models with augmented data.
⨠Features
- Comprehensive Data Utilization: Trained on data from multiple sources, including the U.S. Food and Drug Administration (FDA) FoodData Central, Yelp restaurant reviews, and Amazon food reviews, to enhance the model's understanding of diverse nutritional mentions.
- Robust Preprocessing: The training data undergoes a series of preprocessing steps, such as extraction, normalization, entity tagging, tokenization, and formatting, to meet the BERT model's input requirements.
- Noise Introduction: Deliberate noise is introduced into the training set to improve the model's ability to handle real - world, imperfect data, including sentence swaps and misspellings.
đĻ Installation
No installation steps are provided in the original document.
đģ Usage Examples
Basic Usage
label_map = {
0: 'O',
1: 'I-VITAMINS',
2: 'I-STIMULANTS',
3: 'I-PROXIMATES',
4: 'I-PROTEIN',
5: 'I-PROBIOTICS',
6: 'I-MINERALS',
7: 'I-LIPIDS',
8: 'I-FLAVORING',
9: 'I-ENZYMES',
10: 'I-EMULSIFIERS',
11: 'I-DIETARYFIBER',
12: 'I-COLORANTS',
13: 'I-CARBOHYDRATES',
14: 'I-ANTIOXIDANTS',
15: 'I-ALCOHOLS',
16: 'I-ADDITIVES',
17: 'I-ACIDS',
18: 'B-VITAMINS',
19: 'B-STIMULANTS',
20: 'B-PROXIMATES',
21: 'B-PROTEIN',
22: 'B-PROBIOTICS',
23: 'B-MINERALS',
24: 'B-LIPIDS',
25: 'B-FLAVORING',
26: 'B-ENZYMES',
27: 'B-EMULSIFIERS',
28: 'B-DIETARYFIBER',
29: 'B-COLORANTS',
30: 'B-CARBOHYDRATES',
31: 'B-ANTIOXIDANTS',
32: 'B-ALCOHOLS',
33: 'B-ADDITIVES',
34: 'B-ACIDS'
}
Advanced Usage
INPUT:
'Here are the ingredients to use: Tomato Paste, Sesame Oil, Cheese Cultures, Ground Corn, Vegetable Oil, Brown rice, sea salt, Tomatoes, Milk, Onions, Egg Yolks, Lime Juice Concentrate, Corn Starch, Condensed Milk, Spices, Artificial Flavor, red 5, roasted coffee'
Output:
['CLS', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-CARBOHYDRATES', 'I-CARBOHYDRATES', 'O', 'B-LIPIDS', 'I-LIPIDS', 'O', 'B-PROBIOTICS', 'I-PROBIOTICS', 'O', 'B-CARBOHYDRATES', 'I-CARBOHYDRATES', 'O', 'B-LIPIDS', 'I-LIPIDS', 'O', 'B-CARBOHYDRATES', 'I-CARBOHYDRATES', 'O', 'B-MINERALS', 'I-MINERALS', 'O', 'B-CARBOHYDRATES', 'O', 'B-PROXIMATES', 'O', 'B-CARBOHYDRATES', 'O', 'B-LIPIDS', 'I-LIPIDS', 'I-LIPIDS', 'I-LIPIDS', 'O', 'B-CARBOHYDRATES', 'I-CARBOHYDRATES', 'I-CARBOHYDRATES', 'O', 'B-CARBOHYDRATES', 'I-CARBOHYDRATES', 'I-CARBOHYDRATES', 'O', 'B-PROXIMATES', 'I-PROXIMATES', 'O', 'B-FLAVORING', 'O', 'B-FLAVORING', 'I-FLAVORING', 'O', 'B-COLORANTS', 'I-COLORANTS', 'O', 'B-STIMULANTS', 'I-STIMULANTS', 'O', 'I-STIMULANTS']
đ Documentation
Training Data Description
The training data for the sgarbi/bert-fda-nutrition-ner
model was carefully curated from multiple sources:
Data Source
Property |
Details |
Labeling Source |
U.S. Food and Drug Administration (FDA), FoodData Central. FDA FoodData Central. The dataset includes detailed nutritional data, such as ingredient lists, nutritional values, serving sizes, and other essential label information. |
Yelp Restaurant Reviews |
Utilized the Yelp Review Full dataset from Hugging Face, augmented with Mistral 7B for general tagging, to enrich the model's understanding of restaurant - related nutritional mentions. |
Amazon Food Reviews |
Similar to the Yelp dataset, this model also incorporates the Amazon Food Reviews dataset from Hugging Face, augmented with Mistral 7B, enhancing its capability to recognize and classify a wide range of nutritional information from diverse food product reviews correlated with FDA data. |
Preprocessing and Augmentation Steps
- Extraction: Key textual data, including nutritional facts and ingredient lists, were extracted from the FDA dataset.
- Normalization: All text was normalized for consistency, including converting to lowercase and removing redundant formatting.
- Entity Tagging: Significant nutritional elements were manually tagged, creating a labeled dataset for training. This includes macronutrients, vitamins, minerals, and various specific dietary components.
- Tokenization and Formatting: The data was tokenized and formatted to meet the BERT model's input requirements.
- Introducing Noise: To enhance the model's ability to handle real - world, imperfect data, deliberate noise was introduced into the training set. This included:
- Sentence Swaps: Random swapping of sentences within the text to promote the model's understanding of varied sentence structures.
- Introducing Misspellings: Deliberately inserting common spelling errors to train the model to recognize and correctly process misspelled words frequently encountered in real - world scenarios such as inaccurate document scans.
đ§ Technical Details
Considerations
- The model was trained only on publicly available data from food product labels. No private or sensitive data was used.
- Labeling tasks were performed by Mistral 7B - Instruct served by mistral.ai (https://docs.mistral.ai/). It is probable that models experienced hallucinations during labeling data, which could result in an imprecise taxonomy classification.
- The tool only extracts nutritional entities from text; it should not be used for nutrition or health recommendations. Qualified experts should provide any nutrition advice.
- The language and phrasing on certain types of food product labels may introduce biases to the model.
- This model was created for exploring the BERT architecture and NER tasks.
đ License
This project is licensed under the MIT license.
đ Github
https://github.com/ESgarbi/bert-fda-nutrition-ner