BERT-FDA-Nutrition-NER Open-source Model - For Nutrition Label Detection and Nutrient Composition Classification

Bert Fda Nutrition Ner

Developed by sgarbi

This is a BERT model specifically designed for Named Entity Recognition (NER) in the field of nutrition labels, used to detect and classify different nutritional components.

Sequence Labeling

Transformers

EnglishOpen Source License:MIT #Nutrition Label Entity Recognition #FDA Data Augmentation #Food Ingredient Classification

Downloads 64

Release Time : 12/24/2023

Model Overview

The model is primarily used to identify and classify nutritional components from text data, such as ingredient lists and nutritional values, and is suitable for nutrition label analysis.

Model Features

Specialized for Nutrition Labels

Designed specifically for nutrition label data, it can accurately identify and classify nutritional components.

Multi-source Data Training

Combines FDA open datasets, Yelp reviews, and Amazon food reviews to enhance the model's understanding of diverse nutritional information.

Noise Augmentation

Introduces deliberate noise (such as spelling errors and sentence swaps) into the training data to improve the model's robustness in real-world scenarios.

Model Capabilities

Identify nutritional components

Classify nutritional entities

Handle spelling errors

Analyze diverse text structures

Use Cases

Food Label Analysis

Ingredient List Parsing

Identify and classify various ingredients from food ingredient lists, such as vitamins, minerals, additives, etc.

Accurately tags various nutritional entities, e.g., 'tomato sauce' is classified as a carbohydrate.

Nutrition Information Extraction

Extract nutritional information from food reviews or labels, such as calories, protein content, etc.

Identifies and classifies nutritional data, e.g., '250 calories per 100 grams' is labeled as an approximate value.

🚀 sgarbi/bert-fda-nutrition-ner

A BERT model designed for Named Entity Recognition (NER) in nutrition labeling, aiming to detect and categorize nutritional components from text data.

🚀 Quick Start

This is a BERT model specifically tailored for Named Entity Recognition (NER) in the nutrition labeling domain. Its core function is to detect and classify various nutritional components from textual data, offering a systematic way to understand the information commonly found on nutrition labels and other nutritional materials. This model serves as a benchmark and learning resource for training models with augmented data.

✨ Features

Comprehensive Data Utilization: Trained on data from multiple sources, including the U.S. Food and Drug Administration (FDA) FoodData Central, Yelp restaurant reviews, and Amazon food reviews, to enhance the model's understanding of diverse nutritional mentions.
Robust Preprocessing: The training data undergoes a series of preprocessing steps, such as extraction, normalization, entity tagging, tokenization, and formatting, to meet the BERT model's input requirements.
Noise Introduction: Deliberate noise is introduced into the training set to improve the model's ability to handle real - world, imperfect data, including sentence swaps and misspellings.

📦 Installation

No installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

label_map = {
    0: 'O',
    1: 'I-VITAMINS',
    2: 'I-STIMULANTS',
    3: 'I-PROXIMATES',
    4: 'I-PROTEIN',
    5: 'I-PROBIOTICS',
    6: 'I-MINERALS',
    7: 'I-LIPIDS',
    8: 'I-FLAVORING',
    9: 'I-ENZYMES',
    10: 'I-EMULSIFIERS',
    11: 'I-DIETARYFIBER',
    12: 'I-COLORANTS',
    13: 'I-CARBOHYDRATES',
    14: 'I-ANTIOXIDANTS',
    15: 'I-ALCOHOLS',
    16: 'I-ADDITIVES',
    17: 'I-ACIDS',
    18: 'B-VITAMINS',
    19: 'B-STIMULANTS',
    20: 'B-PROXIMATES',
    21: 'B-PROTEIN',
    22: 'B-PROBIOTICS',
    23: 'B-MINERALS',
    24: 'B-LIPIDS',
    25: 'B-FLAVORING',
    26: 'B-ENZYMES',
    27: 'B-EMULSIFIERS',
    28: 'B-DIETARYFIBER',
    29: 'B-COLORANTS',
    30: 'B-CARBOHYDRATES',
    31: 'B-ANTIOXIDANTS',
    32: 'B-ALCOHOLS',
    33: 'B-ADDITIVES',
    34: 'B-ACIDS'
}

Advanced Usage

INPUT:
'Here are the ingredients to use: Tomato Paste, Sesame Oil, Cheese Cultures,  Ground Corn, Vegetable Oil, Brown rice, sea salt, Tomatoes, Milk, Onions, Egg Yolks, Lime Juice Concentrate, Corn Starch, Condensed Milk, Spices, Artificial Flavor, red 5, roasted coffee'

Output:
['CLS', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-CARBOHYDRATES', 'I-CARBOHYDRATES', 'O', 'B-LIPIDS', 'I-LIPIDS', 'O', 'B-PROBIOTICS', 'I-PROBIOTICS', 'O', 'B-CARBOHYDRATES', 'I-CARBOHYDRATES', 'O', 'B-LIPIDS', 'I-LIPIDS', 'O', 'B-CARBOHYDRATES', 'I-CARBOHYDRATES', 'O', 'B-MINERALS', 'I-MINERALS', 'O', 'B-CARBOHYDRATES', 'O', 'B-PROXIMATES', 'O', 'B-CARBOHYDRATES', 'O', 'B-LIPIDS', 'I-LIPIDS', 'I-LIPIDS', 'I-LIPIDS', 'O', 'B-CARBOHYDRATES', 'I-CARBOHYDRATES', 'I-CARBOHYDRATES', 'O', 'B-CARBOHYDRATES', 'I-CARBOHYDRATES', 'I-CARBOHYDRATES', 'O', 'B-PROXIMATES', 'I-PROXIMATES', 'O', 'B-FLAVORING', 'O', 'B-FLAVORING', 'I-FLAVORING', 'O', 'B-COLORANTS', 'I-COLORANTS', 'O', 'B-STIMULANTS', 'I-STIMULANTS', 'O', 'I-STIMULANTS']

📚 Documentation

Training Data Description

The training data for the sgarbi/bert-fda-nutrition-ner model was carefully curated from multiple sources:

Data Source

Property	Details
Labeling Source	U.S. Food and Drug Administration (FDA), FoodData Central. FDA FoodData Central. The dataset includes detailed nutritional data, such as ingredient lists, nutritional values, serving sizes, and other essential label information.
Yelp Restaurant Reviews	Utilized the Yelp Review Full dataset from Hugging Face, augmented with Mistral 7B for general tagging, to enrich the model's understanding of restaurant - related nutritional mentions.
Amazon Food Reviews	Similar to the Yelp dataset, this model also incorporates the Amazon Food Reviews dataset from Hugging Face, augmented with Mistral 7B, enhancing its capability to recognize and classify a wide range of nutritional information from diverse food product reviews correlated with FDA data.

Preprocessing and Augmentation Steps

Extraction: Key textual data, including nutritional facts and ingredient lists, were extracted from the FDA dataset.
Normalization: All text was normalized for consistency, including converting to lowercase and removing redundant formatting.
Entity Tagging: Significant nutritional elements were manually tagged, creating a labeled dataset for training. This includes macronutrients, vitamins, minerals, and various specific dietary components.
Tokenization and Formatting: The data was tokenized and formatted to meet the BERT model's input requirements.
Introducing Noise: To enhance the model's ability to handle real - world, imperfect data, deliberate noise was introduced into the training set. This included:
- Sentence Swaps: Random swapping of sentences within the text to promote the model's understanding of varied sentence structures.
- Introducing Misspellings: Deliberately inserting common spelling errors to train the model to recognize and correctly process misspelled words frequently encountered in real - world scenarios such as inaccurate document scans.

🔧 Technical Details

Considerations

The model was trained only on publicly available data from food product labels. No private or sensitive data was used.
Labeling tasks were performed by Mistral 7B - Instruct served by mistral.ai (https://docs.mistral.ai/). It is probable that models experienced hallucinations during labeling data, which could result in an imprecise taxonomy classification.
The tool only extracts nutritional entities from text; it should not be used for nutrition or health recommendations. Qualified experts should provide any nutrition advice.
The language and phrasing on certain types of food product labels may introduce biases to the model.
This model was created for exploring the BERT architecture and NER tasks.

📄 License

This project is licensed under the MIT license.

🔗 Github

https://github.com/ESgarbi/bert-fda-nutrition-ner

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご