Bart-TL-NG Open-Source Topic Tag Generation Model - Free Deployment for Weakly Supervised Topic Annotation

Bart Tl Ng

Developed by cristian-popa

A weakly supervised topic label generation model based on BART, solving topic labeling tasks through generation rather than selection

Text Generation

Transformers

EnglishOpen Source License:Apache-2.0 #Topic Label Generation #Weakly Supervised Learning #Text Summarization

Downloads 189

Release Time : 3/2/2022

Model Overview

This model adopts a generative approach to solve topic labeling tasks, capable of generating relevant labels from topic word sequences instead of selecting from predefined label pools. It is fine-tuned based on Facebook's BART model.

Model Features

Generative Topic Labeling

Unlike traditional methods that select from label pools, this model can generate entirely new topic labels

Weakly Supervised Learning

Trained with weakly supervised methods, combining unsupervised candidate selection with topic n-gram techniques

Multi-domain Adaptation

Fine-tuned on multiple StackExchange domain datasets, with certain cross-domain capabilities

Model Capabilities

Topic Label Generation

Text Understanding

Short Text Generation

Use Cases

Text Analysis

LDA Topic Labeling

Assigning readable labels to topic words generated by topic models like LDA

Generating intuitive labels such as 'windows live messenger'

Knowledge Management

Document Classification

Generating classification labels for document collections

🚀 MyModel

This project presents the BART-TL-ng model, aiming to solve the topic labeling task with generative methods, offering an alternative to previous state - of - the - art works.

🚀 Quick Start

The BART-TL-ng model is designed to address the topic labeling task. You can start using it by referring to the usage example below.

✨ Features

Generative Approach: Solves the topic labeling task using generative methods, different from previous works that selected labels from a pool.
Multiple Model Versions: Two models (BART-TL-all and BART-TL-ng) are available from the related paper.

📦 Installation

There is no specific installation process described in the original document. You can directly use the model through the transformers library as shown in the usage example.

💻 Usage Examples

Basic Usage

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

mname = "cristian-popa/bart-tl-ng"
tokenizer = AutoTokenizer.from_pretrained(mname)
model = AutoModelForSeq2SeqLM.from_pretrained(mname)

input = "site web google search website online internet social content user"
enc = tokenizer(input, return_tensors="pt", truncation=True, padding="max_length", max_length=128)
outputs = model.generate(
    input_ids=enc.input_ids,
    attention_mask=enc.attention_mask,
    max_length=15,
    min_length=1,
    do_sample=False,
    num_beams=25,
    length_penalty=1.0,
    repetition_penalty=1.5
)

decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(decoded) # windows live messenger

📚 Documentation

Model description

This is the BART-TL-ng model from the paper BART-TL: Weakly-Supervised Topic Label Generation. We aim to solve the topic labeling task using generative methods, rather than selection from a pool of labels as was done in previous State of the Art works.

For more details not covered here, you can read the paper or look at the open - source implementation: https://github.com/CristianViorelPopa/BART-TL-topic-label-generation.

There are two models made available from the paper:

Intended uses & limitations

How to use

The model takes in a topic, represented as a space - separated series of words. Such topics can be generated using LDA, as was done for gathering the fine - tuning dataset for the model.

Limitations and bias

The model may not generate accurate labels for topics from domains unrelated to the ones it was fine - tuned on, such as gastronomy.

Training data

The model was fine - tuned on 5 different StackExchange corpora (see https://archive.org/download/stackexchange for a full list of existing such corpora): English, biology, economics, law, and photography. 100 topics are extracted using LDA for each of these corpora, filtered for coherence and then used for obtaining the final model here.

Training procedure

The large Facebook BART model is fine - tuned in a weakly - supervised manner, making use of the unsupervised candidate selection of the NETL method, along with n - grams from the topics. The dataset is a one - to - many mapping from topics to labels. More details on training and parameters can be found in the paper or by following this notebook.

Eval results

Model	Top - 1 Avg.	Top - 3 Avg.	Top - 5 Avg.	nDCG - 1	nDCG - 3	nDCG - 5
NETL (U)	2.66	2.59	2.50	0.83	0.85	0.87
NETL (S)	2.74	2.57	2.49	0.88	0.85	0.88
BART - TL - all	2.64	2.52	2.43	0.83	0.84	0.87
BART - TL - ng	2.62	2.50	2.33	0.82	0.84	0.85

BibTeX entry and citation info

@inproceedings{popa-rebedea-2021-bart,
    title = "{BART}-{TL}: Weakly-Supervised Topic Label Generation",
    author = "Popa, Cristian  and
      Rebedea, Traian",
    booktitle = "Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume",
    month = apr,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2021.eacl-main.121",
    pages = "1418--1425",
    abstract = "We propose a novel solution for assigning labels to topic models by using multiple weak labelers. The method leverages generative transformers to learn accurate representations of the most important topic terms and candidate labels. This is achieved by fine-tuning pre-trained BART models on a large number of potential labels generated by state of the art non-neural models for topic labeling, enriched with different techniques. The proposed BART-TL model is able to generate valuable and novel labels in a weakly-supervised manner and can be improved by adding other weak labelers or distant supervision on similar tasks.",
}

📄 License

The model is licensed under the apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご