BERTopic_ArXiv開源話題建模模型 - 支持多維度話題表示與分類

首頁

Bertopic ArXiv

由MaartenGr開發

基於BERTopic框架的預訓練話題建模模型，使用約3萬篇ArXiv論文摘要訓練，支持多維度話題表示和分類

文本分類 #多維度話題建模 #學術論文分析 #ChatGPT增強

下載量 231

發布時間 : 5/30/2023

模型概述

BERTopic是一個靈活模塊化的話題建模框架，能夠從海量數據中生成易於解釋的話題分類。本模型展示了BERTopic中多種話題表示方法的組合應用。

模型特點

多維度話題表示

結合詞性標註、KeyBERT啟發式、MMR等多種技術生成豐富的話題表示

ChatGPT增強

利用ChatGPT生成話題標籤和摘要，提升可解釋性

模塊化設計

支持靈活組合不同的話題表示和聚類算法

模型能力

文本分類

話題提取

關鍵詞生成

話題摘要生成

使用案例

學術研究

論文主題分析

對ArXiv等學術論文庫進行主題挖掘和分類

識別107個不同主題

內容分析

文檔聚類

對大規模文檔集合進行自動主題聚類

🚀 BERTopic_ArXiv

BERTopic_ArXiv是一個基於BERTopic的模型。BERTopic是一個靈活且模塊化的主題建模框架，可從大型數據集中生成易於解釋的主題。此預訓練模型展示了BERTopic中可使用的幾種表示模型，它在約30000篇ArXiv摘要上進行訓練，採用了多種主題表示方法。

🚀 快速開始

本模型的使用步驟如下：

安裝BERTopic：

pip install -U bertopic
pip install -U safetensors

使用模型：

from bertopic import BERTopic
topic_model = BERTopic.load("MaartenGr/BERTopic_ArXiv")

topic_model.get_topic_info()

查看不同的主題表示：

>>> topic_model.get_topic(0, full=True)
{'Main': [['dialogue', 0.02704485163341523],
  ['dialog', 0.01677038224466311],
  ['response', 0.011692640237477233],
  ['responses', 0.01002788412923778],
  ['intent', 0.00990720856306287],
  ['oriented', 0.009217253131615378],
  ['slot', 0.009177118721490055],
  ['conversational', 0.009129311385144046],
  ['systems', 0.009101146153425574],
  ['conversation', 0.008845392252307181]],
 'POS': [['dialogue', 0.02704485163341523],
  ['dialog', 0.01677038224466311],
  ['response', 0.011692640237477233],
  ['responses', 0.01002788412923778],
  ['intent', 0.00990720856306287],
  ['slot', 0.009177118721490055],
  ['conversational', 0.009129311385144046],
  ['systems', 0.009101146153425574],
  ['conversation', 0.008845392252307181],
  ['user', 0.008753551043296965]],
 'KeyBERTInspired': [['task oriented dialogue', 0.6559894680976868],
  ['dialogue systems', 0.6249060034751892],
  ['oriented dialogue', 0.5788208246231079],
  ['dialog systems', 0.530449628829956],
  ['dialogue state', 0.5167528390884399],
  ['response generation', 0.5143576860427856],
  ['spoken language understanding', 0.46739083528518677],
  ['oriented dialog', 0.4600704610347748],
  ['dialog', 0.4534587264060974],
  ['dialogues', 0.44082391262054443]],
 'MMR': [['dialogue', 0.02704485163341523],
  ['dialog', 0.01677038224466311],
  ['response', 0.011692640237477233],
  ['responses', 0.01002788412923778],
  ['intent', 0.00990720856306287],
  ['oriented', 0.009217253131615378],
  ['slot', 0.009177118721490055],
  ['conversational', 0.009129311385144046],
  ['systems', 0.009101146153425574],
  ['conversation', 0.008845392252307181]],
 'KeyBERT + MMR': [['task oriented dialogue', 0.6559894680976868],
  ['dialogue systems', 0.6249060034751892],
  ['oriented dialogue', 0.5788208246231079],
  ['dialog systems', 0.530449628829956],
  ['dialogue state', 0.5167528390884399],
  ['response generation', 0.5143576860427856],
  ['spoken language understanding', 0.46739083528518677],
  ['oriented dialog', 0.4600704610347748],
  ['dialog', 0.4534587264060974],
  ['dialogues', 0.44082391262054443]],
 'OpenAI_Label': [['Challenges and Approaches in Developing Task-oriented Dialogue Systems',
   1]],
 'OpenAI_Summary': [['Task-oriented dialogue systems and their components, such as dialogue policy, natural language understanding, dialogue state tracking, response generation, and end-to-end training using neural networks. These components are crucial in assisting users to complete various activities such as booking tickets and restaurant reservations through spoken language understanding dialogue. The challenge lies in tracking dialogue states of multiple domains and obtaining annotations for training. Effective SLU is achieved by utilizing context from the prior dialogue history.',
   1]]}

✨ 主要特性

多種主題表示方法：使用了POS、KeyBERTInspired、MaximalMarginalRelevance、KeyBERT + MaximalMarginalRelevance、ChatGPT labels、ChatGPT summaries等多種主題表示方法。
可視化示例：提供了默認c-TF-IDF表示和ChatGPT生成標籤的可視化示例。
詳細的主題信息：可以查看每個主題的關鍵詞、頻率、標籤等信息。

📦 安裝指南

要使用此模型，請安裝BERTopic：

pip install -U bertopic
pip install -U safetensors

💻 使用示例

基礎用法

from bertopic import BERTopic
topic_model = BERTopic.load("MaartenGr/BERTopic_ArXiv")

topic_model.get_topic_info()

高級用法

查看所有不同的主題表示（關鍵詞、標籤、摘要等）：

>>> topic_model.get_topic(0, full=True)
{'Main': [['dialogue', 0.02704485163341523],
  ['dialog', 0.01677038224466311],
  ['response', 0.011692640237477233],
  ['responses', 0.01002788412923778],
  ['intent', 0.00990720856306287],
  ['oriented', 0.009217253131615378],
  ['slot', 0.009177118721490055],
  ['conversational', 0.009129311385144046],
  ['systems', 0.009101146153425574],
  ['conversation', 0.008845392252307181]],
 'POS': [['dialogue', 0.02704485163341523],
  ['dialog', 0.01677038224466311],
  ['response', 0.011692640237477233],
  ['responses', 0.01002788412923778],
  ['intent', 0.00990720856306287],
  ['slot', 0.009177118721490055],
  ['conversational', 0.009129311385144046],
  ['systems', 0.009101146153425574],
  ['conversation', 0.008845392252307181],
  ['user', 0.008753551043296965]],
 'KeyBERTInspired': [['task oriented dialogue', 0.6559894680976868],
  ['dialogue systems', 0.6249060034751892],
  ['oriented dialogue', 0.5788208246231079],
  ['dialog systems', 0.530449628829956],
  ['dialogue state', 0.5167528390884399],
  ['response generation', 0.5143576860427856],
  ['spoken language understanding', 0.46739083528518677],
  ['oriented dialog', 0.4600704610347748],
  ['dialog', 0.4534587264060974],
  ['dialogues', 0.44082391262054443]],
 'MMR': [['dialogue', 0.02704485163341523],
  ['dialog', 0.01677038224466311],
  ['response', 0.011692640237477233],
  ['responses', 0.01002788412923778],
  ['intent', 0.00990720856306287],
  ['oriented', 0.009217253131615378],
  ['slot', 0.009177118721490055],
  ['conversational', 0.009129311385144046],
  ['systems', 0.009101146153425574],
  ['conversation', 0.008845392252307181]],
 'KeyBERT + MMR': [['task oriented dialogue', 0.6559894680976868],
  ['dialogue systems', 0.6249060034751892],
  ['oriented dialogue', 0.5788208246231079],
  ['dialog systems', 0.530449628829956],
  ['dialogue state', 0.5167528390884399],
  ['response generation', 0.5143576860427856],
  ['spoken language understanding', 0.46739083528518677],
  ['oriented dialog', 0.4600704610347748],
  ['dialog', 0.4534587264060974],
  ['dialogues', 0.44082391262054443]],
 'OpenAI_Label': [['Challenges and Approaches in Developing Task-oriented Dialogue Systems',
   1]],
 'OpenAI_Summary': [['Task-oriented dialogue systems and their components, such as dialogue policy, natural language understanding, dialogue state tracking, response generation, and end-to-end training using neural networks. These components are crucial in assisting users to complete various activities such as booking tickets and restaurant reservations through spoken language understanding dialogue. The challenge lies in tracking dialogue states of multiple domains and obtaining annotations for training. Effective SLU is achieved by utilizing context from the prior dialogue history.',
   1]]}

📚 詳細文檔

主題概述

主題數量：107
訓練文檔數量：33189

點擊查看所有主題的概述。

主題ID	主題關鍵詞	主題頻率	標籤
-1	language - models - model - data - based	20	-1_language_models_model_data
0	dialogue - dialog - response - responses - intent	14247	0_dialogue_dialog_response_responses
1	speech - asr - speech recognition - recognition - end	1833	1_speech_asr_speech recognition_recognition
2	tuning - tasks - prompt - models - language	1369	2_tuning_tasks_prompt_models
3	summarization - summaries - summary - abstractive - document	1109	3_summarization_summaries_summary_abstractive
4	question - answer - qa - answering - question answering	893	4_question_answer_qa_answering
5	sentiment - sentiment analysis - aspect - analysis - opinion	837	5_sentiment_sentiment analysis_aspect_analysis
6	clinical - medical - biomedical - notes - patient	691	6_clinical_medical_biomedical_notes
7	translation - nmt - machine translation - neural machine - neural machine translation	586	7_translation_nmt_machine translation_neural machine
8	generation - text generation - text - language generation - nlg	558	8_generation_text generation_text_language generation
9	hate - hate speech - offensive - speech - detection	484	9_hate_hate speech_offensive_speech
10	news - fake - fake news - stance - fact	455	10_news_fake_fake news_stance
11	relation - relation extraction - extraction - relations - entity	450	11_relation_relation extraction_extraction_relations
12	ner - named - named entity - entity - named entity recognition	376	12_ner_named_named entity_entity
13	parsing - parser - dependency - treebank - parsers	370	13_parsing_parser_dependency_treebank
14	event - temporal - events - event extraction - extraction	314	14_event_temporal_events_event extraction
15	emotion - emotions - multimodal - emotion recognition - emotional	300	15_emotion_emotions_multimodal_emotion recognition
16	word - embeddings - word embeddings - embedding - words	292	16_word_embeddings_word embeddings_embedding
17	explanations - explanation - rationales - rationale - interpretability	212	17_explanations_explanation_rationales_rationale
18	morphological - arabic - morphology - languages - inflection	204	18_morphological_arabic_morphology_languages
19	topic - topics - topic models - lda - topic modeling	200	19_topic_topics_topic models_lda
20	bias - gender - biases - gender bias - debiasing	195	20_bias_gender_biases_gender bias
21	law - frequency - zipf - words - length	185	21_law_frequency_zipf_words
22	legal - court - law - legal domain - case	182	22_legal_court_law_legal domain
23	adversarial - attacks - attack - adversarial examples - robustness	181	23_adversarial_attacks_attack_adversarial examples
24	commonsense - commonsense knowledge - reasoning - knowledge - commonsense reasoning	180	24_commonsense_commonsense knowledge_reasoning_knowledge
25	quantum - semantics - calculus - compositional - meaning	171	25_quantum_semantics_calculus_compositional
26	correction - error - error correction - grammatical - grammatical error	161	26_correction_error_error correction_grammatical
27	argument - arguments - argumentation - argumentative - mining	160	27_argument_arguments_argumentation_argumentative
28	sarcasm - humor - sarcastic - detection - humorous	157	28_sarcasm_humor_sarcastic_detection
29	coreference - resolution - coreference resolution - mentions - mention	156	29_coreference_resolution_coreference resolution_mentions
30	sense - word sense - wsd - word - disambiguation	153	30_sense_word sense_wsd_word
31	knowledge - knowledge graph - graph - link prediction - entities	149	31_knowledge_knowledge graph_graph_link prediction
32	parsing - semantic parsing - amr - semantic - parser	146	32_parsing_semantic parsing_amr_semantic
33	cross lingual - lingual - cross - transfer - languages	146	33_cross lingual_lingual_cross_transfer
34	mt - translation - qe - quality - machine translation	139	34_mt_translation_qe_quality
35	sql - text sql - queries - spider - schema	138	35_sql_text sql_queries_spider
36	classification - text classification - label - text - labels	136	36_classification_text classification_label_text
37	style - style transfer - transfer - text style - text style transfer	136	37_style_style transfer_transfer_text style
38	question - question generation - questions - answer - generation	129	38_question_question generation_questions_answer
39	authorship - authorship attribution - attribution - author - authors	127	39_authorship_authorship attribution_attribution_author
40	sentence - sentence embeddings - similarity - sts - sentence embedding	123	40_sentence_sentence embeddings_similarity_sts
41	code - identification - switching - cs - code switching	121	41_code_identification_switching_cs
42	story - stories - story generation - generation - storytelling	118	42_story_stories_story generation_generation
43	discourse - discourse relation - discourse relations - rst - discourse parsing	117	43_discourse_discourse relation_discourse relations_rst
44	code - programming - source code - code generation - programming languages	117	44_code_programming_source code_code generation
45	paraphrase - paraphrases - paraphrase generation - paraphrasing - generation	114	45_paraphrase_paraphrases_paraphrase generation_paraphrasing
46	agent - games - environment - instructions - agents	111	46_agent_games_environment_instructions
47	covid - covid 19 - 19 - tweets - pandemic	108	47_covid_covid 19_19_tweets
48	linking - entity linking - entity - el - entities	107	48_linking_entity linking_entity_el
49	poetry - poems - lyrics - poem - music	103	49_poetry_poems_lyrics_poem
50	image - captioning - captions - visual - caption	100	50_image_captioning_captions_visual
51	nli - entailment - inference - natural language inference - language inference	96	51_nli_entailment_inference_natural language inference
52	keyphrase - keyphrases - extraction - document - phrases	95	52_keyphrase_keyphrases_extraction_document
53	simplification - text simplification - ts - sentence - simplified	95	53_simplification_text simplification_ts_sentence
54	empathetic - emotion - emotional - empathy - emotions	95	54_empathetic_emotion_emotional_empathy
55	depression - mental - health - mental health - social media	93	55_depression_mental_health_mental health
56	segmentation - word segmentation - chinese - chinese word segmentation - chinese word	93	56_segmentation_word segmentation_chinese_chinese word segmentation
57	citation - scientific - papers - citations - scholarly	85	57_citation_scientific_papers_citations
58	agreement - syntactic - verb - grammatical - subject verb	85	58_agreement_syntactic_verb_grammatical
59	metaphor - literal - figurative - metaphors - idiomatic	83	59_metaphor_literal_figurative_metaphors
60	srl - semantic role - role labeling - semantic role labeling - role	82	60_srl_semantic role_role labeling_semantic role labeling
61	privacy - private - federated - privacy preserving - federated learning	82	61_privacy_private_federated_privacy preserving
62	change - semantic change - time - semantic - lexical semantic	82	62_change_semantic change_time_semantic
63	bilingual - lingual - cross lingual - cross - embeddings	80	63_bilingual_lingual_cross lingual_cross
64	political - media - news - bias - articles	77	64_political_media_news_bias
65	medical - qa - question - questions - clinical	75	65_medical_qa_question_questions
66	math - mathematical - math word - word problems - problems	73	66_math_mathematical_math word_word problems
67	financial - stock - market - price - news	69	67_financial_stock_market_price
68	table - tables - tabular - reasoning - qa	69	68_table_tables_tabular_reasoning
69	readability - complexity - assessment - features - reading	65	69_readability_complexity_assessment_features
70	layout - document - documents - document understanding - extraction	64	70_layout_document_documents_document understanding
71	brain - cognitive - reading - syntactic - language	62	71_brain_cognitive_reading_syntactic
72	sign - gloss - language - signed - language translation	61	72_sign_gloss_language_signed
73	vqa - visual - visual question - visual question answering - question	59	73_vqa_visual_visual question_visual question answering
74	biased - biases - spurious - nlp - debiasing	57	74_biased_biases_spurious_nlp
75	visual - dialogue - multimodal - image - dialog	55	75_visual_dialogue_multimodal_image
76	translation - machine translation - machine - smt - statistical	54	76_translation_machine translation_machine_smt
77	multimodal - visual - image - translation - machine translation	52	77_multimodal_visual_image_translation
78	geographic - location - geolocation - geo - locations	51	78_geographic_location_geolocation_geo
79	reasoning - prompting - llms - chain thought - chain	48	79_reasoning_prompting_llms_chain thought
80	essay - scoring - aes - essay scoring - essays	45	80_essay_scoring_aes_essay scoring
81	crisis - disaster - traffic - tweets - disasters	45	81_crisis_disaster_traffic_tweets
82	graph - text classification - text - gcn - classification	44	82_graph_text classification_text_gcn
83	annotation - tools - linguistic - resources - xml	43	83_annotation_tools_linguistic_resources
84	entity alignment - alignment - kgs - entity - ea	43	84_entity alignment_alignment_kgs_entity
85	personality - traits - personality traits - evaluative - text	42	85_personality_traits_personality traits_evaluative
86	ad - alzheimer - alzheimer disease - disease - speech	40	86_ad_alzheimer_alzheimer disease_disease
87	taxonomy - hypernymy - taxonomies - hypernym - hypernyms	39	87_taxonomy_hypernymy_taxonomies_hypernym
88	active learning - active - al - learning - uncertainty	37	88_active learning_active_al_learning
89	reviews - summaries - summarization - review - opinion	36	89_reviews_summaries_summarization_review
90	emoji - emojis - sentiment - message - anonymous	35	90_emoji_emojis_sentiment_message
91	table - table text - tables - table text generation - text generation	35	91_table_table text_tables_table text generation
92	domain - domain adaptation - adaptation - domains - source	35	92_domain_domain adaptation_adaptation_domains
93	alignment - word alignment - parallel - pairs - alignments	34	93_alignment_word alignment_parallel_pairs
94	indo - languages - indo european - names - family	34	94_indo_languages_indo european_names
95	patent - claim - claim generation - chemical - technical	32	95_patent_claim_claim generation_chemical
96	agents - emergent - communication - referential - games	32	96_agents_emergent_communication_referential
97	graph - amr - graph text - graphs - text generation	31	97_graph_amr_graph text_graphs
98	moral - ethical - norms - values - social	29	98_moral_ethical_norms_values
99	acronym - acronyms - abbreviations - abbreviation - disambiguation	27	99_acronym_acronyms_abbreviations_abbreviation
100	typing - entity typing - entity - type - types	27	100_typing_entity typing_entity_type
101	coherence - discourse - discourse coherence - coherence modeling - text	26	101_coherence_discourse_discourse coherence_coherence modeling
102	pos - taggers - tagging - tagger - pos tagging	25	102_pos_taggers_tagging_tagger
103	drug - social - social media - media - health	25	103_drug_social_social media_media
104	gender - translation - bias - gender bias - mt	24	104_gender_translation_bias_gender bias
105	job - resume - skills - skill - soft	21	105_job_resume_skills_skill

訓練過程

模型的訓練過程如下：

from cuml.manifold import UMAP
from cuml.cluster import HDBSCAN
from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer
from bertopic.representation import PartOfSpeech, KeyBERTInspired, MaximalMarginalRelevance, OpenAI

# 準備子模型
embedding_model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')
umap_model = UMAP(n_components=5, n_neighbors=50, random_state=42, metric="cosine", verbose=True)
hdbscan_model = HDBSCAN(min_samples=20, gen_min_span_tree=True, prediction_data=False, min_cluster_size=20, verbose=True)
vectorizer_model = CountVectorizer(stop_words="english", ngram_range=(1, 3), min_df=5)

# 使用ChatGPT進行摘要
summarization_prompt = """
I have a topic that is described by the following keywords: [KEYWORDS]
In this topic, the following documents are a small but representative subset of all documents in the topic:
[DOCUMENTS]

Based on the information above, please give a description of this topic in the following format:
topic: <description>
"""
summarization_model = OpenAI(model="gpt-3.5-turbo", chat=True, prompt=summarization_prompt, nr_docs=5, exponential_backoff=True, diversity=0.1)

# 表示模型
representation_models = {
    "POS": PartOfSpeech("en_core_web_lg"),
    "KeyBERTInspired": KeyBERTInspired(),
    "MMR": MaximalMarginalRelevance(diversity=0.3),
    "KeyBERT + MMR": [KeyBERTInspired(), MaximalMarginalRelevance(diversity=0.3)],
    "OpenAI_Label": OpenAI(model="gpt-3.5-turbo", exponential_backoff=True, chat=True, diversity=0.1),
    "OpenAI_Summary": [KeyBERTInspired(), summarization_model],
}

# 擬合BERTopic
topic_model= BERTopic(
        embedding_model=embedding_model,
        umap_model=umap_model,
        hdbscan_model=hdbscan_model,
        vectorizer_model=vectorizer_model,
        representation_model=representation_models,
        verbose=True
).fit(docs)

訓練超參數

calculate_probabilities：False
language：None
low_memory：False
min_topic_size：10
n_gram_range：(1, 1)
nr_topics：None
seed_topic_list：None
top_n_words：10
verbose：True

框架版本

屬性	詳情
模型類型	BERTopic
訓練數據	約30000篇ArXiv摘要
Numpy	1.22.4
HDBSCAN	0.8.29
UMAP	0.5.3
Pandas	1.5.3
Scikit-Learn	1.2.2
Sentence-transformers	2.2.2
Transformers	4.29.2
Numba	0.56.4
Plotly	5.13.1
Python	3.10.11