TY - JOUR
T1 - Topic Modeling
T2 - A Consistent Framework for Comparative Studies
AU - Amaro, Ana
AU - Bação, Fernando
N1 - info:eu-repo/grantAgreement/FCT/Concurso de Projetos de Investigação Científica e Desenvolvimento Tecnológico em Ciência dos dados e inteligência artificial na Administração Pública - 2019/DSAIPA%2FDS%2F0116%2F2019/PT#
info:eu-repo/grantAgreement/FCT/6817 - DCRRNI ID/UIDB%2F04152%2F2020/PT#
Amaro, A., & Bação, F. (2024). Topic Modeling: A Consistent Framework for Comparative Studies. Emerging Science Journal, 8(1), 125-139. https://doi.org/10.28991/ESJ-2024-08-01-09 --- This work was supported by a grant of the Portuguese Foundation for Science and Technology (“Fundação para a Ciência e a Tecnologia”), DSAIPA/DS/0116/2019, and project UIDB/04152/2020—Centro de Investigação em Gestão de Informação (MagIC)
PY - 2024/2/1
Y1 - 2024/2/1
N2 - In recent years, the field of Topic Modeling (TM) has grown in importance due to the increasing availability of digital text data. TM is an unsupervised learning technique that helps uncover latent semantic structures in large sets of documents, making it a valuable tool for finding relevant patterns. However, evaluating the performance of TM algorithms can be challenging as different metrics and datasets are often used, leading to inconsistent results. In addition, many current surveys of TM algorithms focus on a limited number of models and exclude state-of-the-art approaches. This paper has the objective of addressing these issues by presenting a comprehensive comparative study of five TM algorithms across three different benchmark datasets using five different metrics. We offer an updated survey of the latest TM approaches and evaluation metrics, providing a consistent framework for comparing different algorithms while introducing state-of-the art approaches that have been disregarded in the literature. The experiments, which primarily use Context Vectors (CV) Topic Coherence as an evaluation metric, show that Top2Vec is the best-performing model across all datasets, disrupting the tendency for Latent Dirichlet Allocation to be the best performer.
AB - In recent years, the field of Topic Modeling (TM) has grown in importance due to the increasing availability of digital text data. TM is an unsupervised learning technique that helps uncover latent semantic structures in large sets of documents, making it a valuable tool for finding relevant patterns. However, evaluating the performance of TM algorithms can be challenging as different metrics and datasets are often used, leading to inconsistent results. In addition, many current surveys of TM algorithms focus on a limited number of models and exclude state-of-the-art approaches. This paper has the objective of addressing these issues by presenting a comprehensive comparative study of five TM algorithms across three different benchmark datasets using five different metrics. We offer an updated survey of the latest TM approaches and evaluation metrics, providing a consistent framework for comparing different algorithms while introducing state-of-the art approaches that have been disregarded in the literature. The experiments, which primarily use Context Vectors (CV) Topic Coherence as an evaluation metric, show that Top2Vec is the best-performing model across all datasets, disrupting the tendency for Latent Dirichlet Allocation to be the best performer.
KW - Natural Language Processing
KW - Top2Vec
KW - Topic Coherence
KW - Topic Modeling
KW - Unsupervised Learning
UR - https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html
UR - https://huggingface.co/datasets/yahoo_answers_qa
UR - https://huggingface.co/datasets/big_patent
UR - http://www.scopus.com/inward/record.url?scp=85186246398&partnerID=8YFLogxK
U2 - 10.28991/ESJ-2024-08-01-09
DO - 10.28991/ESJ-2024-08-01-09
M3 - Article
SN - 2610-9182
VL - 8
SP - 125
EP - 139
JO - Emerging Science Journal
JF - Emerging Science Journal
IS - 1
ER -