TY - JOUR
T1 - Improving Active Learning Performance through the Use of Data Augmentation
AU - Fonseca, João
AU - Bação, Fernando
N1 - info:eu-repo/grantAgreement/FCT/6817 - DCRRNI ID/UIDB%2F04152%2F2020/PT#
info:eu-repo/grantAgreement/FCT/3599-PPCDT/DSAIPA%2FDS%2F0116%2F2019/PT#
info:eu-repo/grantAgreement/FCT/3599-PPCDT/PTDC%2FCTA-AMB%2F28438%2F2017/PT#
Fonseca, J., & Bação, F. (2023). Improving Active Learning Performance through the Use of Data Augmentation. International Journal of Intelligent Systems, 2023, 1-17. https://doi.org/10.1155/2023/7941878 --- Funding: This research was supported by three research grants of the Portuguese Foundation for Science and Technology (“Fundação para a Ciencia e a Tecnologia”): SFRH/BD/151473/2021 - MIT Portugal PhD Grant; DSAIPA/DS/0116/2019, and PCIF/SSI/0102/2017.
PY - 2023/2/20
Y1 - 2023/2/20
N2 - Active learning (AL) is a well-known technique to optimize data usage in training, through the interactive selection of unlabeled observations, out of a large pool of unlabeled data, to be labeled by a supervisor. Its focus is to find the unlabeled observations that, once labeled, will maximize the informativeness of the training dataset, therefore reducing data-related costs. The literature describes several methods to improve the effectiveness of this process. Nonetheless, there is a paucity of research developed around the application of artificial data sources in AL, especially outside image classification or NLP. This paper proposes a new AL framework, which relies on the effective use of artificial data. It may be used with any classifier, generation mechanism, and data type and can be integrated with multiple other state-of-the-art AL contributions. This combination is expected to increase the ML classifier’s performance and reduce both the supervisor’s involvement and the amount of required labeled data at the expense of a marginal increase in computational time. The proposed method introduces a hyperparameter optimization component to improve the generation of artificial instances during the AL process as well as an uncertainty-based data generation mechanism. We compare the proposed method to the standard framework and an oversampling-based active learning method for more informed data generation in an AL context. The models’ performance was tested using four different classifiers, two AL-specific performance metrics, and three classification performance metrics over 15 different datasets. We demonstrated that the proposed framework, using data augmentation, significantly improved the performance of AL, both in terms of classification performance and data selection efficiency (all the codes and preprocessed data developed for this study are available at https://github.com/joaopfonseca/publications/).
AB - Active learning (AL) is a well-known technique to optimize data usage in training, through the interactive selection of unlabeled observations, out of a large pool of unlabeled data, to be labeled by a supervisor. Its focus is to find the unlabeled observations that, once labeled, will maximize the informativeness of the training dataset, therefore reducing data-related costs. The literature describes several methods to improve the effectiveness of this process. Nonetheless, there is a paucity of research developed around the application of artificial data sources in AL, especially outside image classification or NLP. This paper proposes a new AL framework, which relies on the effective use of artificial data. It may be used with any classifier, generation mechanism, and data type and can be integrated with multiple other state-of-the-art AL contributions. This combination is expected to increase the ML classifier’s performance and reduce both the supervisor’s involvement and the amount of required labeled data at the expense of a marginal increase in computational time. The proposed method introduces a hyperparameter optimization component to improve the generation of artificial instances during the AL process as well as an uncertainty-based data generation mechanism. We compare the proposed method to the standard framework and an oversampling-based active learning method for more informed data generation in an AL context. The models’ performance was tested using four different classifiers, two AL-specific performance metrics, and three classification performance metrics over 15 different datasets. We demonstrated that the proposed framework, using data augmentation, significantly improved the performance of AL, both in terms of classification performance and data selection efficiency (all the codes and preprocessed data developed for this study are available at https://github.com/joaopfonseca/publications/).
UR - https://github.com/joaopfonseca/publications/tree/master/2023-active-learning-augmentation
UR - https://www.webofscience.com/wos/woscc/full-record/WOS:000943691000006
UR - http://www.scopus.com/inward/record.url?scp=85176447050&partnerID=8YFLogxK
U2 - 10.1155/2023/7941878
DO - 10.1155/2023/7941878
M3 - Article
SN - 0884-8173
VL - 2023
SP - 1
EP - 17
JO - International Journal of Intelligent Systems
JF - International Journal of Intelligent Systems
M1 - 7941878
ER -