TY - JOUR
T1 - UMAP-SMOTENC
T2 - A Simple, Efficient, and Consistent Alternative for Privacy-Aware Synthetic Data Generation
AU - Almeida, Gonçalo
AU - Bação, Fernando
N1 - info:eu-repo/grantAgreement/FCT/6817 - DCRRNI ID/UIDB%2F04152%2F2020/PT#
https://doi.org/10.54499/UIDB/04152/2020#
info:eu-repo/grantAgreement/FCT/3599-PPCDT/DSAIPA%2FDS%2F0116%2F2019/PT#
https://doi.org/10.54499/DSAIPA/DS/0116/2019#
Almeida, G., & Bação, F. (2024). UMAP-SMOTENC: A Simple, Efficient, and Consistent Alternative for Privacy-Aware Synthetic Data Generation. Knowledge-Based Systems, 300, 1-14. Article 112174. https://doi.org/10.1016/j.knosys.2024.112174 --- This work was supported by a grant of the Portuguese Foundation for Science and Technology (“Fundação para a Ciência e a Tecnologia”), DSAIPA/DS/0116/2019, and project UIDB/04152/2020—Centro de Investigação em Gestão de Informação (MagIC).
PY - 2024/9/27
Y1 - 2024/9/27
N2 - The intensification of governmental legislation and the social awareness around data privacy protection severely constrains organizations' data utilization capabilities. As a result, the interest in data anonymization techniques, which should preserve the patterns present in the original data but mitigate the risks of privacy leakage, has also increased. While conventional methods may compromise privacy, recently proposed deep learning generative approaches are computationally expensive and unreliable when used in tabular datasets, hindering the democratization and usability of data. In this paper, we explore this trade-off between privacy and the quality of the anonymized data, establishing a new equilibrium obtained using a synthetic oversampling technique, SMOTE-NC, on a non-linear compressed version of the input space, achieved with the application of UMAP. The introduced approach, UMAP-SMOTENC, constitutes an efficient and consistent solution that can be used without significant efforts on hyperparameter tuning or resourcing to massive computing infrastructures. An experiment was conducted to evaluate the robustness of the proposed solution, comparing several metrics and models across eight datasets with diverse characteristics. The results achieved suggest that the presented method can efficiently synthesize privacy-aware data while conserving the relevant patterns of the real dataset, particularly those required for classification tasks.
AB - The intensification of governmental legislation and the social awareness around data privacy protection severely constrains organizations' data utilization capabilities. As a result, the interest in data anonymization techniques, which should preserve the patterns present in the original data but mitigate the risks of privacy leakage, has also increased. While conventional methods may compromise privacy, recently proposed deep learning generative approaches are computationally expensive and unreliable when used in tabular datasets, hindering the democratization and usability of data. In this paper, we explore this trade-off between privacy and the quality of the anonymized data, establishing a new equilibrium obtained using a synthetic oversampling technique, SMOTE-NC, on a non-linear compressed version of the input space, achieved with the application of UMAP. The introduced approach, UMAP-SMOTENC, constitutes an efficient and consistent solution that can be used without significant efforts on hyperparameter tuning or resourcing to massive computing infrastructures. An experiment was conducted to evaluate the robustness of the proposed solution, comparing several metrics and models across eight datasets with diverse characteristics. The results achieved suggest that the presented method can efficiently synthesize privacy-aware data while conserving the relevant patterns of the real dataset, particularly those required for classification tasks.
KW - Anonymization Techniques
KW - Machine Learning
KW - SMOTE
KW - Synthetic Data Generation
KW - UMAP
UR - http://www.scopus.com/inward/record.url?scp=85196937066&partnerID=8YFLogxK
UR - https://www.webofscience.com/wos/woscc/full-record/WOS:001261574700001
U2 - 10.1016/j.knosys.2024.112174
DO - 10.1016/j.knosys.2024.112174
M3 - Article
SN - 0950-7051
VL - 300
SP - 1
EP - 14
JO - Knowledge-Based Systems
JF - Knowledge-Based Systems
M1 - 112174
ER -