TY - JOUR
T1 - Low-cost and scalable machine learning model for identifying children and adolescents with poor oral health using survey data
T2 - An empirical study in Portugal
AU - Lavado, Susana
AU - Costa, Eduardo
AU - Sturm, Niclas F
AU - Tafferner, Johannes S
AU - Rodrigues, Octávio
AU - Pita Barros, Pedro
AU - Zejnilovic, Leid
N1 - Lavado, S., Costa, E., Sturm, N. F., Tafferner, J. S., Rodrigues, O., Pita Barros, P., & Zejnilovic, L. (2025). Low-cost and scalable machine learning model for identifying children and adolescents with poor oral health using survey data: An empirical study in Portugal. PLoS ONE, 20(1), Article e0312075. https://doi.org/10.1371/journal.pone.0312075 -- Copyright: © 2025 Lavado et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
PY - 2025/1/24
Y1 - 2025/1/24
N2 - This empirical study assessed the potential of developing a machine-learning model to identify children and adolescents with poor oral health using only self-reported survey data. Such a model could enable scalable and cost-effective screening and targeted interventions, optimizing limited resources to improve oral health outcomes. To train and test the model, we used data from 2,133 students attending schools in a Portuguese municipality. Poor oral health (the dependent variable) was defined as having a Decayed, Missing, and Filled Teeth index for deciduous teeth (dmft) or permanent teeth (DMFT) above expert-defined thresholds (dmft/DMFT ≥ 3 or 4). The survey provided information about the students' oral health habits, knowledge, beliefs, and food and physical activity habits, which served as independent variables. Logistic regression models with variables selected through low-variance filtering and recursive feature elimination outperformed various others trained with complex machine learning algorithms based on precision@k metric, outperforming also random selection and expert rule-based models in identifying students with poor oral health. The proposed models are inherently explainable, broadly applicable, which given the context, could compensate their lower performance (Area Under the Curve = 0.64-0.70) compared to similar approaches and models. This study is one of the few in oral health care that includes bias auditing of classification models. The audit surfaced potential biases related to demographic factors such as age and social assistance status. Addressing these biases without significantly compromising model performance remains a challenge. The results confirm the feasibility of survey-based machine learning models for identifying individuals with poor oral health, but further validation of this approach and pilot testing in field trials are necessary.
AB - This empirical study assessed the potential of developing a machine-learning model to identify children and adolescents with poor oral health using only self-reported survey data. Such a model could enable scalable and cost-effective screening and targeted interventions, optimizing limited resources to improve oral health outcomes. To train and test the model, we used data from 2,133 students attending schools in a Portuguese municipality. Poor oral health (the dependent variable) was defined as having a Decayed, Missing, and Filled Teeth index for deciduous teeth (dmft) or permanent teeth (DMFT) above expert-defined thresholds (dmft/DMFT ≥ 3 or 4). The survey provided information about the students' oral health habits, knowledge, beliefs, and food and physical activity habits, which served as independent variables. Logistic regression models with variables selected through low-variance filtering and recursive feature elimination outperformed various others trained with complex machine learning algorithms based on precision@k metric, outperforming also random selection and expert rule-based models in identifying students with poor oral health. The proposed models are inherently explainable, broadly applicable, which given the context, could compensate their lower performance (Area Under the Curve = 0.64-0.70) compared to similar approaches and models. This study is one of the few in oral health care that includes bias auditing of classification models. The audit surfaced potential biases related to demographic factors such as age and social assistance status. Addressing these biases without significantly compromising model performance remains a challenge. The results confirm the feasibility of survey-based machine learning models for identifying individuals with poor oral health, but further validation of this approach and pilot testing in field trials are necessary.
KW - Humans
KW - Machine Learning
KW - Portugal
KW - Oral Health
KW - Adolescent
KW - Child
KW - Female
KW - Male
KW - Surveys and Questionnaires
KW - Dental Caries/diagnosis
UR - https://www.scopus.com/pages/publications/85216376937
UR - https://www.webofscience.com/wos/woscc/full-record/WOS:001466879700114
U2 - 10.1371/journal.pone.0312075
DO - 10.1371/journal.pone.0312075
M3 - Article
C2 - 39854338
SN - 1932-6203
VL - 20
JO - PLoS ONE
JF - PLoS ONE
IS - 1
M1 - e0312075
ER -