TY - JOUR
T1 - Regularization Methods for High-Dimensional Data as a Tool for Seafood Traceability
AU - Yokochi, Clara
AU - Bispo, Regina
AU - Ricardo, Fernando
AU - Calado, Ricardo
N1 - info:eu-repo/grantAgreement/FCT/6817 - DCRRNI ID/UIDB%2F00297%2F2020/PT#
info:eu-repo/grantAgreement/FCT/6817 - DCRRNI ID/UIDP%2F00297%2F2020/PT#
PY - 2023/9
Y1 - 2023/9
N2 - Seafood traceability, needed to regulate food safety, control fisheries, combat fraud, and prevent jeopardizing public health from harvesting in polluted locations, depends heavily on the prediction of the geographic origin of seafood. When the available datasets to study traceability are high-dimensional, standard classic statistical models fail. Under these circumstances, proper alternative methods are needed to predict accurately the geographic origin of seafood. In this study, we propose an analytical approach combining the use of regularization methods and resampling techniques to overcome the high-dimensionality problem. In particular, we analyze comparatively the Ridge regression, LASSO and Elastic net penalty-based approaches. These methods were applied to predict the origin of the saltwater clam Ruditapes philippinarum, a non-indigenous and commercially very relevant marine bivalve species that occurs commonly in European estuaries. Further, the resampling method of Monte Carlo Cross-Validation was implemented to overcome challenges related to the small sample size. The results of the three methods were compared. For fully reproducibility, an R Markdown file and the used dataset are provided. We conclude highlighting the insights that this methodology may bring to model a multi-categorical response based on high-dimensional dataset, with highly correlated explanatory variables, and combat the mislabeling of geographic origin of seafood.
AB - Seafood traceability, needed to regulate food safety, control fisheries, combat fraud, and prevent jeopardizing public health from harvesting in polluted locations, depends heavily on the prediction of the geographic origin of seafood. When the available datasets to study traceability are high-dimensional, standard classic statistical models fail. Under these circumstances, proper alternative methods are needed to predict accurately the geographic origin of seafood. In this study, we propose an analytical approach combining the use of regularization methods and resampling techniques to overcome the high-dimensionality problem. In particular, we analyze comparatively the Ridge regression, LASSO and Elastic net penalty-based approaches. These methods were applied to predict the origin of the saltwater clam Ruditapes philippinarum, a non-indigenous and commercially very relevant marine bivalve species that occurs commonly in European estuaries. Further, the resampling method of Monte Carlo Cross-Validation was implemented to overcome challenges related to the small sample size. The results of the three methods were compared. For fully reproducibility, an R Markdown file and the used dataset are provided. We conclude highlighting the insights that this methodology may bring to model a multi-categorical response based on high-dimensional dataset, with highly correlated explanatory variables, and combat the mislabeling of geographic origin of seafood.
KW - Elastic net
KW - LASSO
KW - Regularization
KW - Ridge regression
KW - Traceability
UR - http://www.scopus.com/inward/record.url?scp=85169556874&partnerID=8YFLogxK
U2 - 10.1007/s42519-023-00341-8
DO - 10.1007/s42519-023-00341-8
M3 - Article
AN - SCOPUS:85169556874
SN - 1559-8608
VL - 17
JO - Journal of Statistical Theory and Practice
JF - Journal of Statistical Theory and Practice
IS - 3
M1 - 44
ER -