TY - GEN
T1 - Identification of Common Gene Signatures in Microarray and RNA-Sequencing Data Using Network-Based Regularization
AU - Diegues, Inês
AU - Vinga, Susana
AU - Lopes, Marta B.
N1 - Partially funded by the Portuguese Foundation for Science and Technology (UIDB/ 50021/2020 (INESC-ID), UIDB/00297/2020 (CMA), UIDB/04516/2020 (NOVA LINCS), and PTDC/CCI-CIF/29877/2017).
PY - 2020
Y1 - 2020
N2 - Microarray and RNA-sequencing (RNA-seq) gene expression data alongside machine learning algorithms are promising in the discovery of new cancer biomarkers. However, even though they are similar in purpose, there are some fundamental differences between the two techniques. We propose a methodology for cross-platform integration, and biomarker discovery based on network-based regularization via the Twin Networks Recovery (twiner) penalty, as a strategy to enhance the selection of breast cancer gene signatures that have similar correlation patterns in both platforms. In a classification setting based on sparse logistic regression (LR) taking as classes tumor from both RNA-seq and microarray, and normal tissue samples, twiner achieved precision-recall accuracies of 99.71% and 99.57% in the training and test set, respectively. Moreover, the survival analysis results validated the biological relevance of the signatures identified by twiner. Therefore, by leveraging from the existing amount of data for microarray and RNA-seq, a single biological conclusion can be reached, independent of each technology.
AB - Microarray and RNA-sequencing (RNA-seq) gene expression data alongside machine learning algorithms are promising in the discovery of new cancer biomarkers. However, even though they are similar in purpose, there are some fundamental differences between the two techniques. We propose a methodology for cross-platform integration, and biomarker discovery based on network-based regularization via the Twin Networks Recovery (twiner) penalty, as a strategy to enhance the selection of breast cancer gene signatures that have similar correlation patterns in both platforms. In a classification setting based on sparse logistic regression (LR) taking as classes tumor from both RNA-seq and microarray, and normal tissue samples, twiner achieved precision-recall accuracies of 99.71% and 99.57% in the training and test set, respectively. Moreover, the survival analysis results validated the biological relevance of the signatures identified by twiner. Therefore, by leveraging from the existing amount of data for microarray and RNA-seq, a single biological conclusion can be reached, independent of each technology.
KW - Biomarkers
KW - Machine learning
KW - Microarray
KW - Network-based regularization
KW - RNA-sequencing
UR - http://www.scopus.com/inward/record.url?scp=85085178547&partnerID=8YFLogxK
U2 - 10.1007/978-3-030-45385-5_2
DO - 10.1007/978-3-030-45385-5_2
M3 - Conference contribution
AN - SCOPUS:85085178547
SN - 978-3-030-45384-8
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 15
EP - 26
BT - Bioinformatics and Biomedical Engineering - 8th International Work-Conference, IWBBIO 2020, Proceedings
A2 - Rojas, Ignacio
A2 - Valenzuela, Olga
A2 - Rojas, Fernando
A2 - Herrera, Luis Javier
A2 - Ortuño, Francisco
PB - Springer
CY - Cham
T2 - 8th International Work-Conference on Bioinformatics and Biomedical Engineering, IWBBIO 2020
Y2 - 6 May 2020 through 8 May 2020
ER -