TY - JOUR
T1 - ProPythia
T2 - A Python package for protein classification based on machine and deep learning
AU - Sequeira, Ana Marta
AU - Lousa, Diana
AU - Rocha, Miguel
N1 - Funding Information:
This study was supported by FCT through project PTDC/CCI-BIO/28200/2017 and the strategic funding of UID/BIO/04469/2020, and also by the European Regional Development Fund under the scope of Norte2020, through the projects DeepBio (Ref. NORTE-01?0247-FEDER-039831). This work was also financially supported by Project LISBOA-01?0145-FEDER-007660 (Microbiologia Molecular, Estrutural e Celular) funded by FEDER funds through COMPETE2020 - Programa Operacional Competitividade e Internacionaliza??o (POCI) and by national funds through FCT - Fundacao para a Ciencia e a Tecnologia.
Funding Information:
This study was supported by FCT through project PTDC/CCI-BIO/28200/2017 and the strategic funding of UID/BIO/04469/2020, and also by the European Regional Development Fund under the scope of Norte2020, through the projects DeepBio (Ref. NORTE-01–0247-FEDER-039831). This work was also financially supported by Project LISBOA-01–0145-FEDER-007660 (Microbiologia Molecular, Estrutural e Celular) funded by FEDER funds through COMPETE2020 - Programa Operacional Competitividade e Internacionalização (POCI) and by national funds through FCT - Fundacao para a Ciencia e a Tecnologia.
Publisher Copyright:
© 2021 Elsevier B.V.
PY - 2021
Y1 - 2021
N2 - The field of protein data mining has been growing rapidly in the last years. To characterize proteins and determine their function from their amino acid sequences are challenging and long-standing problems, where Bioinformatics and Machine Learning have an emergent role. A myriad of machine and deep learning algorithms have been applied in these tasks with exciting results. However, tools and platforms to calculate protein features and perform both Machine Learning (ML) and Deep Learning (DL) pipelines, taking as inputs protein sequences, are still lacking and have their limitations in terms of performance, user-friendliness and restricted domains of application. Here, to address these limitations, we propose ProPythia, a generic and modular Python package that allows to easily deploy ML and DL approaches for a plethora of problems in protein sequence analysis and classification. It facilitates the implementation, comparison and validation of the major tasks in ML or DL pipelines including modules to read and alter sequences, calculate protein features, preprocess datasets, execute feature selection and dimensionality reduction, perform clustering and manifold analysis, as well as to train and optimize ML/DL models and use them to make predictions. ProPythia has an adaptable modular architecture being a versatile and easy-to-use tool, which will be useful to transform protein data in valuable knowledge even for people not familiarized with ML code. This platform was tested in several applications comparing with results from literature. Here, we illustrate its applicability in two cases studies: the prediction of antimicrobial peptides and the prediction of enzymes Enzyme commission (EC) numbers. Furthermore, we assess the performance of the different descriptors on four different protein classification challenges. Its source code and documentation, including an user guide and case studies are freely available at https://github.com/BioSystemsUM/propythia.
AB - The field of protein data mining has been growing rapidly in the last years. To characterize proteins and determine their function from their amino acid sequences are challenging and long-standing problems, where Bioinformatics and Machine Learning have an emergent role. A myriad of machine and deep learning algorithms have been applied in these tasks with exciting results. However, tools and platforms to calculate protein features and perform both Machine Learning (ML) and Deep Learning (DL) pipelines, taking as inputs protein sequences, are still lacking and have their limitations in terms of performance, user-friendliness and restricted domains of application. Here, to address these limitations, we propose ProPythia, a generic and modular Python package that allows to easily deploy ML and DL approaches for a plethora of problems in protein sequence analysis and classification. It facilitates the implementation, comparison and validation of the major tasks in ML or DL pipelines including modules to read and alter sequences, calculate protein features, preprocess datasets, execute feature selection and dimensionality reduction, perform clustering and manifold analysis, as well as to train and optimize ML/DL models and use them to make predictions. ProPythia has an adaptable modular architecture being a versatile and easy-to-use tool, which will be useful to transform protein data in valuable knowledge even for people not familiarized with ML code. This platform was tested in several applications comparing with results from literature. Here, we illustrate its applicability in two cases studies: the prediction of antimicrobial peptides and the prediction of enzymes Enzyme commission (EC) numbers. Furthermore, we assess the performance of the different descriptors on four different protein classification challenges. Its source code and documentation, including an user guide and case studies are freely available at https://github.com/BioSystemsUM/propythia.
KW - Antimicrobial peptide
KW - Deep learning
KW - Enzyme
KW - Machine learning
KW - Protein/peptide classification
KW - Python Package
UR - http://www.scopus.com/inward/record.url?scp=85119582124&partnerID=8YFLogxK
U2 - 10.1016/j.neucom.2021.07.102
DO - 10.1016/j.neucom.2021.07.102
M3 - Article
AN - SCOPUS:85119582124
SN - 0925-2312
JO - Neurocomputing
JF - Neurocomputing
ER -