TY - JOUR
T1 - Exploring Molecular Heteroencoders with Latent Space Arithmetic
T2 - Atomic Descriptors and Molecular Operators
AU - Gao, Xinyue
AU - Baimacheva, Natalia
AU - Aires-de-Sousa, João
N1 - Funding Information:
This work was supported by the Associate Laboratory for Green Chemistry (LAQV), which is financed by national funds from the Funda\u00E7\u00E3o para a Ci\u00EAncia e Tecnologia (FCT/MECI), Portugal, under grants LA/P/0008/2020 DOI 10.54499/LA/P/0008/2020, UIDP/50006/2020 DOI 10.54499/UIDP/50006/2020, and UIDB/50006/2020 DOI 10.54499/UIDB/50006/2020. This work was co-funded by the European Union through scholarships awarded to N.B. and X.G. by the Erasmus Mundus Joint Masters ChEMoinformaticsplus project (program ERASMUS2027, ERASMUS-EDU-2021-PEX-EMJM-MOB; project number 101050809).
Publisher Copyright:
© 2024 by the authors.
PY - 2024/8/22
Y1 - 2024/8/22
N2 - A variational heteroencoder based on recurrent neural networks, trained with SMILES linear notations of molecular structures, was used to derive the following atomic descriptors: delta latent space vectors (DLSVs) obtained from the original SMILES of the whole molecule and the SMILES of the same molecule with the target atom replaced. Different replacements were explored, namely, changing the atomic element, replacement with a character of the model vocabulary not used in the training set, or the removal of the target atom from the SMILES. Unsupervised mapping of the DLSV descriptors with t-distributed stochastic neighbor embedding (t-SNE) revealed a remarkable clustering according to the atomic element, hybridization, atomic type, and aromaticity. Atomic DLSV descriptors were used to train machine learning (ML) models to predict 19F NMR chemical shifts. An R2 of up to 0.89 and mean absolute errors of up to 5.5 ppm were obtained for an independent test set of 1046 molecules with random forests or a gradient-boosting regressor. Intermediate representations from a Transformer model yielded comparable results. Furthermore, DLSVs were applied as molecular operators in the latent space: the DLSV of a halogenation (H→F substitution) was summed to the LSVs of 4135 new molecules with no fluorine atom and decoded into SMILES, yielding 99% of valid SMILES, with 75% of the SMILES incorporating fluorine and 56% of the structures incorporating fluorine with no other structural change.
AB - A variational heteroencoder based on recurrent neural networks, trained with SMILES linear notations of molecular structures, was used to derive the following atomic descriptors: delta latent space vectors (DLSVs) obtained from the original SMILES of the whole molecule and the SMILES of the same molecule with the target atom replaced. Different replacements were explored, namely, changing the atomic element, replacement with a character of the model vocabulary not used in the training set, or the removal of the target atom from the SMILES. Unsupervised mapping of the DLSV descriptors with t-distributed stochastic neighbor embedding (t-SNE) revealed a remarkable clustering according to the atomic element, hybridization, atomic type, and aromaticity. Atomic DLSV descriptors were used to train machine learning (ML) models to predict 19F NMR chemical shifts. An R2 of up to 0.89 and mean absolute errors of up to 5.5 ppm were obtained for an independent test set of 1046 molecules with random forests or a gradient-boosting regressor. Intermediate representations from a Transformer model yielded comparable results. Furthermore, DLSVs were applied as molecular operators in the latent space: the DLSV of a halogenation (H→F substitution) was summed to the LSVs of 4135 new molecules with no fluorine atom and decoded into SMILES, yielding 99% of valid SMILES, with 75% of the SMILES incorporating fluorine and 56% of the structures incorporating fluorine with no other structural change.
KW - atomic descriptors
KW - molecular operators
KW - natural language models
KW - QSPR
UR - http://www.scopus.com/inward/record.url?scp=85202686998&partnerID=8YFLogxK
U2 - 10.3390/molecules29163969
DO - 10.3390/molecules29163969
M3 - Article
C2 - 39203047
AN - SCOPUS:85202686998
SN - 1420-3049
VL - 29
JO - Molecules
JF - Molecules
IS - 16
M1 - 3969
ER -