Journal Home > Volume 7 , Issue 1

Generating novel molecules to satisfy specific properties is a challenging task in modern drug discovery, which requires the optimization of a specific objective based on satisfying chemical rules. Herein, we aim to optimize the properties of a specific molecule to satisfy the specific properties of the generated molecule. The Matched Molecular Pairs (MMPs), which contain the source and target molecules, are used herein, and logD and solubility are selected as the optimization properties. The main innovative work lies in the calculation related to a specific transformer from the perspective of a matrix dimension. Threshold intervals and state changes are then used to encode logD and solubility for subsequent tests. During the experiments, we screen the data based on the proportion of heavy atoms to all atoms in the groups and select 12365, 1503, and 1570 MMPs as the training, validation, and test sets, respectively. Transformer models are compared with the baseline models with respect to their abilities to generate molecules with specific properties. Results show that the transformer model can accurately optimize the source molecules to satisfy specific properties.


menu
Abstract
Full text
Outline
About this article

Molecular Generation and Optimization of Molecular Properties Using a Transformer Model

Show Author's information Zhongyin Xu1Xiujuan Lei1( )Mei Ma1Yi Pan2( )
School of Computer Science, Shaanxi Normal University, Xi’an 710119, China
Faculty of Computer Science and Control Engineering, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China

Abstract

Generating novel molecules to satisfy specific properties is a challenging task in modern drug discovery, which requires the optimization of a specific objective based on satisfying chemical rules. Herein, we aim to optimize the properties of a specific molecule to satisfy the specific properties of the generated molecule. The Matched Molecular Pairs (MMPs), which contain the source and target molecules, are used herein, and logD and solubility are selected as the optimization properties. The main innovative work lies in the calculation related to a specific transformer from the perspective of a matrix dimension. Threshold intervals and state changes are then used to encode logD and solubility for subsequent tests. During the experiments, we screen the data based on the proportion of heavy atoms to all atoms in the groups and select 12365, 1503, and 1570 MMPs as the training, validation, and test sets, respectively. Transformer models are compared with the baseline models with respect to their abilities to generate molecules with specific properties. Results show that the transformer model can accurately optimize the source molecules to satisfy specific properties.

Keywords: solubility, transformer, molecular optimization, Matched Molecular Pairs (MMPs), logD

References(47)

[1]

P. G. Polishchuk, T. I. Madzhidov, and A. Varnek, Estimation of the size of drug-like chemical space based on GDB-17 data, J. Comput. Aided Mol. Des., vol. 27, no. 8, pp. 675–679, 2013.

[2]

S. Heller, A. McNaught, S. Stein, D. Tchekhovskoi, and I. Pletnev, InChI − the worldwide chemical structure identifier standard, J. Cheminform., vol. 5, no. 1, p. 7, 2013.

[3]
N. M. O’Boyle and A. Dalke, DeepSMILES: An adaptation of SMILES for use in machine-learning of chemical structures. doi:10.26434/chemrxiv.7097960.
DOI
[4]

M. Krenn, F. Häse, A. K. Nigam, P. Friederich, and A. Aspuru-Guzik, Self-Referencing Embedded Strings (SELFIES): A 100% robust molecular string representation, Mach. Learn. Sci. Technol., vol. 1, no. 4, p. 045024, 2020.

[5]
E. J. Bjerrum and R. Threlfall, Molecular generation with recurrent neural networks (RNNs), arXiv preprint arXiv: 1705.04612, 2017.
[6]

A. Gupta, A. T. Müller, B. J. H. Huisman, J. A. Fuchs, P. Schneider, and G. Schneider, Generative recurrent networks for de novo drug design, Mol. Inform., vol. 37, nos. 1&2, p. 1700111, 2018.

[7]

M. H. S. Segler, T. Kogej, C. Tyrchan, and M. P. Waller, Generating focused molecule libraries for drug discovery with recurrent neural networks, ACS Cent. Sci., vol. 4, no. 1, pp. 120–131, 2018.

[8]

R. Gómez-Bombarelli, J. N. Wei, D. Duvenaud, J. M. Hernández-Lobato, B. Sánchez-Lengeling, D. Sheberla, J. Aguilera-Iparraguirre, T. D. Hirzel, R. P. Adams, and A. Aspuru-Guzik, Automatic chemical design using a data-driven continuous representation of molecules, ACS Cent. Sci., vol. 4, no. 2, pp. 268–276, 2018.

[9]

J. Lim, S. Ryu, J. W. Kim, and W. Y. Kim, Molecular generative model based on conditional variational autoencoder for de novo molecular design, J. Cheminform., vol. 10, p. 31, 2018.

[10]
M. J. Kusner, B. Paige, and J. M. Hernández-Lobato, Grammar variational autoencoder, in Proc. 34 th Int. Conf. Machine Learning, Sydney, Australia, 2017, pp. 1945–1954.
[11]
H. Dai, Y. Tian, B. Dai, S. Skiena, and L. Song, Syntax-directed variational autoencoder for molecule generation, in Proc. Int. Conf. Learning Representations, https://doi.org/10.48550/arXiv.1802.08786, 2018.
[12]
Q. Liu, M. Allamanis, M. Brockschmidt, and A. L. Gaunt, Constrained graph variational autoencoders for molecule design, in Proc. 32 nd Int. Conf. Neural Information Processing Systems, Montréal, Canada, 2018, pp. 7806–7815.
[13]
W. Jin, R. Barzilay, and T. Jaakkola, Junction tree variational autoencoder for molecular graph generation, in Proc. 35 th Int. Conf. Machine Learning, Stockholm, Sweden, 2018, pp. 2323–2332.
[14]
M. Simonovsky and N. Komodakis, GraphVAE: Towards generation of small graphs using variational autoencoders, in Proc. 27 th Int. Conf. Artificial Neural Networks, Rhodes, Greece, 2018, pp. 412–422.
DOI
[15]
G. L. Guimaraes, B. Sanchez-Lengeling, C. Outeiral, P. L. C. Farias, and A. Aspuru-Guzik, Objective-reinforced generative adversarial networks (ORGAN) for sequence generation models, arXiv preprint arXiv: 1705.10843, 2018.
[16]

E. Putin, A. Asadulaev, Y. Ivanenkov, V. Aladinskiy, B. Sanchez-Lengeling, A. Aspuru-Guzik, and A. Zhavoronkov, Reinforced adversarial neural computer for de novo molecular design, J. Chem. Inf. Model., vol. 58, no. 6, pp. 1194–1204, 2018.

[17]

E. Putin, A. Asadulaev, Q. Vanhaelen, Y. Ivanenkov, A. V. Aladinskaya, A. Aliper, and A. Zhavoronkov, Adversarial threshold neural computer for molecular de novo design, Mol. Pharm., vol. 15, no. 10, pp. 4386–4397, 2018.

[18]
N. De Cao and T. Kipf, MolGAN: An implicit generative model for small molecular graphs, arXiv preprint arXiv: 1805.11973, 2022.
[19]
L. Dinh, D. Krueger, and Y. Bengio, NICE: Non-linear independent components estimation, arXiv preprint arXiv: 1410.8516, 2015.
[20]
L. Dinh, J. Sohl-Dickstein, and S. Bengio, Density estimation using real NVP, arXiv preprint arXiv: 1605.08803, 2017.
[21]
D. P. Kingma and P. Dhariwal, Glow: Generative flow with invertible 1x1 convolutions, arXiv preprint arXiv: 1807.03039, 2018.
[22]

M. Lee and K. Min, MGCVAE: Multi-objective inverse design via molecular graph conditional variational autoencoder, J. Chem. Inf. Model., vol. 62, no. 12, pp. 2943–2950, 2022.

[23]
C. Li, J. Yao, W. Wei, Z. Niu, X. Zeng, J. Li, and J. Wang, Geometry-based molecular generation with deep constrained variational autoencoder, IEEE Trans. Neural Netw. Learn. Syst. doi: 10.1109/TNNLS.2022.3147790.
DOI
[24]
C. Ma and X. Zhang, GF-VAE: A flow-based variational autoencoder for molecule generation, in Proc. 30 th ACM Int. Conf. Information & Knowledge Management, Virtual Event, Queensland, Australia, 2021, pp. 1181–1190.
DOI
[25]
S. Luo, J. Guan, J. Ma, and J. Peng, A 3D generative model for structure-based drug design, arXiv preprint arXiv: 2203.10446, 2022.
[26]

V. Bagal, R. Aggarwal, P. K. Vinod, and U. D. Priyakumar, MolGPT: Molecular generation using a transformer-decoder model, J. Chem. Inf. Model., vol. 62, no. 9, pp. 2064–2076, 2022.

[27]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, Attention is all you need, in Proc. 31 st Int. Conf. Neural Information Processing Systems, Long Beach, CA, USA, 2017. pp. 6000–6010.
[28]

M. Langevin, H. Minoux, M. Levesque, and M. Bianciotto, Scaffold-constrained molecular generation, J. Chem. Inf. Model., vol. 60, no. 12, pp. 5637–5646, 2020.

[29]

J. Zhang and H. Chen, De novo molecule design using molecular generative models constrained by ligand-protein interactions, J. Chem. Inf. Model., vol. 62, no. 14, pp. 3291–3306, 2022.

[30]

J. He, H. You, E. Sandström, E. Nittinger, E. J. Bjerrum, C. Tyrchan, W. Czechtizky, and O. Engkvist, Molecular optimization by capturing chemist’s intuition using deep neural networks, J. Cheminform., vol. 13, no. 1, p. 26, 2021.

[31]

J. He, E. Nittinger, C. Tyrchan, W. Czechtizky, A. Patronov, E. J. Bjerrum, and O. Engkvist, Transformer-based molecular optimization beyond matched molecular pairs, J. Cheminform., vol. 14, no. 1, p. 18, 2022.

[32]

G. R. Bickerton, G. V. Paolini, J. Besnard, S. Muresan, and A. L. Hopkins, Quantifying the chemical beauty of drugs, Nat. Chem., vol. 4, no. 2, pp. 90–98, 2012.

[33]

K. Preuer, P. Renz, T. Unterthiner, S. Hochreiter, and G. Klambauer, Fréchet ChemNet distance: A metric for generative models for molecules in drug discovery, J. Chem. Inf. Model., vol. 58, no. 9, pp. 1736–1741, 2018.

[34]
T. Fu, C. Xiao, and J. Sun, CORE: Automatic molecule optimization using copy & refine strategy, Proc. AAAI Conf. Artif. Intell., vol. 34, no. 1, pp. 638–645, 2020.
DOI
[35]

N. Brown, M. Fiscato, M. H. S. Segler, and A. C. Vaucher, GuacaMol: Benchmarking models for de novo molecular design, J. Chem. Inf. Model., vol. 59, no. 3, pp. 1096–1108, 2019.

[36]

D. Polykovskiy, A. Zhebrak, B. Sanchez-Lengeling, S. Golovanov, O. Tatanov, S. Belyaev, R. Kurbanov, A. Artamonov, V. Aladinskiy, M. Veselov, et al., Molecular Sets (MOSES): A benchmarking platform for molecular generation models, Front. Pharmacol., vol. 11, p. 565644, 2020.

[37]
D. Bahdanau, K. Cho, and Y. Bengio, Neural machine translation by jointly learning to align and translate, arXiv preprint arXiv: 1409.0473, 2016.
[38]

A. Gaulton, L. J. Bellis, A. P. Bento, J. Chambers, M. Davies, A. Hersey, Y. Light, S. McGlinchey, D. Michalovich, B. Al-Lazikani, et al., ChEMBL: a large-scale bioactivity database for drug discovery, Nucleic Acids Res., vol. 40, no. D1, pp. D1100–D1107, 2012.

[39]

A. Dalke, J. Hert, and C. Kramer, mmpdb: An open-source matched molecular pair platform for large multiproperty data sets, J. Chem. Inf. Model., vol. 58, no. 5, pp. 902–910, 2018.

[40]

D. Weininger, SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules, J. Chem. Inf. Comput. Sci., vol. 28, no. 1, pp. 31–36, 1988.

[41]

K. Yang, K. Swanson, W. G. Jin, C. Coley, P. Eiden, H. Gao, A. Guzman-Perez, T. Hopper, B. Kelley, M. Mathea, et al., Analyzing learned molecular representations for property prediction, J. Chem. Inf. Model., vol. 59, no. 8, pp. 3370–3388, 2019.

[42]

S. Turk, B. Merget, F. Rippmann, and S. Fulle, Coupling matched molecular pairs with machine learning for virtual compound optimization, J. Chem. Inf. Model., vol. 57, no. 12, pp. 3079–3085, 2017.

[43]

D. Mendez, A. Gaulton, A. P. Bento, J. Chambers, M. De Veij, E. Félix, M. P. Magariños, J. F. Mosquera, P. Mutowo, M. Nowotka, et al., ChEMBL: Towards direct deposition of bioassay data, Nucleic Acids Res., vol. 47, no. D1, pp. D930–D940, 2019.

[44]
M. Swain, MolVS: Molecule validation and standardization, https://pypi.org/project/Molvs, 2018.
[45]

J. G. Cumming, A. M. Davis, S. Muresan, M. Haeberlein, and H. Chen, Chemical predictive modelling to improve compound quality, Nat. Rev. Drug Discov., vol. 12, no. 12, pp. 948–962, 2013.

[46]

F. W. Scholz and M. A. Stephens, K-sample Anderson-darling tests, J. Am. Stat. Assoc., vol. 82, no. 399, pp. 918–924, 1987.

[47]

J. B. Dressman and C. Reppas, In vitro-in vivo correlations for lipophilic, poorly water-soluble drugs, Eur. J. Pharm. Sci., vol. 11, no. S2, pp. S73–S80, 2000.

Publication history
Copyright
Acknowledgements
Rights and permissions

Publication history

Received: 06 January 2023
Revised: 27 March 2023
Accepted: 04 May 2023
Published: 25 December 2023
Issue date: March 2024

Copyright

© The author(s) 2023.

Acknowledgements

Acknowledgment

This work was supported by the National Natural Science Foundation of China (Nos. 62272288, 61972451, and U22A2041) and the Shenzhen Key Laboratory of Intelligent Bioinformatics (No. ZDSYS20220422103800001).

Rights and permissions

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).

Return