Generic placeholder image

Current Bioinformatics

Editor-in-Chief

ISSN (Print): 1574-8936
ISSN (Online): 2212-392X

Research Article

DiffSeqMol: A Non-Autoregressive Diffusion-Based Approach for Molecular Sequence Generation and Optimization

Author(s): Zixu Wang*, Yangyang Chen, Xiulan Guo, Yayang Li, Pengyong Li*, Chunyan Li, Xiucai Ye* and Tetsuya Sakurai

Volume 20, Issue 1, 2025

Published on: 01 April, 2024

Page: [46 - 58] Pages: 13

DOI: 10.2174/0115748936285493240307071916

Price: $65

TIMBC 2026
Abstract

Background: The application of deep generative models for molecular discovery has witnessed a significant surge in recent years. Currently, the field of molecular generation and molecular optimization is predominantly governed by autoregressive models regardless of how molecular data is represented. However, an emerging paradigm in the generation domain is diffusion models, which treat data non-autoregressively and have achieved significant breakthroughs in areas such as image generation.

Methods: The potential and capability of diffusion models in molecular generation and optimization tasks remain largely unexplored. In order to investigate the potential applicability of diffusion models in the domain of molecular exploration, we proposed DiffSeqMol, a molecular sequence generation model, underpinned by diffusion process.

Results & Discussion: DiffSeqMol distinguishes itself from traditional autoregressive methods by its capacity to draw samples from random noise and direct generating the entire molecule. Through experiment evaluations, we demonstrated that DiffSeqMol can achieve, even surpass, the performance of established state-of-the-art models on unconditional generation tasks and molecular optimization tasks.

Conclusion: Taken together, our results show that DiffSeqMol can be considered a promising molecular generation method. It opens new pathways to traverse the expansive chemical space and to discover novel molecules.

Keywords: Diffusion model, molecule generation, molecule optimization, autoregressive approach, gaussian noise, encode models.

Graphical Abstract
[1]
Zeng X, Wang F, Luo Y, et al. Deep generative molecular design reshapes drug discovery. Cell Rep Med 2022; 3(12): 100794.
[http://dx.doi.org/10.1016/j.xcrm.2022.100794] [PMID: 36306797]
[2]
Su R, Yang H, Wei L, Chen S, Zou Q. A multi-label learning model for predicting drug-induced pathology in multi-organ based on toxicogenomics data. PLOS Comput Biol 2022; 18(9): e1010402.
[http://dx.doi.org/10.1371/journal.pcbi.1010402] [PMID: 36070305]
[3]
Wang F, Ding Y, Lei X, Liao B, Wu F-X. Machine learning and deep learning strategies in drug repositioning. Curr Bioinform 2022; 17(3): 217-37.
[http://dx.doi.org/10.2174/1574893616666211119093100]
[4]
Butler KT, Davies DW, Cartwright H, Isayev O, Walsh A. Machine learning for molecular and materials science. Nature 2018; 559(7715): 547-55.
[http://dx.doi.org/10.1038/s41586-018-0337-2] [PMID: 30046072]
[5]
Meng Y, Lu C, Jin M, Xu J, Zeng X, Yang J. A weighted bilinear neural collaborative filtering approach for drug repositioning. Brief Bioinform 2022; 23(2): bbab581.
[http://dx.doi.org/10.1093/bib/bbab581] [PMID: 35039838]
[6]
Pan X, Lin X, Cao D, et al. Deep learning for drug repurposing: Methods, databases, and applications. Wiley Interdiscip Rev Comput Mol Sci 2022; 12(4): e1597.
[http://dx.doi.org/10.1002/wcms.1597]
[7]
Jin J, Yu Y, Wang R, et al. iDNA-ABF: Multi-scale deep biological language learning model for the interpretable prediction of DNA methylations. Genome Biol 2022; 23(1): 219.
[http://dx.doi.org/10.1186/s13059-022-02780-1] [PMID: 36253864]
[8]
Wang R, Jiang Y, Jin J, et al. DeepBIO: an automated and interpretable deep-learning platform for high-throughput biological sequence prediction, functional annotation and visualization analysis. Nucleic Acids Res 2023; 51(7): 3017-29.
[http://dx.doi.org/10.1093/nar/gkad055] [PMID: 36796796]
[9]
Elman JL. Finding structure in time. Cogn Sci 1990; 14(2): 179-211.
[http://dx.doi.org/10.1207/s15516709cog1402_1]
[10]
Vaswani A, et al. Attention is all you need. Adv Neural Inf Process Syst 2017; 30.
[11]
Yan K, Lv H, Guo Y, Chen Y, Wu H, Liu B. TPpred-ATMV: Therapeutic peptide prediction by adaptive multi-view tensor learning model. Bioinformatics 2022; 38(10): 2712-8.
[http://dx.doi.org/10.1093/bioinformatics/btac200] [PMID: 35561206]
[12]
Ji Z, Lee N, Frieske R, et al. Survey of hallucination in natural language generation. ACM Comput Surv 2023; 55(12): 1-38.
[http://dx.doi.org/10.1145/3571730]
[13]
Weininger D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J Chem Inf Comput Sci 1988; 28(1): 31-6.
[http://dx.doi.org/10.1021/ci00057a005]
[14]
Wang Y, Zhai Y, Ding Y, Zou Q. SBSM-Pro: Support bio-sequence machine for proteins. arXiv:230810275 2023.
[15]
Liu XW, Shi TY, Gao D, et al. iPADD: A computational tool for predicting potential antidiabetic drugs using machine learning algorithms. J Chem Inf Model 2023; 63(15): 4960-9.
[http://dx.doi.org/10.1021/acs.jcim.3c00564] [PMID: 37499224]
[16]
Yang Y, Gao D, Xie X, et al. DeepIDC: A prediction framework of injectable drug combination based on heterogeneous information and deep learning. Clin Pharmacokinet 2022; 61(12): 1749-59.
[http://dx.doi.org/10.1007/s40262-022-01180-9] [PMID: 36369328]
[17]
Segler MHS, Kogej T, Tyrchan C, Waller MP. Generating focused molecule libraries for drug discovery with recurrent neural networks. ACS Cent Sci 2018; 4(1): 120-31.
[http://dx.doi.org/10.1021/acscentsci.7b00512] [PMID: 29392184]
[18]
Kotsias PC, Arús-Pous J, Chen H, Engkvist O, Tyrchan C, Bjerrum EJ. Direct steering of de novo molecular generation with descriptor conditional recurrent neural networks. Nat Mach Intell 2020; 2(5): 254-65.
[http://dx.doi.org/10.1038/s42256-020-0174-5]
[19]
Fromer JC, Coley CW. Computer-aided multi-objective optimization in small molecule discovery. Patterns 2023; 4(2): 100678.
[http://dx.doi.org/10.1016/j.patter.2023.100678]
[20]
Chen Y, Wang Z, Wang L, et al. Deep generative model for drug design from protein target sequence. J Cheminform 2023; 15(1): 38.
[http://dx.doi.org/10.1186/s13321-023-00702-2] [PMID: 36978179]
[21]
Wang J, Chu Y, Mao J, et al. De novo molecular design with deep molecular generative models for PPI inhibitors. Brief Bioinform 2022; 23(4): bbac285.
[http://dx.doi.org/10.1093/bib/bbac285] [PMID: 35830870]
[22]
Kingma DP, Welling M. Auto-encoding variational bayes International Conference on Learning Representations, ICLR.
[23]
Jin W, Barzilay R, Jaakkola T. Junction tree variational autoencoder for molecular graph generation. arXiv:180204364 2018.
[24]
Flam-Shepherd D, Zhu K, Aspuru-Guzik A. Language models can learn complex molecular distributions. Nat Commun 2022; 13(1): 3293.
[http://dx.doi.org/10.1038/s41467-022-30839-x] [PMID: 35672310]
[25]
Walters WP, Barzilay R. Applications of deep learning in molecule generation and molecular property prediction. Acc Chem Res 2021; 54(2): 263-70.
[http://dx.doi.org/10.1021/acs.accounts.0c00699] [PMID: 33370107]
[26]
Ho J, Jain A, Abbeel P. Denoising diffusion probabilistic models. Adv Neural Inf Process Syst 2020; 33: 6840-51.
[27]
Sohl-Dickstein J, Weiss E A, Maheswaranathan N, Ganguli S. Deep unsupervised learning using nonequilibrium thermodynamics. arXiv:150303585 2015.
[28]
Ho J, Salimans T, Gritsenko A, Chan W, Norouzi M, Fleet DJ. Video diffusion models. arXiv:220403458 2022.
[29]
Kong Z, Ping W, Huang J, Zhao K, Catanzaro B. Diffwave: A versatile diffusion model for audio synthesis. arXiv:220403458 2020.
[30]
Rombach R. Blattmann A, Lorenz D, Esser P, Ommer B. High-resolution image synthesis with latent diffusion models. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans, LA, USA 2022, pp. 10674-10685
[http://dx.doi.org/10.1109/CVPR52688.2022.01042]
[31]
Li X, Thickstun J, Gulrajani I, Liang P, Hashimoto TB. Diffusion-lm improves controllable text generation. arXiv:220514217 2022.
[32]
Gao Z, Guo J, Tan X, Zhu Y, Zhang F, Bian J. Difformer: Empowering diffusion model on embedding space for text generation. arXiv:221209412 2022.
[33]
Gong S, Li M, Feng J, Wu Z, Kong L. Diffuseq: Sequence to sequence text generation with diffusion models. arXiv:221008933 2023.
[34]
Guo Z, Jian L, Yanli W, Mengrui C, Duolin W, Dong X. Diffusion models in bioinformatics and computational biology. Nat Rev Bioeng 2023; 2: 1-19.
[35]
Yang L, Zhang Z, Song Y, et al. Diffusion models: A comprehensive survey of methods and applications. ACM Comput Surv 2024; 56(4): 1-39.
[http://dx.doi.org/10.1145/3626235]
[36]
Luo S, Chence S, Minkai X, Jian T. Predicting molecular conformation via dynamic graph score matching. Adv Neural Inf Process Syst 2021; 34: 19784-95.
[37]
Hoogeboom E, Satorras VG, Vignac C, Welling M. Equivariant diffusion for molecule generation in 3d. arXiv:220317003 2022.
[38]
Watson JL, David J, Nathaniel RB, Brian LT, Jason Y. Broadly applicable and accurate protein design by integrating structure prediction networks and diffusion generative models. BioRxiv202212 2022.
[http://dx.doi.org/10.1101/2022.12.09.519842]
[39]
Xu M, Yu L, Song Y, Shi C, Ermon S. Geodiff: A geometric diffusion model for molecular conformation generation. arXiv:220302923 2022.
[40]
Xu M, Powers A, Dror R, Ermon S, Leskovec J. Geometric latent diffusion models for 3d molecule generation. arXiv:230501140 2023.
[41]
Lin H, Huang Y, Liu M, Li X, Ji S, Li SZ. Diffbp: Generative diffusion of 3d molecules for target protein binding. arXiv:220302923 2022.
[42]
Corso G, Stärk H, Jing B, Barzilay R, Jaakkola T. Diffdock: Diffusion steps, twists, and turns for molecular docking. arXiv:221001776 2022.
[43]
Vignac C, Krawczuk I, Siraudin A, Siraudin B, Cevher V. Digress: Discrete denoising diffusion for graph generation. arXiv:220914734 2022.
[44]
Lee S, Jo J, Hwang SJ, et al. Exploring chemical space with score-based out-of-distribution generation. arXiv:220607632 2023.
[45]
Liu S, Li Y, Li Z, Gitter A, Lu J, Xu Z. A text-guided protein design framework. arXiv:230204611 2023.
[46]
Ni B, Kaplan DL, Buehler MJ, et al. Generative design of de novo proteins based on secondary-structure constraints using an attention-based diffusion model. Chem 2023; 9(7): 1828-49.
[http://dx.doi.org/10.1016/j.chempr.2023.03.020] [PMID: 37614363]
[47]
Avdeyev P, Shi C, Tan Y, Dudnyk K, Zhou J, et al. Dirichlet diffusion score model for biological sequence generation. arXiv:230204611 2023.
[48]
Li Z, Yuhao N, Tim AB, Akashaditya D, Guoxuan X. Latent diffusion model for DNA sequence generation. arXiv:231006150 2023.
[49]
Creswell A, White T, Dumoulin V, Arulkumaran K, Sengupta B, Bharath AA. Generative adversarial networks: An overview. arXiv:171007035 2018.
[http://dx.doi.org/10.1109/MSP.2017.2765202]
[50]
Wang Y, Luo X, Zou Q. Effector-GAN: Prediction of fungal effector proteins based on pretrained deep representation learning methods and generative adversarial networks. Bioinformatics 2022; 38(14): 3541-8.
[http://dx.doi.org/10.1093/bioinformatics/btac374] [PMID: 35640972]
[51]
Sennrich R, Haddow B, Birch A. Neural machine translation of rare words with subword units. arXiv:150807909 2015.
[52]
Irwin JJ, Shoichet BK. ZINC--a free database of commercially available compounds for virtual screening. J Chem Inf Model 2005; 45(1): 177-82.
[http://dx.doi.org/10.1021/ci049714+] [PMID: 15667143]
[53]
Blum LC, Reymond JL. 970 million druglike small molecules for virtual screening in the chemical universe database GDB-13. J Am Chem Soc 2009; 131(25): 8732-3.
[http://dx.doi.org/10.1021/ja902302h] [PMID: 19505099]
[54]
Hachmann J, Olivares-Amaya R, Atahan-Evrenk S, et al. The Harvard clean energy project: Large-scale computational screening and design of organic photovoltaics on the world community grid. J Phys Chem Lett 2011; 2(17): 2241-51.
[http://dx.doi.org/10.1021/jz200866s]
[55]
St John PC, Phillips C, Kemper TW, et al. Message-passing neural networks for high-throughput polymer screening. J Chem Phys 2019; 150(23): 234111.
[http://dx.doi.org/10.1063/1.5099132] [PMID: 31228909]
[56]
Kim S, Chen J, Cheng T, et al. PubChem 2019 update: Improved access to chemical data. Nucleic Acids Res 2019; 47(D1): D1102-9.
[http://dx.doi.org/10.1093/nar/gky1033] [PMID: 30371825]
[57]
Olivecrona M, Blaschke T, Engkvist O, Chen H. Molecular de-novo design through deep reinforcement learning. J Cheminform 2017; 9(1): 48.
[http://dx.doi.org/10.1186/s13321-017-0235-x] [PMID: 29086083]
[58]
Qi R, Guo F, Zou Q. String kernels construction and fusion: A survey with bioinformatics application. Front Comput Sci 2022; 16(6): 166904.
[http://dx.doi.org/10.1007/s11704-021-1118-x]
[59]
Chen Y, Wang Z, Zeng X, et al. Molecular language models: RNNs or transformer? Brief Funct Genomics 2023; 22(4): 392-400.
[http://dx.doi.org/10.1093/bfgp/elad012] [PMID: 37078726]
[60]
Fabbri M, Moro G. Dow jones trading with deep learning: The unreasonable effectiveness of recurrent neural networks. In: Proceedings of the 7th International Conference on Data Science, Technology and Applications. 2018; pp. 142-53.
[http://dx.doi.org/10.5220/0006922101420153]
[61]
Liu Q, Allamanis M, Brockschmidt M, Gaunt AL. Constrained graph variational autoencoders for molecule design. arXiv:180509076 2018.
[62]
Jin W, Barzilay R, Jaakkola T. Hierarchical generation of molecular graphs using structural motifs. arXiv:200203230 2020.
[63]
Hoogeboom E, Nielsen D, Jaini P, Forré P, Welling M. Argmax flows and multinomial diffusion: Learning categorical distributions. arXiv:210205379 2021.
[64]
Bemis GW, Murcko MA. The properties of known drugs. 1. Molecular frameworks. J Med Chem 1996; 39(15): 2887-93.
[http://dx.doi.org/10.1021/jm9602928] [PMID: 8709122]
[65]
Benhenda M. ChemGAN challenge for drug discovery: Can AI reproduce natural chemical diversity? arXiv:170808227 2017.
[66]
Simonovsky M, Komodakis N. Graphvae: Towards generation of small graphs using variational autoencoders. 27th International Conference on Artificial Neural Networks. Rhodes, Greece, October 4-7, 2018
[http://dx.doi.org/10.1007/978-3-030-01418-6_41]
[67]
De Cao N, Kipf T. MolGAN: An implicit generative model for small molecular graphs. arXiv:170808227 2018.
[68]
Ma T, Chen J, Xiao C. Constrained generation of semantically valid graphs via regularizing variational autoencoders. Adv Neural Inf Process Syst 2018; 31.
[69]
Flam-Shepherd D, Wu TC, Aspuru-Guzik A. MPGVAE: Improved generation of small organic molecules using message passing neural nets. Mach Learn: Sci Techno 2021; 2(4): 045010.
[70]
You J, Liu B, Ying R, Pande V, Leskovec J. Graph convolutional policy network for goal-directed molecular graph generation. Adv Neural Inf Process Syst 2018; 31.
[71]
Dalke A, Hert J, Kramer C. mmpdb: An open-source matched molecular pair platform for large multiproperty data sets. J Chem Inf Model 2018; 58(5): 902-10.
[http://dx.doi.org/10.1021/acs.jcim.8b00173] [PMID: 29770697]
[72]
Jin W, Yang K, Barzilay R, Jaakkola T. Learning multimodal graph-to-graph translation for molecular optimization. arXiv:170808227 2018.
[73]
Eberhardt J, Santos-Martins D, Tillack AF, Forli S. AutoDock Vina 1.2. 0: New docking methods, expanded force field, and python bindings. J Chem Inf Model 2021; 61(8): 3891-8.
[http://dx.doi.org/10.1021/acs.jcim.1c00203] [PMID: 34278794]
[74]
Trott O, Olson AJ. AutoDock Vina: Improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. J Comput Chem 2010; 31(2): 455-61.
[http://dx.doi.org/10.1002/jcc.21334] [PMID: 19499576]
[75]
Wang S, Che T, Levit A, Shoichet BK, Wacker D, Roth BL. Structure of the D2 dopamine receptor bound to the atypical antipsychotic drug risperidone. Nature 2018; 555(7695): 269-73.
[http://dx.doi.org/10.1038/nature25758] [PMID: 29466326]

Rights & Permissions Print Cite
© 2025 Bentham Science Publishers | Privacy Policy