Multimodal Transformer for Property Prediction in Polymers.
In this work, we designed a multimodal transformer that combines both the Simplified Molecular Input Line Entry System (SMILES) and molecular graph representations to enhance the prediction of polymer properties. Three models with different embeddings (SMILES, SMILES + monomer, and SMILES + dimer) were employed to assess the performance of incorporating multimodal features into transformer architectures. Fine-tuning results across five properties (i.e., density, glass-transition temperature (Tg), melting temperature (Tm), volume resistivity, and conductivity) demonstrated that the multimodal transformer with both the SMILES and the dimer configuration as inputs outperformed the transformer using only SMILES across all five properties. Furthermore, our model facilitates in-depth analysis by examining attention scores, providing deeper insights into the relationship between the deep learning model and the polymer attributes. We believe that our work, shedding light on the potential of multimodal transformers in predicting polymer properties, paves a new direction for understanding and refining polymer properties.
- Research Article
3
- 10.1038/s41598-025-01890-7
- May 15, 2025
- Scientific Reports
The Simplified Molecular Input Line Entry System (SMILES) is one of the most widely adopted molecular representations. However, SMILES notation suffers from limited token diversity and a lack of chemical information within individual tokens. To address these limitations while maintaining its simplicity, we propose a molecular representation method through the hybridization of standard SMILES tokens with Atom-In-SMILES (AIS) tokens, which incorporate local chemical environment information into a single token. This hybrid representation, termed SMI + AIS, combines SMILES and AIS tokens, allowing AIS tokens to differentiate chemical elements based on their chemical context without introducing additional tokens for less frequent elements. Using the SMI + AIS representation, we evaluated its performance by comparing the predefined metric of generated structures in chemical structure generation based on latent space optimization. Compared to standard SMILES, SMI + AIS achieved a 7% improvement in binding affinity and a 6% increase in synthesizability, highlighting its utility in the enhancement of machine learning-based molecular design. Our results demonstrate that the SMI + AIS representation provides a more effective and informative approach to encapsulate chemical context and presents potential for performance enhancement in other machine learning tasks in chemistry.
- Preprint Article
- 10.26434/chemrxiv.14450313.v1
- Apr 20, 2021
- ChemRxiv
The hit-to-lead process makes the physicochemical properties of the hit compounds that show the desired type of activity obtained in the screening assay more drug-like. Deep learning-based molecular generative models are expected to contribute to the hit-to-lead process.The simplified molecular input line entry system (SMILES), which is a string of alphanumeric characters representing the chemical structure of a molecule, is one of the most commonly used representations of molecules, and molecular generative models based on SMILES have achieved significant success. However, in contrast to molecular graphs, during the process of generation, SMILES are not considered as valid SMILES. Further, it is quite difficult to generate molecules starting from a certain molecule, thus making it difficult to apply SMILES to the hit-to-lead process.In this study, we have developed a SMILES-based generative model that can be generated starting from a certain compound. This method generates partial SMILES and inserts it into the original SMILES using Monte Carlo Tree Search and a Recurrent Neural Network.We validated our method using a molecule dataset obtained from the ZINC database and successfully generated molecules that were both well optimized for the objectives of the quantitative estimate of drug-likeness (QED) and penalized octanol-water partition coefficient (PLogP) optimization.The source code is available at https: //github.com/sekijima-lab/mermaid.
- Research Article
26
- 10.1186/s13321-021-00572-6
- Nov 27, 2021
- Journal of Cheminformatics
The hit-to-lead process makes the physicochemical properties of the hit molecules that show the desired type of activity obtained in the screening assay more drug-like. Deep learning-based molecular generative models are expected to contribute to the hit-to-lead process. The simplified molecular input line entry system (SMILES), which is a string of alphanumeric characters representing the chemical structure of a molecule, is one of the most commonly used representations of molecules, and molecular generative models based on SMILES have achieved significant success. However, in contrast to molecular graphs, during the process of generation, SMILES are not considered as valid SMILES. Further, it is quite difficult to generate molecules starting from a certain molecule, thus making it difficult to apply SMILES to the hit-to-lead process. In this study, we have developed a SMILES-based generative model that can be generated starting from a certain molecule. This method generates partial SMILES and inserts it into the original SMILES using Monte Carlo Tree Search and a Recurrent Neural Network. We validated our method using a molecule dataset obtained from the ZINC database and successfully generated molecules that were both well optimized for the objectives of the quantitative estimate of drug-likeness (QED) and penalized octanol-water partition coefficient (PLogP) optimization. The source code is available at https://github.com/sekijima-lab/mermaid.
- Research Article
8
- 10.1186/s13321-025-00959-9
- Feb 5, 2025
- Journal of Cheminformatics
Recently, advancements in cheminformatics such as representation learning for chemical structures, deep learning (DL) for property prediction, data-driven discovery, and optimization of chemical data handling, have led to increased demands for handling chemical simplified molecular input line entry system (SMILES) data, particularly in text analysis tasks. These advancements have driven the need to optimize components like positional encoding and positional embeddings (PEs) in transformer model to better capture the sequential and contextual information embedded in molecular representations. SMILES data represent complex relationships among atoms or elements, rendering them critical for various learning tasks within the field of cheminformatics. This study addresses the critical challenge of encoding complex relationships among atoms in SMILES strings to explore various PEs within the transformer-based framework to increase the accuracy and generalization of molecular property predictions. The success of transformer-based models, such as the bidirectional encoder representations from transformer (BERT) models, in natural language processing tasks has sparked growing interest from the domain of cheminformatics. However, the performance of these models during pretraining and fine-tuning is significantly influenced by positional information such as PEs, which help in understanding the intricate relationships within sequences. Integrating position information within transformer architectures has emerged as a promising approach. This encoding mechanism provides essential supervision for modeling dependencies among elements situated at different positions within a given sequence. In this study, we first conduct pretraining experiments using various PEs to explore diverse methodologies for incorporating positional information into the BERT model for chemical text analysis using SMILES strings. Next, for each PE, we fine-tune the best-performing BERT (masked language modeling) model on downstream tasks for molecular-property prediction. Here, we use two molecular representations, SMILES and DeepSMILES, to comprehensively assess the potential and limitations of the PEs in zero-shot learning analysis, demonstrating the model’s proficiency in predicting properties of unseen molecular representations in the context of newly proposed and existing datasets.Scientific contributionThis study explores the unexplored potential of PEs using BERT model for molecular property prediction. The study involved pretraining and fine-tuning the BERT model on various datasets related to COVID-19, bioassay data, and other molecular and biological properties using SMILES and DeepSMILES representations. The study details the pretraining architecture, fine-tuning datasets, and the performance of the BERT model with different PEs. It also explores zero-shot learning analysis and the model’s performance on various classification and regression tasks. In this study, newly proposed datasets from different domains were introduced during fine-tuning in addition to the existing and commonly used datasets. The study highlights the robustness of the BERT model in predicting chemical properties and its potential applications in cheminformatics and bioinformatics.
- Research Article
37
- 10.1186/s12859-024-05847-x
- Jun 26, 2024
- BMC Bioinformatics
PurposeLarge Language Models (LLMs) like Generative Pre-trained Transformer (GPT) from OpenAI and LLaMA (Large Language Model Meta AI) from Meta AI are increasingly recognized for their potential in the field of cheminformatics, particularly in understanding Simplified Molecular Input Line Entry System (SMILES), a standard method for representing chemical structures. These LLMs also have the ability to decode SMILES strings into vector representations.MethodWe investigate the performance of GPT and LLaMA compared to pre-trained models on SMILES in embedding SMILES strings on downstream tasks, focusing on two key applications: molecular property prediction and drug-drug interaction prediction.ResultsWe find that SMILES embeddings generated using LLaMA outperform those from GPT in both molecular property and DDI prediction tasks. Notably, LLaMA-based SMILES embeddings show results comparable to pre-trained models on SMILES in molecular prediction tasks and outperform the pre-trained models for the DDI prediction tasks.ConclusionThe performance of LLMs in generating SMILES embeddings shows great potential for further investigation of these models for molecular embedding. We hope our study bridges the gap between LLMs and molecular embedding, motivating additional research into the potential of LLMs in the molecular representation field. GitHub: https://github.com/sshaghayeghs/LLaMA-VS-GPT.
- Research Article
11
- 10.3390/polym16172464
- Aug 29, 2024
- Polymers
Polymer materials have garnered significant attention due to their exceptional mechanical properties and diverse industrial applications. Understanding the glass transition temperature (Tg) of polymers is critical to prevent operational failures at specific temperatures. Traditional methods for measuring Tg, such as differential scanning calorimetry (DSC) and dynamic mechanical analysis, while accurate, are often time-consuming, costly, and susceptible to inaccuracies due to random and uncertain factors. To address these limitations, the aim of the present study is to investigate the potential of Simplified Molecular Input Line Entry System (SMILES) as descriptors in simple machine learning models to predict Tg efficiently and reliably. Five models were utilized: k-nearest neighbors (KNNs), support vector regression (SVR), extreme gradient boosting (XGBoost), artificial neural network (ANN), and recurrent neural network (RNN). SMILES descriptors were converted into numerical data using either One Hot Encoding (OHE) or Natural Language Processing (NLP). The study found that SMILES inputs with fewer than 200 characters were inadequate for accurately describing compound structures, while inputs exceeding 200 characters diminished model performance due to the curse of dimensionality. The ANN model achieved the highest R2 value of 0.79; however, the XGB model, with an R2 value of 0.774, exhibited the highest stability and shorter training times compared to other models, making it the preferred choice for Tg prediction. The efficiency of the OHE method over NLP was demonstrated by faster training times across the KNN, SVR, XGB, and ANN models. Validation of new polymer data showed the XGB model’s robustness, with an average prediction deviation of 9.76 from actual Tg values. These findings underscore the importance of optimizing SMILES conversion methods and model parameters to enhance prediction reliability. Future research should focus on improving model accuracy and generalizability by incorporating additional features and advanced techniques. This study contributes to the development of efficient and reliable predictive models for polymer properties, facilitating the design and application of new polymer materials.
- Research Article
3
- 10.1093/bioinformatics/btaf275
- May 6, 2025
- Bioinformatics (Oxford, England)
Molecular property prediction with deep learning has accelerated drug discovery and retrosynthesis. However, the shortage of labeled molecular data and the challenge of generalizing across the vast chemical spaces pose significant hurdles for leveraging deep learning in molecular property prediction. This study proposes a self-supervised framework designed to acquire a Simplified Molecular Input Line Entry System (SMILES) representation, which we have dubbed Simple SMILES contrastive learning (SimSon). SimSon was pre-trained using unlabeled SMILES data through contrastive learning to grasp the SMILES representations. Our findings demonstrate that contrastive learning with randomized SMILES enriches the ability of the model to generalize and its robustness as it captures the global semantic context at the molecular level. In downstream tasks, SimSon performs competitively when compared to graph-based methods and even outperforms them on certain benchmark datasets. These results indicate that SimSon effectively captures structural information from SMILES, exhibiting remarkable generalization and robustness. The potential applications of SimSon extend to bioinformatics and cheminformatics, encompassing areas such as drug discovery and drug-drug interaction prediction. The source code is available at https://github.com/lee00206/SimSon.
- Research Article
114
- 10.1021/acs.jcim.0c01127
- Mar 15, 2021
- Journal of Chemical Information and Modeling
Simplified molecular input line entry system (SMILES)-based deep learning models are slowly emerging as an important research topic in cheminformatics. In this study, we introduce SMILES pair encoding (SPE), a data-driven tokenization algorithm. SPE first learns a vocabulary of high-frequency SMILES substrings from a large chemical dataset (e.g., ChEMBL) and then tokenizes SMILES based on the learned vocabulary for the actual training of deep learning models. SPE augments the widely used atom-level tokenization by adding human-readable and chemically explainable SMILES substrings as tokens. Case studies show that SPE can achieve superior performances on both molecular generation and quantitative structure-activity relationship (QSAR) prediction tasks. In particular, the SPE-based generative models outperformed the atom-level tokenization model in the aspects of novelty, diversity, and ability to resemble the training set distribution. The performance of SPE-based QSAR prediction models were evaluated using 24 benchmark datasets where SPE consistently either did match or outperform atom-level and k-mer tokenization. Therefore, SPE could be a promising tokenization method for SMILES-based deep learning models. An open-source Python package SmilesPE was developed to implement this algorithm and is now freely available at https://github.com/XinhaoLi74/SmilesPE.
- Research Article
35
- 10.1007/s11224-020-01568-y
- Jul 9, 2020
- Structural Chemistry
Ionic liquids (ILs) have been popular in many industrial and chemical processes, like antimicrobial properties, solvents, and synthesis of new compounds with antioxidant activity. Because of the significance of their application, the prediction minimal inhibitory concentration (MIC) of 204 ILs and the minimal bactericidal concentration (MBC) of 114 ILs of them against Staphylococcus aureus (S. aureus) have been carried out using the quantitative structure activity relationship (QSAR) based on the Monte Carlo method. Using the simplified molecular input line entry system (SMILES) notation, molecular structures of all of ILs were displayed. Hybrid optimal descriptor was employed in developing the model for pMIC and pMBC, which was obtained by combining the molecular graph and SMILES. For pMIC, hybrid optimal descriptors were calculated via SMILES and hydrogen-suppressed molecular graph (HSG), as well as hybrid optimal descriptors for pMBC were calculated via SMILES and hydrogen-filled graph (HFG). The total dataset was randomly split into training, invisible training, calibration, and validation set for three times. Statistically analyzed by the calculated descriptors, a QSAR model was developed for pMIC and pMBC of ILs, and the index of ideality of correlation (IIC) was examined as a benchmark for predictive potential of these models. Their correlation coefficient (R2) values of the training, invisible training, calibration, and validation sets for three splits were 0.8585–0.8853, 0.8523–0.8898, 0.8809–0.9240, and 0.8036–0.8903 for pMIC and 0.8357–0.8991, 0.8223–0.9306, 0.8372–0.9170, and 0.8171–0.8901 for pMBC, respectively. The results show that the predictability to develop the QSAR model for all splits is at a high level.
- Book Chapter
2
- 10.1007/978-1-4020-6845-4_14
- Jan 1, 2008
Optimal descriptors, calculated with simplified molecular input line entry system (SMILES), have been used for modeling solubility of fullerene C60 in organic solvents. Local and global attributes of the SMILES have been involved in the modeling algorithm. Local attributes represent symbols, which are images of chemical elements (“O”, “N”, “Cl”, “Br”, etc) or chemical environment (double bonds, i.e., the “ = ”; triple bonds, i.e., “#”, etc.) Global SMILES attributes are expressed as number of a given chemical element in given SMILES as well as superposition of chemical elements (for instance, SMILES contains both “Cl” and “Br”). Statistical characteristics of the derived model are given by n = 92, r 2 = 0.8865, q 2 = 0.8807, s = 0.363, F = 703 (training set); and n = 30, r 2 = 0.9069, q 2 = 0.8932, s = 0.399, F = 273 (test set).KeywordsOptimal descriptorQSPRSMILESSolubility fullerene C60
- Research Article
- 10.64336/001c.120720
- Jul 8, 2024
- Journal of High School Science
Parkinson’s Disease (PD) is a neurodegenerative disease that causes the gradual impairment of movement primarily by mutations in the SNCA (synuclein alpha) gene, excessively producing the alpha-synuclein protein. This protein aggregates in formations known as Lewy Bodies which correlate with neurotransmitter dysfunction, hence leading to the common side-effects of PD, such as loss of balance and memory. Currently, there exist no widely available drugs for PD, partly due to the long process for drugs to be identified, FDA certified, and then marketed. This process, however, can be expedited through the use of drug repurposing by Machine Learning models. The model constructed in this paper utilized the Simplified Molecular Input Line Entry System (SMILES) encoding and vector representations of specific, FDA-approved drugs in order to extract important features of a drug’s molecular structure. Then, disease representation vectors for PD were combined with a drug’s vector representations and were used as inputs to both Deep Learning and Regression models to predict the efficacy of various drug-disease pairs. The use of these models was then applied specifically to PD, identifying the drugs that held promise for therapeutic efficacy in PD. Valproic Acid emerged as a potential drug for further research in its applications to PD, with the best predictive models utilizing Deep Learning. Drugs that are folate or pyridoxine modulators, carbonic anhydrase inhibitors, PPAR-γ modulators, anti-inflammatory agents (prostaglandin modulators, antioxidants, SIRT1 modulators), antifibrinolytics (plasminogen modulators), or those used to treat hyperthyroidism (thyroid hormone inhibitors), were scored high by the model with respect to their ability to modulate brain GABA levels, as were antimetabolites or alkylating agents that modulate nucleotides like adenosine and purine. Deep Learning models with low validation loss can predict drugs that are structurally adequate to provide therapeutic effects for PD and other medicinally under-served diseases.
- Research Article
3
- 10.1007/s10822-025-00614-3
- Jul 18, 2025
- Journal of computer-aided molecular design
In recent years, the emergence of large language models (LLMs), particularly the advent of ChatGPT, has positioned natural language sequence-based representation learning and generative models as the dominant research paradigm in AI for science. Within the domains of drug discovery and computational chemistry, compound representation learning and molecular generation stand out as two of the most significant tasks. Currently, the predominant molecular representation sequences used for molecular characterization and generation include SMILES (Simplified Molecular-Input Line-Entry System), SELFIES (SELF-referencing Embedded Strings), SMARTS (Smiles Arbitrary Target Specification), and IUPAC (International Union of Pure and Applied Chemistry) nomenclature. In the context of AI-assisted drug design, each of these molecular languages has its own strengths and weaknesses, and the granularity of information encoded by different molecular representation forms varies significantly. However, the selection of an appropriate molecular representation as the input format for model training is crucial, yet this issue has not been thoroughly explored. Furthermore, the state-of-the-art models currently employed for molecular generation and optimization are diffusion models. Therefore, this study investigates the characteristics of the four mainstream molecular representation languages within the same diffusion model for training generative molecular sets. First, a single molecule is represented in four different ways through varying methodologies, followed by training a denoising diffusion model using identical parameters. Subsequently, thirty thousand molecules are generated for evaluation and analysis. The results indicate that the four molecular representation languages exhibit both similarities and differences in attribute distribution and spatial distribution; notably, SELFIES and SMARTS demonstrate a high degree of similarity, while IUPAC and SMILES show substantial differences. Additionally, IUPAC's primary advantage lies in the novelty and diversity of generated molecules, whereas SMILES excels in QEPPI and SAscore metrics, with SELFIES and SMARTS performing best on the QED metric. The findings of this research will provide crucial insights into the selection of molecular representations in AI drug design tasks, thereby contributing to enhanced efficiency in drug development.
- Research Article
69
- 10.1093/bib/bbab327
- Aug 24, 2021
- Briefings in Bioinformatics
Computational methods have become indispensable tools to accelerate the drug discovery process and alleviate the excessive dependence on time-consuming and labor-intensive experiments. Traditional feature-engineering approaches heavily rely on expert knowledge to devise useful features, which could be costly and sometimes biased. The emerging deep learning (DL) methods deliver a data-driven method to automatically learn expressive representations from complex raw data. Inspired by this, researchers have attempted to apply various deep neural network models to simplified molecular input line entry specification (SMILES) strings, which contain all the composition and structure information of molecules. However, current models usually suffer from the scarcity of labeled data. This results in a low generalization ability of SMILES-based DL models, which prevents them from competing with the state-of-the-art computational methods. In this study, we utilized the BiLSTM (bidirectional long short term merory) attention network (BAN) in which we employed a novel multi-step attention mechanism to facilitate the extracting of key features from the SMILES strings. Meanwhile, SMILES enumeration was utilized as a data augmentation method in the training phase to substantially increase the number of labeled data and enlarge the probability of mining more patterns from complex SMILES. We again took advantage of SMILES enumeration in the prediction phase to rectify model prediction bias and provide a more accurate prediction. Combined with the BAN model, our strategies can greatly improve the performance of latent features learned from SMILES strings. In 11 canonical absorption, distribution, metabolism, excretion and toxicity-related tasks, our method outperformed the state-of-the-art approaches.
- Research Article
2
- 10.1021/acs.jcim.5c00436
- May 9, 2025
- Journal of chemical information and modeling
In molecular property prediction tasks, most methods rely on single-view representations, such as simplified molecular input line entry system (SMILES) strings. Some scholars have attempted to combine two graphical views for joint representation purposes, such as SMILES and molecular graphs, but few have utilized three or more graphical views for molecular representation. Additionally, these methods typically extract features through pretraining models and then fine-tune them for specific tasks. This type of approach is not suitable for tasks with limited data and fails to fully leverage the correlations between tasks. To improve molecular representations, we propose a method that integrates traditional molecular representation learning by combining molecular sequences, molecular graphs, and molecular images. We design three different encoders to extract three graphical views of the same features from a molecule and use contrastive learning to align these views. Moreover, we adopt a multitask optimization strategy that effectively utilizes the shared information and correlations between tasks, thereby improving the generalizability and predictive performance of the model. Finally, we use low-rank adaptation (LoRA) fine-tuning for specific tasks to further improve the output prediction results. The experimental results show that this method enhances the accuracy and robustness of molecular property prediction across multiple benchmark data sets.
- Research Article
12
- 10.1038/s42004-025-01423-3
- Jan 29, 2025
- Communications Chemistry
Generative models have revolutionized de novo drug design, allowing to produce molecules on-demand with desired physicochemical and pharmacological properties. String based molecular representations, such as SMILES (Simplified Molecular Input Line Entry System) and SELFIES (Self-Referencing Embedded Strings), have played a pivotal role in the success of generative approaches, thanks to their capacity to encode atom- and bond- information and ease-of-generation. However, such ‘atom-level’ string representations could have certain limitations, in terms of capturing information on chirality, and synthetic accessibility of the corresponding designs.In this paper, we present fragSMILES, a novel fragment-based molecular representation in the form of string. fragSMILES encode fragments in a ‘chemically-meaningful’ way via a novel graph-reduction approach, allowing to obtain an efficient, interpretable, and expressive molecular representation, which also avoids fragment redundancy. fragSMILES contributes to the field of fragment-based representation, by reporting fragments and their ‘breaking’ bonds independently. Moreover, fragSMILES also embeds information of molecular chirality, thereby overcoming known limitations of existing string notations. When compared with SMILES, SELFIES and t-SMILES for de novo design, the fragSMILES notation showed its promise in generating molecules with desirable biochemical and scaffolds properties.