GSelformer-MV: Multiview, Subgraph-Augmented Group SELFIES Transformer for Molecular Property Prediction.
Data-driven approaches are essential for relating properties to the chemical structure. Atom-focused views of individual compounds are common in molecular representation learning: graph neural networks and chemical language models, the two main algorithm classes, take atomic-level graphs and atom-wise token sequences as input, respectively. However, directly integrating information about functional groups into advanced architectures remains nearly unexplored. To fill this gap, we introduce gSelformer-MV, a transformer that operates on multiple views of Group SELFIES (a SELFIES variant augmented with tokens for functional groups) that enables representation at both the atomic and substructure levels. Unlike prior Group SELFIES approaches that produce a single string per molecule, gSelformer-MV constructs multiple subgraph-partitioned Group SELFIES views and uses them jointly during training and inference. We show that gSelformer-MV is superior in terms of accuracy and explainability to the models trained exclusively on SELFIES strings. Moreover, gSelformer-MV achieves state-of-the-art performance on several regression benchmarks; further gains are obtained when restricting to high-confidence predictions. These results indicate that subgraph augmentation is a simple and effective route for advancing string-based molecular property prediction.
- Research Article
88
- 10.1021/acs.jcim.0c01489
- May 19, 2021
- Journal of Chemical Information and Modeling
Determining the properties of chemical molecules is essential for screening candidates similar to a specific drug. These candidate molecules are further evaluated for their target binding affinities, side effects, target missing probabilities, etc. Conventional machine learning algorithms demonstrated satisfying prediction accuracies of molecular properties. A molecule cannot be directly loaded into a machine learning model, and a set of engineered features needs to be designed and calculated from a molecule. Such hand-crafted features rely heavily on the experiences of the investigating researchers. The concept of graph neural networks (GNNs) was recently introduced to describe the chemical molecules. The features may be automatically and objectively extracted from the molecules through various types of GNNs, e.g., GCN (graph convolution network), GGNN (gated graph neural network), DMPNN (directed message passing neural network), etc. However, the training of a stable GNN model requires a huge number of training samples and a large amount of computing power, compared with the conventional machine learning strategies. This study proposed the integrated framework XGraphBoost to extract the features using a GNN and build an accurate prediction model of molecular properties using the classifier XGBoost. The proposed framework XGraphBoost fully inherits the merits of the GNN-based automatic molecular feature extraction and XGBoost-based accurate prediction performance. Both classification and regression problems were evaluated using the framework XGraphBoost. The experimental results strongly suggest that XGraphBoost may facilitate the efficient and accurate predictions of various molecular properties. The source code is freely available to academic users at https://github.com/chenxiaowei-vincent/XGraphBoost.git.
- Research Article
- 10.34133/csbj.0036
- Mar 15, 2026
- Computational and structural biotechnology journal
HYG-mol: An Interpretable Multimodal Hypergraph Framework for Molecular Property Prediction.
- Preprint Article
- 10.21203/rs.3.rs-6756851/v1
- May 30, 2025
- Research Square
Discovering molecules with desirable molecular properties, including ADMET profiles, is of great importance in drug discovery. Existing approaches typically employ deep learning models, such as Graph Neural Networks and Transformers, to predict these molecular properties by learning from diverse chemical information. However, these models often lack mechanisms for effective interaction among multi-level features. To address these limitations, we propose a Hierarchical Interaction Message Passing Mechanism, which serves as the foundation of our novel model, the Hierarchical Interaction Message Net (HimNet). Our method enables interaction-aware representation learning across atomic, motif, and molecular levels via hierarchical attention-guided message passing. This design allows HimNet to effectively balance global and local information, ensuring rich and task-relevant feature extraction for downstream property prediction tasks. We systematically evaluate HimNet on eleven datasets, including eight widely-used MoleculeNet benchmarks and three challenging, high-value datasets for metabolic stability, malaria activity, and liver microsomal clearance, covering a broad range of pharmacologically relevant properties. Extensive experiments demonstrate that HimNet achieves the best or near-best performance in most molecular property prediction tasks. We believe that HimNet offers an accurate and efficient solution for molecular activity and ADMET property prediction, contributing significantly to advanced decision-making in the early stages of drug discovery.
- Research Article
- 10.1038/s42004-026-01922-x
- Feb 14, 2026
- Communications chemistry
Discovering molecules with desirable molecular properties, including ADMET profiles, is of great importance in drug discovery. Existing approaches typically employ deep learning models, such as Graph Neural Networks and Transformers, to predict these molecular properties by learning from diverse chemical information. However, these models often lack mechanisms for effective interaction among multi-level features. To address these limitations, we propose a Hierarchical Interaction Message Passing Mechanism, which serves as the foundation of our model, the Hierarchical Interaction Message Net (HimNet). Our method enables interaction-aware representation learning across atomic, motif, and molecular levels via hierarchical attention-guided message passing. This design allows HimNet to effectively balance global and local information, ensuring rich and task-relevant feature extraction for downstream property prediction tasks. We systematically evaluate HimNet on eleven datasets, including eight widely-used MoleculeNet benchmarks and three challenging, high-value datasets for metabolic stability, malaria activity, and liver microsomal clearance, covering a broad range of pharmacologically relevant properties. Extensive experiments demonstrate that HimNet achieves the best or near-best performance in most molecular property prediction tasks. We believe that HimNet offers an accurate and efficient solution for molecular activity and ADMET property prediction, contributing significantly to advanced decision-making in the early stages of drug discovery.
- Research Article
1
- 10.1109/tpami.2026.3664098
- Jan 1, 2026
- IEEE transactions on pattern analysis and machine intelligence
Providing explainable molecular property predictions is critical for many scientific domains, such as drug discovery and material science. Though transformer-based language models have shown great potential in accurate molecular property prediction, they neither provide chemically meaningful explanations nor faithfully reveal the molecular structure-property relationships. In this work, we develop a framework for explainable molecular property prediction based on language models, dubbed as Lamole, which can provide chemical concepts-aligned explanations. We take a string-based molecular representation - Group SELFIES - as input tokens to pre-train and fine-tune our Lamole, as it provides chemically meaningful semantics. By disentangling the information flows of Lamole, we propose considering both self-attention weights and gradients for better quantification of each chemically meaningful substructure's impact on the model's output. To make the explanations more faithful to the structure-property relationship, we then carefully craft a marginal loss to explicitly optimize the explanations to align with the chemists' annotations. We bridge the manifold hypothesis with the elaborated marginal loss to prove that the loss can align the explanations with the tangent space of the data manifold, leading to concept-aligned explanations. Experimental results over eight datasets demonstrate Lamole can achieve comparable prediction accuracy and boost the explanation accuracy by up to 14.3%, being the state-of-the-art in explainable molecular property prediction. To further illustrate the actionable utility of the explanations derived from Lamole, we integrated the framework with an evolutionary algorithm. This integration established an interpretable optimization pipeline for molecular editing, demonstrating that Lamole functions beyond simple post-hoc analysis but serves as a practical guide for molecule discovery.
- Research Article
5
- 10.1089/cmb.2023.0452
- Jul 31, 2024
- Journal of computational biology : a journal of computational molecular cell biology
The development of new drugs is a vital effort that has the potential to improve human health, well-being and life expectancy. Molecular property prediction is a crucial step in drug discovery, as it helps to identify potential therapeutic compounds. However, experimental methods for drug development can often be time-consuming and resource-intensive, with a low probability of success. To address such limitations, deep learning (DL) methods have emerged as a viable alternative due to their ability to identify high-discriminating patterns in molecular data. In particular, graph neural networks (GNNs) operate on graph-structured data to identify promising drug candidates with desirable molecular properties. These methods represent molecules as a set of node (atoms) and edge (chemical bonds) features to aggregate local information for molecular graph representation learning. Despite the availability of several GNN frameworks, each approach has its own shortcomings. Although, some GNNs may excel in certain tasks, they may not perform as well in others. In this work, we propose a hybrid approach that incorporates different graph-based methods to combine their strengths and mitigate their limitations to accurately predict molecular properties. The proposed approach consists in a multi-layered hybrid GNN architecture that integrates multiple GNN frameworks to compute graph embeddings for molecular property prediction. Furthermore, we conduct extensive experiments on multiple benchmark datasets to demonstrate that our hybrid approach significantly outperforms the state-of-the-art graph-based models. The data and code scripts to reproduce the results are available in the repository, https://github.com/pedro-quesado/HybridGNN.
- Research Article
64
- 10.1021/acs.jcim.2c00495
- May 31, 2022
- Journal of Chemical Information and Modeling
Deep learning has been a prevalence in computational chemistry and widely implemented in molecular property predictions. Recently, self-supervised learning (SSL), especially contrastive learning (CL), has gathered growing attention for the potential to learn molecular representations that generalize to the gigantic chemical space. Unlike supervised learning, SSL can directly leverage large unlabeled data, which greatly reduces the effort to acquire molecular property labels through costly and time-consuming simulations or experiments. However, most molecular SSL methods borrow the insights from the machine learning community but neglect the unique cheminformatics (e.g., molecular fingerprints) and multilevel graphical structures (e.g., functional groups) of molecules. In this work, we propose iMolCLR, improvement of Molecular Contrastive Learning of Representations with graph neural networks (GNNs) in two aspects: (1) mitigating faulty negative contrastive instances via considering cheminformatics similarities between molecule pairs and (2) fragment-level contrasting between intramolecule and intermolecule substructures decomposed from molecules. Experiments have shown that the proposed strategies significantly improve the performance of GNN models on various challenging molecular property predictions. In comparison to the previous CL framework, iMolCLR demonstrates an averaged 1.2% improvement of ROC-AUC on eight classification benchmarks and an averaged 10.1% decrease of the error on six regression benchmarks. On most benchmarks, the generic GNN pretrained by iMolCLR rivals or even surpasses supervised learning models with sophisticated architectures and engineered features. Further investigations demonstrate that representations learned through iMolCLR intrinsically embed scaffolds and functional groups that can reason molecule similarities.
- Research Article
- 10.1109/tcbbio.2025.3577899
- Sep 1, 2025
- IEEE transactions on computational biology and bioinformatics
Molecular property prediction is crucial for advancing medical research in areas like retrosynthesis analysis and drug discovery. The challenge of obtaining accurate molecular property labels has led to the use of multi-level pretrained Graph Neural Networks (GNNs) with self-supervised learning methods. However, these multi-level approaches do not adequately address relationships across molecular graph levels particularly at the motif and atom levels, and neglect considering the fusion method of different grains. To overcome these limitations, we introduce the Motif-centric Multi-grain Graph Pretaining and Finetuning Strategy Framework (MMGSF). This framework consists of two components: Motif-centric Molecular Graph Pretraining Strategy(MMGS) which focuses on motif-centric contrastive learning on multi-level graph without disturbing molecular structure, and Multi-grain Finetuning (MGF) that refines node representations across grains, using a novel mol-adapter module with cross-attention for adaptive feature fusion. Our MGF captures complex feature interactions, ensuring structural and semantic information from different grains contributes effectively to molecular property predictions. Superior results in molecular property classification tasks demonstrate the effectiveness of MMGSF, and its visualization performance shows that the learned representations capture molecular multi-grain information and properties successfully. This study offers fresh insights into the design of more effective self-supervised learning frameworks for molecular property prediction.
- Research Article
2
- 10.1088/2632-2153/ad979b
- Nov 26, 2024
- Machine Learning: Science and Technology
Recently, graph neural networks (GNNs) have been widely used in various domains, including social networks, recommender systems, protein classification, molecular property prediction, and genetic networks. In bioinformatics and chemical engineering, considerable research is being actively conducted to represent molecules or proteins on graphs by conceptualizing atoms or amino acids as nodes and the relationships between nodes as edges. The overall structures of proteins and their interconnections are crucial for predicting and classifying their properties. However, as GNNs stack more layers to create deeper networks, the embeddings between nodes may become excessively similar, causing an oversmoothing problem that reduces the performance for downstream tasks. To avoid this, GNNs typically use a limited number of layers, which leads to the problem of reflecting only the local structure and neighborhood information rather than the global structure of the graph. Therefore, we propose a structurally informed convolutional GNN (SICGNN) that utilizes information that can express the overall topological structure of a protein graph during GNN training and prediction. By explicitly including information of the entire graph topology, the proposed model can utilize both local neighborhood and global structural information. We applied the SICGNN to representative GNNs such as GraphSAGE, graph isomorphism network, and graph attention network, and confirmed performance improvements across various datasets. We also demonstrate the robustness of SICGNN using multiple stratified 10-fold cross-validations and various hyperparameter settings, and demonstrate that its accuracy is comparable or better than those of existing GNN models.
- Research Article
1
- 10.3390/sym17060873
- Jun 4, 2025
- Symmetry
Molecular property prediction, as one of the important tasks in cheminformatics, is attracting more and more attention. The structure of a molecule is closely related to its properties, and a symmetrical molecular structure may differ significantly from an asymmetrical structure in terms of properties, such as the melting point, boiling point, water solubility, and so on. However, a single molecular representation does not provide a better overall representation of the molecule. And, it is also a challenge to better use graph neural networks to aggregate the information of neighboring nodes in the molecular graph. So, in this paper, we constructed a novel graph neural network with additive attention (termed Add-GNN) for molecular property prediction, which fuses the molecular graph and molecular descriptors to jointly represent molecular features in order to make the molecular representations more comprehensive. Then, in the message-passing stage, we designed an additive attention mechanism that can effectively fuse the features of neighboring nodes and the features of edges to better capture the intrinsic information of molecules. In addition, we applied L2-norm to calculate the importance of each atom to the predicted results and visualized it, providing interpretability to the model. We validated the proposed model on public datasets and showed that the model outperforms graph-based baseline methods and some graph neural network variants, proving that our proposed method is feasible and competitive.
- Research Article
9
- 10.1002/jcc.70011
- Jan 22, 2025
- Journal of computational chemistry
In the realm of artificial intelligence-driven drug discovery (AIDD), accurately predicting the influence of molecular structures on their properties is a critical research focus. While deep learning models based on graph neural networks (GNNs) have made significant advancements in this area, prior studies have primarily concentrated on molecule-level representations, often neglecting the impact of functional group structures and the potential relationships between fragments on molecular property predictions. To address this gap, we introduce the multi-scale feature attention graph neural network (MfGNN), which enhances traditional atom-based molecular graph representations by incorporating fragment-level representations derived from chemically synthesizable BRICS fragments. MfGNN not only effectively captures both the structural information of molecules and the features of functional groups but also pays special attention to the potential relationships between fragments, exploring how they collectively influence molecular properties. This model integrates two core mechanisms: a graph attention mechanism that captures embeddings of molecules and functional groups, and a feature extraction module that systematically processes BRICS fragment-level features to uncover relationships among the fragments. Our comprehensive experiments demonstrate that MfGNN outperforms leading machine learning and deep learning models, achieving state-of-the-art performance in 8 out of 11 learning tasks across various domains, including physical chemistry, biophysics, physiology, and toxicology. Furthermore, ablation studies reveal that the integration of multi-scale feature information and the feature extraction module enhances the richness of molecular features, thereby improving the model's predictive capabilities.
- Conference Article
1
- 10.1145/3638529.3654055
- Jul 14, 2024
Graph representation of molecular data enables extracting stereoscopic features, with graph neural networks (GNNs) excelling in molecular property prediction. However, selecting optimal hyper-parameters for GNN construction is challenging due to the vast search space and high computational costs. To tackle this, we introduce a hierarchical evaluation strategy integrated with a genetic algorithm (HESGA). HESGA combines full and fast evaluations of GNNs. Full evaluation involves training a GNN with preset epochs, using root mean square error (RMSE) to measure hyperparameter quality. Fast evaluation interrupts training early, using the difference in RMSE values as a score for GNN potential. HESGA integrates these evaluations, with fast evaluation guiding candidate selection for full evaluation, maintaining elite individuals. Applying HESGA to optimise deep GNNs for molecular property prediction, experimental results on three datasets demonstrate its superiority over traditional Bayesian optimisation, Tree-structured Parzen Estimator, and CMA-ES. HESGA efficiently navigates the complex GNN hyperparameter space, offering a promising approach for molecular property prediction.
- Conference Article
7
- 10.1109/ism52913.2021.00049
- Nov 1, 2021
Graph Neural Networks (GNNs) are deep learning models that take graph data as inputs, and they are applied to various tasks such as traffic prediction and molecular property prediction. However, owing to the complexity of the GNNs, it has been difficult to analyze which parts of inputs affect the GNN model’s outputs. In this study, we extend explainability methods for Convolutional Neural Networks (CNNs), such as Local Interpretable Model-Agnostic Explanations (LIME), Gradient-Based Saliency Maps, and Gradient-Weighted Class Activation Mapping (Grad-CAM) to GNNs, and predict which edges in the input graphs are important for GNN decisions. The experimental results indicate that the LIME-based approach is the most efficient explainability method for multiple tasks in the real-world situation, outperforming even the state-of-the-art method in GNN explainability.
- Research Article
1
- 10.1371/journal.pone.0327636
- Jul 8, 2025
- PloS one
Machine learning is a powerful tool to develop algorithms for clinical diagnosis. However, standard machine learning algorithms are not perfectly suited for clinical data since the data are interconnected and may contain time series. As shown for recommender systems and molecular property predictions, Graph Neural Networks (GNNs) may represent a powerful alternative to exploit the inherently graph-based properties of clinical data. The main goal of this study is to evaluate when GNNs represent a valuable alternative for analyzing large clinical data from the clinical routine on the example of Complete Blood Count Data. In this study, we evaluated the performance and time consumption of several GNNs (e.g., Graph Attention Networks) on similarity graphs compared to simpler, state-of-the-art machine learning algorithms (e.g., XGBoost) on the classification of sepsis from blood count data as well as the importance and slope of each feature for the final classification. Additionally, we connected complete blood count samples of the same patient based on their measured time (patient-centric graphs) to incorporate time series information in the GNNs. As our main evaluation metric, we used the Area Under Receiver Operating Curve (AUROC) to have a threshold independent metric that can handle class imbalance. Standard GNNs on evaluated similarity-graphs achieved an Area Under Receiver Operating Curve (AUROC) of up to 0.8747 comparable to the performance of ensemble-based machine learning algorithms and a neural network. However, our integration of time series information using patient-centric graphs with GNNs achieved a superior AUROC of up to 0.9565. Finally, we discovered that feature slope and importance highly differ between trained algorithms (e.g., XGBoost and GNN) on the same data basis.
- Research Article
- 10.1371/journal.pone.0327636.r006
- Jul 8, 2025
- PLOS One
PurposeMachine learning is a powerful tool to develop algorithms for clinical diagnosis. However, standard machine learning algorithms are not perfectly suited for clinical data since the data are interconnected and may contain time series. As shown for recommender systems and molecular property predictions, Graph Neural Networks (GNNs) may represent a powerful alternative to exploit the inherently graph-based properties of clinical data. The main goal of this study is to evaluate when GNNs represent a valuable alternative for analyzing large clinical data from the clinical routine on the example of Complete Blood Count Data.MethodsIn this study, we evaluated the performance and time consumption of several GNNs (e.g., Graph Attention Networks) on similarity graphs compared to simpler, state-of-the-art machine learning algorithms (e.g., XGBoost) on the classification of sepsis from blood count data as well as the importance and slope of each feature for the final classification. Additionally, we connected complete blood count samples of the same patient based on their measured time (patient-centric graphs) to incorporate time series information in the GNNs. As our main evaluation metric, we used the Area Under Receiver Operating Curve (AUROC) to have a threshold independent metric that can handle class imbalance.Results and ConclusionStandard GNNs on evaluated similarity-graphs achieved an Area Under Receiver Operating Curve (AUROC) of up to 0.8747 comparable to the performance of ensemble-based machine learning algorithms and a neural network. However, our integration of time series information using patient-centric graphs with GNNs achieved a superior AUROC of up to 0.9565. Finally, we discovered that feature slope and importance highly differ between trained algorithms (e.g., XGBoost and GNN) on the same data basis.