Abstract

BackgroundData preprocessing is a major step in data mining. In data preprocessing, several known techniques can be applied, or new ones developed, to improve data quality such that the mining results become more accurate and intelligible. Bioinformatics is one area with a high demand for generation of comprehensive models from large datasets. In this article, we propose a context-based data preprocessing approach to mine data from molecular docking simulation results. The test cases used a fully-flexible receptor (FFR) model of Mycobacterium tuberculosis InhA enzyme (FFR_InhA) and four different ligands.ResultsWe generated an initial set of attributes as well as their respective instances. To improve this initial set, we applied two selection strategies. The first was based on our context-based approach while the second used the CFS (Correlation-based Feature Selection) machine learning algorithm. Additionally, we produced an extra dataset containing features selected by combining our context strategy and the CFS algorithm. To demonstrate the effectiveness of the proposed method, we evaluated its performance based on various predictive (RMSE, MAE, Correlation, and Nodes) and context (Precision, Recall and FScore) measures.ConclusionsStatistical analysis of the results shows that the proposed context-based data preprocessing approach significantly improves predictive and context measures and outperforms the CFS algorithm. Context-based data preprocessing improves mining results by producing superior interpretable models, which makes it well-suited for practical applications in molecular docking simulations using FFR models.

Highlights

  • Freitas et al [3] support the use of decision trees models, instead of black box algorithms, to represent, graphically, patterns revealed by data mining, for example, Support Vector Machine (SVM) or Neural Networks models

  • We model the full receptor explicit flexibility in the molecular docking simulations [10] using a set of different conformations for the receptor, generated by molecular dynamics (MD) simulations [11]

  • We evaluated the models by means of the predictive and context measures presented

Read more

Summary

Introduction

Freitas et al [3] support the use of decision trees models, instead of black box algorithms, to represent, graphically, patterns revealed by data mining, for example, Support Vector Machine (SVM) or Neural Networks models. Still according to these authors [3], the hierarchical structure developed can emphasize the importance of the attributes used for prediction

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call