Abstract

Knowledge Discovery in Databases (KDD) refers to the use of methodologies from machine learning, pattern recognition, statistics, and other fields to extract knowledge from large collections of data, where the knowledge is not explicitly available as part of the database structure. In this paper, we describe four modern data mining techniques, Rough Set Theory (RST), Association Rule Mining (ARM), Emerging Pattern Mining (EP), and Formal Concept Analysis (FCA), and we have attempted to give an exhaustive list of their chemoinformatics applications. One of the main strengths of these methods is their descriptive ability. When used to derive rules, for example, in structure-activity relationships, the rules have clear physical meaning. This review has shown that there are close relationships between the methods. Often apparent differences lie in the way in which the problem under investigation has been formulated which can lead to the natural adoption of one or other method. For example, the idea of a structural alert, as a structure which is present in toxic and absent in nontoxic compounds, leads to the natural formulation of an Emerging Pattern search. Despite the similarities between the methods, each has its strengths. RST is useful for dealing with uncertain and noisy data. Its main chemoinformatics applications so far have been in feature extraction and feature reduction, the latter often as input to another data mining method, such as an Support Vector Machine (SVM). ARM has mostly been used for frequent subgraph mining. EP and FCA have both been used to mine both structural and nonstructural patterns for classification of both active and inactive molecules. Since their introduction in the 1980s and 1990s, RST, ARM, EP, and FCA have found wide-ranging applications, with many thousands of citations in Web of Science, but their adoption by the chemoinformatics community has been relatively slow. Advances, both in computer power and in algorithm development, mean that there is the potential to apply these techniques to larger data sets and thus to different problems in the future.

Highlights

  • Knowledge Discovery in Databases (KDD) refers to the use of methodologies from machine learning, pattern recognition, statistics, and other fields to extract knowledge from large collections of data, where the knowledge is not explicitly available as part of the database structure

  • Often apparent differences lie in the way in which the problem under investigation has been formulated, which can lead to the natural adoption of one or other method

  • The equivalence between Association Rule Mining (ARM) and Rough Set Theory (RST) when searching for decision rules is very interesting but does not mean that RST should be disregarded since its treatment of inconsistent rules is unique and is an important feature of the method

Read more

Summary

INTRODUCTION

Knowledge Discovery in Databases (KDD) refers to the use of methodologies from machine learning, pattern recognition, statistics, and other fields to extract knowledge from large collections of data, where the knowledge is not explicitly available as part of the database structure. There has been much interest in methodologies which offer explanations of molecular activity Such interpretable methods can be poorer in terms of predictive power, they can be of greater value to medicinal chemists since they can provide useful guidance on what compounds to make next. Such interpretable methods can be poorer in terms of predictive power, they can be of greater value to medicinal chemists since they can provide useful guidance on what compounds to make This has led to several different but related methodologies from the KDD field being introduced to chemoinformatics, including Rough Set Theory (RST), Association Rule Mining (ARM), Emerging Pattern Mining (EP), and Formal Concept Analysis (FCA). We give a comprehensive review of the use of these methods in chemoinformatics

MOLECULAR ANALYSIS
DATA SETS
KDD ALGORITHMS
COMPARISON OF KDD METHODS
APPLICATIONS
CONCLUSIONS AND FUTURE DIRECTIONS
■ ACKNOWLEDGMENTS
■ REFERENCES
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.