Abstract

Keyword search plays a critical role for researchers in bioinformatics to retrieve structured, semi-structured, and unstructured data. In addition, in order to fully exploit the rich repository of biological databases, data mining has drawn increasing attention of researchers. An interesting issue is to examine the possible relationship between database keyword search (DB KWS) and in- depth database exploration (or data mining) in the context of bioinformatics, and in particular, the potential contribution of DB KWS for data mining. However, so far there is no known systematic investigation on this relationship. In this paper, we provide a preliminary discussion on how we can take advantage of DB KWS for in-depth exploration of biological databases, and describe a case study on the association between genetic variants and diseases. The case study is motivated from the fact that the advent of high throughput sequencing technologies have facilitated in generating a huge amount of genomic data. A wealth of genomic information in the form of publicly available databases is underutilized as a potential resource for uncovering functionally relevant markers underlying complex human traits. The discovery of genetic associations is an important factor in the understanding of human illness to derive disease pathways and a plethora of other information such as the disease-gene associations, the variants associated with the diseases etc. A database was curated of the genome wide association studies, and an algorithm inspired by DBXplorer was used to implement the keyword search over the database in JAVA. The case study further proposes ways to include the association rule mining as a data mining technique, which is useful for discovering interesting relationships hidden in large data sets, to further investigate the results of the keyword search when done with different yet sensible combinations of disease and genes. We believe that such an integrated study to explore the potential of how bioinformatics can take advantage of both techniques in a single bioinformatics application would be a very interesting issue of both theoretical and practical importance.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call