Application of Kernel Functions for Accurate Similarity Search in Large Chemical Databases

Xiaohong Wang,Jun Huan,Aaron Smalter,Gerald H Lushington

doi:10.1109/bibm.2009.72

Abstract

Similarity search in chemical structure databases is an important problem with many applications in chemicalgenomics, drug design, and efficient chemical probe screeningamong others. It is widely believed that structure based methods provide an efficient way to do the query. Recently various graph kernel functions have been designed to capture the intrinsic similarity of graphs. Though successful in constructing accurate predictive and classification models, graph kernel functions can not be applied to large chemical compound database due to the high computational complexity and the difficulties in indexing similarity search for large databases.To bridge graph kernel function and similarity search inchemical databases, we applied a novel kernel-based similarity measurement, developed in our team, to measure similarity of graph represented chemicals. In our method, we utilize a hash table to support new graph kernel function definition, efficient storage and fast search. We have applied our method, namedG-hash, to large chemical databases. Our results show thatthe G-hash method achieves state-of-the-art performance for k-nearest neighbor (k-NN) classification. Moreover, the similarity measurement and the index structure is scalable to large chemical databases with smaller indexing size, and faster query processing time as compared to state-of-the-art indexing methods such as Daylight fingerprints, C-tree and GraphGrep.

Highlights

Elucidate the roles of small organic molecules in biological systems, as studied in chemical genomics, is an emergent and challenging task
The analysis of chemical genomics data was done mainly within pharmaceutical companies for therapeutics discovery, and it was estimated that only 1% of chemical information was in the public domains [1]
Before we proceed to discuss the algorithmic details, we present some general background materials which include the introduction of the concept of graphs and chemical structures as graphs

Summary

Introduction

Elucidate the roles of small organic molecules in biological systems, as studied in chemical genomics, is an emergent and challenging task. Most 3D structure based approaches compare threedimensional shapes using a range of molecular descriptors [5][6] Such methods provide fast query processing in large chemical databases but relatively poor accuracy since such methods may lost much of the structure information during compressing the three-dimensional shapes. Though successful in constructing accurate predictive and classification models, graph kernel functions can not be applied to large chemical compound database due to the high computational complexity and the difficulties in indexing similarity search for large databases. Below we introduce details of the feature extractiion process, the index structure for fast similarity query and the kernel function for similarity measurement. Based on the hash table, we calculate distances between query graph and graphs in the database.

Methods

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Application of Kernel Functions for Accurate Similarity Search in Large Chemical Databases

Abstract

Highlights

Summary

Talk to us

Similar Papers

Lead the way for us

Publication Date: Nov 1, 2009
Citations: 23	License type: cc-by

Similar Papers

Application of kernel functions for accurate similarity search in large chemical databases.
Xiaohong Wang ... Jun Huan
BMC Bioinformatics | VOL. Suppl 11 3
Xiaohong Wang, et. al.Xiaohong Wang ... Jun Huan
01 Apr 2010
BMC Bioinformatics | VOL. Suppl 11 3

G-Hash: Towards Fast Kernel-based Similarity Search in Large Graph Databases.
Xiaohong Wang ... Aaron Smalter
Advances in database technology : proceedings. International Conference on Extending Database Technology | VOL. 360
Xiaohong Wang, et. al.Xiaohong Wang ... Aaron Smalter
24 Mar 2009
Advances in database technology : proceedings. International Conference on Extending Database Technology | VOL. 360

Virtual screening applications: a study of ligand-based methods and different structure representations in four different scenarios
Dimitar P Hristozov ... Johann Gasteiger
Journal of Computer-Aided Molecular Design | VOL. 21
Dimitar P Hristozov, et. al.Dimitar P Hristozov ... Johann Gasteiger
01 Oct 2007
Journal of Computer-Aided Molecular Design | VOL. 21

Comparison of the NCI open database with seven large chemical structural databases.
Johannes H Voigt ... Marc C Nicklaus
Journal of Chemical Information and Computer Sciences | VOL. 41
Johannes H Voigt, et. al.Johannes H Voigt ... Marc C Nicklaus
01 May 2001
Journal of Chemical Information and Computer Sciences | VOL. 41

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Application of Kernel Functions for Accurate Similarity Search in Large Chemical Databases

Abstract

Highlights

Summary

Talk to us

Similar Papers