Machine Learning Methods for Life Sciences: Intelligent Data Analysis in Bio- and Chemoinformatics

Johannes Mohr

doi:10.14279/depositonce-2052

Abstract

In the past few years, experimental techniques in the life sciences have undergone a rapid progress. Moreover, the integration of methods from different disciplines has led to the formation of new fields of research, like imaging genetics, molecular medicine and biological psychology. The experimental progress has come along with an increasing need for intelligent data analysis, which aims at analyzing a given dataset in the most promising way taking domain knowledge into account. This includes the representation of the data, the choice of variables, the preprocessing, the handling of missing values, the model assumptions, the choice of methods for prediction, model selection and regularization, as well as the interpretation of the results. The topic of this thesis is intelligent data analysis in the fields of bioinformatics and chemoinformatics using machine learning techniques. The goal of imaging genetics is to gain insight into genetically determined psychiatric diseases by association studies between potentially relevant genetic variants and endophenotypes. In this thesis, two different methods for an exploratory analysis are developed: The first method is based on P-SVM feature selection for multiple regression and models additive and multiplicative gene effects on an endophenotype using a sparse regression model. The second method introduces a new learning paradigm called target selection to model the association between a single genetic variable and a multidimensional endophenotype. Often, several different models for genetic association are suggested in the literature, and the question is how much evidence a measured dataset provides for each of them. For this purpose, a method for model comparison in imaging genetics is suggested in this thesis, which is based on the use of information criteria. The aim of quantitative structure activity relationship (QSAR) analysis is to predict the biological activity of compounds from their molecular structure. Traditionally, QSAR methods are based on extracting a set of molecular descriptors and using them to build a predictive model. In this thesis, a descriptor-free method for 3D QSAR analysis is proposed, which introduces the concept of molecule kernels to measure the similarity between the 3D structures of a pair of molecules. The molecule kernels can be used together with the P-SVM, a recently proposed support vector machine for dyadic data, to build explanatory QSAR models which do not require an explicit descriptor construction. The resulting models make direct use of the structural similarities between the compounds which are to be predicted and a set of support molecules. The proposed method is applied to QSARand genotoxicity datasets.

Full Text