Abstract

Finding the exact gene mutations that cause a genetic disease has been a challenging task. Despite the development in information technology, the task of extracting gene-disease associations has been mainly a manual process. This is a time-consuming process, in which experts extract gene-disease associations from relevant research papers from the literature manually. The main aim of this paper is to develop an automated approach for extracting and classifying gene-disease associations from relevant literature research papers using both natural language processing and machine learning techniques. This paper extracted data from free-text literature research papers and built four different dataset formats to discover an optimal representation. Machine and Deep learning models (NB, KNN, SVM, NN, CNN, and LSTM) with TF-IDF were applied on the built datasets. As a result, the format of the dataset with (Positive and Negative) instances only, was found to be the best representation for extracting gene-disease associations with optimal accuracy between 74% and 91%. For the four dataset representations, Multilayer Neural Networks was able to predict all classes in most experiments with accuracy between 64% and 91%. From the initial results, this work highlights the need for additional work to improve both the performance of these models and the data extraction method to build more accurate and optimal dataset representation.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call