Design of a Flexible, User Friendly Feature Matrix Generation System and its Application on Biomedical Datasets

M Ghorbani,S J E Taylor,A M Payne,S Swift

doi:10.1007/s10723-020-09518-y

M Ghorbani, S J E Taylor + Show 2 more

Open Access

https://doi.org/10.1007/s10723-020-09518-y

Copy DOI

Journal: Journal of Grid Computing	Publication Date: Apr 27, 2020
Citations: 1	License type: open-access

Affiliation: Brunel University London

Abstract

The generation of a feature matrix is the first step in conducting machine learning analyses on complex data sets such as those containing DNA, RNA or protein sequences. These matrices contain information for each object which have to be identified using complex algorithms to interrogate the data. They are normally generated by combining the results of running such algorithms across various datasets from different and distributed data sources. Thus for non-computing experts the generation of such matrices prove a barrier to employing machine learning techniques. Further since datasets are becoming larger this barrier is augmented by the limitations of the single personal computer most often used by investigators to carry out such analyses. Here we propose a user friendly system to generate feature matrices in a way that is flexible, scalable and extendable. Additionally by making use of The Berkeley Open Infrastructure for Network Computing (BOINC) software, the process can be speeded up using distributed volunteer computing possible in most institutions. The system makes use of a combination of the Grid and Cloud User Support Environment (gUSE), combined with the Web Services Parallel Grid Runtime and Developer Environment Portal (WS-PGRADE) to create workflow-based science gateways that allow users to submit work to the distributed computing. This report demonstrates the use of our proposed WS-PGRADE/gUSE BOINC system to identify features to populate matrices from very large DNA sequence data repositories, however we propose that this system could be used to analyse a wide variety of feature sets including image, numerical and text data.

Highlights

Machine learning techniques have proved to be important tools in many research areas to aid knowledge discovery from complex data sets
Machine learning analysis is preceded by the important stage of feature matrix generation which selects the features to be analyzed from these data sets
Often the features are generated by running algorithms across the data to draw out derived features or values not in the original data set. It follows that the successful outcome of machine learning techniques is highly dependent upon the feature generation stage [4]

Summary

Introduction

Machine learning techniques have proved to be important tools in many research areas to aid knowledge discovery from complex data sets. Machine learning analysis is preceded by the important stage of feature matrix generation which selects the features to be analyzed from these data sets. In some cases these features can be a chosen subset of features in the data set; chosen using expert knowledge of the subject arena the data was collected from. It follows that the successful outcome of machine learning techniques is highly dependent upon the feature generation stage [4]. This adds an additional layer of complexity to an already difficult analysis for the non-expert

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Design of a Flexible, User Friendly Feature Matrix Generation System and its Application on Biomedical Datasets

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Grid Computing

Lead the way for us

Similar Papers

The Sagittarius Dwarf Galaxy Tidal Debris in the south Galactic Cap
...
-
, et. al. ...
08 Jan 2014
08 Jan 2014

Finding Protein Binding Sites Using Volunteer Computing Grids
Travis Desell ... William Thompson
-
Travis Desell, et. al.Travis Desell ... William Thompson
01 Jan 2012
01 Jan 2012

Searching the Human Genome for Snail and Slug With DNA@Home.
Kristopher Zarns ... Sergei Nechaev
Proceedings ... IEEE International Conference on eScience. IEEE International Conference on eScience | VOL. 2015
Kristopher Zarns, et. al.Kristopher Zarns ... Sergei Nechaev
01 Aug 2015
Proceedings ... IEEE International Conference on eScience. IEEE International Conference on eScience | VOL. 2015

Ensemble learning model for Protein-Protein interaction prediction with multiple Machine learning techniques
Lai Zhenghui ... Guan Lixin
Measurement | VOL. 242
Lai Zhenghui, et. al.Lai Zhenghui ... Guan Lixin
01 Jan 2025
Measurement | VOL. 242

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Design of a Flexible, User Friendly Feature Matrix Generation System and its Application on Biomedical Datasets

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Grid Computing