Abstract

In the field of supervised machine learning, the quality of a classifier model is directly correlated with the quality of the data that is used to train the model. The presence of unwanted outliers in the data could significantly reduce the accuracy of a model or, even worse, result in a biased model leading to an inaccurate classification. Identifying the presence of outliers and eliminating them is, therefore, crucial for building good quality training datasets. Pre-processing procedures for dealing with missing and outlier data, commonly known as feature engineering, are standard practice in machine learning problems. They help to make better assumptions about the data and also prepare datasets in a way that best expose the underlying problem to the machine learning algorithms. In this work, we propose a multistage method for detecting and removing outliers in high-dimensional data. Our proposed method is based on utilising a technique called t-distributed stochastic neighbour embedding (t-SNE) to reduce high-dimensional map of features into a lower, two-dimensional, probability density distribution and then use a simple descriptive statistical method called interquartile range (IQR) to identifying any outlier values from the density distribution of the features. t-SNE is a machine learning algorithm and a nonlinear dimensionality reduction technique well-suited for embedding high-dimensional data for visualisation in a low-dimensional space of two or three dimensions. We applied this method on a dataset containing images for training a convolutional neural network model (ConvNet) for an image classification problem. The dataset contains four different classes of images: three classes contain defects in construction (mould, stain, and paint deterioration) and a no-defect class (normal). We used the transfer learning technique to modify a pre-trained VGG-16 model. We used this model as a feature extractor and as a benchmark to evaluate our method. We have shown that, when using this method, we can identify and remove the outlier images in the dataset. After removing the outlier images from the dataset and re-training the VGG-16 model, the results have also shown that the accuracy of the classification has significantly improved and the number of misclassified cases has also dropped. While many feature engineering techniques for handling missing and outlier data are common in predictive machine learning problems involving numerical or categorical data, there is little work on developing techniques for handling outliers in high-dimensional data which can be used to improve the quality of machine learning problems involving images such as ConvNet models for image classification and object detection problems.

Highlights

  • Machine learning (ML) has shown huge advances in recent years. The potential of this field has been elevated across a wide range of applications including image recognition [1,2,3,4], speech recognition [5,6,7], medical diagnosis [8,9,10], defect detection and construction health assessment [11,12,13,14,15,16,17]. These recent advances in machine learning are attributed to several factors including the development of self-learning statistical models which allow computer systems to perform specific tasks relying only on the learnt patterns, and to the increase in computer processing power which support the analytical capabilities of these models [18,19,20]

  • The our classification accuracy was tested using the was recorded at non-used images mentioned earlier, which are dedicated to evaluating our model

  • The test accuracy without t-distributed stochastic neighbour embedding (t-Stochastic Neighbour Embedding (SNE)) was recorded at 81.25%

Read more

Summary

Introduction

Machine learning (ML) has shown huge advances in recent years. The potential of this field has been elevated across a wide range of applications including image recognition [1,2,3,4], speech recognition [5,6,7], medical diagnosis [8,9,10], defect detection and construction health assessment [11,12,13,14,15,16,17].These recent advances in machine learning are attributed to several factors including the development of self-learning statistical models which allow computer systems to perform specific (human-like) tasks relying only on the learnt patterns, and to the increase in computer processing power which support the analytical capabilities of these models [18,19,20]. Machine learning (ML) has shown huge advances in recent years The potential of this field has been elevated across a wide range of applications including image recognition [1,2,3,4], speech recognition [5,6,7], medical diagnosis [8,9,10], defect detection and construction health assessment [11,12,13,14,15,16,17]. A simple example to illustrate the effect of unwanted outliers on the results of data analysis is in statistical analysis, where the presence of outliers in the data can significantly affect the estimation of the mean and/or standard deviation of a sample data, which can lead to either over- or under-estimated values [25]

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call