Improving the Accuracy of Convolutional Neural Networks by Identifying and Removing Outlier Images in Datasets Using t-SNE

Husein Perez,Joseph H M Tah

doi:10.3390/math8050662

Abstract

In the field of supervised machine learning, the quality of a classifier model is directly correlated with the quality of the data that is used to train the model. The presence of unwanted outliers in the data could significantly reduce the accuracy of a model or, even worse, result in a biased model leading to an inaccurate classification. Identifying the presence of outliers and eliminating them is, therefore, crucial for building good quality training datasets. Pre-processing procedures for dealing with missing and outlier data, commonly known as feature engineering, are standard practice in machine learning problems. They help to make better assumptions about the data and also prepare datasets in a way that best expose the underlying problem to the machine learning algorithms. In this work, we propose a multistage method for detecting and removing outliers in high-dimensional data. Our proposed method is based on utilising a technique called t-distributed stochastic neighbour embedding (t-SNE) to reduce high-dimensional map of features into a lower, two-dimensional, probability density distribution and then use a simple descriptive statistical method called interquartile range (IQR) to identifying any outlier values from the density distribution of the features. t-SNE is a machine learning algorithm and a nonlinear dimensionality reduction technique well-suited for embedding high-dimensional data for visualisation in a low-dimensional space of two or three dimensions. We applied this method on a dataset containing images for training a convolutional neural network model (ConvNet) for an image classification problem. The dataset contains four different classes of images: three classes contain defects in construction (mould, stain, and paint deterioration) and a no-defect class (normal). We used the transfer learning technique to modify a pre-trained VGG-16 model. We used this model as a feature extractor and as a benchmark to evaluate our method. We have shown that, when using this method, we can identify and remove the outlier images in the dataset. After removing the outlier images from the dataset and re-training the VGG-16 model, the results have also shown that the accuracy of the classification has significantly improved and the number of misclassified cases has also dropped. While many feature engineering techniques for handling missing and outlier data are common in predictive machine learning problems involving numerical or categorical data, there is little work on developing techniques for handling outliers in high-dimensional data which can be used to improve the quality of machine learning problems involving images such as ConvNet models for image classification and object detection problems.

Highlights

Machine learning (ML) has shown huge advances in recent years. The potential of this field has been elevated across a wide range of applications including image recognition [1,2,3,4], speech recognition [5,6,7], medical diagnosis [8,9,10], defect detection and construction health assessment [11,12,13,14,15,16,17]. These recent advances in machine learning are attributed to several factors including the development of self-learning statistical models which allow computer systems to perform specific tasks relying only on the learnt patterns, and to the increase in computer processing power which support the analytical capabilities of these models [18,19,20]
The our classification accuracy was tested using the was recorded at non-used images mentioned earlier, which are dedicated to evaluating our model
The test accuracy without t-distributed stochastic neighbour embedding (t-Stochastic Neighbour Embedding (SNE)) was recorded at 81.25%

Summary

Introduction

Machine learning (ML) has shown huge advances in recent years. The potential of this field has been elevated across a wide range of applications including image recognition [1,2,3,4], speech recognition [5,6,7], medical diagnosis [8,9,10], defect detection and construction health assessment [11,12,13,14,15,16,17].These recent advances in machine learning are attributed to several factors including the development of self-learning statistical models which allow computer systems to perform specific (human-like) tasks relying only on the learnt patterns, and to the increase in computer processing power which support the analytical capabilities of these models [18,19,20]. Machine learning (ML) has shown huge advances in recent years The potential of this field has been elevated across a wide range of applications including image recognition [1,2,3,4], speech recognition [5,6,7], medical diagnosis [8,9,10], defect detection and construction health assessment [11,12,13,14,15,16,17]. A simple example to illustrate the effect of unwanted outliers on the results of data analysis is in statistical analysis, where the presence of outliers in the data can significantly affect the estimation of the mean and/or standard deviation of a sample data, which can lead to either over- or under-estimated values [25]

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Mathematics	Publication Date: Apr 27, 2020
Citations: 53	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Improving the Accuracy of Convolutional Neural Networks by Identifying and Removing Outlier Images in Datasets Using t-SNE

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Mathematics

Lead the way for us

Similar Papers

OutRank: ranking outliers in high dimensional data
Emmanuel Muller ... Ira Assent
-
Emmanuel Muller, et. al.Emmanuel Muller ... Ira Assent
01 Apr 2008
01 Apr 2008

Detecting and ranking outliers in high-dimensional data
Amardeep Kaur ... Amitava Datta
International Journal of Advances in Engineering Sciences and Applied Mathematics | VOL. 11
Amardeep Kaur, et. al.Amardeep Kaur ... Amitava Datta
14 Dec 2018
International Journal of Advances in Engineering Sciences and Applied Mathematics | VOL. 11

Outliers in High Dimensional Data
N N R Ranga Suri ... G Athithan
-
N N R Ranga Suri, et. al.N N R Ranga Suri ... G Athithan
01 Jan 2019
01 Jan 2019

Detecting Projected Outliers in High-Dimensional Data Streams
Ji Zhang ... Kai Xu
-
Ji Zhang, et. al.Ji Zhang ... Kai Xu
01 Jan 2009
01 Jan 2009

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Improving the Accuracy of Convolutional Neural Networks by Identifying and Removing Outlier Images in Datasets Using t-SNE

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Mathematics