Abstract

In computational pathology, a typical Whole Slide Image may easily reach a size of 3–4 Gigabytes, while a database, with hundred of cases, might reach Terabytes. In this scenario, training any model is expensive and therefore the possibility of reducing the size of the training data is appealing. This paper presents a method to summarize the information of a data set, a function of the variance, by selecting the most informative samples. The first step of the whole strategy consists in projecting small image patches (100 × 100) to a feature space and discretize it with a simple k-means to obtain the feature space vocabulary. The groups obtained by the k-means are the vocabulary words and therefore any small patch is represented by the centroid of the group to which such patch is projected. A second step, a probabilistic Latent Semantic Analysis constructs groups of words known as topics by computing frequencies of words in documents, which are larger patches containing between 450 and 500 small patches. A third step collects the documents representing 80 % of the topic variance, their patches are assembled and a Singular Value Decomposition (SVD) is applied to these patches. The Data Distillation process chooses only the small patches belonging to the topics showing higher variance in the matrix of eigenvalues S from the SVD decomposition. The method efficacy was evaluated by comparing the performance of models trained either with 40 % of an actual ovarian cancer dataset selected using this method and the entire dataset without any selection. Results show the F-score obtained with these two sets was similar, about 0.87, with different classifiers, namely Support Vector Machine and a Multilayer Perceptron.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call