Bayesian Mixture Models on Connected Components for Newspaper Article Segmentation

Giorgos Sfikas,Basilis Gatos,Georgios Louloudis,Nikolaos Stamatopoulos

doi:10.1145/2960811.2967165

Bayesian Mixture Models on Connected Components for Newspaper Article Segmentation

Giorgos Sfikas, Basilis Gatos + Show 2 more

https://doi.org/10.1145/2960811.2967165

Copy DOI

Publication Date: Sep 13, 2016

#Bayesian Mixture Model #Bayesian Model + Show 8 more

Abstract
Full-Text PDF
Similar Papers

Abstract

In this paper we propose a new method for automated segmentation of scanned newspaper pages into articles. Article regions are produced as a result of merging sub-article level content and title regions. We use a Bayesian Gaussian mixture model to model page Connected Component information and cluster input into sub-article components. The Bayesian model is conditioned on a prior distribution over region features, aiding classification into titles and content. Using a Dirichlet prior we are able to automatically estimate correctly the number of title and article regions. The method is tested on a dataset of digitized historical newspapers, where visual experimental results are very promising.

Full Text