Abstract
Mixed Raster Content (MRC) is a standard for efficient document compression which can dramatically improve the compression/quality tradeoff as compared to traditional lossy image compression algorithms. The key to MRC's performance is the separation of the document into foreground and background layers, represented as a binary mask. Typically, the foreground layer contains text colors, the background layer contains images and graphics, and the binary mask layer represents fine detail of text fonts. The resulting quality and compression ratio of a MRC document encoder is highly dependent on the segmentation algorithm used to compute the binary mask. In this paper, we propose a novel segmentation method based on the MRC standards (ITU-T T.44). The algorithm consists of two components: Cost Optimized Segmentation (COS) and Connected Component Classification (CCC). The COS algorithm is a blockwise segmentation algorithm formulated in a global cost optimization framework, while CCC is based on feature vector classification of connected components. In the experimental results, we show that the new algorithm achieves the same accuracy of text detection but with lower false detection of non-text features, as compared to state-of-the-art commercial MRC products. This results in high quality MRC encoded documents with fewer non-text artifacts, and lower bit rate.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have