Abstract

We discuss some operational issues pertaining to the detection of duplicates in the databases of bitmapped binary document images, and reason that efficient and effective duplicate document detection probably needs a combination of an efficient primary detector and an effective subordinate detector to be achieved. An algorithm that executes binary pattern template matching by cross-correlation is proposed as a duplicate document detection methodology. The template matching operation is amenable to pixel-parallel computation on serial architecture computers by bitwise integer operations. A description of the algorithm is accompanied by a discussion of issues that arise in its practical implementation. Duplicate detection by template matching is especially well suited to facsimile (i.e. fax) databases, in particular for detecting the single feed-multiple transmissions that often dominate the occurrence of duplicates in fax databases. Detailed experimental results presented for fax documents demonstrate that template matching is suitable as both a primary detector when conducted with small template and search area sizes, and a subordinate detector when conducted with moderate template and search area sizes.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.