Abstract

Daily newspapers publish a tremendous amount of information disseminated through the Internet. Freely available and easily accessible large online repositories are not indexed and are in an un-processable format. The major hindrance in developing and evaluating existing/new monolingual text in an image is that it is not linked and indexed. There is no method to reuse the online news images because of the unavailability of standardized benchmark corpora, especially for South Asian languages. The corpus is a vital resource for developing and evaluating text in an image to reuse local news systems in general and specifically for the Urdu language. Lack of indexing, primarily semantic indexing of the daily news items, makes news items impracticable for any querying. Moreover, the most straightforward search facility does not support these unindexed news resources. Our study addresses this gap by associating and marking the newspaper images with one of the widely spoken but under-resourced languages, i.e., Urdu. The present work proposed a method to build a benchmark corpus of news in image form by introducing a web crawler. The corpus is then semantically linked and annotated with daily news items. Two techniques are proposed for image annotation, free annotation and fixed cross examination annotation. The second technique got higher accuracy. Build news ontology in protégé using Ontology Web Language (OWL) language and indexed the annotations under it. The application is also built and linked with protégé so that the readers and journalists have an interface to query the news items directly. Similarly, news items linked together will provide complete coverage and bring together different opinions at a single location for readers to do the analysis themselves.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call