Recovering Damaged Documents to Improve Information Retrieval Processes

Angel Luis Garrido,Alvaro Peiró

doi:10.5584/jiomics.v8i3.230

Abstract

Although computer forensics is frequently related to the investigation of computer crimes, it can also be used in civil procedures. An example of case of use is information retrieval from damaged documents, where words have undergone alterations, either accidentally or intentionally. In this paper, we present a new tool able to retrieve information from large volumes of documents whose contents have been damaged. We have designed a new approach to recover the original words, composed of two stages: a text cleaning filter, able to remove non relevant information, and a text correction unit, which gather a general purpose spell checker with a N-gram based spell checker built specifically for the domain of the documents. The benefits of using this combined approach are two-fold: on the one hand, the general spell checker allows us to leverage all the general purpose techniques that are usually used to perform the corrections; on the other hand, the use of an N-gram based model allows us to adapt them to the particular domain we are tackling exploiting text regularities detected in successfully processed domain documents. The result of the correction allows us to improve automatic information retrieval tasks of from the texts. We have tested it using a real data set by using an information extraction tool based on semantic technologies in collaboration with the Spanish company InSynergy Consulting.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Recovering Damaged Documents to Improve Information Retrieval Processes

Abstract

Talk to us

Similar Papers

More From: Journal of Integrated OMICS

Lead the way for us

Similar Papers

Automatic spell checker for Malay blog
Surayaini Binti Basri ... Rayner Alfred
-
Surayaini Binti Basri, et. al.Surayaini Binti Basri ... Rayner Alfred
01 Nov 2012
01 Nov 2012

An Automatic Spell Checker Framework for Malay Language Blogs
Surayaini Binti Basri ... Rayner Alfred
-
Surayaini Binti Basri, et. al.Surayaini Binti Basri ... Rayner Alfred
01 Jan 2013
01 Jan 2013

Validating the TEMAA LE evaluation methodology: a case study on Danish spelling checkers
Patrizia Paggio ... Nancy L Underwood
Natural Language Engineering | VOL. 4
Patrizia Paggio, et. al.Patrizia Paggio ... Nancy L Underwood
01 Sep 1998
Natural Language Engineering | VOL. 4

The use of Semantic Web technologies for decision support – a survey
Eva Blomqvist
Semantic Web | VOL. 5
Eva BlomqvistEva Blomqvist
01 Jan 2013
Semantic Web | VOL. 5

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Recovering Damaged Documents to Improve Information Retrieval Processes

Abstract

Talk to us

Similar Papers

More From: Journal of Integrated OMICS