Abstract

Nowadays, the generation of information through digital text documents has increased exponentially, so there is a need to store documents in mass storage devices such as high capacity hard discs, storage servers, the cloud and others. However, the storage that is carried out lacks a thematic organization, therefore, a search for information becomes complex. Given this problem, this publication describes the development of a system that has the purpose of classifying a digital text document based on the thematic content. This system implements ontologies to achieve a better classification by taking advantage of its characteristics. The system is divided into five tasks: the first is the implementation of a word count to create a frequency vector; The second task performs a refinement on the frequency vector to eliminate the sentence connectors and prepositions; the third task orders the vector from the highest to the lowest frequency; the fourth task takes the most significant set of frequencies vector, in which the ontology of a domain is applied and the relation that the words have to determine the thematic of the document is sought; and the fifth task is to organize the documents in a folder structure based on the identified domains. The system was developed with the incremental development methodology. To validate the operation of the system, a set of tests was carried out in a controlled scenario in order to verify the correct classification of the documents.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.