Abstract

This preliminaries study aims to propose a good classification technique that capable of doing document classification based on text mining technique and create an algorithm to automatically classify document according to its folder based on document’s content while able to do sentiment analyses to data sets and summarize it. The objective of this paper to identify an efficient text mining classification technique which can resulted with highest accuracy of classifying document into document folder, capable of extracting valuable information from context-based term that can be used as an output for algorithm to do automatic classification and evaluate the classification technique. Methodology of this study comprises in 5 modules which is 1) Document collection, 2) Pre-Processing Stage, 3) Term Frequency-Inversed Document Frequency, 4) Classification Technique and Algorithm, and lastly 5) Evaluation and Visualization of the classification result. The proposed framework will have utilized Term Frequency-Inversed Document Frequency (TF-IDF) and Decision Tree technique which TF-IDF used as purposes to rank all the terms based on most frequent to least frequent terms so, while decision tree function as decision making in terms of deciding which folder the document belongs to.

Highlights

  • Text Mining (TM) is one of an analytics process, it was formulated to execute a task in analyzing a collection of unstructured textual materials in deriving high-quality information and essential knowledge covered by raw texts and TM specified in takes care of unstructured information

  • With a combination of TM and Term Document Matrix (TDM) [2] its competent to indexed and count all terms appear in each document in table form which arranged by column for terms appeared in a document and row represents the document identification or vice versa

  • In this research, we suggest term frequency (TF)-inversed document frequency (IDF) as reduction attribute technique to be combined with decision tree as an absolutely factual method to assess the significance of words dependent on its frequency of occurrence in the document and in its related corpus

Read more

Summary

Related Work

We lived in an era that computing technology grows so fast and data collecting becomes notable and contribute too many fields of work such as in medical used, business, education, reference, report, etc. Real world data have many type which is qualitative, quantitative, discrete, etc. These data can be recorded and visualized in variety of mediums such as electronic document, and databases. Data mining known as knowledge discovery in databases is the process of extracting hidden useful knowledge through large data set with help of tools to analyses data. Classification is one of data mining components that used to analyse and result in predict set of data according to its target class that a data belongs to. First sub-section is about introduction of text mining, followed by term frequency inversed document frequency and end with comparison of classification technique in text mining

Text Mining
Term Frequency Inversed Document Frequency
Text Classification Technique and Comparison
Framework of Keyword-Based Text Classification
Proposed Framework
RSTUDIO
Expected Outcomes
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.