Abstract

Automated information retrieval is critical for enterprise information systems to acquire knowledge from the vast amount of data sets. One challenge in information retrieval is text classification. Current practices rely heavily on the classical naïve Bayes algorithm due to its simplicity and robustness. However, results from this algorithm are not always satisfactory. In this article, the limitations of the naïve Bayes algorithm are discussed, and it is found that the assumption on the independence of terms is the main reason for an unsatisfactory classification in many real-world applications. To overcome the limitations, the dependent factors are considered by integrating a term frequency–inverse document frequency (TF-IDF) weighting algorithm in the naïve Bayes classification. Moreover, the TF-IDF algorithm itself is improved so that both frequencies and distribution information are taken into consideration. To illustrate the effectiveness of the proposed method, two simulation experiments were conducted, and the comparisons with other classification methods have shown that the proposed method has outperformed other existing algorithms in terms of precision and index recall rate.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.