Abstract

This study involves a methodology for the automatic identification of semantic features and document clusters in a heterogeneous text collection. The methodology is based upon encoding the data using low rank non-negative matrix factorization algorithms to preserve natural data non-negativity and thus avoid subtractive basis vector and encoding interactions present in techniques such as principal component analysis. Some existing non-negative matrix factorization techniques are reviewed and some new ones are proposed. Numerical experiments are reported on the use of a hybrid NMF algorithm to produce a parts-based approximation of a sparse term-by-document matrix. The resulting basis vectors and matrix projection can be used to identify underlying semantic features (topics) and document clusters of the corresponding text collection.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.