Abstract

In this paper, we apply discriminant analysis on a large set of historic news articles published at www.bloomberg.com and investigate what features make the difference between news articles with short and long shelf lives. We define the shelf life of an article as the time to reach 60% of its total hits throughout its overall life time. The bag-of-words model is used to represent the content of an article as a vector of features, which are uni-, bi-, or tri-gram keywords. The thesaurus approach is applied to group words with similar meanings to a set of root words to reduce the size of the feature space. Normalized TF-IDF (Term Frequency -- Inverse Document Frequency) scheme is used to encode the feature vectors. By applying Linear Discriminant Analysis (LDA) on the articles with short and long shelf lives, near or over 80% precision and recall on both categories are achieved. Surprisingly, we also find that the sentiment of news articles has little correlation with their shelf lives.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.