Semi-Supervised Lexicon-Aware Embedding for News Article Time Estimation

Usman Ahmed,Jerry Chun-Wei Lin,Gautam Srivastava,Unil Yun

doi:10.1145/3592604

Abstract

In the information retrieval community, Temporal Information Retrieval (TIR) has become increasingly popular. Documents focused on the time surrounding their publication are more likely to be accurate and contain information relevant to the reader. In this study, we explore the inverted pyramid paradigm by extracting temporal expressions from news documents, standardizing their values, and evaluating them based on their position within the text. We present a lexicon expansion method that employs WordNet as input. This approach enhances the lexicon by grouping words with similar meanings, potentially improving the accuracy of event detection algorithms. Additionally, this process can introduce new words and phrases to the lexicon, expanding the vocabulary. Using each tagged dataset, a classifier is trained with a pre-trained network. A pool of unlabeled data are processed, and high-confidence pseudo-labels are assigned. Pseudo-labels are generated by leveraging the partially trained model and the original labelled data. As the classifier predicts the correct label for a data sample, the pseudo-labels of other data samples are updated, and vice versa. At the end of this process, the predictions from different matching classifiers are combined. It takes several rounds to label the unlabeled inputs using this method. To evaluate the proposed solutions, we conducted experiments on 4,500 online news articles relevant to temporal retrieval. LSTM, BiLSTM, and BERT models with and without lexicon expansion were assessed based on log loss and relative divergence of entropy. A jointly trained semi-supervised learning model achieved a mean KL divergence of 0.89, an F1 score of 0.74 for temporal events, and 0.63 for non-temporal events. Besides alleviating data sparsity issues and enabling the training of more complex networks, this technique can also serve as an alternative to data augmentation methods.

Full Text