Abstract

The potential impact of a scientific article has a significant correlation with its ability to establish novel connections between pre-existing knowledge [1-2]. Discovering hidden connections between the existing scientific literature is an interesting yet highly challenging information retrieval problem [2]. Literature based discovery (LBD) uses computational algorithms to discover potential hidden connections between previously disconnected sets of literature [3]. Most of the current LBD methods focus on analyzing latent semantic features in texts but are usually computationally demanding. In particular, they do not aim at predicting novel discovery links between clusters of literature. Combining latent semantic and structural features of literature is a promising yet unexplored LBD approach. This approach is potentially scalable and effective. For example, incorporating structural features of Web pages has increased the effectiveness of many large-scale IR systems [4]. The bibliographic structures of scientific papers make it possible to view a corpus of literature as a complex network of nodes (articles) and links (citation relationships) in which recognizable communities or clusters can be observed, each representing a distinct research field [5]. Consequently, potential hidden connections between disparate fields might be found from among non-overlapping clusters that do not have any existing link between their members yet exhibit a high propensity to converge in the future. This work approaches LBD as a cluster link prediction problem. We view disjoint literature sets as disjoint clusters in citation networks. Our method searches for hidden connections between disjoint clusters whose member nodes show high probabilities in forming future links. To this end, we address two research problems. The first problem is to group papers into clusters of distinct research areas. We compare the accuracy of well-known community detection algorithms, such as LOUVAIN and INFOMAP [5], in detecting research field clusters from citation networks of physics literature. We evaluate the quality of these clusters using purity, Rand Index, F-measure and Normalized Mutual Information [5-6]. Since ground truth communities are usually unknown, we also propose using alternative textual coherence measures such as Jensen-Shannon divergence [7]. The second problem is to predict the future formation of links between the nodes in previously disconnected clusters. We introduce a novel algorithm, Latent Domain Similarity (LDS), which uses combinations of semantic features (e.g. distribution of technical terms in titles and abstracts) and structural features (e.g. cited references, citing articles) of two or more articles in order to infer shared latent domains between them. We assume that while two sets of literature could have been published separately in two seemingly unrelated fields, it is possible that they share many similar domains previously unknown to researchers in each field. The goal is to explore whether these shared latent domains correlate with the probability of previously disconnected clusters to form future citation links with each other.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.