Large-scale structure of a network of co-occurring MeSH terms: statistical analysis of macroscopic properties.

Andrej Kastrin,Thomas C Rindflesch,Dimitar Hristovski,Alejandro Raul Hernandez Montoya

doi:10.1371/journal.pone.0102188

Andrej Kastrin, Thomas C Rindflesch + Show 2 more

Open Access

https://doi.org/10.1371/journal.pone.0102188

Copy DOI

Abstract

Concept associations can be represented by a network that consists of a set of nodes representing concepts and a set of edges representing their relationships. Complex networks exhibit some common topological features including small diameter, high degree of clustering, power-law degree distribution, and modularity. We investigated the topological properties of a network constructed from co-occurrences between MeSH descriptors in the MEDLINE database. We conducted the analysis on two networks, one constructed from all MeSH descriptors and another using only major descriptors. Network reduction was performed using the Pearson's chi-square test for independence. To characterize topological properties of the network we adopted some specific measures, including diameter, average path length, clustering coefficient, and degree distribution. For the full MeSH network the average path length was 1.95 with a diameter of three edges and clustering coefficient of 0.26. The Kolmogorov-Smirnov test rejects the power law as a plausible model for degree distribution. For the major MeSH network the average path length was 2.63 edges with a diameter of seven edges and clustering coefficient of 0.15. The Kolmogorov-Smirnov test failed to reject the power law as a plausible model. The power-law exponent was 5.07. In both networks it was evident that nodes with a lower degree exhibit higher clustering than those with a higher degree. After simulated attack, where we removed 10% of nodes with the highest degrees, the giant component of each of the two networks contains about 90% of all nodes. Because of small average path length and high degree of clustering the MeSH network is small-world. A power-law distribution is not a plausible model for the degree distribution. The network is highly modular, highly resistant to targeted and random attack and with minimal dissortativity.

Highlights

The proliferation of scientific knowledge during the past decades makes it difficult even for domain experts to keep abreast of the relevant information in their specific field of interest
We characterize the statistical properties of the Medical Subject Headings (MeSH) networks
Our experimentation was conducted on two types of co-occurrence networks: (i) the full network, which consists of all MeSH descriptors in each MEDLINE citation and (ii) on the reduced network, which contains only major MeSH terms

Summary

Introduction

The proliferation of scientific knowledge during the past decades makes it difficult even for domain experts to keep abreast of the relevant information in their specific field of interest. At the time of this writing, the MEDLINE database [1] contains over 23 million bibliographic citations with a continuous growth rate of about 2,000–4,000 citations per day. Associations between entities based on co-occurrence of biomedical terms, such as chemical substances, biological processes, diseases or genes constitute an important part of knowledge representation. Simple linkage between concepts can be further extended by the number of times a concept is found in a document or by closeness between one concept and another concept in a sentence [4]. Literature mining technologies complement information extracted from structured biomedical sources (e.g., GeneOntology) by providing researchers with more relevant and interpretable knowledge. A plethora of applications have been developed exploiting co-occurrence for mining interesting patterns in biomedical resources (e.g., BITOLA [5], iHOP [6], AliBaba [7], EBIMed [8], FACTA [9], PLAN2L [10], STRING [11], LAITOR [12])

Objectives

Methods

Results

Discussion

Conclusion