Abstract
An ongoing project explores the extent to which artificial intelligence (AI), specifically in the areas of natural language processing and semantic reasoning, can be exploited to facilitate the studies of science by deploying software agents equipped with natural language understanding capabilities to read scholarly publications on the web. The knowledge extracted by these AI agents is organized into a heterogeneous graph, called Microsoft Academic Graph (MAG), where the nodes and the edges represent the entities engaging in scholarly communications and the relationships among them, respectively. The frequently updated data set and a few software tools central to the underlying AI components are distributed under an open data license for research and commercial applications. This paper describes the design, schema, and technical and business motivations behind MAG and elaborates how MAG can be used in analytics, search, and recommendation scenarios. How AI plays an important role in avoiding various biases and human induced errors in other data sets and how the technologies can be further improved in the future are also discussed.
Highlights
The field of science that studies the structure and the evolution of science has been established firmly on quantitative methodologies
The knowledge extracted by these artificial intelligence (AI) agents is organized into a heterogeneous graph, called Microsoft Academic Graph (MAG), where the nodes and the edges represent the entities engaging in scholarly communications and the relationships among them, respectively
MAG shows the growth rate in annual publication output has been on an exponential pace for almost two centuries and shows no sign of abating
Summary
The field of science that studies the structure and the evolution of science has been established firmly on quantitative methodologies. The move allows MAS to replicate the success of Google Scholar, which utilizes the massive document index from a web search engine to achieve comprehensive coverage of contemporary scholarly materials, many of which are not published and distributed through traditional channels and not assigned DOIs. In contrast to the index size of 40 million in 2014, the web crawl approach has enabled MAS to improve its coverage dramatically to include, by the end of November 2019, more than 225 million publications with more than 2 billion unique citations, growing at more than 1 million new publications a month in recent years. In contrast to the index size of 40 million in 2014, the web crawl approach has enabled MAS to improve its coverage dramatically to include, by the end of November 2019, more than 225 million publications with more than 2 billion unique citations, growing at more than 1 million new publications a month in recent years This improved coverage is a key in alleviating concerns about sampling biases in studies using incomplete data sets. To understand the potentials of MAG and why it is organized in its current form, a deeper understanding of the technologies behind its creation is warranted and provided below
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have