Abstract

Target identification and prioritisation are prominent first steps in modern drug discovery. Traditionally, individual scientists have used their expertise to manually interpret scientific literature and prioritise opportunities. However, increasing publication rates and the wider routine coverage of human genes by omic-scale research make it difficult to maintain meaningful overviews from which to identify promising new trends. Here we propose an automated yet flexible pipeline that identifies trends in the scientific corpus which align with the specific interests of a researcher and facilitate an initial prioritisation of opportunities. Using a procedure based on co-citation networks and machine learning, genes and diseases are first parsed from PubMed articles using a novel named entity recognition system together with publication date and supporting information. Then recurrent neural networks are trained to predict the publication dynamics of all human genes. For a user-defined therapeutic focus, genes generating more publications or citations are identified as high-interest targets. We also used topic detection routines to help understand why a gene is trendy and implement a system to propose the most prominent review articles for a potential target. This TrendyGenes pipeline detects emerging targets and pathways and provides a new way to explore the literature for individual researchers, pharmaceutical companies and funding agencies.

Highlights

  • Target identification and prioritisation are prominent first steps in modern drug discovery

  • Evidence is emerging that machine learning can be used to speed-up and reduce the costs in all stages in drug ­discovery[5,6]: drug ­repurposing[7,8], clinical t­rials[9,10], de-novo drug d­ esign[11,12,13,14,15,16,17,18,19,20], and target-disease a­ ssociations[21,22,23,24,25]

  • We present TrendyGenes as a first attempt to (i) establish a systematic analysis of contemporary topics associated to human genes and diseases, (ii) develop an alert system for emerging targets and trends in the scientific literature across the human, protein-coding genome, (iii) to use topic modelling to rapidly generate timelines of phrases that facilitate the understanding of why these genes are trendy

Read more

Summary

Introduction

Target identification and prioritisation are prominent first steps in modern drug discovery. We used topic detection routines to help understand why a gene is trendy and implement a system to propose the most prominent review articles for a potential target This TrendyGenes pipeline detects emerging targets and pathways and provides a new way to explore the literature for individual researchers, pharmaceutical companies and funding agencies. We propose a new disambiguation algorithm based on co-citation networks and natural language processing to obtain accurate publication dynamics for every coding-gene in the human genome. This time-series data was used to train recurrent neural networks (RNN) in historical data and predict the state of the literature in recent years. The aims of this paper are fourfold: (i) to unambiguously detect genes and diseases within articles with a novel named entity recogniser (ii) to generate a ranking of genes and diseases based on a novel metric that defines its trendiness, (iii) to generate an automatic pipeline to analyse why these biological entities may be trendy, and (iv) to generate a recommendation system to suggest which articles to read which maximise the information coverage in subnetworks

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call