Abstract

This article introduces the Variome Annotation Schema, a schema that aims to capture the core concepts and relations relevant to cataloguing and interpreting human genetic variation and its relationship to disease, as described in the published literature. The schema was inspired by the needs of the database curators of the International Society for Gastrointestinal Hereditary Tumours (InSiGHT) database, but is intended to have application to genetic variation information in a range of diseases. The schema has been applied to a small corpus of full text journal publications on the subject of inherited colorectal cancer. We show that the inter-annotator agreement on annotation of this corpus ranges from 0.78 to 0.95 F-score across different entity types when exact matching is measured, and improves to a minimum F-score of 0.87 when boundary matching is relaxed. Relations show more variability in agreement, but several are reliable, with the highest, cohort-has-size, reaching 0.90 F-score. We also explore the relevance of the schema to the InSiGHT database curation process. The schema and the corpus represent an important new resource for the development of text mining solutions that address relationships among patient cohorts, disease and genetic variation, and therefore, we also discuss the role text mining might play in the curation of information related to the human variome. The corpus is available at http://opennicta.com/home/health/variome.

Highlights

  • The identification of associations between human genetic variation and disease phenotypes is a major thrust of current biomedical research

  • We introduce a schema for annotation of the biomedical literature that targets the core information relevant to genetic variation and lays the foundation for text mining of this information

  • International Society for Gastrointestinal Hereditary Tumours (InSiGHT) maintains a database of genetic variants for both of these syndromes, but for this work, we focus on Lynch Syndrome, which is caused by mutations in the mismatch repair (MMR) genes

Read more

Summary

Introduction

The identification of associations between human genetic variation and disease phenotypes is a major thrust of current biomedical research. Such associations facilitate our understanding of the genetic basis for disease, but will open the door to personalized medicine, where treatment of patients can be tailored to their unique genetic characteristics. Other work extends the methods to relate such gene/mutation pairs to a specific disease [10] These approaches require annotated textual data for training and evaluation of text mining systems. InSiGHT maintains a database of genetic variants for both of these syndromes, but for this work, we focus on Lynch Syndrome, which is caused by mutations in the mismatch repair (MMR) genes. The original database was established in the 1990s, with mutations reported by individual laboratories [12]

Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call