Abstract

BackgroundMelanoma is one of the least common but the deadliest of skin cancers. This cancer begins when the genes of a cell suffer damage or fail, and identifying the genes involved in melanoma is crucial for understanding the melanoma tumorigenesis. Thousands of publications about human melanoma appear every year. However, while biological curation of data is costly and time-consuming, to date the application of machine learning for gene-melanoma relation extraction from text has been severely limited by the lack of annotated resources.ResultsTo overcome this lack of resources for melanoma, we have exploited the information of the Melanoma Gene Database (MGDB, a manually curated database of genes involved in human melanoma) to automatically build an annotated dataset of binary relations between gene and melanoma entities occurring in PubMed abstracts. The entities were automatically annotated by state-of-the-art text-mining tools. Their annotation includes both the mention text spans and normalized concept identifiers. The relations among the entities were annotated at concept- and mention-level. The concept-level annotation was produced using the information of the genes in MGDB to decide if a relation holds between a gene and melanoma concept in the whole abstract. The exploitability of this dataset was tested with both traditional machine learning, and neural network-based models like BERT. The models were then used to automatically extract gene-melanoma relations from the biomedical literature. Most of the current models use context-aware representations of the target entities to establish relations between them. To facilitate researchers in their experiments we generated a mention-level annotation in support to the concept-level annotation. The mention-level annotation was generated by automatically linking gene and melanoma mentions co-occurring within the sentences that in MGDB establish the association of the gene with melanoma.ConclusionsThis paper presents a corpus containing gene-melanoma annotated relations. Additionally, it discusses experiments which show the usefulness of such a corpus for training a system capable of mining gene-melanoma relationships from the literature. Researchers can use the corpus to develop and compare their own models, and produce results which might be integrated with existing structured knowledge databases, which in turn might facilitate medical research.

Highlights

  • Melanoma is one of the least common but the deadliest of skin cancers

  • Many genes related to human melanoma have been studied, and many publications reporting new genes associated with prognosis in melanoma are being published every year (Fig. 1)

  • The percentage of relations present in the mention-level annotation of the dataset that are found by BioBERT and Convolutional Neural Network (CNN) (Table 4, recall values in brackets) is in the average of recall values obtained by the best system at SemEval2013 Task 9 on the MedLine dataset (51%) and DrugBank dataset (84%)

Read more

Summary

Introduction

Melanoma is one of the least common but the deadliest of skin cancers. Melanoma is a skin cancer that starts when the genes that control the cell division and reproduction are damaged [2]. This causes the cell to divide and grow in number without control. Many genes related to human melanoma have been studied, and many publications reporting new genes associated with prognosis in melanoma are being published every year (Fig. 1). Genes and melanoma diseases can be mentioned together without any causal relation between them. This forces researchers to analyze a large amount of documents to find the actual relation of interest

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call