Standard Corpus Research Articles

The BioCreative V chemical-disease relation (CDR) track was proposed to accelerate the progress of text mining in facilitating integrative understanding of chemicals, diseases and their relations. In this article, we describe an extension of our system (namely UET-CAM) that participated in the BioCreative V CDR. The original UET-CAM system’s performance was ranked fourth among 18 participating systems by the BioCreative CDR track committee. In the Disease Named Entity Recognition and Normalization (DNER) phase, our system employed joint inference (decoding) with a perceptron-based named entity recognizer (NER) and a back-off model with Semantic Supervised Indexing and Skip-gram for named entity normalization. In the chemical-induced disease (CID) relation extraction phase, we proposed a pipeline that includes a coreference resolution module and a Support Vector Machine relation extraction model. The former module utilized a multi-pass sieve to extend entity recall. In this article, the UET-CAM system was improved by adding a ‘silver’ CID corpus to train the prediction model. This silver standard corpus of more than 50 thousand sentences was automatically built based on the Comparative Toxicogenomics Database (CTD) database. We evaluated our method on the CDR test set. Results showed that our system could reach the state of the art performance with F1 of 82.44 for the DNER task and 58.90 for the CID task. Analysis demonstrated substantial benefits of both the multi-pass sieve coreference resolution method (F1 + 4.13%) and the silver CID corpus (F1 +7.3%).Database URL: SilverCID–The silver-standard corpus for CID relation extraction is freely online available at: https://zenodo.org/record/34530 (doi:10.5281/zenodo.34530).

Motivation.Finding relevant scientific literature is one of the essential tasks researchers are facing on a daily basis. Digital libraries and web information retrieval techniques provide rapid access to a vast amount of scientific literature. However, no further automated support is available that would enable fine-grained access to the knowledge ‘stored’ in these documents. The emerging domain ofSemantic Publishingaims at making scientific knowledge accessible to both humans and machines, by adding semantic annotations to content, such as a publication’s contributions, methods, or application domains. However, despite the promises of better knowledge access, the manual annotation of existing research literature is prohibitively expensive for wide-spread adoption. We argue that a novel combination of three distinct methods can significantly advance this vision in a fully-automated way: (i) Natural Language Processing (NLP) forRhetorical Entity(RE) detection; (ii)Named Entity(NE) recognition based on the Linked Open Data (LOD) cloud; and (iii) automatic knowledge base construction for both NEs and REs using semantic web ontologies that interconnect entities in documents with the machine-readable LOD cloud.Results.We present a complete workflow to transform scientific literature into a semantic knowledge base, based on the W3C standards RDF and RDFS. A text mining pipeline, implemented based on the GATE framework, automatically extracts rhetorical entities of typeClaimsandContributionsfrom full-text scientific literature. These REs are further enriched with named entities, represented as URIs to the linked open data cloud, by integrating the DBpedia Spotlight tool into our workflow. Text mining results are stored in a knowledge base through a flexible export process that provides for a dynamic mapping of semantic annotations to LOD vocabularies through rules stored in the knowledge base. We created a gold standard corpus from computer science conference proceedings and journal articles, whereClaimandContributionsentences are manually annotated with their respective types using LOD URIs. The performance of the RE detection phase is evaluated against this corpus, where it achieves an averageF-measure of 0.73. We further demonstrate a number of semantic queries that show how the generated knowledge base can provide support for numerous use cases in managing scientific literature.Availability.All software presented in this paper is available under open source licenses athttp://www.semanticsoftware.info/semantic-scientific-literature-peerj-2015-supplements. Development releases of individual components are additionally available on our GitHub page athttps://github.com/SemanticSoftwareLab.

Standard Corpus Research Articles

Related Topics

Articles published on Standard Corpus

Deep learning with word embeddings improves biomedical named entity recognition

A New Image Mining Approach for Detecting Micro-Calcification in Digital Mammograms

Argumentation Mining in User-Generated Web Discourse

Popular vs. Professional Aspects of Economics Texts in English

Sentence selection with neural networks using string kernels

Evaluating Urdu to Arabic Machine Translation Tools

Study of sub-word acoustical models for Kannada isolated word recognition system

COUNTER: corpus of Urdu news text reuse

A spam filtering multi-objective optimization study covering parsimony maximization and three-way classification

Sieve-based coreference resolution enhances semi-supervised learning model for chemical-induced disease relation extraction.

Cloud-Based Evaluation of Anatomical Structure Segmentation and Landmark Detection Algorithms: VISCERAL Anatomy Benchmarks.

NUWT: JAWI-SPECIFIC BUCKWALTER CORPUS FOR MALAY WORD TOKENIZATION

ZeuScansion: A tool for scansion of English poetry

The Manifesto Corpus: A new resource for research on political parties and quantitative text analysis

Local binary pattern based face recognition with automatically detected fiducial points

How well does Google work with Persian documents?

Boosting Accuracy of Classical Machine Learning Antispam Classifiers in Real Scenarios by Applying Rough Set Theory

Is the Juice Worth the Squeeze? Costs and Benefits of Multiple Human Annotators for Clinical Text De-identification.

Semantic representation of scientific literature: bringing claims, contributions and named entities onto the Linked Open Data cloud

A CLUSTERED SEMANTIC GRAPH APPROACH FOR MULTI-DOCUMENT ABSTRACTIVE SUMMARIZATION

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Standard Corpus Research Articles

Related Topics

Articles published on Standard Corpus

Deep learning with word embeddings improves biomedical named entity recognition

A New Image Mining Approach for Detecting Micro-Calcification in Digital Mammograms

Argumentation Mining in User-Generated Web Discourse

Popular vs. Professional Aspects of Economics Texts in English

Sentence selection with neural networks using string kernels

Evaluating Urdu to Arabic Machine Translation Tools

Study of sub-word acoustical models for Kannada isolated word recognition system

COUNTER: corpus of Urdu news text reuse

A spam filtering multi-objective optimization study covering parsimony maximization and three-way classification

Sieve-based coreference resolution enhances semi-supervised learning model for chemical-induced disease relation extraction.

Cloud-Based Evaluation of Anatomical Structure Segmentation and Landmark Detection Algorithms: VISCERAL Anatomy Benchmarks.

NUWT: JAWI-SPECIFIC BUCKWALTER CORPUS FOR MALAY WORD TOKENIZATION

ZeuScansion: A tool for scansion of English poetry

The Manifesto Corpus: A new resource for research on political parties and quantitative text analysis

Local binary pattern based face recognition with automatically detected fiducial points

How well does Google work with Persian documents?

Boosting Accuracy of Classical Machine Learning Antispam Classifiers in Real Scenarios by Applying Rough Set Theory

Is the Juice Worth the Squeeze? Costs and Benefits of Multiple Human Annotators for Clinical Text De-identification.

Semantic representation of scientific literature: bringing claims, contributions and named entities onto the Linked Open Data cloud

A CLUSTERED SEMANTIC GRAPH APPROACH FOR MULTI-DOCUMENT ABSTRACTIVE SUMMARIZATION