Chemical-induced disease (CID) relation extraction has been pivotal in the understanding of biological processes. A CID relation between a chemical and disease entity may be extracted either from a single sentence or from two or more adjacent sentences. We use `intrasentence level' to refer to the mention of the desired entities in the same sentence and `intersentence level? to refer to the mention of these entities in two or more adjacent sentences. This study proposes a three-phase architecture for extracting CID relations from biomedical literature by considering both sentence levels and additionally the combination of these two sentence levels which we describe as the 'joint level'. In phase 1, we construct relation instances at the intra- and intersentence levels which are subsequently combined to form the joint level. In phase 2, we extracted features specifically for an individual relation instance at the three levels. At each of these levels, we trained three classifier models that consist of the combination of two classifiers. We used the training dataset for training and later classified the CID relation instances using the test dataset. Phase 3 consists of two steps; in step 1, the classifier outputs from both the intra- and intersentence levels are combined and in step 2, the results from step 1 are combined with the results from the classifier trained at joint level using a prediction probability-based voting algorithm to determine the final result. Using the BioCreative V corpus for validation, we obtain results that outperform all the state-of-the-art systems for CID relation extraction on the standard chemical-disease relation corpus.
Read full abstract