A semi-automatic solution to build a biomedical semantic role corpus named PASBio+ was proposed. The corpus was annotated with a predicate argument structure, the important information that revealed the main content of a sentence. Because more than 86% of the arguments in the biomedical domain significantly differed from those in the general domain, this proposed corpus was labeled on top of 317 labeled sentences from PASBio, the argument frameset specifically designed for the Biomedical domain. From these sentences, the proposed semi-automatic solution additionally generated 87 sentences which were manually annotated by our experts. More instances were further generated by using the virtual example method, a powerful and flexible data augmentation technique that has been successfully applied in a wide range of tasks. Specifically, two sequential rules (the swap rule and the replace rule) were proposed to ensure that the biomedical knowledge was always kept correct. PASBio+ was also augmented by adding grammatical variants of the original sentences which kept the corpus having a wide coverage of diverse natural writing styles. In addition, from the very beginning, the PASBio's original sentence set was also enriched by an external text source which was an additional set of sentences selected from GREC biomedical corpus. As a result, a corpus with 2,500 fully labeled sentences with a uniform frequency distribution among predicates was obtained, thereby eliminating the problem of data sparsity and helping to restrict the overfitting in machine learning. The experimental results showed that when using the augmented corpus to train a semantic role labeling model, an increase in the F score by 52.2% or 22.5% were obtained compared to those trained by using the original PASBio corpus or a general domain one, respectively.
Read full abstract