Abstract
Treebanks are a linguistic resource: a large database where the morphological, syntactic and lexical information for each sentence has been explicitly marked. The critical requirements of treebanks for various NLP activities (research and application) are well known. This also implies that treebanks need to be as error free as possible. However, manual validation of a treebank is very costly, both in terms of time and money. This paper describes an approach to automatically detect errors in a treebank after a complete manual annotation. Over and above improving an earlier error detection tool (Ambati et al. (2011)) for a Hindi treebank. We also present a user study to show that our system reduces the validation time significantly while detecting 81.49% of the errors at the dependency level.
Highlights
Treebanks have proved to be a crucial resource for NLP research and developing solutions for various NLP related applications
A treebank should be error free considering its role in providing appropriate linguistic knowledge
The PBSM proposed by Ambati et al (2011), extracts some contextual features, trains using gold standard training data that is validated by linguistic experts, creates a model using maximum entropy classication algorithm6 (MAXENT), tests the system on the testing data and obtains the probabilities for all the possible dependency tags
Summary
Treebanks have proved to be a crucial resource for NLP research and developing solutions for various NLP related applications. Automatic error detection tools are required to reduce the time of validation. A semi-automatic procedure involves annotating the grammatical information using tools. Output of these tools is manually checked and corrected. Both these procedures may leave errors in the treebank on the rst attempt. We improve over the mechanism proposed by Ambati et al (2011) to detect dependency annotation errors. For more details on the type of errors which we extract from the Hindi dependency treebank, please refer to our previous work (Ambati et al (2011)).
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.