Abstract

In recent years, various types of tagged corpora have been constructed and much research using tagged corpora has been done. However, tagged corpora contain errors, which impedes the progress of research. Therefore, the correction of errors in corpora is an important research issue. In this study we investigate the correction of such errors, which we call corpus correction. Using machine-learning methods, we applied corpus correction to a verb modality corpus for machine translation. We used the maximum-entropy and decision-list methods as machine-learning methods. We compared several kinds of methods for corpus correction in our experiments, and determined which is most effective by using a statistical test. We obtained several noteworthy findings: (1) Precision was almost the same for both detection and correction, so it is more convenient to do both correction and detection, rather than detection only. (2) In general, the maximum-entropy method worked better than the decision-list method; but the two methods had almost the same precision for the top 50 pieces of extracted data when closed data was used. (3) In terms of precision, the use of closed data was better than the use of open data; however, in terms of the total number of extracted errors, the use of open data was better than the use of closed data. Based on our analysis of these results, we developed a good method for corpus correction. We confirmed the effectiveness of our method by carrying out experiments on machine translation. As corpus-based machine translation continues to be developed, the corpus correction we discuss in this article should prove to be increasingly significant.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.