Abstract
This thesis proposes several methods for bilingual corpus form different websites, such as Automatic acquisition of bilingual corpus base on "iciba" web, CNKI and Patent network. It introduced methods, procedures of the acquisition of a variety of corpus. We proposed different methods to obtain the bilingual corpus for different characteristics of different sites, and achieved fast and accurate automatic access of a large-scale bilingual corpus. When we obtain the bilingual corpus based on "iciba" web, the main method is Nutch crawler, which is relatively good, and has an accurate retrieve and a good correlation. In addition, we give up the idea of bilingual corpus obtained from the entire Internet, but we use an entirely new access, that is to access to the basic information of scholarly thesis’s in the CNKI to obtain the large-scale high-quality English-Chinese bilingual corpus. We obtain GB level of large-scale bilingual aligned corpus in the end, which is very accurate by the manual evaluation. And the corpus makes preparation for the further cross-language information retrieval research.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.