Abstract

Abstract Aiming at the problems existing in the Chinese-Khmer parallel corpus, such as single field, small scale, and poor timeliness, a method of Chinese-Khmer parallel fragment extraction from comparable corpus based on Dirichlet process is proposed. The method firstly obtains the topic distribution of bilingual comparable corpus through bilingual topic model, then uses Poisson distribution to randomly divide the bilingual texts, sets up a threshold to initially filter parallel fragment of comparable corpus, and then obtains the matching probability between parallel fragments by Dirichlet process. Obtaining the final parallel fragments by Gibbs sampling. From the comparison experiments, the method of parallel fragment extraction from bilingual comparable corpus based on Dirichlet process can obtain higher quality parallel fragments without providing any parallel data, which is more suitable for languages where bilingual resources are low.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.