Bicleaner at WMT 2020: Universitat d'Alacant-Prompsit's submission to the parallel corpus filtering shared task

Miquel Esplà-Gomis ,Jaume Zaragoza-Bernabeu ,Víctor M Sánchez-Cartagena ,Felipe Sánchez-Martínez

doi:10.5281/zenodo.6580098

Bicleaner at WMT 2020: Universitat d'Alacant-Prompsit's submission to the parallel corpus filtering shared task

Miquel Esplà-Gomis , Jaume Zaragoza-Bernabeu + Show 2 more

https://doi.org/10.5281/zenodo.6580098

Copy DOI

Journal: Zenodo (CERN European Organization for Nuclear Research)	Publication Date: Nov 1, 2020
Citations: 3	License type: cc-by

Affiliation: University of Alicante

#Character-level Language Models #Parallel Corpus Filtering + Show 8 more

Abstract
Full-Text PDF
Similar Papers

Abstract

This paper describes the joint submission of Universitat d’Alacant and Prompsit Language Engineering to the WMT 2020 shared task on parallel corpus filtering. Our submission, based on the free/open-source tool Bicleaner, enhances it with Extremely Randomised Trees and lexical similarity features that account for the frequency of the words in the parallel sentences to determine if two sentences are parallel. To train this classifier we used the clean corpora provided for the task and synthetic noisy parallel sentences. In addition we re-score the output of Bicleaner using character-level language models and n-gram saturation.

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Similar Papers

Paper Title

Journal

Date

Author

View more papers

More From: Zenodo (CERN European Organization for Nuclear Research)

Paper Title

Journal

Date

Author

View more papers

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.