Abstract

Inthis paper, we take Indonesian as the research object, and propose a multiple filter correction framework (MFCF). The main idea of MFCF is to remove noise from candidate words to increase the probability of correct words being selected. In MFCF, we use window search algorithm (WSA) to filter the candidate words in the dictionary. When searching for candidate words whose Levenshtein distance is 1, WSA reduces the candidate word search space by an average of 71%. When searching for candidate words whose Levenshtein distance is 2, the search space is reduced by an average of 55%. The reduction in search space has brought about an increase in search speed. When WSA searches for candidate words with Levenshtein distance equal to 1 and 2, the speed exceeds the current advanced search algorithm. A character vector-based candidate word scoring model (CWSM-CV) is also introduced in this paper. CWSM-CV is a simple but unsupervised method. In MFCF, we use CWSM-CV to filter the correct word in the candidate word list. Through exploring the feasibility of using word vector-based candidate word scoring model to score candidate words (CWSM-WV), we find the necessity of denoising the candidate word list and verified it with experiments. In order to apply this finding to the text correction, a new set of evaluation indicators are proposed to replace accuracy. Finally, we recommend that researchers who correct text in low-resource languages ​​make the model an open system and publish it for users to use. The system receives user feedback as new data to gradually reduce the negative impact of data volume.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.