Abstract
Spell checking and correction is still in infancy for Yorùbá language, and existing tools cannot be applied directly to address the problem because Yorùbá language requires extensive use of diacritics for marking phonemes and tone. We addressed this problem by collecting data from on-line sources and from optical character recognition of hard copy of books. The features relevant to spell checking and correction in this language that marks tones (and underdot) were identified through the review of existing spell checking solutions, analysis of the data collected and consultation with relevant Yorùbá Linguistics textbooks. A conceptual model was formulated as a parallel combination of a unigram language model and a language diacritic model to form a dictionary sub-model that is used by Error Detection and Candidate Generation modules. The candidate generation module was implemented as an inverse Levensthein edit-distance algorithm.
 The system was evaluated using Detection Accuracy (calculated from Precision and Recall) and Suggestion Accuracy (SA) as metrics.Our experimental setups compared the performance of the component subsystems when used alone with the their combination into a unified model. The detection accuracies for each of the models were 93.23%, 94.10% and 95.01% respectively while their suggestion accuracies were 26.94%, 72.10% and 65.89% respectively. In relation to the size of training corpus, the unified model was able to achieve a increase of 1.83% in detection accuracy and 5.27% in suggestion accuracy for 70% increase in size of corpus. The results indicated that each of the sub-models in the dictionary played different roles while the increase in training data does not give a linear proportional increase in performance of the spell checker. The study also showed that spell checking a Yorùbá text was better when attention is paid to the diacritical aspect of the language
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.