Abstract

The Hausa language, spoken by a large population, is considered a low-resource language in the field of Natural Language Processing (NLP), presenting unique challenges. Despite increasing efforts to address these challenges, the quality of existing resources, particularly datasets, remains uncertain. A critical task like stop word identification is often hindered by the absence of standardized resources. This study bridges this gap by leveraging the Term Frequency-Inverse Document Frequency (TF-IDF) approach alongside manual evaluation to develop a comprehensive stop word list for Hausa. Using datasets from four reputable online Hausa news sources, comprising 4,501 articles and 1,202,822 tokens, we applied TF-IDF with a threshold of 0.001 to each dataset, identifying 91 candidate stop words by intersecting results across the datasets. After manual examination, the list was narrowed to 76 final stop words. Compared to prior study, our list increased the number of identified stop words by 6%. This standardized resource advances Hausa NLP by facilitating more effective text processing tasks, such as sentiment analysis and machine translation, and lays the groundwork for further research in low-resource languages.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.