Abstract

AbstractThe development of a benchmark for part-of-speech (PoS) tagging of spoken dialectal European Spanish is presented, which will serve as the foundation for a future treebank. The benchmark is constructed using transcriptions of the Corpus Oral y Sonoro del Español Rural (COSER;“Audible corpus of spoken rural Spanish”) and follows the Universal Dependencies project guidelines. We describe the methodology used to create a gold standard, which serves to evaluate different state-of-the-art PoS taggers (spaCy, Stanza NLP, and UDPipe), originally trained on written data and to fine-tune and evaluate a model for spoken Spanish. It is shown that the accuracy of these taggers drops from 0.98$$-$$ - 0.99 to 0.94$$-$$ - 0.95 when tested on spoken data. Of these three taggers, the spaCy’s trf (transformers) and Stanza NLP models performed the best. Finally, the spaCy trf model is fine-tuned using our gold standard, which resulted in an accuracy of 0.98 for coarse-grained tags (UPOS) and 0.97 for fine-grained tags (FEATS). Our benchmark will enable the development of more accurate PoS taggers for spoken Spanish and facilitate the construction of a treebank for European Spanish varieties.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.