Abstract

Abstract As in many other natural language processing (NLP) fields, the use of statistical methods is now part of mainstream natural language generation (NLG). In the development of systems of this kind, however, there is the issue of data sparseness, a problem that is particularly evident in the case of morphologically-rich languages such as Portuguese. This work presents a shallow surface realisation system that makes use of factored language models (FLMs) of Portuguese to overcome some of these difficulties. The system combines FLMs trained on a large corpus with a number of NLP resources that have been made publicly available by the Brazilian NLP research community in recent years, such as corpora, dictionaries, thesauri and others. Our FLM-based approach to surface realisation has been successfully applied to the generation of Brazilian newspapers headlines, and the results are shown to outperform a number of statistical and non-statistical baseline systems alike.

Highlights

  • In natural language generation (NLG) systems, surface realisation is known as the task of mapping abstract sentence representations to a surface form, that is, a sequence of words, FAPESP grant 2009/08499-9.E

  • As in many other NLG applications, surface realisation systems have successfully relied upon statistical methods ranging from language modelling techniques [2,3] to full-blown probabilistic generation-space models [4] and grammar acquisition [5]

  • As a means to overcome some of these difficulties, in this work we address the use of factored language models (FLMs) [8] in the development of a shallow surface realisation system for the Brazilian Portuguese language

Read more

Summary

Introduction

In natural language generation (NLG) systems, surface realisation is known as the task of mapping abstract sentence representations to a surface form, that is, a sequence of words, FAPESP grant 2009/08499-9. Our own work focuses on 2-stage, or generate-and-select NLG architectures as introduced in [2] and by others. Systems of this kind produce text from an abstract input representation by separating the generation space from decision-making, that is, by overgenerating a large number of alternative surface realisations (often including non grammatical or ill-formed candidates) and subsequently selecting the most likely output string with the aid of a statistical language model

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.