Factored Language Models Research Articles

Language modeling is a statistical technique to represent the text data in machine readable format. It finds the probability distribution of sequence of words present in the text. Language model estimates the likelihood of upcoming words in some spoken or written conversation. Markov assumption enables language model to predict the next word depending on previous n − 1 words, called as n-gram, in the sentence. Limitation of n-gram technique is that it utilizes only preceding words to predict the upcoming word. Factored language modeling is an extension to n-gram technique that facilitates to integrate grammatical and linguistic knowledge of the words such as number, gender, part-of-speech tag of the word, etc. in the model for predicting the next word. Back-off is a method to resort to less number of preceding words in case of unavailability of more words in contextual history. This research work finds the effect of various combinations of linguistic features and generalized back-off strategies on the upcoming word prediction capability of language model over Hindi language. The paper empirically compares the results obtained after utilizing linguistic features of Hindi words in factored language model against baseline n-gram technique. The language models are compared using perplexity metric. In summary, the factored language model with product combine strategy produces the lowest perplexity of 1.881235. It is about 50% less than traditional baseline trigram model.

Read full abstract

Most previous work on trainable language generation has focused on two paradigms: (a) using a generation decisions of an existing generator. Both approaches rely on the existence of a handcrafted generation component, which is likely to limit their scalability to new domains. The first contribution of this article is to present Bagel, a fully data-driven generation method that treats the language generation task as a search for the most likely sequence of semantic concepts and realization phrases, according to Factored Language Models (FLMs). As domain utterances are not readily available for most natural language generation tasks, a large creative effort is required to produce the data necessary to represent human linguistic variation for nontrivial domains. This article is based on the assumption that learning to produce paraphrases can be facilitated by collecting data from a large sample of untrained annotators using crowdsourcing—rather than a few domain experts—by relying on a coarse meaning representation. A second contribution of this article is to use crowdsourced data to show how dialogue naturalness can be improved by learning to vary the output utterances generated for a given semantic input. Two data-driven methods for generating paraphrases in dialogue are presented: (a) by sampling from the n-best list of realizations produced by Bagel's FLM reranker; and (b) by learning a structured perceptron predicting whether candidate realizations are valid paraphrases. We train Bagel on a set of 1,956 utterances produced by 137 annotators, which covers 10 types of dialogue acts and 128 semantic concepts in a tourist information system for Cambridge. An automated evaluation shows that Bagel outperforms utterance class LM baselines on this domain. A human evaluation of 600 resynthesized dialogue extracts shows that Bagel's FLM output produces utterances comparable to a handcrafted baseline, whereas the perceptron classifier performs worse. Interestingly, human judges find the system sampling from the n-best list to be more natural than a system always returning the first-best utterance. The judges are also more willing to interact with the n-best system in the future. These results suggest that capturing the large variation found in human language using data-driven methods is beneficial for dialogue interaction.

Read full abstract

Factored Language Models Research Articles

Related Topics

Articles published on Factored Language Models

Experimenting with factored language model and generalized back-off for Hindi

Context-dependent factored language models

Syntactic and Semantic Features For Code-Switching Factored Language Models

Stochastic Language Generation in Dialogue using Factored Language Models

Portuguese text generation using factored language models

A New Prosody-Assisted Mandarin ASR System

Factored bilingual n-gram language models for statistical machine translation

Recognition of Dialogue Acts in Multiparty Meetings Using a Switching DBN

Morphology-based language modeling for conversational Arabic speech recognition

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Factored Language Models Research Articles

Related Topics

Articles published on Factored Language Models

Experimenting with factored language model and generalized back-off for Hindi

Context-dependent factored language models

Syntactic and Semantic Features For Code-Switching Factored Language Models

Stochastic Language Generation in Dialogue using Factored Language Models

Portuguese text generation using factored language models

A New Prosody-Assisted Mandarin ASR System

Factored bilingual n-gram language models for statistical machine translation

Recognition of Dialogue Acts in Multiparty Meetings Using a Switching DBN

Morphology-based language modeling for conversational Arabic speech recognition