Modeling Linguistic Variables With Regression Models: Addressing Non-Gaussian Distributions, Non-independent Observations, and Non-linear Predictors With Random Effects and Generalized Additive Models for Location, Scale, and Shape.

Christophe Coupé

doi:10.3389/fpsyg.2018.00513

Abstract

As statistical approaches are getting increasingly used in linguistics, attention must be paid to the choice of methods and algorithms used. This is especially true since they require assumptions to be satisfied to provide valid results, and because scientific articles still often fall short of reporting whether such assumptions are met. Progress is being, however, made in various directions, one of them being the introduction of techniques able to model data that cannot be properly analyzed with simpler linear regression models. We report recent advances in statistical modeling in linguistics. We first describe linear mixed-effects regression models (LMM), which address grouping of observations, and generalized linear mixed-effects models (GLMM), which offer a family of distributions for the dependent variable. Generalized additive models (GAM) are then introduced, which allow modeling non-linear parametric or non-parametric relationships between the dependent variable and the predictors. We then highlight the possibilities offered by generalized additive models for location, scale, and shape (GAMLSS). We explain how they make it possible to go beyond common distributions, such as Gaussian or Poisson, and offer the appropriate inferential framework to account for ‘difficult’ variables such as count data with strong overdispersion. We also demonstrate how they offer interesting perspectives on data when not only the mean of the dependent variable is modeled, but also its variance, skewness, and kurtosis. As an illustration, the case of phonemic inventory size is analyzed throughout the article. For over 1,500 languages, we consider as predictors the number of speakers, the distance from Africa, an estimation of the intensity of language contact, and linguistic relationships. We discuss the use of random effects to account for genealogical relationships, the choice of appropriate distributions to model count data, and non-linear relationships. Relying on GAMLSS, we assess a range of candidate distributions, including the Sichel, Delaporte, Box-Cox Green and Cole, and Box-Cox t distributions. We find that the Box-Cox t distribution, with appropriate modeling of its parameters, best fits the conditional distribution of phonemic inventory size. We finally discuss the specificities of phoneme counts, weak effects, and how GAMLSS should be considered for other linguistic variables.

Highlights

Comparing the two models with penalization, one sees that cubic splines lead to high degrees of non-linearity for Distance from Africa and Local linguistic density, which is reflected by the larger values of the effective degrees of freedom of these two smooth terms (8.90 and 8.72, respectively, to be compared to 5.82 and 4.47 for P-splines), while discarding an influence of Number of speakers
Generalized additive models for location, scale and shapes are an extension of GAM(M) which allows one to consider a wide range of options for the conditional distribution of the dependent variable, while Generalized linear models (GLM)(M) and GAM(M) are restricted to the exponential family of distributions (Rigby and Stasinopoulos, 2005)
One can observe here that strictly referring to AIC values, the Box-Cox Cole and Green distribution (BCCG) and Box-Cox t distribution (BCT) distributions provide better fits that the SICHEL and DEL distributions. Do these results suggest that the BCCG should be the distribution to use in a GAMLSS with our various predictors? One must be cautious here, since the marginal distribution is not the same as the conditional distribution of the dependent variable, i.e., its distribution when factoring in the various predictors

Summary

A CASE STUDY

What drives linguistic diversity? What phenomena, and in particular what external factors, explain the distribution of linguistic structures across the globe? These questions are at the heart of linguistics, and can be considered at various levels of linguistic analysis, either with qualitative or more quantitative approaches. Comparing the two models with penalization, one sees that cubic splines lead to high degrees of non-linearity for Distance from Africa and Local linguistic density, which is reflected by the larger values of the effective degrees of freedom of these two smooth terms (8.90 and 8.72, respectively, to be compared to 5.82 and 4.47 for P-splines), while discarding an influence of Number of speakers (owing to the modified penalty introduced above) It looks as if canceling the influence of this predictor resulted in increased non-linearity in the two other continuous predictors. Generalized additive models for location, scale and shapes are an extension of GAM(M) which allows one to consider a wide range of options for the conditional distribution of the dependent variable, while GLM(M) and GAM(M) are FIGURE 4 | Smooth terms for Distance from Africa, Number of Speakers, and Local linguistic density, for three smoothing approaches in an inverse-Gaussian GAMM: cubic splines (top), P-splines (middle), and cubic splines with a fixed smoothing parameter equal to 3. This could be due to less satisfying statistical approaches, but should serve as a warning of the limited trust one should put in this result

DISCUSSION

Findings

DATA AVAILABILITY STATEMENT

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Frontiers in psychology	Publication Date: Apr 16, 2018
Citations: 26	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Modeling Linguistic Variables With Regression Models: Addressing Non-Gaussian Distributions, Non-independent Observations, and Non-linear Predictors With Random Effects and Generalized Additive Models for Location, Scale, and Shape.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Frontiers in psychology

Lead the way for us

Similar Papers

Non-parametric Approach in Modelling Effects of Remittances on Household Credit in Kenya
Caspah Lidiema ... Anthony Waititu
American Journal of Applied Mathematics and Statistics | VOL. 6
Caspah Lidiema, et. al.Caspah Lidiema ... Anthony Waititu
27 Mar 2018
American Journal of Applied Mathematics and Statistics | VOL. 6

Are generalized additive models for location, scale, and shape an improvement on existing models for estimating skewed and heteroskedastic cost data?
Alex A. Bohl ... David K. Blough
Health Services and Outcomes Research Methodology | VOL. 13
Alex A. Bohl, et. al.Alex A. Bohl ... David K. Blough
15 May 2012
Health Services and Outcomes Research Methodology | VOL. 13

Detection of risk factors for obesity in early childhood with quantile regression methods for longitudinal data
...
-
, et. al. ...
25 Sep 2008
25 Sep 2008

Alternative regression models to assess increase in childhood BMI
Andreas Beyerlein ... Ulrich Mansmann
BMC Medical Research Methodology | VOL. 8
Andreas Beyerlein, et. al.Andreas Beyerlein ... Ulrich Mansmann
08 Sep 2008
BMC Medical Research Methodology | VOL. 8

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Modeling Linguistic Variables With Regression Models: Addressing Non-Gaussian Distributions, Non-independent Observations, and Non-linear Predictors With Random Effects and Generalized Additive Models for Location, Scale, and Shape.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Frontiers in psychology