Dealing with complete separation and quasi-complete separation in logistic regression for linguistic data

Robert G Clark,Wade Blanchard,Francis K.C Hui,Ran Tian,Haruka Woods

doi:10.1016/j.rmal.2023.100044

Abstract

Logistic regression is a powerful and widely used analytical tool in linguistics for modelling a binary outcome variable against a set of explanatory variables. One challenge that can arise when applying logistic regression to linguistics data is complete or quasi-complete separation, phenomena that occur when (paradoxically) the model has too much explanatory power, resulting in effectively infinite coefficient estimates and standard errors. Instead of seeing this as a drawback of the method, or naïvely removing covariates that cause separation, we demonstrate a straightforward and user-friendly modification of logistic regression, based on penalising the coefficient estimates, that is capable of systematically handling separation. We illustrate the use of penalised, multi-level logistic regression on two clustered datasets relating to second language acquisition and corpus data, showing in both cases how penalisation remedies the problem of separation and thus facilitates sensible and valid statistical conclusions to be drawn. We also show via simulation that results are not overly sensitive to the amount of penalisation employed for handling separation.

Full Text