Abstract

Paired t-tests can be generalized as linear mixed-effects regression models (LME). After reviewing the t-test work of his colleague Janet Hart,1 Dr. Steve Carpenter decides to explore the use of regression models in R to study her BMI data.He begins by loading two additional packages, reshape2 and lme4, and then repeating her data preparation steps. (Figure 1)At this point, Steve has the same dataframe object, aps.exam, that Dr. Hart used for her final paired t-test. He then uses melt() to create a new dataframe with the aps.exam data in a “long” format. (Figure 2)Dr. Carpenter uses this dataframe as the data source for a simple linear regression. The summary() function returns the results of the model. (Figure 3)The intercept of 26.8917 is the mean of the APS BMI values. The coefficient for bmi.src, −0.1025 is the “step” going from APS mean to the exam mean. This coefficient is not statistically significant. Linear regression does not account for the within person correlation of BMI values.Dr. Carpenter turns his attention to a linear mixed-effects model (LME) using the lmer() function from the lme4 package. The fixed effects portion is specified as “bmi ∼ bmi.src” and the random effects component is “(1|personID)”. (Figure 4)The intercept and the coefficient are unchanged from the linear regression model, but now the t value tells us the coefficient is highly significant (see comments in Discussion section regarding p values). Just as a paired t-test takes into account the correlation between BMI values within persons, so does the LME with its random effects component.Each person in aps.exam can be thought of as a cluster with a pair of observations. While paired t-tests require paired observations, LME models have no such constraint. The LME clusters can have more than two observations, and they need not be balanced.Steve recalls that dataframes aps.select and exam.bmi each have multiple duplicate observations that were removed for the paired t-test. For example, note person ID's 413115 and 4558173 in the partial listings in Figure 5. Only observations 4 and 7 from aps.select and 2 and 4 from exam.bmi were used for Dr. Hart's paired t-test. (Figure 5)The unbalanced multiple observations in aps.select and exam.bmi can be used in an LME model. Steve begins by stacking the two dataframes into one new one. (Figure 6)The model specification for the unbalanced data is the same as for the first LME model, (Figure 7) but this model has the advantage of using all the available data. The intercept and the coefficient for bmi.src are different yet both remain highly significant.Having had success with the unbalanced data, Dr. Carpenter decides to experiment with another feature that is available in LME – multiple fixed-effects predictors. He notes that his dataframe me.long has two additional factor variables, sex and smkr (for smoker status), and the continuous variable, age. The factors are both binary – sex has two possible values, male and female; smkr also has two levels, NS and SM. Review of the age variable reveals that it begins at 18, and it has a very wide range. (Figure 8)Steve decides to transform the age variable such that it starts at zero and is measured in decade units rather than years. (Figure 9)He is now ready to run his LME model using his transformed age variable, age10. (Figure 10)Linear mixed-effects models (LME) are models containing both fixed effects and random effects. The simplest LMEs provide a way to derive results identical to what is obtained with t-tests. However, a regression model specification is far more flexible and provides for unbalanced datasets, multiple fixed effects, multiple random effects, covariate interactions, and many other possibilities.A brief search for recent studies that used mixed-effects models returned these topics: rates of decline in Alzheimer Disease;2 outcomes of bariatric surgery;3 differences in the interpretation of angiograms;4 and survival of HIV-positive patients.5 This Research Note provides a brief introduction to mixed-effects models.Synthetic data was developed for this article using Stata 14.6 The 229,586 synthetic BMI records developed for a previous article were chosen as a starting point.7 Each observation had both height and weight values. Of these, 33,972 records were chosen on the basis of: BMI ≥ 15; BMI≤ 45; APS date that was both non-missing and ≥ 01Nov2010. Duplicate records were dropped such that there remained only one BMI per policy number. These 23,842 observations had these five variables: personID, apsDate, height, weight, and apsBMI. Integer (whole number) versions of height and weight were created and used in an outer join with a large (140,180 obs) dataset of actual exam data. The large Cartesian product that resulted from the outer join provided a single set of age, sex, and smoker status values for 22,075 of the 23,842 BMI observations. The remaining 1767 BMI observations had multiple possibilities for age, sex, and smoker status. Final values for these were chosen through random sampling. Those observations with APS dates ≥ 01Jul2012 were dropped. The final 23,346 observations were exported to a pipe-delimited text file named “apsBMI_23.psv”.APS dataset observations with dates ≥ 01Nov2010 were the basis of the exam dataset. An exam date was created such that it was a random number of days (range 30 – 365) after the APS date. Observations with an “exam date” other than those in a business calendar were dropped. Random values with a normal distribution centered on 0.12 and having a SD of 0.5 were subtracted from the APS BMI values. Of the 8056 observations at this point in the process, 4950 were randomly selected for export into examBMI_23.psv. Further data manipulation is detailed in the R code.Both data files and R script files are available to members of AAIM. Please e-mail your request to: Data.Wrangler@LilRedHen.org.Version 0.98.1091 of RStudio8 and version 3.1.2 of R9 were used for R programming. The following functions are part of the base R package: read.csv(); as.Date(); head(); t.test(); lm(); and summary(). The following functions and operators are from the package dplyr10: summarize(); semi_join(); tally(); %>%; rename(); group_by(); arrange(); and distinct(). The reshape2 package11 contributed the melt() function. The plotting functions qplot() and ggplot() are from package ggplot2.12 The lme4 package13 contributed lmer().RStudio's “Compile Notebook…” feature was used to create the MS Word in the Vignette.For the BMI data, the linear regression model is of the formwhere i = 0, 1 (0 = APS, 1 = exam) and j = 1,…,4599 persons. The intercept β0 and the coefficient β1 are determined through ordinary least squares. The ɛ is the random error term or “random effects” in this linear regression model.In Steve's linear model lm.m1, the coefficient β1 is equivalent to the difference between the mean of APS BMI values (with their high variance) and the mean of the exam BMI values (again, high variance). It is not surprising that β1 is not significant.The simple linear mixed-effects (LME) model that corresponds to the paired t-test is,Where are the “fixed-effects” and are the “random-effects”. Together they make it a “mixed-effects” model.The R formula is written as,The fixed portion of the model, , is what one is interested in – the overall regression line for the population. The random effect, , serves to cluster BMI values by person. Where the paired t-test essentially subtracts the exam value from the APS value for each pair (cluster) and evaluates the set of differences, LME performs a regression within each cluster and evaluates the set of results. Where each cluster contains only two values, the processes are equivalent. The distinction is that with regressions, LME can handle more than two values per cluster.For sake of clarity, the following plots of regressions consider only the first 20 persons in the dataset. The first panel shows a linear regression of APS and exam BMI data points. The regression line is almost flat and the standard error (the grayed area) is very wide. The second panel depicts the within cluster regressions performed in LME while the third panel shows the accumulated result of those regressions. The intercept and the slope of the regression line are the same as in the first panel, but the standard error is much smaller. (Figure 11)The specification for the LME unbalanced data model is the same as that for the balanced (paired) data. Just as the personID random effect grouped the paired observations, it also groups the multiple observations in unbalanced data and performs a regression within each cluster.The other simple LME model in the vignette featured multiple fixed effects. This LME model has the form,and the R formula is written as,Review of the fixed-effects coefficients shows how informative this model can be. (Figure 12) All of the coefficients are significant. The authors of the lme4 package feel that the calculation of p values is controversial and choose to not include them in the lmer() output. One can calculate p values for the coefficients in simple models by using anova() testing of nested models, but the process is beyond the scope of this paper.The intercept value is now quite a bit lower – it now represents the mean BMI for an 18-year-old, non-smoking, female as reported in an APS. Going through the coefficients we see that the predicted mean BMI increases by 0.3746 for every decade of life after age 18. The BMI increases by 0.4750 if one is a smoker and by 1.6727 if male. In comparison, the 0.1295 decrease in the mean BMI when measured by paramed examiners is relatively small.Other models are possible with these covariates. Age and sex could be thought to have an interaction. Perhaps smkr is better as a random effect. These hypotheses could be modeled as,Mixed-effects models are also known as “multilevel” or “hierarchical” models. These terms reflect different approaches to grouping data and/or multiple levels of nested groups. In addition to the simple linear models, logit, Poisson, and survival mixed-effects models are possible. Given their flexibility, it is not surprising that mixed-effects models are encountered more frequently in the literature. With a modest amount of effort, one can use these models for insurance medicine research.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call