Abstract

Incomplete data in the analysis of longitudinal survey information is a pervasive problem in social science research. Popularly employed techniques for the analysis of partially observed datasets rely on the assumption of an underlying normal distribution governing the incomplete variables. However, variables in the social science research which are prone to missing observations are often highly skewed and possess thick distributional tails. When the assumptions of the adopted imputation technique do not accurately represent the underlying behaviour of the variable being imputed, data imputation can introduce further biases in the resulting statistical analysis. As the usage and availability of complex social survey datasets expand, the need to develop more flexible imputation and modelling techniques that accurately capture the behaviours of the variables being analysed becomes imperative.This study is motivated by the recent digitisation and availability of the East Laguna Village cross-sectional surveys from the Philippines. The dataset is a collection of cross-sectional household surveys of an agricultural village that has been ongoing since the 1960s. The survey information has not been previously connected and analysed as a long panel dataset and became publicly available only recently. As with most household surveys, the dataset also suffers from the common problem of missing data when analysed jointly over time. This research has improved the useability of the East Laguna Village survey information through the formation of a longitudinal dataset and the simultaneous management of the resulting missing data through imputation. The study focused both on improving existing modelling and imputation techniques for incomplete non-normal continuous variables with two-level hierarchical structure, and using these for the analysis of the newly constructed longitudinal dataset.This thesis developed three improvements in existing modelling and imputation techniques for incomplete and non-normal continuous variables with hierarchical data structure. First, an extension of the two-level hierarchical linear model technique is proposed where the error and random terms are modified to be skew-t distributed. This modification allowed for the direct modelling of non-normal continuous dependent variables directly. Second, this model is further expanded to be used as a flexible technique to generate multiple imputation through joint modelling. The flexible multiple imputation proposal enabled several non-normal continuous incomplete variables with two-level hierarchical structure to be imputed directly without the need for variable transformation or modification. Third, an alternative imputation proposal is developed using the two-level hierarchical extension of the Seemingly Unrelated Regression with flexible skew-t distribution. This imputation model facilitated simultaneous modelling of multiple incomplete variables while imposing cross-equation parameter restrictions which allowed faster and more pragmatic implementation. These methodological improvements provided further generality to the imputation of missing variables in a dataset and lifted the restriction usually imposed by the commonly assumed normal distribution.The East Laguna Village dataset was analysed, in conjunction with the missing data simulation study, to illustrate the improvements gained when using these methodological proposals. The empirical analysis focussed on assessing the effectiveness of modern rice varieties, fertilizers, and herbicides in improving farmer’s rice production in the village over time. Part of this research accomplished the formation of a person-level and household-level longitudinal dataset by linking the different rice production and demographic information obtained from selected village surveys between 1974 and 2007. In the process of longitudinal merging, some of the important rice production covariates suffered missing information, particularly with the variable indicating the rice variety planted and the level of herbicide and fertilizer used for each planting period. The two latter variables are both continuous and non-normally distributed. Without proper imputation, the empirical analysis of the rice production trends suggested weaker support for the overall effectiveness of modern rice varieties in bringing significant yield improvements in the farmers’ productivity over the decades. Furthermore, the estimated contribution of the fertilizers and herbicides were both significantly understated in comparison to related literature in rice production. When the estimation was limited to using the incomplete dataset or if the imputation method applied did not accurately capture the shape of the underlying distribution of the variables which has missing information, the results of the empirical analysis were inconsistent and not supportive of the findings from the larger agricultural studies. However, when the empirical investigation incorporated all the available information through the introduced flexible modelling and imputation methods in this research, the empirical estimates provided more support towards the positive contribution of the modern rice strains and increased rice production inputs in improving the farmer’s rice yield. These latter findings are more consistent with related agricultural studies on modern rice technology.This research primarily uses Bayesian inference for model estimation applying a Markov Chain Monte Carlo technique with the Adaptive Metropolis-within-Gibbs (AMWG) algorithm. All simulations and empirical models were executed using the R software.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call