Abstract

Due to the advancement of electronic data capturing, the amount of repeated categorical data being collected and stored has increased. This massive amount of data is complex and poses significant statistical challenges in methodology and computation. To analyze such big data, the divide and recombine method is commonly used. First, a large data set is partitioned into subsets, and each subset is analyzed separately. Then, the results are recombined in a manner that produces statistically valid output. However, available literature can only accommodate cross-sectional data. We propose a new simpler approach to analyze large, repeated categorical data using a joint modeling framework. In the proposed method, follow-up time is a natural conditioning variable that allows big data to be divided into subsets. Then, using the relationship between joint, marginal, and conditional probabilities, we can recombine the results in a statistically valid way. Several machine learning algorithms for cross-sectional data are extended for repeated outcomes to predict trajectories using the proposed framework. As an illustration, the proposed methodology is used to analyze repeatedly measured activity of daily living (ADL) data from the Health and Retirement Study (HRS), USA. We also check model performances under multiple machine learning algorithms using bootstrap simulations.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call