Not having access to the entire data distribution entails severe biases in the resulting parameter estimates of any empirical investigation and which hampers the understanding and analysis of the data generation process. We integrate and synthesize the knowledge present in various scientific disciplines for the development of semiparametric endogenous truncation-proof algorithms correcting for truncation bias due to endogenous self-selection. Thus, computer science (pattern recognition), machine (unsupervised) learning, electrical engineering (signal extraction), economics, psychology, and management science all contribute to the ideas and techniques embedded in the offered algorithms developed. This synthesis enriches the algorithms' accuracy, efficiency and applicability. Under truncation the unobserved portion of the distribution is similar to the unobserved pixels problems in pattern recognition dealt with in computer science, machine learning and artificial intelligence algorithms. However, data in the social sciences are intrinsically affected and largely generated by their own behavior (cognition). Thus, incorporating behavioral aspects from economics, psychology and management science to play a role in the model allows for endogeneity to take place, be modeled and controlled. The algorithms offered improve upon the covariate-shift assumption in machine learning in that, each data point's decision to truncate itself out from the original distribution is an important building block generating the estimation algorithms. Refining the concept of Vox Populi (Wisdom of Crowd) allows data points to sort themselves out depending on their estimated latent reference group opinion space. The opinion space is composed of experts' observed and unobserved characteristics. The later are captured by latent classes. The emerging algorithms are semiparametric and are of the orthonormal polynomials sequence family (such as Fourier) known to have flexibility and are useful in non-parametric analysis. The most attractive feature of this estimator for our purpose is that it intrinsically prevents potential multicollinearity problems, a feature the kernel estimator does not possess. Each datum is generated by a different distribution function characterized as a finite mixture of various continuous distribution functions which are not restricted to be unimodal or symmetric. Thus, the proposed algorithm is distribution free. The number of reference groups is not arbitrarily imposed but rather estimated using SCAD (smoothly clipped absolute deviation) penalization mechanism. Monte-Carlo simulations based on 2,000,000 different distribution functions, practically generating 100 millions realizations which are not i.i.d and attest to a very high accuracy of our model.
Read full abstract