Abstract

The distribution of multivariate quantitative survey data usually is not normal. Skewed and semi-continuous distributions occur often. In addition, missing values and non-response is common. All together this mix of problems makes multivariate outlier detection difficult. Examples of surveys where these problems occur are most business surveys and some household surveys like the Survey for the Statistics of Income and Living Condition (SILC) of the European Union. Several methods for multivariate outlier detection are collected in the R-package modi. This paper gives an overview of modi and its functions for outlier detection and corresponding imputation. The use of the methods is explained with a business survey dataset. The discussion covers pre- and post-processing to deal with skewness and zero-inflation, advantages and disadvantages of the methods and the choice of the parameters.

Highlights

  • In surveys on monetary values, often several monetary variables are collected in order to capture the economic situation of an entity

  • Several multivariate outlier detection and imputation procedures are contained in Version 1.6 of the package modi

  • The sepe data set has first been prepared for the FP5 project EUREDIT (Charlton 2003) and later been used as protected data for educational purposes. For this demonstration of the modi package, we focus on 8 variables representing the most important expenditure-areas and investment-areas

Read more

Summary

Introduction

In surveys on monetary values, often several monetary variables are collected in order to capture the economic situation of an entity. This holds for business surveys, where many particular types of expenditures may be asked. Non-monetary quantitative variables may be collected like various health indicators in a health survey or physical production parameters in a business survey or in a survey on livestock of farms All these surveys have some common features: They have a complex sample design including stratification and possibly sub-sampling; they have elaborated questionnaires; they have unit and item non-response, and they typically have zero inflated distributions because of the multi-faceted economic situation.

Overview of the modi package
The SEPE data set
Applying the methods
EA – Epidemic Algorithm
Findings
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.