Dirty Data Research Articles

The purpose of this study is to empirically address questions pertaining to the effects of data screening practices in survey research. This study addresses questions about the impact of screening techniques on data and statistical analyses. It also serves an initial attempt to estimate descriptive statistics and graphically display the distributions of popular screening techniques. Data were obtained from an online sample who completed demographic items and measures of character strengths (N = 307). Screening indices demonstrate minimal overlap and differ in the number of participants flagged. Existing cutoff scores for most screening techniques seem appropriate, but cutoff values for consistency-based indices may be too liberal. Screens differ in the extent to which they impact survey results. The use of screening techniques can impact inter-item correlations, inter-scale correlations, reliability estimates, and statistical results. While data screening can improve the quality and trustworthiness of data, screening techniques are not interchangeable. Researchers and practitioners should be aware of the differences between data screening techniques and apply appropriate screens for their survey characteristics and study design. Low-impact direct and unobtrusive screens such as self-report indicators, bogus items, instructed items, longstring, individual response variability, and response time are relatively simple to administer and analyze. The fact that data screening can influence the statistical results of a study demonstrates that low-quality data can distort hypothesis testing in organizational research and practice. We recommend analyzing results both before and after screens have been applied.

Determining if two sets are related - that is, if they have similar values or if one set contains the other -- is an important problem with many applications in data cleaning, data integration, and information retrieval. For example, set relatedness can be a useful tool to discover whether columns from two different databases are joinable; if enough of the values in the columns match, it may make sense to join them. A common metric is to measure the relatedness of two sets by treating the elements as vertices of a bipartite graph and calculating the score of the maximum matching pairing between elements. Compared to other metrics which require exact matchings between elements, this metric uses a similarity function to compare elements between the two sets, making it robust to small dissimilarities in elements and more useful for real-world, dirty data. Unfortunately, the metric suffers from expensive computational cost, taking O ( n 3 ) time, where n is the number of elements in the sets, for each set-to-set comparison. Thus for applications that try to search for all pairings of related sets in a brute-force manner, the runtime becomes unacceptably large. To address this challenge, we developed S ilk M oth , a system capable of rapidly discovering related set pairs in collections of sets. Internally, S ilk M oth creates a signature for each set, with the property that any other set which is related must match the signature. S ilk M oth then uses these signatures to prune the search space, so only sets that match the signatures are left as candidates. Finally, S ilk M oth applies the maximum matching metric on remaining candidates to verify which of these candidates are truly related sets. An important property of S ilk M oth is that it is guaranteed to output exactly the same related set pairings as the brute-force method, unlike approximate techniques. Thus, a contribution of this paper is the characterization of the space of signatures which enable this property. We show that selecting the optimal signature in this space is NP-complete, and based on insights from the characterization of the space, we propose two novel filters which help to prune the candidates further before verification. In addition, we introduce a simple optimization to the calculation of the maximum matching metric itself based on the triangle inequality. Compared to related approaches, S ilk M oth is much more general, handling a larger space of similarity functions and relatedness metrics, and is an order of magnitude more efficient on real datasets.

Dirty Data Research Articles

Related Topics

Articles published on Dirty Data

Efficient histogram-based range query estimation for dirty data

DiffusionInsighter: Visual Analysis of Traffic Diffusion Flow Patterns

Big Data's Dirty Secret

A Survey on Cleaning Dirty Data Using Machine Learning Paradigm for Big Data Analytics

Improvement of training set structure in fusion data cleaning using Time-Domain Global Similarity method

HAP

Dirty Data: The Effects of Screening Respondents Who Provide Low-Quality Data in Survey Research

CleanM

Mvp - an open-source preprocessor for cleaning duplicate records and missing values in mass spectrometry data.

S ilk M oth

Time series data cleaning

EntityManager: Managing Dirty Data Based on Entity Resolution

VSMURF: A Novel Sliding Window Cleaning Algorithm for RFID Networks

Efficient Sequential Data Migration Scheme Considering Dying Data for HDD/SSD Hybrid Storage Systems

Eager Synching

Qualitative data cleaning

ActiveClean

스마트폰 저장장치의 성능개선을 위한 비휘발성메모리 기반의 버퍼캐쉬 관리

Caching Strategies for Distributed System: A Critical Research Study

Eliminating Periodic Flush Overhead of File I/O with Non-Volatile Buffer Cache

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Dirty Data Research Articles

Related Topics

Articles published on Dirty Data

Efficient histogram-based range query estimation for dirty data

DiffusionInsighter: Visual Analysis of Traffic Diffusion Flow Patterns

Big Data's Dirty Secret

A Survey on Cleaning Dirty Data Using Machine Learning Paradigm for Big Data Analytics

Improvement of training set structure in fusion data cleaning using Time-Domain Global Similarity method

HAP

Dirty Data: The Effects of Screening Respondents Who Provide Low-Quality Data in Survey Research

CleanM

Mvp - an open-source preprocessor for cleaning duplicate records and missing values in mass spectrometry data.

S ilk M oth

Time series data cleaning

EntityManager: Managing Dirty Data Based on Entity Resolution

VSMURF: A Novel Sliding Window Cleaning Algorithm for RFID Networks

Efficient Sequential Data Migration Scheme Considering Dying Data for HDD/SSD Hybrid Storage Systems

Eager Synching

Qualitative data cleaning

ActiveClean

스마트폰 저장장치의 성능개선을 위한 비휘발성메모리 기반의 버퍼캐쉬 관리

Caching Strategies for Distributed System: A Critical Research Study

Eliminating Periodic Flush Overhead of File I/O with Non-Volatile Buffer Cache