Abstract
Abstract Given that data is readily available at our fingertips, data scientists face a difficult task when integrating multiple databases. This task is commonly known as entity resolution , which combines structured databases that may refer to the same individual (business, etc.), where unique identifiers are not available, are only partially available, or may not be reliable. Applications where unique identifiers are not available, are only partially available, or may not be reliable are only growing in the literature. As recently mentioned in the literature, such data sets “in the wild may have temporal variation, missing values, data distortions, and large amounts of noise.” Examples include electronic health data, human rights conflicts, official statistics and survey data, web‐scraped data, financial data, and others. As statistical data scientists, our goal is to clean such data for predictive or inferential analyses. This article reviews the data cleaning pipeline and seminal entity resolution tasks which have inspired the rise of Bayesian entity resolution, which inherently allows one to quantify the uncertainty of the entity resolution task through posterior inference.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.