Abstract

In business intelligence for retail, it is critical to ensure consistent and unambiguous product dimension information. This is challenging, especially if an organization does not have full control over the source of either transaction or master data. Such lack of control is the case when brands rely on data provided directly by consumers through images of receipts. Product name strings obtained from the digitization of receipts often contain substitution, insertion, and deletion errors. These errors prevent product names from serving as a useful dimension for further analysis. This paper proposes a clustering-based approach to link error-laden product names to underlying SKUs to remove this noise. The problem can be modeled as an entity resolution problem: each digitized product name is a reference to an underlying entity SKU. The entity resolution problem can further be modeled as a clique-partitioning problem that can be solved in a reasonable time with an agglomerative clustering heuristic. The results of clustering a synthetic data set show that the approach can successfully resolve product references to reveal coarse-grained (i.e., category, generic product) groupings. Future work may be done on implementing blocking strategies, optimizing the model parameters, and understanding the limits of the model for fine-grained (i.e., size variation) groupings.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.