Steered Training Data Generation for Learned Semantic Type Detection

Sven Langenecker,Christian Schalles Schalles,Christoph Sturm,Carsten Binnig

doi:10.1145/3589786

Steered Training Data Generation for Learned Semantic Type Detection

Sven Langenecker, Christian Schalles Schalles + Show 2 more

https://doi.org/10.1145/3589786

Copy DOI

Journal: Proceedings of the ACM on Management of Data

Publication Date: Jun 13, 2023

Affiliation: Technical University of Darmstadt, Ansbach University of Applied Sciences

#Data Lakes #Training Data + Show 8 more

Abstract
Full-Text PDF
Similar Papers

Abstract

In this paper, we introduce STEER to adapt learned semantic type extraction approaches to a new, unseen data lake. STEER provides a data programming framework for semantic labeling which is used to generate new labeled training data with minimal overhead. At its core, STEER comes with a novel training data generation procedure called Steered-Labeling that can generate high quality training data not only for non-numeric but also for numerical columns. With this generated training data STEER is able to fine-tune existing learned semantic type extraction models. We evaluate our approach on four different data lakes and show that we can significantly improve the performance of two different types of learned models across all data lakes.

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Similar Papers

Paper Title

Journal

Date

Author

View more papers

More From: Proceedings of the ACM on Management of Data

Paper Title

Journal

Date

Author

View more papers

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.