Abstract

Web tables have become very popular and important in many real applications, such as search engines and knowledge base enrichment. Due to its benefit, it is very urgent to understand web tables. An important task in web table understanding is the column-type detection, which detects the most likely types (categories) to describe the columns in the web table. Some existing studies use knowledge bases to determine the column types. However, this problem has three challenges. (i) Web tables are too dirty to be understood. (ii) Knowledge bases are not comprehensive enough to cover all the columns. (iii) The size of both knowledge bases and web tables are extremely huge. Thus, traditional approaches encounter the limitations with low quality and poor scalability. Also, they cannot extract the best type from top-k types automatically. To address these limitations, we propose a collective inference approach (CIA) based on Topic Sensitive PageRank, which considers not only the types of detected columns, but also the collective information of web tables to automatically produce more accurate top-k types, especially the top-1 type, for both incorrectly detected columns and undetectable columns whose cells do not exist in the knowledge base. We also propose three methods to improve the inference performance and implemented techniques of CIA in MapReduce. Experimental results on real-world datasets show that our CIA achieves much higher quality in top-1 type detection as well as the entity enrichment, and outperforms state-of-the-art approaches significantly.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.