Abstract

Extracting information from Web pages requires the ability to work at Web scale in terms of the number of documents, the number of domains and domain complexity. Recent approaches have used existing knowledge bases to learn to extract information with promising results. In this paper we propose the use of distant supervision for relation extraction from the Web. Distant supervision is a method which uses background information from the Linking Open Data cloud to automatically label sentences with relations to create training data for relation classifiers. Although the method is promising, existing approaches are still not suitable for Web extraction as they suffer from three main issues: data sparsity, noise and lexical ambiguity. Our approach reduces the impact of data sparsity by making entity recognition tools more robust across domains, as well as extracting relations across sentence boundaries. We reduce the noise caused by lexical ambiguity by employing statistical methods to strategically select training data. Our experiments show that using a more robust entity recognition approach and expanding the scope of relation extraction results in about 8 times the number of extractions, and that strategically selecting training data can result in an error reduction of about 30%.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.