Abstract

Information Extraction (IE) systems that can exploit the vast source of textual information that is the internet would provide a revolutionary step forward in terms of delivering large volumes of content cheaply and precisely, thus enabling a wide range of new knowledge driven applications and services. However, despite this enormous potential, few IE systems have successfully made the transition from laboratory to commercial application. The reason may be a purely practical one—to build useable, scaleable IE systems requires bringing together a range of different technologies as well as providing clear and reproducible guidelines as to how to collectively configure and deploy those technologies. This paper is an attempt to address these issues. The paper focuses on two primary goals. Firstly, we show that an information extraction system which is used for real world applications and different domains can be built using some autonomous, corporate components (agents). Such a system has some advanced properties: clear separation to different extraction tasks and steps, portability to multiple application domain, trainability, extensibility, etc. Secondly, we show that machine learning and, in particular, learning in different ways and at different levels, can be used to build practical IE systems. We show that carefully selecting the right machine learning technique for the right task and selective sampling can be used to reduce the human effort required to annotate examples for building such systems.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.