Abstract
Online information resources that are based on a structured data collection provide a rich set of operators for accessing their content. However, the vast majority of online information sources are based on collections of documents in unstructured form, and are not amenable to searching or navigation other than by the relatively unsophisticated methods of keyword-based search and document-at-a-time retrieval. Manually creating large structured collections from large sets of unstructured documents is not feasible. Thus, there is a need to develop tools which can automate (as much as possible) the process of extracting the information from a wide variety of unstructured documents into structured form. Over the past decade, there has been intense research toward achieving the goal of effective information extraction. However, most research to date has traded off the level of automation against the level of structuredness in the documents. Some systems have focused on achieving a high level of automation but with the requirement of well-structured input texts. Other systems require manual interaction as part of the extraction process, but work with a relatively unstructured input texts. Still other systems have taken a middle road, with medium levels of automation on a reasonable range of documents. The main contribution of this dissertation is to propose a novel approach to the problem of information extraction that fills a gap in the space of solutions to this problem: we make minimal assumptions about the structure or format of input documents, and we require minimal manual effort from users. The key idea behind our approach is that, instead of designing extraction rules manually, we incorporate machine learning algorithms into our system, using multiple different learners to handle the different tasks involved in information extraction: feature selection, region identification, text classification, synopsis extraction, pattern discovery and pattern matching. In this dissertation, we describe complete solutions including architectures, algorithms and implementations which address three of the most important problems in today's information extraction: document decomposition, text classification and data extraction. Our solutions achieve information extraction effectiveness that is as good or better than other related systems.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: International Journal of Advance Engineering and Research Development
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.