Abstract

Discovering potentially useful and previously unknown information or knowledge from heterogeneous web contents such as list all laptop prices from Walmart and Staples between 2013 and 2015 including type, screen size, CPU power, year of make, would require the difficult task of finding the schema of web documents from different web pages, performing web content data integration, building their virtual or physical data warehouse integration before web content extraction and mining from the database. Wrappers that extract target information from web pages can be manual, semi-supervised or automatic systems. Automatic systems such as the WebOMiner system, use some data extraction techniques based on parsing the web page html source code into a document object model (DOM) tree, then traverse the DOM for pattern discovery. Some limitations of these existing systems include using complicated matching techniques such as tree matching, Finite state automata, not yielding accurate results for complex queries such as historical and derived. This paper proposes building the WebOMiner S which uses web structure and content mining approaches on the DOM-tree html code to simplify and make more easily extendable, the web data extraction process of theWebOMiner system. TheWebOMiner system is based on non-deterministic finite state automata (NFA) to recognize and extract web different types (e.g., text, image, links, and lists). The proposed WebOMiner S replaces the use of NFA of the WebOMiner with a frequent structure finder algorithm which uses regular expression matching in Java xpath parser and methods (such as compile(),evaluate()) to dynamically discover the most frequent structure (which is the most frequently repeated blocks in the html code represented as tags ) in the Dom tree. This approach eliminates the need for any supervised training or updating the wrapper for each new B2C web page making the approach simpler, more easily extendable and automated.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.