Abstract

The web has established itself as the dominant medium for doing electronic commerce. Realizing that its global reach provides significant market and business opportunities, service providers, both large and small are advertising their services on the web. A number of them operate their own web sites promoting their services at length while others are merely listed in a referral site. Aggregating all of the providers into a queriable service directory makes it easy for customers to locate the one most suited for his/her needs.YellowPager is a tool for creating service directories by mining web sources. Service directories created by YellowPager have several merits compared to those generated by existing practices, which typically require participation by service providers (e.g. Verizon's SuperYellowPages.com). Firstly, the information content will be rich. Secondly since the process is automated and repeatable the content can always be kept current. Finally the same process can be readily adapted to different domains.YellowPager builds service directories by mining the web through a combination of keyword-based search engines,web agents, text classifiers and novel extraction algorithms.The extraction is driven by a services ontology consisting of a taxonomy of service concepts and their associated attributes (such as names and addresses) and type descriptions for the attributes. In addition the ontology also associates an extractor function with each attribute. Applying the function to a web page will identify all the occurrences of the attribute in that page.YellowPager's mining algorithm consists of a training step followed by classification and extraction steps. In the training step a classifier is trained to identify web pages relevant to the service of interest. The classification step proceeds by doing a search for the particular service of interest using a keyword based web search engine and retrieves all the matching web pages. From these pages the relevant ones are identified using the classifier. The final step is extraction of attribute values, associated with the service, from these pages. Each web page is parsed into a DOM tree and the extractor functions are applied. All of the attributes corresponding to a service provider are then correctly aggregated. This can pose difficulties especially in the presence of multiple service providers in a page. Using a novel concept of scoring and conflict resolution to prevent erroneous associations of attributes with service provider entities in the page, the algorithm aggregates all the attribute occurrences correctly. The extractor function may not be complete in the sense that it cannot always identify all the attributes in a page. By exploiting the regularity of the sequence in which attributes occurr in referral pages, the mining algorithm automatically learns generalized patterns to locate attributes that the extractor function misses. The distinguishing aspects of YellowPager's extraction algorithm are: (i) it is unsupervised, and (ii) the attribute values in the pages are extracted independent of any page-specific relationships that may exist among the markup tags.YellowPager has been used by a large pet food producer to build a directory of veterinarian service providers in the United States. The resulting database was found to be much larger and richer than that found in Vetquest, Vetworld, and the Super Yellow pages.YellowPager is implemented in JAVA and is interfaced to Rainbow, a library utility in C that is used for classification. The tool will demonstrate the creation of a service directory for any service domain by mining web sources.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.