APPLICATIONS OF GRAPH PROBING TO WEB DOCUMENT ANALYSIS

Daniel Lopresti,Gordon Wilfong

doi:10.1142/9789812775375_0002

Abstract

Graphs are a fundamental representation in much of computer science, including the analysis of both traditional and Web documents. Algorithms for higher-level document understanding tasks often use graphs to encode logical structure. HTML pages are usually regarded as treestructured, while the WWW itself is an enormous, dynamic multigraph. Much work on attempting to extract information from Web pages makes explicit or implicit use of graph representations [1, 3, 4, 7, 11]. It follows, then, that the ability to compare two graphs is basic functionality, as demonstrated in such applications as query-by-structure, wrapper generation for information extraction, performance evaluation, etc. Because most problems relating to graph comparison have no known efficient, guaranteed-optimal solution, researchers have developed a wide range of heuristics. For the problem of determining isomorphism, for example, many heuristics rely on the existence of certain vertex invariants, which consist of a value f(v) assigned to each vertex v, so that under any isomorphism I, if I(v) = v then f(v) = f(v). One commonly used invariant is the degree of a vertex. In fact nauty, a successful software package for determining graph isomorphism (see [9]), relies on such vertex invariants. This observation can be seen as forming the basis for graph probing, a paradigm we have recently begun exploring for graph comparison [5, 8]. However, we desire more than a simple “yes/no” answer; we are interested in quantifying the similarity between two graphs, not just in whether they may be isomorphic. Conceptually, the idea of probing is to place each of the two graphs under study inside a “black box” capable of evaluating a set of graph-oriented operations (e.g., returning a list of all the leaf vertices, or all vertices labeled in a certain way). We then pose a series of probes and correlate the responses of the two systems. Our past work in the area treats graph probing as an online process; both the query graph and the database graph are available for synthesizing the probe set. While this is an appropriate assumption when one is comparing, say, the output of a recognition algorithm with its associated ground-truth, it is not a workable model for retrieval applications when the database contains anything other than a small number of documents. In this paper, we describe our first steps towards adapting the graph probing paradigm to allow pre-computation of a compact, efficient probe set for databases of graphstructured documents in general, and Web pages coded in HTML in particular. This new model is shown in Figure 1, where the portion of the computation bounded by dashed lines is performed off-line. We consider both comparing two graphs in their entirety, as well as determining whether one graph contains a subgraph that closely matches the other. We present an overview of work in progress, as well as some preliminary experimental results.

Full Text