Abstract

Considerable quantities of valuable data about product information and financial statements is often available in sources and formats that are not amenable for querying using traditional database techniques. One such important source is text documents. In such documents these kinds of data often appear in tabular form. A data item in these text tables may span several words (e.g. product description). Furthermore items supposedly within the same column do not necessarily begin or end at the same position. Thus the absence of any regularity in column separators makes it difficult to automatically mine, i.e. extract data items from text tables. Nevertheless an interesting characteristic often exhibited by these tables is that intra-column items are “closer” to each other than intercolumn items. We exploit this observation to develop a clustering-based technique to extract data items from these tables. In contrast to previous appproaches, a unique and important aspect of using clustering is that it makes the technique robust in the presence of misalignments. We provide a characterization theorem for text tables on which this technique will always produce a correct extraction. We discuss the design and implementation of a system for extracting tabular data based on this clustering technique. We present experimental evidence of its effectiveness and usability on real industrial data. ∗This work was supported by industry and university grant – NSF IIS0072927. †XSB, Inc., Suite 115, Nassau Hall, Stony Brook, NY 11790. davulcu@cs.sunysb.edu ‡XSB, Inc., Suite 115, Nassau Hall, Stony Brook, NY 11790 and Department of Computer Science, SUNY Stony Brook, Stony Brook, NY 11794. saikat@cs.sunysb.eud. §XSB, Inc., Suite 115, Nassau Hall, Stony Brook, NY 11790. Department of Computer Science, SUNY Stony Brook, Stony Brook, NY 11794. ram@cs.sunysb.eud.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.