Abstract
The rapid progress of computer and network technologies makes it easy to collect and store a large amount of unstructured or semi-structured texts such as Web pages, HTML/XML archives, E-mails, and text files. These text data can be thought of large scale text databases, and thus it becomes important to develop an efficient tools to discover interesting knowledge from such text databases.There are a large body of data mining researches to discover interesting rules or patterns from well-structured data such as transaction databases with boolean or numeric attributes [1,8,13]. However, it is difficult to directly apply the traditional data mining technologies to text or semi-structured data mentioned above since these text databases consist of (i) heterogeneous and (ii) huge collections of (iii) un-structured or semi-structured data. Therefore, there still have been a small number of studies on text mining, e.g., [4,5,12,17].Our research goal is to devise an efficient semi-automatic tool that supports human discovery from large text databases. Therefore, we require a fast pattern discovery algorithm that can work in time, e.g., O(n) to O(n log n), to respond in real time on an unstructured data set of total size n. Furthermore, such an algorithm has to be robust in the sense that it can work on a large amount of noisy and incomplete data without the assumption of an unknown hypothesis class.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.