Abstract
Semi-structured documents are a common type of data containing free text in natural language (unstructured data) as well as additional information about the document, or meta-data, typically following a schema or controlled vocabulary (structured data). Simultaneous analysis of unstructured and structured data enables the discovery of hidden relationships that cannot be identified from either of these sources when analyzed independently of each other. In this work, we present a visual text analytics tool for semi-structured documents (ViTA-SSD), that aims to support the user in the exploration and finding of insightful patterns in a visual and interactive manner in a semi-structured collection of documents. It achieves this goal by presenting to the user a set of coordinated visualizations that allows the linking of the metadata with interactively generated clusters of documents in such a way that relevant patterns can be easily spotted. The system contains two novel approaches in its back end: a feature-learning method to learn a compact representation of the corpus and a fast-clustering approach that has been redesigned to allow user supervision. These novel contributions make it possible for the user to interact with a large and dynamic document collection and to perform several text analytical tasks more efficiently. Finally, we present two use cases that illustrate the suitability of the system for in-depth interactive exploration of semi-structured document collections, two user studies, and results of several evaluations of our text-mining components.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: ACM Transactions on Interactive Intelligent Systems
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.