Abstract
Dimensionality Reduction transforms data from high-dimensional space into visual space preserving the existing relationships. This abstract representation of complex data enables exploration of data similarities, but brings challenges about the analysis and interpretation for users on mismatching between their expectations and the visual representation. A possible way to model these understandings is via different feature extractors, because each feature has its own way to encode characteristics. Since there is no perfect feature extractor, the combination of multiple sets of features has been explored through a process called feature fusion. Feature fusion can be readily performed when machine learning or data mining algorithms have a cost function. However, when such a function does not exist, user support needs to be provided otherwise the process is impractical. In this project, we present a novel feature fusion approach that employs data samples and visualization to allow users to not only effortlessly control the combination of different feature sets but also to understand the attained results. The effectiveness of our approach is confirmed by a comprehensive set of qualitative and quantitative experiments, opening up different possibilities for user-guided analytical scenarios. The ability of our approach to provide real-time feedback for feature fusion is exploited in the context of unsupervised clustering techniques, where users can perform an exploratory process to discover the best combination of features that reflects their individual perceptions about similarity. A traditional way to visualize data similarities is via scatter plots, however, they suffer from overlap issues. Overlapping hides data distributions and makes the relationship among data instances difficult to observe, which hampers data exploration. To tackle this issue, we developed a technique called Distance-preserving Grid (DGrid). DGrid employs a binary space partitioning process in combination with Dimensionality Reduction output to create orthogonal regular grid layouts. DGrid ensures non-overlapping instances because each data instance is assigned only to one grid cell. Our results show that DGrid outperforms the existing state-of-the-art techniques, whereas requiring only a fraction of the running time and computational resources rendering DGrid as a very attractive method for large datasets.
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have