Abstract

Discovery of the chronological or geographical distribution of collections of historical text can be more reliable when based on multivariate rather than on univariate data because multivariate data provide a more complete description. Where the data are high-dimensional, however, their complexity can defy analysis using traditional philological methods. The first step in dealing with such data is to visualize it using graphical methods in order to identify any latent structure. If found, such structure facilitates formulation of hypotheses which can be tested using a range of mathematical and statistical methods. Where, however, the dimensionality is greater than 3, direct graphical investigation is impossible. The present discussion presents a roadmap of how this obstacle can be overcome, and is in three main parts: the first part presents some fundamental data concepts, the second describes an example corpus and a high-dimensional data set derived from it, and the third outlines two approaches to visualization of that data set: dimensionality reduction and cluster analysis.

Highlights

  • Discovery of the chronological or geographical distribution of collections of historical text can be more reliable when based on multivariate rather than on univariate data because, assuming that the variables describe different aspects of the texts in question, multivariate data provide a more complete description

  • Exemplification is based on data abstracted from a corpus of English historical texts with a known temporal distribution, allowing the efficacy of the methods covered in the discussion to be readily verified by the reader

  • Four distinct clusters of points are visually identifiable, and these correspond to four conventional periods in the development of English, as labelled; dimensionality reduction via selection of the highest-variance variables has yielded a good indication of the known chronological structure of the example corpus

Read more

Summary

INTRODUCTION

Discovery of the chronological or geographical distribution of collections of historical text can be more reliable when based on multivariate rather than on univariate data because, assuming that the variables describe different aspects of the texts in question, multivariate data provide a more complete description. The first part presents some fundamental data concepts - its nature, its representation using vectors and matrices, and its interpretation in terms of concepts of vector space and manifold, the second part describes the corpus and a high-dimensional data set abstracted from it, and the third outlines approaches to visualization of that data set using the concepts from (1) applied to (2). The second, cluster analysis, represents the structure of data in high-dimensional space directly without dimensionality reduction

FUNDAMENTAL DATA CONCEPTS
CORPUS AND DATA
VISUALIZATION OF HIGH-DIMENSIONAL DATA
Variable selection
Varieties of cluster analysis
Hierarchical cluster analysis
Findings
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call