INTRODUCTION The advent of inexpensive computing and the creation of large machine-actionable corpora consisting of well-structured digital texts have made it possible to analyze and mark for morphosyntactic features significant amounts of text (> 1,000,000 tokens) with a high degree of accuracy (> 80 percent) rapidly and automatically. Although the problem of automatically tagging text with part-of-speech (POS) information has been largely solved for languages with little morphonological complexity, (2) more complex languages, such as Old Icelandic (OIc) and other ancient languages, continue to pose problems for automated systems. Despite these difficulties, rich morphosyntactic markup that includes lemmatization holds great promise for both linguistic and textual scholarship. Accurate markup would enable the development of sophisticated online study environments that allow researchers to perform complex searches, make comparisons across multiple texts, and generate calculations concerning word-use and syntactical patterns. Our work, focusing on Old Icelandic, confirms that even for morphonologically complex Indo-European languages, the information gain offered by automatic morphosyntactic analysis of texts, measured as the percentage of correctly tagged tokens, sentences, and complete texts over the extant corpus, offers a marked improvement over previously available hand-marked texts (Rognvaldsson and Helgadottir 2008, 2011). (3) Even the most detailed and accurate indexes produced in the past centuries--such as Oniforradet i de dista islandska handskrifterna (Larsson 1891), which provides an accurate and exhaustive word-form index for a number of the oldest Old Icelandic manuscripts (ranging from late twelfth- to mid-thirteenth-century manuscripts)--offer only minimal coverage when compared to the very large number of extant Old Icelandic texts. For a researcher interested in the study of the entire Old Icelandic corpus (or a large sub-corpus of Old Icelandic literature), these early handbooks, no matter how accurately compiled, are of limited use. Unfortunately, it is not economically feasible to extend the earlier practice of manual encoding to a greater number of manuscripts; the manual compilation of handbooks is costly and requires tremendous amounts of time, expertise, and energy. The old paper-and-pen approach does not, to borrow a term front computer science, scale well. A dream of many researchers in Old Icelandic is to be able to work with a large number of texts (and manuscript witnesses to texts)--or even a comprehensive corpus--that include the high level of morphosyntactic detail of the early handbooks mentioned above. Similarly, historical linguists (especially syntacticians) are eager to work with a much larger parsed corpus of Old Icelandic texts than is currently available. Recent work, such as that of the Icelandic Parsed Historical Corpus group (IcePaHC) (Wallenberg et al. 2011) is a major step toward making such resources available, as it provides a considerable number of texts tagged in a semi-supervised fashion, and moves us closer to a comprehensive parsed Old Icelandic corpus. Yet, it is unlikely that IcePaHC alone will provide adequate coverage for Old Icelandic textual research, in part because it is focused on the historical development of Icelandic up through the present, and in part because it provides limited lemmatization of the texts. As such, IcePaHC diverges from our project, which has as its sole focus the morphosyntactic analysis and lemmatization of Old Icelandic texts. We believe that the computational methods developed by our group can augment those of IcePaHC and others, and have the potential to not only extend the necessarily limited scope of the earlier historical handbooks, but also increase considerably the number of richly marked texts available to researchers. (4) Automatic morphosyntactic analysis of Old Icelandic oilers an efficient method for accurately tagging millions of tokens in the growing corpus of machine-actionable texts. …
Read full abstract