Abstract

While digital corpora have enabled new perspectives into the variation and continuums of human communication, they often pose problems related to implicit biases of the data and the limited reach of current methods in recognising similarity in linguistically complex data, especially in small languages. The digital corpus of historical Finnic oral poetry in alliterative tetrametre is characterised by significant poetic, linguistic and orthographic variation. At the extreme, a word may be written in hundreds of different ways. The current corpus comprises 189,189 poetic texts in six Finnic languages (Karelian, Ingrian, Votic, Estonian, Seto and Finnish) recorded in 1564–1957 by 5,287 recorders. It has a long curation history and significant bias towards some genres, poetic forms and regions that collectors have preferred. In this poetic tradition, an idea is typically expressed with several parallel, partly alternative poetic lines or motifs, and similar verse types may be used in different contexts. A manual attempt to find all the occurrences of widely used expressions or motifs in the corpus is an unattainable task. While the digital tools—starting from simple queries to more advanced methods—make it possible to aim at wider intertextual analyses, some part of relevant material is typically not reached. Thus, it becomes central to estimate the amount and quality of the relevant data that is not recognised with different methods. Here, we discuss two strategies for mapping intertextuality in the corpus: 1) proceeding with text queries and 2) recognising similar poetic lines computationally, based on string similarity. We compare these approaches with one another, and then proceed to compare theresults they yield with the existing type index and the results of manual early 20th-century research. While the methodological and theoretical foundations of this type of research no longer hold, and while our further interest lies in the intertextuality and variation rather than in the problematic concept of poem types, parts of earlier analyses may be used in evaluating the performance of digital approaches.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.