Abstract

To enable quantitative studies of large volumes of data, it is often appropriate to create machine-readable forms of existing printed works. We have undertaken such a project (Embleton & Wheeler, 1997) for Finnish using an important, but out-of-print, dialect atlas (Kettunen, 1940), and have reached a stage where the primary data entry has been completed. Next, we need to confirm the accuracy of the data entry in a way that is both efficient for us and still convincing to a potential user of the data or other outside party. We describe our testing protocol, testing tools and the practical concerns of selecting appropriate sample sizes for statistically-based tests. A critical issue, however, is the inherent ambiguity in the data itself. Because the original dialect atlas used typographic conventions for marking dialect areas, the delineation of these areas has a different precision than the digital form. For example, Village A may be on the edge of an area marked with X’s, and on the edge of an area marked by O’s, but not definitely inside or outside either or both areas. For the atlas reader, the marginal relationship of the village to each of the two dialect features is obvious. However, in digitizing the map (with the categories we have chosen), it is necessary to assign the village to ‘X’ or ‘not X’, and to ‘O’ or ‘not O’. We outline our approach to resolving these issues for Finnish. However, we note that the problem is much more general, and needs to be considered in the design of any such conversion of data for quantitative study.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.