USING BIG DATA IN SEMANTIC RESEARCH: PERSPECTIVES AND APPROACHES

A.M Ivanova,V.V Nikitina

doi:10.36622/mlmdr.2023.95.56.003

Abstract

Statement of the problem. Digitization and technology are rapidly taking over various fields of humanitarian sciences, including cognitive linguistics that is developing new algorithms and methods of language research, with linguists being encouraged to acquire new skills in data collection and processing as well as to expand their sources of language materials. In cognitive semantics and discourse studies, the biggest game-changer has been ever increasing (to the points of apparent limitlessness) volumes of various language data from digitalized texts available on the Internet that effectively accumulate in and function as Big Data – data sets that are too multiple, large and complex to be processed by traditional, ‘by-hand’ methods. Extensive language data requires specific mining and analysis tools and techniques and imposes certain restrictions on its processing and interpretation. In this paper, the authors discuss approaches to processing (‘intellectualization’) of empirical linguistic data coming from various online sources via search engines and their overall reliability for cognitive semantics research. Results. The main argument is that to ensure the validity of the digital language data under study and credible study results, researchers should employ several data collection methods and tools, i.e., opt for methodological triangulation. The authors describe their methodology on a practical example and conclude that big data systems such as indexed digital texts can effectively replace native speakers as a source of credible information on a natural language semantics as well as improve the overall quality and verifiability of semantics studies. Conclusion. Considering the national segment of the Internet in a certain language (English, Russian, etc.) as a natural text corpus of high representativeness, the researcher can use it as a tool for testing various linguistic hypotheses. At the same time, it is necessary to reflect on the intrinsic qualities of big data, such as volume, velocity (swiftness of change), variability, lack of structure, valuableness, and the presence of “noise”, as well as carefully consider, design and test procedures for using search engines to collect big data in order to ensure the quality and verifiability of the results obtained.

Full Text