The purpose of this study is to compare and analyze quantitatively the characteristics of lexical diversity by each register of the Mongolian language. To this end, 10 sample corpuses were established with the size of 90,605 tokens for each of ten registers and lexical diversity was measured for each corpus. The values of lexical diversity indices were obtained and compared. As a result, the register with the largest lexical diversity was literature textbook, followed by newspaper articles (culture), newspaper articles (world), interviews, newspaper articles (sports), newspaper articles (society), newspaper articles (economy), podcasts, newspaper articles (politics), and law. In other words, in Mongolian language, the most vocabulary is used in literature textbooks and newspaper articles such as culture and world news, and the least vocabulary is used in law, politics in newspaper articles, and impromptu conversation (podcasts). In addition, newspaper articles were divided into six registers and analyzed, and each register showed a large difference in lexical diversity. This is a result that quantitatively proves that newspaper articles should not be treated as a single register when corpus composition or corpus-based research, but must be divided into registers. And, in the process of this study, it was confirmed that most of the lexical diversity indices show the same results if the number of tokens per corpus is the same, but the indices show different results when the size of the corpus is different. Finally, when studying lexical diversity, it is also proved that one way to reduce errors in research results is to obtain values of as many indices as possible and compare them with each other rather than analyzing with only one index.
Read full abstract