Abstract
This paper aims to research the applicability of Benford's Law in Chinese texts. Firstly, the Chinese corpus was collected and word segmentation was performed. The distributions of the first digit of frequency were calculated for words, low-frequency words and single characters respectively in Chinese texts, and the relative entropy (Kullback-Leibler distance) between the distributions and the general Benford's law. Secondly, the parameter value range of the Generalized Benford's law was researched, and in view of the limitation of Zipf's law that is only applicable to large amounts of data, we carried out a statistical analysis of small-scale data. Then, the experimental analysis of the probability of the first digit of the word frequency of the single character data was carried out to explore the applicability of the Generalized Benford's law for single-character data. Finally, the applicability of Benford's law was investigated for artificially modified corpus. The results show that the words and characters in Chinese texts conform to the Benford's law, and Benford's law overcomes the limitation of Zipf's law on the size of the data sets, and the Generalized Benford's law has the ability to discriminate the natural quality of the corpus, which has important practical significance for Chinese information processing.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.