Abstract

The lengths of sentences in written texts have been reported to exhibit characteristic distributions that resemble lognormal distributions. However, the mechanism responsible for such lognormality is unclear. In this quantitative study, we analyze over 10,000 Japanese sentences from out-of-copyright Japanese texts stored on Aozora Bunko. We first confirm that sentence length distributions can be better represented by the lognormal function than by other functions (e.g., the gamma distribution). Next, under the assumption that each sentence is generated by a hierarchical branching process in terms of dependency trees, we test whether the composition of sentences can be explained by a simple multiplicative process by utilizing the Japanese dependency analyzer CaboCha. The results imply that the lognormality of sentence length distributions originates from the dependency tree depth and that a simple multiplicative model cannot accurately model the processes involved in generating sentences.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.