Text Segmentation Via Processes that Count the Number of Different Words Forward and Backward

Berhane Abebe,Mikhail Chebunin,Artyom Kovalevskii

doi:10.1080/09296174.2023.2275342

Text Segmentation Via Processes that Count the Number of Different Words Forward and Backward

Berhane Abebe, Mikhail Chebunin + Show 1 more

https://doi.org/10.1080/09296174.2023.2275342

Copy DOI

Journal: Journal of Quantitative Linguistics

Publication Date: Nov 10, 2023

Affiliation: Novosibirsk State University, Sobolev Institute of Mathematics, Novosibirsk State Technical University

#Number Of Different Words #Text Segmentation Methods + Show 8 more

Abstract
Full-Text
Similar Papers

Abstract

ABSTRACT The paper is developing a new statistical approach to automatic partitioning of texts into parts belonging to different authors. It is based on the analysis of processes that counts the number of different words forward and backward. The theoretical study of the processes is based on the assumptions of an elementary probability model with a change point. We prove consistence of our statistical estimate of the point of concatenation in the case when the concatenated texts have different Zipf exponents. This method is being tested on the Brown corpus and also on newspaper texts in different languages. Testing shows a good estimate of the concatenation point. This method can be used in parallel with other text segmentation methods.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Similar Papers

Paper Title

Journal

Date

Author

View more papers

More From: Journal of Quantitative Linguistics

Paper Title

Journal

Date

Author

View more papers

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.